agent platform

Comprehensive Multi-Machine Development Network - Technical Implementation Plan

Executive Summary

Transform two Mac laptops (M4 + M3) into a unified, distributed Kubernetes cluster for autonomous agent development, leveraging existing OrbStack infrastructure and Tailscale mesh networking. This plan optimizes resource utilization across machines while maintaining the sophisticated agent platform already running on M4.

Current State Analysis

Hardware Inventory

Machine	Tailscale IP	Current Role	Specs	Utilization
Mac M4 (Bluefly)	100.108.129.7	Primary dev	M4 chip, macOS 25.1.0	80-90%
Mac M3 (GitLab)	100.108.180.36	GitLab only	M3 chip	20-30%
GL-BE3600	100.116.110.123	Subnet router	192.168.8.0/24	N/A

Current M4 Infrastructure (OrbStack)

Orchestration:

OrbStack 2.0.5 (Docker 28.5.2 + Kubernetes)
kubectl v1.32.3
25 active namespaces
100+ deployments/StatefulSets
30+ running containers

Key Services Running:

Agent Services (Ports 3000-3015): 16 microservices
Data Layer: PostgreSQL, MongoDB, Redis, ClickHouse, Neo4j, Qdrant, MinIO
Message Broker: RabbitMQ
Observability: Phoenix (6006), Prometheus (9090), Grafana (3009), Jaeger (16686)
AI Infrastructure: LiteLLM (4000), LibreChat (3080), Ollama (11434)

Network:

Local IP: 192.168.8.109
Gateway: 192.168.8.1 (GL-BE3600)
OrbStack bridges: bridge100-103
Tailscale mesh: tailcf98b3.ts.net

Storage:

overlay2 on btrfs
100+ Docker images (~50GB)
Local volumes only (single point of failure)

Software Already Installed (Both Machines)

Container & Orchestration:

docker (via OrbStack)
kubernetes-cli (kubectl v1.32.3)
k9s (TUI management)
kubectx (context switching)
stern (log tailing)
kustomize (config management)

DevOps:

gitlab-runner
glab (GitLab CLI)
git + git-lfs

Infrastructure:

tailscale
nginx
caddy

Observability:

prometheus (K8s)
grafana (K8s)
jaeger (Docker image available)

CLI Tools:

jq, yq (JSON/YAML)
fzf, ripgrep, bat, eza
btop (monitoring)

Gaps Identified

Category	Missing Component	Priority	Impact
Cluster	Multi-node K8s	Critical	Can't distribute workloads
Networking	Cilium CNI	High	No advanced networking
Storage	Distributed storage	Critical	No failover for data
Load Balancing	MetalLB	High	No proper service exposure
GitOps	Argo CD	Medium	Manual deployments
Logging	Loki	Medium	No centralized logs
Backup	Velero	High	No disaster recovery
Secrets	External Secrets	Medium	Manual secret management

Target Architecture

Hybrid Cloud Architecture: Cloudflare + Tailscale + vast.ai

Location-independent development - work from anywhere with consistent URLs.


                              PUBLIC INTERNET                                 
                                                                              
  GitLab Duo  mesh.bluefly.internal  Cloudflare Tunnel            
  API clients  api.blueflyagents.com  Cloudflare Tunnel            
                                                                              
                              cloudflared                                     
                     (routes to wherever you are)                             

                                    

                 TAILSCALE PRIVATE MESH (tailcf98b3.ts.net)                  
                                                                              
     
     YOUR LOCATION (home, hotel, vacation - GL-BE3600 travel router)      
                                                                           
                             
      Mac M4 (Control)         Mac M3 (Data)                           
      100.108.129.7            100.108.180.36                          
       agent-mesh              PostgreSQL                            
       cloudflared             Redis, Neo4j                          
       16 OSSA agents          Qdrant, backups                       
                             
     
                                                                             
     
     vast.ai GPU CLUSTER (always online, cloud)                            
      Ollama (qwen2.5-coder, deepseek-r1, llama3)                         
      vLLM for high-throughput inference                                  
      GPU-accelerated embeddings

Layer	Technology	Purpose
Public URLs	Cloudflare DNS + Tunnel	Stable endpoints (mesh.bluefly.internal)
Private Mesh	Tailscale	Encrypted connectivity between all devices
Local Network	GL-BE3600	Consistent 192.168.8.x subnet anywhere
GPU Compute	vast.ai	LLM inference (always available)

Cloudflare Domains

Domain	Backend	Purpose
mesh.bluefly.internal	agent-mesh:3005	GitLab Duo gateway
api.blueflyagents.com	agent-router:3006	LLM routing
brain.bluefly.internal	agent-brain:3000	Vector DB

Network Topology (Local)

Internet
  |
Spectrum Modem (or hotel WiFi via GL-BE3600)
  +---> Deco X60 Mesh (general network - home only)
  +---> GL-BE3600 Router (BlueflyMesh: 192.168.8.0/24 - travels with you)
      +---> Tailscale Subnet Router (100.116.110.123)
      |
      +---> Mac M4 (100.108.129.7) - K8s Control Plane
      |   +-- etcd, kube-apiserver, kube-scheduler
      |   +-- Cilium CNI, MetalLB
      |   +-- Core Services: PostgreSQL, RabbitMQ, Redis
      |   +-- Control Services: Agent Mesh, Agent Router
      |   +-- Observability: Prometheus, Grafana, Loki
      |   +-- cloudflared (tunnel to Cloudflare)
      |
      +---> Mac M3 (100.108.180.36) - K8s Worker Node
      |   +-- kubelet, kube-proxy
      |   +-- Compute: Agent Brain, Phoenix
      |   +-- Data: ClickHouse, MongoDB replica, Qdrant
      |   +-- Heavy Workloads: LibreChat, LiteLLM replicas
      |
      +---> vast.ai GPU (dynamic Tailscale IP)
          +-- Ollama with large models (32B, 70B)
          +-- vLLM for batch inference
          +-- GPU-accelerated agent-brain

Service Distribution Strategy

Mac M4 (Control Plane + Stateful Core)

Kubernetes Components:

etcd (cluster state)
kube-apiserver
kube-controller-manager
kube-scheduler
CoreDNS

Core Data Services (Primary):

PostgreSQL 15 (primary, port 5432)
Redis 7 (master, port 6379)
RabbitMQ 3.13 (primary, port 5672)
Neo4j 5 (primary, port 7687)

Coordination Services:

Agent Mesh (port 3003) - Service registry
Agent Router (port 3006) - LLM gateway
Agent Protocol (port 3005) - OSSA coordination

Observability Stack:

Prometheus (port 9090)
Grafana (port 3009)
Loki (new, port 3100)
Jaeger Query (port 16686)

Ingress & Gateway:

Ingress NGINX controller
CSMA Gateway (192.168.139.2:8090)

Mac M3 (Worker Node + Compute Intensive)

Heavy Compute Services:

Agent Brain (port 3000) - Qdrant vector operations
Ollama (port 11434) - Local LLM inference
Agent Studio (port 3007) - IDE workloads
Phoenix (port 6006) - Tracing processing
Agent Tracer (port 3008) - Trace collection

Distributed Data (Replicas):

PostgreSQL 15 (replica, read-only)
MongoDB 7.0 (replica set member)
Qdrant (distributed mode)
ClickHouse (distributed table)

User-Facing Services:

LibreChat (port 3080) - Chat UI
LiteLLM (port 4000) - Additional replicas
Studio UI (port 3014) - Frontend

Agent Workloads:

Agent Chat (port 3001)
Agent Docker (port 3002)
Agent Ops (port 3004)
Agentic Flows (port 3009)
Compliance Engine (port 3010)
Doc Engine (port 3011)

Storage Architecture

+-------------------------------------------------------------+
| Longhorn Distributed Storage Cluster                         |
+-------------------------------------------------------------+
|                                                               |
|  Mac M4 Node                     Mac M3 Node                 |
|  +-- /var/lib/longhorn (50GB)   +-- /var/lib/longhorn (50GB) |
|  |                              |                            |
|  +-- Volumes (3 replicas):       +-- Volumes (3 replicas):    |
|     * postgres-data (replica 1)    * postgres-data (replica 2)|
|     * mongodb-data (replica 1)     * mongodb-data (replica 2)|
|     * redis-data (replica 1)       * redis-data (replica 2)  |
|     * neo4j-data (replica 1)       * neo4j-data (replica 2)  |
|     * qdrant-data (replica 1)      * qdrant-data (replica 2) |
|                                                               |
|  Configuration:                                               |
|  * Replica Count: 2                                          |
|  * Stale Replica Timeout: 30m                                |
|  * Backup Target: S3 (MinIO)                                 |
|  * Snapshot Schedule: Daily 2am                              |
+-------------------------------------------------------------+

Implementation Phases

Phase 1: Cluster Foundation (Week 1)

Day 1-2: Multi-Node Kubernetes Setup

Objective: Create K3s cluster spanning both machines

On Mac M4 (Control Plane):

# 1. Uninstall OrbStack K8s (keep Docker)
orbstack config set kubernetes.enabled false

# 2. Install K3s as server
curl -sfL https://get.k3s.io | sh -s - server \
  --node-ip 100.108.129.7 \
  --node-external-ip 100.108.129.7 \
  --flannel-backend=none \
  --disable-network-policy \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644 \
  --tls-san 100.108.129.7 \
  --bind-address 100.108.129.7

# 3. Get join token
sudo cat /var/lib/rancher/k3s/server/node-token

# 4. Update kubeconfig
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

On Mac M3 (Worker Node):

# Install K3s as agent
export K3S_TOKEN="<token-from-m4>"
export K3S_URL="https://100.108.129.7:6443"

curl -sfL https://get.k3s.io | sh -s - agent \
  --node-ip 100.108.180.36 \
  --node-external-ip 100.108.180.36

# Verify node joined
kubectl get nodes

Expected Output:

NAME                        STATUS   ROLES                  AGE   VERSION
mac-m4.tail<hash>.ts.net    Ready    control-plane,master   5m    v1.30.x
mac-m3.tail<hash>.ts.net    Ready    <none>                 2m    v1.30.x

Day 3: Network Layer (Cilium)

# On M4 only:

# 1. Install Cilium CLI
brew install cilium-cli

# 2. Install Cilium CNI
cilium install \
  --set ipam.mode=kubernetes \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true \
  --set ipv4NativeRoutingCIDR="100.64.0.0/10" \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

# 3. Verify installation
cilium status --wait
cilium connectivity test

# 4. Enable Hubble (network observability)
cilium hubble enable --ui

Validation:

# Check Cilium pods running on both nodes
kubectl get pods -n kube-system -l k8s-app=cilium -o wide

# Test pod-to-pod connectivity across nodes
kubectl run test-m4 --image=nginx --overrides='{"spec":{"nodeName":"mac-m4"}}'
kubectl run test-m3 --image=nginx --overrides='{"spec":{"nodeName":"mac-m3"}}'
kubectl exec test-m4 -- curl test-m3

Day 4-5: Storage Layer (Longhorn)

# On BOTH M4 and M3:

# 1. Install dependencies
brew install iscsi-initiator-utils  # If available, or use apt-get on Linux

# 2. Create Longhorn storage directory
sudo mkdir -p /var/lib/longhorn
sudo chmod 755 /var/lib/longhorn

# On M4 only:

# 3. Install Longhorn
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

# 4. Wait for Longhorn to be ready
kubectl -n longhorn-system get pods

# 5. Access Longhorn UI
kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80

# Open http://localhost:8080

# 6. Configure Longhorn settings
cat <<EOF | kubectl apply -f -
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
  name: default-replica-count
  namespace: longhorn-system
spec:
  value: "2"
---
apiVersion: longhorn.io/v1beta1
kind: Setting
metadata:
  name: stale-replica-timeout
  namespace: longhorn-system
spec:
  value: "30"
EOF

# 7. Create StorageClass
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
  fromBackup: ""
  fsType: "ext4"
EOF

Validation:

# Test PVC creation
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
EOF

kubectl get pvc test-pvc
kubectl delete pvc test-pvc

Day 6-7: Load Balancing (MetalLB)

# On M4:

# 1. Install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml

# 2. Wait for MetalLB to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

# 3. Configure IP pool (using Tailscale IPs)
cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: tailscale-pool
  namespace: metallb-system
spec:
  addresses:
  - 100.108.129.100-100.108.129.150
  - 100.108.180.100-100.108.180.150
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: tailscale-l2
  namespace: metallb-system
spec:
  ipAddressPools:
  - tailscale-pool
EOF

Validation:

# Test LoadBalancer service
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer

# Check external IP assigned
kubectl get svc nginx
# Should show EXTERNAL-IP from pool

# Test access
curl <EXTERNAL-IP>

# Cleanup
kubectl delete svc nginx
kubectl delete deployment nginx

Phase 2: Core Infrastructure Migration (Week 2)

Day 1-2: Ingress & Gateway

# 1. Install NGINX Ingress Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

# 2. Configure for Tailscale
kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"LoadBalancer"}}'

# 3. Get ingress IP
kubectl get svc -n ingress-nginx ingress-nginx-controller

# 4. Install Cert-Manager (for TLS)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 5. Create ClusterIssuer for self-signed certs
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
EOF

Day 3-4: Migrate Stateless Services

Strategy: Start with services that have no persistent data

# 1. Create namespace structure
kubectl create namespace development
kubectl create namespace production
kubectl create namespace staging
kubectl create namespace csma
kubectl create namespace csma-agents
kubectl create namespace databases
kubectl create namespace monitoring

# 2. Deploy Agent Router (first stateless service)
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-router
  namespace: development
spec:
  replicas: 2  # 1 on M4, 1 on M3
  selector:
    matchLabels:
      app: agent-router
  template:
    metadata:
      labels:
        app: agent-router
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-latest
        ports:
        - containerPort: 4000
        env:
        - name: OLLAMA_API_BASE
          value: "http://ollama.development.svc.cluster.local:11434"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - agent-router
              topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
  name: agent-router
  namespace: development
spec:
  type: LoadBalancer
  selector:
    app: agent-router
  ports:
  - port: 4000
    targetPort: 4000
EOF

# 3. Verify deployment
kubectl get pods -n development -o wide
# Should see 1 pod on M4, 1 pod on M3

# 4. Test service
ROUTER_IP=$(kubectl get svc agent-router -n development -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl http://$ROUTER_IP:4000/health

Day 5-7: Migrate Stateful Services with Longhorn

PostgreSQL Migration:

# 1. Backup existing data from OrbStack
docker exec postgres-container pg_dumpall -U postgres > /tmp/postgres-backup.sql

# 2. Deploy PostgreSQL with Longhorn PVC
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data
  namespace: databases
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql
  namespace: databases
spec:
  serviceName: postgresql
  replicas: 1
  selector:
    matchLabels:
      app: postgresql
  template:
    metadata:
      labels:
        app: postgresql
    spec:
      nodeSelector:
        kubernetes.io/hostname: mac-m4  # Pin to M4 for primary
      containers:
      - name: postgresql
        image: postgres:15
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
      volumes:
      - name: postgres-data
        persistentVolumeClaim:
          claimName: postgres-data
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: databases
spec:
  clusterIP: None
  selector:
    app: postgresql
  ports:
  - port: 5432
    targetPort: 5432
EOF

# 3. Restore data
kubectl exec -n databases postgresql-0 -i -- psql -U postgres < /tmp/postgres-backup.sql

# 4. Verify Longhorn replication
kubectl get volumes -n longhorn-system
# Should show 2 replicas

Repeat for MongoDB, Redis, Neo4j, Qdrant:

# Template for other databases (adjust per service)
# - Backup from OrbStack
# - Create PVC with Longhorn
# - Deploy StatefulSet with nodeSelector
# - Restore data
# - Verify replication

Phase 3: Service Distribution & Optimization (Week 3)

Day 1-3: Deploy Compute-Intensive Services to M3

# 1. Label nodes
kubectl label node mac-m3 workload-type=compute-intensive
kubectl label node mac-m4 workload-type=control-plane

# 2. Deploy Ollama to M3
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: development
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      nodeSelector:
        workload-type: compute-intensive  # Force to M3
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-data
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: development
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
EOF

# 3. Deploy remaining compute services to M3
# - Agent Brain (Qdrant intensive)
# - Phoenix (trace processing)
# - LibreChat (user-facing)
# - Agent Studio (IDE workloads)

Day 4-5: Implement Database Replication

PostgreSQL Streaming Replication:

# 1. Create replica on M3
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgresql-replica
  namespace: databases
spec:
  serviceName: postgresql-replica
  replicas: 1
  selector:
    matchLabels:
      app: postgresql-replica
  template:
    metadata:
      labels:
        app: postgresql-replica
    spec:
      nodeSelector:
        kubernetes.io/hostname: mac-m3
      containers:
      - name: postgresql
        image: postgres:15
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: POSTGRES_PRIMARY_HOST
          value: postgresql.databases.svc.cluster.local
        volumeMounts:
        - name: postgres-replica-data
          mountPath: /var/lib/postgresql/data
        - name: recovery-config
          mountPath: /docker-entrypoint-initdb.d
      volumes:
      - name: postgres-replica-data
        persistentVolumeClaim:
          claimName: postgres-replica-data
      - name: recovery-config
        configMap:
          name: postgres-replica-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-replica-config
  namespace: databases
data:
  setup-replication.sh: |
    #!/bin/bash
    pg_basebackup -h $POSTGRES_PRIMARY_HOST -D /var/lib/postgresql/data -U replication -v -P
    cat > /var/lib/postgresql/data/postgresql.conf <<EOL
    primary_conninfo = 'host=$POSTGRES_PRIMARY_HOST port=5432 user=replication'
    hot_standby = on
    EOL
EOF

# 2. Create read-only service pointing to replica
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: postgresql-readonly
  namespace: databases
spec:
  selector:
    app: postgresql-replica
  ports:
  - port: 5432
    targetPort: 5432
EOF

Redis Sentinel:

# Install Redis with Sentinel via Helm
helm repo add bitnami https://charts.bitnami.com/bitnami

helm install redis bitnami/redis \
  --namespace databases \
  --set sentinel.enabled=true \
  --set sentinel.quorum=2 \
  --set master.persistence.storageClass=longhorn \
  --set replica.replicaCount=1 \
  --set replica.persistence.storageClass=longhorn

Day 6-7: Deploy Observability Stack

# 1. Install Loki via Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set grafana.enabled=false \
  --set loki.persistence.enabled=true \
  --set loki.persistence.storageClassName=longhorn \
  --set loki.persistence.size=50Gi

# 2. Migrate Prometheus to cluster
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# 3. Configure Grafana with Loki datasource
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
    - name: Prometheus
      type: prometheus
      access: proxy
      url: http://prometheus-kube-prometheus-prometheus:9090
EOF

Phase 4: GitOps & Automation (Week 4)

Day 1-2: Argo CD Setup

# 1. Install Argo CD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 2. Expose Argo CD UI via LoadBalancer
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

# 3. Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# 4. Login to Argo CD UI
ARGOCD_IP=$(kubectl get svc argocd-server -n argocd -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
open http://$ARGOCD_IP

# 5. Create GitLab integration
argocd repo add https://gitlab.com/blueflyio/agent-platform.git \
  --username <username> \
  --password <token>

Day 3-4: External Secrets Operator (1Password Integration)

# 1. Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets \
  external-secrets/external-secrets \
  -n external-secrets-system \
  --create-namespace

# 2. Install 1Password Connect (optional, or use 1Password CLI)
# Follow: https://developer.1password.com/docs/connect/

# 3. Create SecretStore
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: onepassword
  namespace: default
spec:
  provider:
    onepassword:
      auth:
        secretRef:
          connectTokenSecretRef:
            name: onepassword-token
            key: token
      connectHost: http://onepassword-connect:8080
      vaults:
        csma-secrets: 1
EOF

# 4. Create ExternalSecret for GitLab tokens
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: gitlab-tokens
  namespace: default
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: onepassword
    kind: SecretStore
  target:
    name: gitlab-tokens
    creationPolicy: Owner
  data:
  - secretKey: token
    remoteRef:
      key: gitlab-api-token
EOF

Day 5-7: Argo Workflows for Agent Orchestration

# 1. Install Argo Workflows
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml

# 2. Expose Argo Workflows UI
kubectl patch svc argo-server -n argo -p '{"spec": {"type": "LoadBalancer"}}'

# 3. Create workflow template for agent execution
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: agent-execution
  namespace: argo
spec:
  entrypoint: main
  arguments:
    parameters:
    - name: agent-name
    - name: task
  templates:
  - name: main
    inputs:
      parameters:
      - name: agent-name
      - name: task
    container:
      image: "ghcr.io/blueflyio/{{inputs.parameters.agent-name}}:latest"
      command: ["/bin/sh"]
      args: ["-c", "echo '{{inputs.parameters.task}}' | agent-executor"]
EOF

Phase 5: Production Readiness (Week 5)

Day 1-2: Backup & Disaster Recovery

# 1. Install Velero
brew install velero

# 2. Set up MinIO as backup target (already have MinIO)
kubectl create namespace velero

# 3. Create MinIO bucket for backups
kubectl run minio-client --rm -it --image=minio/mc --restart=Never -- \
  mc alias set minio http://minio.databases.svc.cluster.local:9000 minioadmin minioadmin
  mc mb minio/velero-backups

# 4. Install Velero with MinIO backend
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --snapshot-location-config region=minio \
  --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.databases.svc.cluster.local:9000

# 5. Create backup schedule
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces development,production,databases,csma

# 6. Test backup/restore
velero backup create test-backup --include-namespaces development
velero backup describe test-backup
velero restore create --from-backup test-backup

Day 3-4: Performance Tuning

# 1. Deploy Vertical Pod Autoscaler
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

# 2. Create VPA for Agent Router
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: agent-router-vpa
  namespace: development
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-router
  updatePolicy:
    updateMode: "Auto"
EOF

# 3. Deploy Horizontal Pod Autoscaler
kubectl autoscale deployment agent-router \
  --namespace development \
  --cpu-percent=70 \
  --min=2 \
  --max=4

# 4. Enable resource quotas per namespace
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: development-quota
  namespace: development
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
EOF

Day 5: Security Hardening

# 1. Enable Pod Security Standards
kubectl label namespace development pod-security.kubernetes.io/enforce=baseline

# 2. Create Network Policies with Cilium
cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: agent-router-policy
  namespace: development
spec:
  endpointSelector:
    matchLabels:
      app: agent-router
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: agent-mesh
    toPorts:
    - ports:
      - port: "4000"
        protocol: TCP
  egress:
  - toEndpoints:
    - matchLabels:
        app: ollama
    toPorts:
    - ports:
      - port: "11434"
        protocol: TCP
EOF

# 3. Install Falco for runtime security
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set tty=true

Day 6-7: Documentation & Runbooks

Create operational runbooks:

# Runbooks (to create)
1. Node Failure Recovery
2. Database Failover Procedure
3. Service Scaling Guidelines
4. Backup/Restore Procedures
5. Security Incident Response
6. Performance Troubleshooting
7. GitOps Deployment Process

Validation & Testing

Cluster Health Checks

# 1. Node status
kubectl get nodes -o wide

# 2. Pod distribution
kubectl get pods --all-namespaces -o wide | grep -E 'mac-m4|mac-m3'

# 3. Storage health
kubectl get pv,pvc --all-namespaces
kubectl -n longhorn-system get volumes

# 4. Network connectivity
kubectl run test-pod --image=nicolaka/netshoot --rm -it -- /bin/bash
# Inside pod: ping services, curl endpoints

# 5. Service endpoints
kubectl get svc --all-namespaces -o wide

Performance Benchmarks

# 1. Database performance (PostgreSQL)
kubectl run pgbench --rm -it --image=postgres:15 -- \
  pgbench -h postgresql.databases.svc.cluster.local -U postgres -c 10 -j 2 -t 1000

# 2. Network throughput (iperf3)
# On M4:
kubectl run iperf-server --image=networkstatic/iperf3 -- -s
# On M3:
kubectl run iperf-client --rm -it --image=networkstatic/iperf3 -- \
  -c iperf-server.default.svc.cluster.local -t 30

# 3. Load testing Agent Router
kubectl run load-test --rm -it --image=williamyeh/wrk -- \
  -t 4 -c 100 -d 30s http://agent-router.development.svc.cluster.local:4000/health

Failover Testing

# 1. Simulate M3 node failure
kubectl drain mac-m3 --ignore-daemonsets --delete-emptydir-data

# 2. Verify services moved to M4
kubectl get pods --all-namespaces -o wide | grep mac-m4

# 3. Check database replication
kubectl exec -n databases postgresql-replica-0 -- pg_isready

# 4. Restore M3
kubectl uncordon mac-m3

Monitoring & Alerting

Grafana Dashboards to Create

Cluster Overview
- Node CPU/Memory utilization
- Pod distribution
- Network traffic
Service Health
- Agent Router throughput
- Database query latency
- LLM inference time
Storage Metrics
- Longhorn volume health
- IOPS per node
- Backup status
Cost Tracking
- Resource usage per namespace
- Pod efficiency scores

Prometheus Alerts

# alerts.yaml
groups:
- name: cluster-health
  rules:
  - alert: NodeDown
    expr: up{job="node-exporter"} == 0
    for: 5m
    annotations:
      summary: "Node {{ $labels.instance }} is down"

  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 10m
    annotations:
      summary: "Pod {{ $labels.pod }} is crash looping"

  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 5m
    annotations:
      summary: "Node {{ $labels.instance }} has less than 10% memory available"

  - alert: StorageVolumeUnhealthy
    expr: longhorn_volume_robustness != 0
    for: 5m
    annotations:
      summary: "Longhorn volume {{ $labels.volume }} is unhealthy"

Migration Checklist

Pre-Migration

Document current OrbStack configuration
Backup all databases and persistent data
Export existing Kubernetes manifests
Test Tailscale connectivity between machines
Verify disk space on both machines (50GB+ free)

Migration Execution

Install K3s on both machines
Verify cluster formation
Install Cilium CNI
Install Longhorn storage
Install MetalLB load balancer
Migrate databases one at a time
Deploy stateless services
Configure service distribution
Set up replication for stateful services
Deploy observability stack

Post-Migration

Validate all services are running
Test failover scenarios
Run performance benchmarks
Configure backups
Set up monitoring alerts
Update documentation
Train on new operational procedures

Expected Outcomes

Performance Improvements

2x CPU capacity for agent workloads
2x RAM capacity for memory-intensive operations
40-50% reduction in M4 resource utilization
Parallel execution of multiple agent tasks
Faster LLM inference with dedicated compute on M3

Reliability Improvements

Zero single point of failure for critical services
Automatic failover for databases and services
Data replication across machines (2 copies minimum)
Self-healing pods via Kubernetes
Backup/restore capability via Velero

Operational Improvements

GitOps workflows via Argo CD
Centralized logging via Loki
Advanced networking via Cilium
Secret management via External Secrets
Automated scaling via HPA/VPA

Cost Savings

$0 cloud costs during development
100% hardware utilization across both machines
Eliminate resource waste from idle M3

Risk Mitigation

Risk	Likelihood	Impact	Mitigation
Network latency via Tailscale	Medium	Medium	Use subnet routing, monitor with Hubble
Storage sync issues	Low	High	Longhorn battle-tested, 2 replicas minimum
Split-brain scenarios	Low	High	K8s leader election, odd-numbered etcd quorum
Service disruption during migration	High	Medium	Migrate incrementally, keep OrbStack running in parallel
Configuration complexity	High	Low	Document everything, use GitOps, runbooks
M3 failure affecting critical services	Medium	High	Keep critical services on M4, use anti-affinity rules

Rollback Plan

If issues arise during migration:

# 1. Stop K3s services
sudo systemctl stop k3s  # M4
sudo systemctl stop k3s-agent  # M3

# 2. Restore OrbStack Kubernetes
orbstack config set kubernetes.enabled true

# 3. Restore databases from backups
# (restoration commands per database)

# 4. Restart OrbStack containers
docker start $(docker ps -aq)

# 5. Verify services are running
curl http://localhost:4000/health  # Agent Router
curl http://localhost:3003/health  # Agent Mesh

Maintenance Windows

Recommended maintenance schedule:

Daily: Automated backups (2am)
Weekly: Health checks and log review (Sunday)
Monthly: Security updates and patches
Quarterly: Performance review and optimization

Next Steps After Completion

Scale to additional machines (if available)
Implement advanced features:
- Ray for distributed Python workloads
- Kubeflow for ML pipelines
- Temporal for complex workflows
Optimize costs:
- Fine-tune resource requests/limits
- Implement spot instance patterns
Production readiness:
- Disaster recovery drills
- Load testing at scale
- Security audits

Contact & Support

Internal Resources:

GitLab: https://gitlab.com/blueflyio/agent-platform
Documentation: https://docs.blueflyagents.com
Wiki: $LLM_ROOT/WIKIs/technical-docs.wiki

External Resources:

End of Technical Implementation Plan