Skip to main content

agent platform

Comprehensive Multi-Machine Development Network - Technical Implementation Plan

Executive Summary

Transform two Mac laptops (M4 + M3) into a unified, distributed Kubernetes cluster for autonomous agent development, leveraging existing OrbStack infrastructure and Tailscale mesh networking. This plan optimizes resource utilization across machines while maintaining the sophisticated agent platform already running on M4.


Current State Analysis

Hardware Inventory

MachineTailscale IPCurrent RoleSpecsUtilization
Mac M4 (Bluefly)100.108.129.7Primary devM4 chip, macOS 25.1.080-90%
Mac M3 (GitLab)100.108.180.36GitLab onlyM3 chip20-30%
GL-BE3600100.116.110.123Subnet router192.168.8.0/24N/A

Current M4 Infrastructure (OrbStack)

Orchestration:

  • OrbStack 2.0.5 (Docker 28.5.2 + Kubernetes)
  • kubectl v1.32.3
  • 25 active namespaces
  • 100+ deployments/StatefulSets
  • 30+ running containers

Key Services Running:

  • Agent Services (Ports 3000-3015): 16 microservices
  • Data Layer: PostgreSQL, MongoDB, Redis, ClickHouse, Neo4j, Qdrant, MinIO
  • Message Broker: RabbitMQ
  • Observability: Phoenix (6006), Prometheus (9090), Grafana (3009), Jaeger (16686)
  • AI Infrastructure: LiteLLM (4000), LibreChat (3080), Ollama (11434)

Network:

  • Local IP: 192.168.8.109
  • Gateway: 192.168.8.1 (GL-BE3600)
  • OrbStack bridges: bridge100-103
  • Tailscale mesh: tailcf98b3.ts.net

Storage:

  • overlay2 on btrfs
  • 100+ Docker images (~50GB)
  • Local volumes only (single point of failure)

Software Already Installed (Both Machines)

Container & Orchestration:

  • docker (via OrbStack)
  • kubernetes-cli (kubectl v1.32.3)
  • k9s (TUI management)
  • kubectx (context switching)
  • stern (log tailing)
  • kustomize (config management)

DevOps:

  • gitlab-runner
  • glab (GitLab CLI)
  • git + git-lfs

Infrastructure:

  • tailscale
  • nginx
  • caddy

Observability:

  • prometheus (K8s)
  • grafana (K8s)
  • jaeger (Docker image available)

CLI Tools:

  • jq, yq (JSON/YAML)
  • fzf, ripgrep, bat, eza
  • btop (monitoring)

Gaps Identified

CategoryMissing ComponentPriorityImpact
ClusterMulti-node K8sCriticalCan't distribute workloads
NetworkingCilium CNIHighNo advanced networking
StorageDistributed storageCriticalNo failover for data
Load BalancingMetalLBHighNo proper service exposure
GitOpsArgo CDMediumManual deployments
LoggingLokiMediumNo centralized logs
BackupVeleroHighNo disaster recovery
SecretsExternal SecretsMediumManual secret management

Target Architecture

Hybrid Cloud Architecture: Cloudflare + Tailscale + vast.ai

Location-independent development - work from anywhere with consistent URLs.


                              PUBLIC INTERNET                                 
                                                                              
  GitLab Duo  mesh.bluefly.internal  Cloudflare Tunnel            
  API clients  api.blueflyagents.com  Cloudflare Tunnel            
                                                                              
                              cloudflared                                     
                     (routes to wherever you are)                             

                                    

                 TAILSCALE PRIVATE MESH (tailcf98b3.ts.net)                  
                                                                              
     
     YOUR LOCATION (home, hotel, vacation - GL-BE3600 travel router)      
                                                                           
                             
      Mac M4 (Control)         Mac M3 (Data)                           
      100.108.129.7            100.108.180.36                          
       agent-mesh              PostgreSQL                            
       cloudflared             Redis, Neo4j                          
       16 OSSA agents          Qdrant, backups                       
                             
     
                                                                             
     
     vast.ai GPU CLUSTER (always online, cloud)                            
      Ollama (qwen2.5-coder, deepseek-r1, llama3)                         
      vLLM for high-throughput inference                                  
      GPU-accelerated embeddings                                          
     

LayerTechnologyPurpose
Public URLsCloudflare DNS + TunnelStable endpoints (mesh.bluefly.internal)
Private MeshTailscaleEncrypted connectivity between all devices
Local NetworkGL-BE3600Consistent 192.168.8.x subnet anywhere
GPU Computevast.aiLLM inference (always available)

Cloudflare Domains

DomainBackendPurpose
mesh.bluefly.internalagent-mesh:3005GitLab Duo gateway
api.blueflyagents.comagent-router:3006LLM routing
brain.bluefly.internalagent-brain:3000Vector DB

Network Topology (Local)

Internet
  |
Spectrum Modem (or hotel WiFi via GL-BE3600)
  +---> Deco X60 Mesh (general network - home only)
  +---> GL-BE3600 Router (BlueflyMesh: 192.168.8.0/24 - travels with you)
      +---> Tailscale Subnet Router (100.116.110.123)
      |
      +---> Mac M4 (100.108.129.7) - K8s Control Plane
      |   +-- etcd, kube-apiserver, kube-scheduler
      |   +-- Cilium CNI, MetalLB
      |   +-- Core Services: PostgreSQL, RabbitMQ, Redis
      |   +-- Control Services: Agent Mesh, Agent Router
      |   +-- Observability: Prometheus, Grafana, Loki
      |   +-- cloudflared (tunnel to Cloudflare)
      |
      +---> Mac M3 (100.108.180.36) - K8s Worker Node
      |   +-- kubelet, kube-proxy
      |   +-- Compute: Agent Brain, Phoenix
      |   +-- Data: ClickHouse, MongoDB replica, Qdrant
      |   +-- Heavy Workloads: LibreChat, LiteLLM replicas
      |
      +---> vast.ai GPU (dynamic Tailscale IP)
          +-- Ollama with large models (32B, 70B)
          +-- vLLM for batch inference
          +-- GPU-accelerated agent-brain

Service Distribution Strategy

Mac M4 (Control Plane + Stateful Core)

Kubernetes Components:

  • etcd (cluster state)
  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler
  • CoreDNS

Core Data Services (Primary):

  • PostgreSQL 15 (primary, port 5432)
  • Redis 7 (master, port 6379)
  • RabbitMQ 3.13 (primary, port 5672)
  • Neo4j 5 (primary, port 7687)

Coordination Services:

  • Agent Mesh (port 3003) - Service registry
  • Agent Router (port 3006) - LLM gateway
  • Agent Protocol (port 3005) - OSSA coordination

Observability Stack:

  • Prometheus (port 9090)
  • Grafana (port 3009)
  • Loki (new, port 3100)
  • Jaeger Query (port 16686)

Ingress & Gateway:

  • Ingress NGINX controller
  • CSMA Gateway (192.168.139.2:8090)

Mac M3 (Worker Node + Compute Intensive)

Heavy Compute Services:

  • Agent Brain (port 3000) - Qdrant vector operations
  • Ollama (port 11434) - Local LLM inference
  • Agent Studio (port 3007) - IDE workloads
  • Phoenix (port 6006) - Tracing processing
  • Agent Tracer (port 3008) - Trace collection

Distributed Data (Replicas):

  • PostgreSQL 15 (replica, read-only)
  • MongoDB 7.0 (replica set member)
  • Qdrant (distributed mode)
  • ClickHouse (distributed table)

User-Facing Services:

  • LibreChat (port 3080) - Chat UI
  • LiteLLM (port 4000) - Additional replicas
  • Studio UI (port 3014) - Frontend

Agent Workloads:

  • Agent Chat (port 3001)
  • Agent Docker (port 3002)
  • Agent Ops (port 3004)
  • Agentic Flows (port 3009)
  • Compliance Engine (port 3010)
  • Doc Engine (port 3011)

Storage Architecture

+-------------------------------------------------------------+
| Longhorn Distributed Storage Cluster                         |
+-------------------------------------------------------------+
|                                                               |
|  Mac M4 Node                     Mac M3 Node                 |
|  +-- /var/lib/longhorn (50GB)   +-- /var/lib/longhorn (50GB) |
|  |                              |                            |
|  +-- Volumes (3 replicas):       +-- Volumes (3 replicas):    |
|     * postgres-data (replica 1)    * postgres-data (replica 2)|
|     * mongodb-data (replica 1)     * mongodb-data (replica 2)|
|     * redis-data (replica 1)       * redis-data (replica 2)  |
|     * neo4j-data (replica 1)       * neo4j-data (replica 2)  |
|     * qdrant-data (replica 1)      * qdrant-data (replica 2) |
|                                                               |
|  Configuration:                                               |
|  * Replica Count: 2                                          |
|  * Stale Replica Timeout: 30m                                |
|  * Backup Target: S3 (MinIO)                                 |
|  * Snapshot Schedule: Daily 2am                              |
+-------------------------------------------------------------+

Implementation Phases

Phase 1: Cluster Foundation (Week 1)

Day 1-2: Multi-Node Kubernetes Setup

Objective: Create K3s cluster spanning both machines

On Mac M4 (Control Plane):

# 1. Uninstall OrbStack K8s (keep Docker) orbstack config set kubernetes.enabled false # 2. Install K3s as server curl -sfL https://get.k3s.io | sh -s - server \ --node-ip 100.108.129.7 \ --node-external-ip 100.108.129.7 \ --flannel-backend=none \ --disable-network-policy \ --disable traefik \ --disable servicelb \ --write-kubeconfig-mode 644 \ --tls-san 100.108.129.7 \ --bind-address 100.108.129.7 # 3. Get join token sudo cat /var/lib/rancher/k3s/server/node-token # 4. Update kubeconfig export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

On Mac M3 (Worker Node):

# Install K3s as agent export K3S_TOKEN="<token-from-m4>" export K3S_URL="https://100.108.129.7:6443" curl -sfL https://get.k3s.io | sh -s - agent \ --node-ip 100.108.180.36 \ --node-external-ip 100.108.180.36 # Verify node joined kubectl get nodes

Expected Output:

NAME                        STATUS   ROLES                  AGE   VERSION
mac-m4.tail<hash>.ts.net    Ready    control-plane,master   5m    v1.30.x
mac-m3.tail<hash>.ts.net    Ready    <none>                 2m    v1.30.x

Day 3: Network Layer (Cilium)

# On M4 only: # 1. Install Cilium CLI brew install cilium-cli # 2. Install Cilium CNI cilium install \ --set ipam.mode=kubernetes \ --set tunnel=disabled \ --set autoDirectNodeRoutes=true \ --set ipv4NativeRoutingCIDR="100.64.0.0/10" \ --set hubble.relay.enabled=true \ --set hubble.ui.enabled=true # 3. Verify installation cilium status --wait cilium connectivity test # 4. Enable Hubble (network observability) cilium hubble enable --ui

Validation:

# Check Cilium pods running on both nodes kubectl get pods -n kube-system -l k8s-app=cilium -o wide # Test pod-to-pod connectivity across nodes kubectl run test-m4 --image=nginx --overrides='{"spec":{"nodeName":"mac-m4"}}' kubectl run test-m3 --image=nginx --overrides='{"spec":{"nodeName":"mac-m3"}}' kubectl exec test-m4 -- curl test-m3

Day 4-5: Storage Layer (Longhorn)

# On BOTH M4 and M3: # 1. Install dependencies brew install iscsi-initiator-utils # If available, or use apt-get on Linux # 2. Create Longhorn storage directory sudo mkdir -p /var/lib/longhorn sudo chmod 755 /var/lib/longhorn # On M4 only: # 3. Install Longhorn kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml # 4. Wait for Longhorn to be ready kubectl -n longhorn-system get pods # 5. Access Longhorn UI kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80 # Open http://localhost:8080 # 6. Configure Longhorn settings cat <<EOF | kubectl apply -f - apiVersion: longhorn.io/v1beta1 kind: Setting metadata: name: default-replica-count namespace: longhorn-system spec: value: "2" --- apiVersion: longhorn.io/v1beta1 kind: Setting metadata: name: stale-replica-timeout namespace: longhorn-system spec: value: "30" EOF # 7. Create StorageClass cat <<EOF | kubectl apply -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: longhorn annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: driver.longhorn.io allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: Immediate parameters: numberOfReplicas: "2" staleReplicaTimeout: "30" fromBackup: "" fsType: "ext4" EOF

Validation:

# Test PVC creation cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 1Gi EOF kubectl get pvc test-pvc kubectl delete pvc test-pvc

Day 6-7: Load Balancing (MetalLB)

# On M4: # 1. Install MetalLB kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml # 2. Wait for MetalLB to be ready kubectl wait --namespace metallb-system \ --for=condition=ready pod \ --selector=app=metallb \ --timeout=90s # 3. Configure IP pool (using Tailscale IPs) cat <<EOF | kubectl apply -f - apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: tailscale-pool namespace: metallb-system spec: addresses: - 100.108.129.100-100.108.129.150 - 100.108.180.100-100.108.180.150 --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: tailscale-l2 namespace: metallb-system spec: ipAddressPools: - tailscale-pool EOF

Validation:

# Test LoadBalancer service kubectl create deployment nginx --image=nginx kubectl expose deployment nginx --port=80 --type=LoadBalancer # Check external IP assigned kubectl get svc nginx # Should show EXTERNAL-IP from pool # Test access curl <EXTERNAL-IP> # Cleanup kubectl delete svc nginx kubectl delete deployment nginx

Phase 2: Core Infrastructure Migration (Week 2)

Day 1-2: Ingress & Gateway

# 1. Install NGINX Ingress Controller kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml # 2. Configure for Tailscale kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"LoadBalancer"}}' # 3. Get ingress IP kubectl get svc -n ingress-nginx ingress-nginx-controller # 4. Install Cert-Manager (for TLS) kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml # 5. Create ClusterIssuer for self-signed certs cat <<EOF | kubectl apply -f - apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: selfsigned-issuer spec: selfSigned: {} EOF

Day 3-4: Migrate Stateless Services

Strategy: Start with services that have no persistent data

# 1. Create namespace structure kubectl create namespace development kubectl create namespace production kubectl create namespace staging kubectl create namespace csma kubectl create namespace csma-agents kubectl create namespace databases kubectl create namespace monitoring # 2. Deploy Agent Router (first stateless service) cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: agent-router namespace: development spec: replicas: 2 # 1 on M4, 1 on M3 selector: matchLabels: app: agent-router template: metadata: labels: app: agent-router spec: containers: - name: litellm image: ghcr.io/berriai/litellm:main-latest ports: - containerPort: 4000 env: - name: OLLAMA_API_BASE value: "http://ollama.development.svc.cluster.local:11434" resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - agent-router topologyKey: kubernetes.io/hostname --- apiVersion: v1 kind: Service metadata: name: agent-router namespace: development spec: type: LoadBalancer selector: app: agent-router ports: - port: 4000 targetPort: 4000 EOF # 3. Verify deployment kubectl get pods -n development -o wide # Should see 1 pod on M4, 1 pod on M3 # 4. Test service ROUTER_IP=$(kubectl get svc agent-router -n development -o jsonpath='{.status.loadBalancer.ingress[0].ip}') curl http://$ROUTER_IP:4000/health

Day 5-7: Migrate Stateful Services with Longhorn

PostgreSQL Migration:

# 1. Backup existing data from OrbStack docker exec postgres-container pg_dumpall -U postgres > /tmp/postgres-backup.sql # 2. Deploy PostgreSQL with Longhorn PVC cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: postgres-data namespace: databases spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 20Gi --- apiVersion: apps/v1 kind: StatefulSet metadata: name: postgresql namespace: databases spec: serviceName: postgresql replicas: 1 selector: matchLabels: app: postgresql template: metadata: labels: app: postgresql spec: nodeSelector: kubernetes.io/hostname: mac-m4 # Pin to M4 for primary containers: - name: postgresql image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: - name: postgres-data mountPath: /var/lib/postgresql/data resources: requests: memory: "2Gi" cpu: "500m" limits: memory: "4Gi" cpu: "2000m" volumes: - name: postgres-data persistentVolumeClaim: claimName: postgres-data --- apiVersion: v1 kind: Service metadata: name: postgresql namespace: databases spec: clusterIP: None selector: app: postgresql ports: - port: 5432 targetPort: 5432 EOF # 3. Restore data kubectl exec -n databases postgresql-0 -i -- psql -U postgres < /tmp/postgres-backup.sql # 4. Verify Longhorn replication kubectl get volumes -n longhorn-system # Should show 2 replicas

Repeat for MongoDB, Redis, Neo4j, Qdrant:

# Template for other databases (adjust per service) # - Backup from OrbStack # - Create PVC with Longhorn # - Deploy StatefulSet with nodeSelector # - Restore data # - Verify replication

Phase 3: Service Distribution & Optimization (Week 3)

Day 1-3: Deploy Compute-Intensive Services to M3

# 1. Label nodes kubectl label node mac-m3 workload-type=compute-intensive kubectl label node mac-m4 workload-type=control-plane # 2. Deploy Ollama to M3 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: ollama namespace: development spec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: nodeSelector: workload-type: compute-intensive # Force to M3 containers: - name: ollama image: ollama/ollama:latest ports: - containerPort: 11434 volumeMounts: - name: ollama-data mountPath: /root/.ollama resources: requests: memory: "4Gi" cpu: "2000m" limits: memory: "8Gi" cpu: "4000m" volumes: - name: ollama-data persistentVolumeClaim: claimName: ollama-data --- apiVersion: v1 kind: Service metadata: name: ollama namespace: development spec: selector: app: ollama ports: - port: 11434 targetPort: 11434 EOF # 3. Deploy remaining compute services to M3 # - Agent Brain (Qdrant intensive) # - Phoenix (trace processing) # - LibreChat (user-facing) # - Agent Studio (IDE workloads)

Day 4-5: Implement Database Replication

PostgreSQL Streaming Replication:

# 1. Create replica on M3 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: postgresql-replica namespace: databases spec: serviceName: postgresql-replica replicas: 1 selector: matchLabels: app: postgresql-replica template: metadata: labels: app: postgresql-replica spec: nodeSelector: kubernetes.io/hostname: mac-m3 containers: - name: postgresql image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata - name: POSTGRES_PRIMARY_HOST value: postgresql.databases.svc.cluster.local volumeMounts: - name: postgres-replica-data mountPath: /var/lib/postgresql/data - name: recovery-config mountPath: /docker-entrypoint-initdb.d volumes: - name: postgres-replica-data persistentVolumeClaim: claimName: postgres-replica-data - name: recovery-config configMap: name: postgres-replica-config --- apiVersion: v1 kind: ConfigMap metadata: name: postgres-replica-config namespace: databases data: setup-replication.sh: | #!/bin/bash pg_basebackup -h $POSTGRES_PRIMARY_HOST -D /var/lib/postgresql/data -U replication -v -P cat > /var/lib/postgresql/data/postgresql.conf <<EOL primary_conninfo = 'host=$POSTGRES_PRIMARY_HOST port=5432 user=replication' hot_standby = on EOL EOF # 2. Create read-only service pointing to replica cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: postgresql-readonly namespace: databases spec: selector: app: postgresql-replica ports: - port: 5432 targetPort: 5432 EOF

Redis Sentinel:

# Install Redis with Sentinel via Helm helm repo add bitnami https://charts.bitnami.com/bitnami helm install redis bitnami/redis \ --namespace databases \ --set sentinel.enabled=true \ --set sentinel.quorum=2 \ --set master.persistence.storageClass=longhorn \ --set replica.replicaCount=1 \ --set replica.persistence.storageClass=longhorn

Day 6-7: Deploy Observability Stack

# 1. Install Loki via Helm helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true \ --set grafana.enabled=false \ --set loki.persistence.enabled=true \ --set loki.persistence.storageClassName=longhorn \ --set loki.persistence.size=50Gi # 2. Migrate Prometheus to cluster helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi # 3. Configure Grafana with Loki datasource kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources namespace: monitoring data: datasources.yaml: | apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 - name: Prometheus type: prometheus access: proxy url: http://prometheus-kube-prometheus-prometheus:9090 EOF

Phase 4: GitOps & Automation (Week 4)

Day 1-2: Argo CD Setup

# 1. Install Argo CD kubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # 2. Expose Argo CD UI via LoadBalancer kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}' # 3. Get admin password kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d # 4. Login to Argo CD UI ARGOCD_IP=$(kubectl get svc argocd-server -n argocd -o jsonpath='{.status.loadBalancer.ingress[0].ip}') open http://$ARGOCD_IP # 5. Create GitLab integration argocd repo add https://gitlab.com/blueflyio/agent-platform.git \ --username <username> \ --password <token>

Day 3-4: External Secrets Operator (1Password Integration)

# 1. Install External Secrets Operator helm repo add external-secrets https://charts.external-secrets.io helm install external-secrets \ external-secrets/external-secrets \ -n external-secrets-system \ --create-namespace # 2. Install 1Password Connect (optional, or use 1Password CLI) # Follow: https://developer.1password.com/docs/connect/ # 3. Create SecretStore cat <<EOF | kubectl apply -f - apiVersion: external-secrets.io/v1beta1 kind: SecretStore metadata: name: onepassword namespace: default spec: provider: onepassword: auth: secretRef: connectTokenSecretRef: name: onepassword-token key: token connectHost: http://onepassword-connect:8080 vaults: csma-secrets: 1 EOF # 4. Create ExternalSecret for GitLab tokens cat <<EOF | kubectl apply -f - apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: gitlab-tokens namespace: default spec: refreshInterval: 1h secretStoreRef: name: onepassword kind: SecretStore target: name: gitlab-tokens creationPolicy: Owner data: - secretKey: token remoteRef: key: gitlab-api-token EOF

Day 5-7: Argo Workflows for Agent Orchestration

# 1. Install Argo Workflows kubectl create namespace argo kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml # 2. Expose Argo Workflows UI kubectl patch svc argo-server -n argo -p '{"spec": {"type": "LoadBalancer"}}' # 3. Create workflow template for agent execution cat <<EOF | kubectl apply -f - apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: agent-execution namespace: argo spec: entrypoint: main arguments: parameters: - name: agent-name - name: task templates: - name: main inputs: parameters: - name: agent-name - name: task container: image: "ghcr.io/blueflyio/{{inputs.parameters.agent-name}}:latest" command: ["/bin/sh"] args: ["-c", "echo '{{inputs.parameters.task}}' | agent-executor"] EOF

Phase 5: Production Readiness (Week 5)

Day 1-2: Backup & Disaster Recovery

# 1. Install Velero brew install velero # 2. Set up MinIO as backup target (already have MinIO) kubectl create namespace velero # 3. Create MinIO bucket for backups kubectl run minio-client --rm -it --image=minio/mc --restart=Never -- \ mc alias set minio http://minio.databases.svc.cluster.local:9000 minioadmin minioadmin mc mb minio/velero-backups # 4. Install Velero with MinIO backend velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backups \ --secret-file ./credentials-velero \ --use-volume-snapshots=true \ --snapshot-location-config region=minio \ --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.databases.svc.cluster.local:9000 # 5. Create backup schedule velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces development,production,databases,csma # 6. Test backup/restore velero backup create test-backup --include-namespaces development velero backup describe test-backup velero restore create --from-backup test-backup

Day 3-4: Performance Tuning

# 1. Deploy Vertical Pod Autoscaler kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml # 2. Create VPA for Agent Router cat <<EOF | kubectl apply -f - apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: agent-router-vpa namespace: development spec: targetRef: apiVersion: apps/v1 kind: Deployment name: agent-router updatePolicy: updateMode: "Auto" EOF # 3. Deploy Horizontal Pod Autoscaler kubectl autoscale deployment agent-router \ --namespace development \ --cpu-percent=70 \ --min=2 \ --max=4 # 4. Enable resource quotas per namespace cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ResourceQuota metadata: name: development-quota namespace: development spec: hard: requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi EOF

Day 5: Security Hardening

# 1. Enable Pod Security Standards kubectl label namespace development pod-security.kubernetes.io/enforce=baseline # 2. Create Network Policies with Cilium cat <<EOF | kubectl apply -f - apiVersion: cilium.io/v2 kind: CiliumNetworkPolicy metadata: name: agent-router-policy namespace: development spec: endpointSelector: matchLabels: app: agent-router ingress: - fromEndpoints: - matchLabels: app: agent-mesh toPorts: - ports: - port: "4000" protocol: TCP egress: - toEndpoints: - matchLabels: app: ollama toPorts: - ports: - port: "11434" protocol: TCP EOF # 3. Install Falco for runtime security helm repo add falcosecurity https://falcosecurity.github.io/charts helm install falco falcosecurity/falco \ --namespace falco \ --create-namespace \ --set tty=true

Day 6-7: Documentation & Runbooks

Create operational runbooks:

# Runbooks (to create) 1. Node Failure Recovery 2. Database Failover Procedure 3. Service Scaling Guidelines 4. Backup/Restore Procedures 5. Security Incident Response 6. Performance Troubleshooting 7. GitOps Deployment Process

Validation & Testing

Cluster Health Checks

# 1. Node status kubectl get nodes -o wide # 2. Pod distribution kubectl get pods --all-namespaces -o wide | grep -E 'mac-m4|mac-m3' # 3. Storage health kubectl get pv,pvc --all-namespaces kubectl -n longhorn-system get volumes # 4. Network connectivity kubectl run test-pod --image=nicolaka/netshoot --rm -it -- /bin/bash # Inside pod: ping services, curl endpoints # 5. Service endpoints kubectl get svc --all-namespaces -o wide

Performance Benchmarks

# 1. Database performance (PostgreSQL) kubectl run pgbench --rm -it --image=postgres:15 -- \ pgbench -h postgresql.databases.svc.cluster.local -U postgres -c 10 -j 2 -t 1000 # 2. Network throughput (iperf3) # On M4: kubectl run iperf-server --image=networkstatic/iperf3 -- -s # On M3: kubectl run iperf-client --rm -it --image=networkstatic/iperf3 -- \ -c iperf-server.default.svc.cluster.local -t 30 # 3. Load testing Agent Router kubectl run load-test --rm -it --image=williamyeh/wrk -- \ -t 4 -c 100 -d 30s http://agent-router.development.svc.cluster.local:4000/health

Failover Testing

# 1. Simulate M3 node failure kubectl drain mac-m3 --ignore-daemonsets --delete-emptydir-data # 2. Verify services moved to M4 kubectl get pods --all-namespaces -o wide | grep mac-m4 # 3. Check database replication kubectl exec -n databases postgresql-replica-0 -- pg_isready # 4. Restore M3 kubectl uncordon mac-m3

Monitoring & Alerting

Grafana Dashboards to Create

  1. Cluster Overview

    • Node CPU/Memory utilization
    • Pod distribution
    • Network traffic
  2. Service Health

    • Agent Router throughput
    • Database query latency
    • LLM inference time
  3. Storage Metrics

    • Longhorn volume health
    • IOPS per node
    • Backup status
  4. Cost Tracking

    • Resource usage per namespace
    • Pod efficiency scores

Prometheus Alerts

# alerts.yaml groups: - name: cluster-health rules: - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 5m annotations: summary: "Node {{ $labels.instance }} is down" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 10m annotations: summary: "Pod {{ $labels.pod }} is crash looping" - alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m annotations: summary: "Node {{ $labels.instance }} has less than 10% memory available" - alert: StorageVolumeUnhealthy expr: longhorn_volume_robustness != 0 for: 5m annotations: summary: "Longhorn volume {{ $labels.volume }} is unhealthy"

Migration Checklist

Pre-Migration

  • Document current OrbStack configuration
  • Backup all databases and persistent data
  • Export existing Kubernetes manifests
  • Test Tailscale connectivity between machines
  • Verify disk space on both machines (50GB+ free)

Migration Execution

  • Install K3s on both machines
  • Verify cluster formation
  • Install Cilium CNI
  • Install Longhorn storage
  • Install MetalLB load balancer
  • Migrate databases one at a time
  • Deploy stateless services
  • Configure service distribution
  • Set up replication for stateful services
  • Deploy observability stack

Post-Migration

  • Validate all services are running
  • Test failover scenarios
  • Run performance benchmarks
  • Configure backups
  • Set up monitoring alerts
  • Update documentation
  • Train on new operational procedures

Expected Outcomes

Performance Improvements

  • 2x CPU capacity for agent workloads
  • 2x RAM capacity for memory-intensive operations
  • 40-50% reduction in M4 resource utilization
  • Parallel execution of multiple agent tasks
  • Faster LLM inference with dedicated compute on M3

Reliability Improvements

  • Zero single point of failure for critical services
  • Automatic failover for databases and services
  • Data replication across machines (2 copies minimum)
  • Self-healing pods via Kubernetes
  • Backup/restore capability via Velero

Operational Improvements

  • GitOps workflows via Argo CD
  • Centralized logging via Loki
  • Advanced networking via Cilium
  • Secret management via External Secrets
  • Automated scaling via HPA/VPA

Cost Savings

  • $0 cloud costs during development
  • 100% hardware utilization across both machines
  • Eliminate resource waste from idle M3

Risk Mitigation

RiskLikelihoodImpactMitigation
Network latency via TailscaleMediumMediumUse subnet routing, monitor with Hubble
Storage sync issuesLowHighLonghorn battle-tested, 2 replicas minimum
Split-brain scenariosLowHighK8s leader election, odd-numbered etcd quorum
Service disruption during migrationHighMediumMigrate incrementally, keep OrbStack running in parallel
Configuration complexityHighLowDocument everything, use GitOps, runbooks
M3 failure affecting critical servicesMediumHighKeep critical services on M4, use anti-affinity rules

Rollback Plan

If issues arise during migration:

# 1. Stop K3s services sudo systemctl stop k3s # M4 sudo systemctl stop k3s-agent # M3 # 2. Restore OrbStack Kubernetes orbstack config set kubernetes.enabled true # 3. Restore databases from backups # (restoration commands per database) # 4. Restart OrbStack containers docker start $(docker ps -aq) # 5. Verify services are running curl http://localhost:4000/health # Agent Router curl http://localhost:3003/health # Agent Mesh

Maintenance Windows

Recommended maintenance schedule:

  • Daily: Automated backups (2am)
  • Weekly: Health checks and log review (Sunday)
  • Monthly: Security updates and patches
  • Quarterly: Performance review and optimization

Next Steps After Completion

  1. Scale to additional machines (if available)
  2. Implement advanced features:
    • Ray for distributed Python workloads
    • Kubeflow for ML pipelines
    • Temporal for complex workflows
  3. Optimize costs:
    • Fine-tune resource requests/limits
    • Implement spot instance patterns
  4. Production readiness:
    • Disaster recovery drills
    • Load testing at scale
    • Security audits

Contact & Support

Internal Resources:

External Resources:


End of Technical Implementation Plan