agent platform
Comprehensive Multi-Machine Development Network - Technical Implementation Plan
Executive Summary
Transform two Mac laptops (M4 + M3) into a unified, distributed Kubernetes cluster for autonomous agent development, leveraging existing OrbStack infrastructure and Tailscale mesh networking. This plan optimizes resource utilization across machines while maintaining the sophisticated agent platform already running on M4.
Current State Analysis
Hardware Inventory
| Machine | Tailscale IP | Current Role | Specs | Utilization |
|---|---|---|---|---|
| Mac M4 (Bluefly) | 100.108.129.7 | Primary dev | M4 chip, macOS 25.1.0 | 80-90% |
| Mac M3 (GitLab) | 100.108.180.36 | GitLab only | M3 chip | 20-30% |
| GL-BE3600 | 100.116.110.123 | Subnet router | 192.168.8.0/24 | N/A |
Current M4 Infrastructure (OrbStack)
Orchestration:
- OrbStack 2.0.5 (Docker 28.5.2 + Kubernetes)
- kubectl v1.32.3
- 25 active namespaces
- 100+ deployments/StatefulSets
- 30+ running containers
Key Services Running:
- Agent Services (Ports 3000-3015): 16 microservices
- Data Layer: PostgreSQL, MongoDB, Redis, ClickHouse, Neo4j, Qdrant, MinIO
- Message Broker: RabbitMQ
- Observability: Phoenix (6006), Prometheus (9090), Grafana (3009), Jaeger (16686)
- AI Infrastructure: LiteLLM (4000), LibreChat (3080), Ollama (11434)
Network:
- Local IP: 192.168.8.109
- Gateway: 192.168.8.1 (GL-BE3600)
- OrbStack bridges: bridge100-103
- Tailscale mesh: tailcf98b3.ts.net
Storage:
- overlay2 on btrfs
- 100+ Docker images (~50GB)
- Local volumes only (single point of failure)
Software Already Installed (Both Machines)
Container & Orchestration:
- docker (via OrbStack)
- kubernetes-cli (kubectl v1.32.3)
- k9s (TUI management)
- kubectx (context switching)
- stern (log tailing)
- kustomize (config management)
DevOps:
- gitlab-runner
- glab (GitLab CLI)
- git + git-lfs
Infrastructure:
- tailscale
- nginx
- caddy
Observability:
- prometheus (K8s)
- grafana (K8s)
- jaeger (Docker image available)
CLI Tools:
- jq, yq (JSON/YAML)
- fzf, ripgrep, bat, eza
- btop (monitoring)
Gaps Identified
| Category | Missing Component | Priority | Impact |
|---|---|---|---|
| Cluster | Multi-node K8s | Critical | Can't distribute workloads |
| Networking | Cilium CNI | High | No advanced networking |
| Storage | Distributed storage | Critical | No failover for data |
| Load Balancing | MetalLB | High | No proper service exposure |
| GitOps | Argo CD | Medium | Manual deployments |
| Logging | Loki | Medium | No centralized logs |
| Backup | Velero | High | No disaster recovery |
| Secrets | External Secrets | Medium | Manual secret management |
Target Architecture
Hybrid Cloud Architecture: Cloudflare + Tailscale + vast.ai
Location-independent development - work from anywhere with consistent URLs.
PUBLIC INTERNET
GitLab Duo mesh.bluefly.internal Cloudflare Tunnel
API clients api.blueflyagents.com Cloudflare Tunnel
cloudflared
(routes to wherever you are)
TAILSCALE PRIVATE MESH (tailcf98b3.ts.net)
YOUR LOCATION (home, hotel, vacation - GL-BE3600 travel router)
Mac M4 (Control) Mac M3 (Data)
100.108.129.7 100.108.180.36
agent-mesh PostgreSQL
cloudflared Redis, Neo4j
16 OSSA agents Qdrant, backups
vast.ai GPU CLUSTER (always online, cloud)
Ollama (qwen2.5-coder, deepseek-r1, llama3)
vLLM for high-throughput inference
GPU-accelerated embeddings
| Layer | Technology | Purpose |
|---|---|---|
| Public URLs | Cloudflare DNS + Tunnel | Stable endpoints (mesh.bluefly.internal) |
| Private Mesh | Tailscale | Encrypted connectivity between all devices |
| Local Network | GL-BE3600 | Consistent 192.168.8.x subnet anywhere |
| GPU Compute | vast.ai | LLM inference (always available) |
Cloudflare Domains
| Domain | Backend | Purpose |
|---|---|---|
| mesh.bluefly.internal | agent-mesh:3005 | GitLab Duo gateway |
| api.blueflyagents.com | agent-router:3006 | LLM routing |
| brain.bluefly.internal | agent-brain:3000 | Vector DB |
Network Topology (Local)
Internet
|
Spectrum Modem (or hotel WiFi via GL-BE3600)
+---> Deco X60 Mesh (general network - home only)
+---> GL-BE3600 Router (BlueflyMesh: 192.168.8.0/24 - travels with you)
+---> Tailscale Subnet Router (100.116.110.123)
|
+---> Mac M4 (100.108.129.7) - K8s Control Plane
| +-- etcd, kube-apiserver, kube-scheduler
| +-- Cilium CNI, MetalLB
| +-- Core Services: PostgreSQL, RabbitMQ, Redis
| +-- Control Services: Agent Mesh, Agent Router
| +-- Observability: Prometheus, Grafana, Loki
| +-- cloudflared (tunnel to Cloudflare)
|
+---> Mac M3 (100.108.180.36) - K8s Worker Node
| +-- kubelet, kube-proxy
| +-- Compute: Agent Brain, Phoenix
| +-- Data: ClickHouse, MongoDB replica, Qdrant
| +-- Heavy Workloads: LibreChat, LiteLLM replicas
|
+---> vast.ai GPU (dynamic Tailscale IP)
+-- Ollama with large models (32B, 70B)
+-- vLLM for batch inference
+-- GPU-accelerated agent-brain
Service Distribution Strategy
Mac M4 (Control Plane + Stateful Core)
Kubernetes Components:
- etcd (cluster state)
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- CoreDNS
Core Data Services (Primary):
- PostgreSQL 15 (primary, port 5432)
- Redis 7 (master, port 6379)
- RabbitMQ 3.13 (primary, port 5672)
- Neo4j 5 (primary, port 7687)
Coordination Services:
- Agent Mesh (port 3003) - Service registry
- Agent Router (port 3006) - LLM gateway
- Agent Protocol (port 3005) - OSSA coordination
Observability Stack:
- Prometheus (port 9090)
- Grafana (port 3009)
- Loki (new, port 3100)
- Jaeger Query (port 16686)
Ingress & Gateway:
- Ingress NGINX controller
- CSMA Gateway (192.168.139.2:8090)
Mac M3 (Worker Node + Compute Intensive)
Heavy Compute Services:
- Agent Brain (port 3000) - Qdrant vector operations
- Ollama (port 11434) - Local LLM inference
- Agent Studio (port 3007) - IDE workloads
- Phoenix (port 6006) - Tracing processing
- Agent Tracer (port 3008) - Trace collection
Distributed Data (Replicas):
- PostgreSQL 15 (replica, read-only)
- MongoDB 7.0 (replica set member)
- Qdrant (distributed mode)
- ClickHouse (distributed table)
User-Facing Services:
- LibreChat (port 3080) - Chat UI
- LiteLLM (port 4000) - Additional replicas
- Studio UI (port 3014) - Frontend
Agent Workloads:
- Agent Chat (port 3001)
- Agent Docker (port 3002)
- Agent Ops (port 3004)
- Agentic Flows (port 3009)
- Compliance Engine (port 3010)
- Doc Engine (port 3011)
Storage Architecture
+-------------------------------------------------------------+
| Longhorn Distributed Storage Cluster |
+-------------------------------------------------------------+
| |
| Mac M4 Node Mac M3 Node |
| +-- /var/lib/longhorn (50GB) +-- /var/lib/longhorn (50GB) |
| | | |
| +-- Volumes (3 replicas): +-- Volumes (3 replicas): |
| * postgres-data (replica 1) * postgres-data (replica 2)|
| * mongodb-data (replica 1) * mongodb-data (replica 2)|
| * redis-data (replica 1) * redis-data (replica 2) |
| * neo4j-data (replica 1) * neo4j-data (replica 2) |
| * qdrant-data (replica 1) * qdrant-data (replica 2) |
| |
| Configuration: |
| * Replica Count: 2 |
| * Stale Replica Timeout: 30m |
| * Backup Target: S3 (MinIO) |
| * Snapshot Schedule: Daily 2am |
+-------------------------------------------------------------+
Implementation Phases
Phase 1: Cluster Foundation (Week 1)
Day 1-2: Multi-Node Kubernetes Setup
Objective: Create K3s cluster spanning both machines
On Mac M4 (Control Plane):
# 1. Uninstall OrbStack K8s (keep Docker) orbstack config set kubernetes.enabled false # 2. Install K3s as server curl -sfL https://get.k3s.io | sh -s - server \ --node-ip 100.108.129.7 \ --node-external-ip 100.108.129.7 \ --flannel-backend=none \ --disable-network-policy \ --disable traefik \ --disable servicelb \ --write-kubeconfig-mode 644 \ --tls-san 100.108.129.7 \ --bind-address 100.108.129.7 # 3. Get join token sudo cat /var/lib/rancher/k3s/server/node-token # 4. Update kubeconfig export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
On Mac M3 (Worker Node):
# Install K3s as agent export K3S_TOKEN="<token-from-m4>" export K3S_URL="https://100.108.129.7:6443" curl -sfL https://get.k3s.io | sh -s - agent \ --node-ip 100.108.180.36 \ --node-external-ip 100.108.180.36 # Verify node joined kubectl get nodes
Expected Output:
NAME STATUS ROLES AGE VERSION
mac-m4.tail<hash>.ts.net Ready control-plane,master 5m v1.30.x
mac-m3.tail<hash>.ts.net Ready <none> 2m v1.30.x
Day 3: Network Layer (Cilium)
# On M4 only: # 1. Install Cilium CLI brew install cilium-cli # 2. Install Cilium CNI cilium install \ --set ipam.mode=kubernetes \ --set tunnel=disabled \ --set autoDirectNodeRoutes=true \ --set ipv4NativeRoutingCIDR="100.64.0.0/10" \ --set hubble.relay.enabled=true \ --set hubble.ui.enabled=true # 3. Verify installation cilium status --wait cilium connectivity test # 4. Enable Hubble (network observability) cilium hubble enable --ui
Validation:
# Check Cilium pods running on both nodes kubectl get pods -n kube-system -l k8s-app=cilium -o wide # Test pod-to-pod connectivity across nodes kubectl run test-m4 --image=nginx --overrides='{"spec":{"nodeName":"mac-m4"}}' kubectl run test-m3 --image=nginx --overrides='{"spec":{"nodeName":"mac-m3"}}' kubectl exec test-m4 -- curl test-m3
Day 4-5: Storage Layer (Longhorn)
# On BOTH M4 and M3: # 1. Install dependencies brew install iscsi-initiator-utils # If available, or use apt-get on Linux # 2. Create Longhorn storage directory sudo mkdir -p /var/lib/longhorn sudo chmod 755 /var/lib/longhorn # On M4 only: # 3. Install Longhorn kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml # 4. Wait for Longhorn to be ready kubectl -n longhorn-system get pods # 5. Access Longhorn UI kubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80 # Open http://localhost:8080 # 6. Configure Longhorn settings cat <<EOF | kubectl apply -f - apiVersion: longhorn.io/v1beta1 kind: Setting metadata: name: default-replica-count namespace: longhorn-system spec: value: "2" --- apiVersion: longhorn.io/v1beta1 kind: Setting metadata: name: stale-replica-timeout namespace: longhorn-system spec: value: "30" EOF # 7. Create StorageClass cat <<EOF | kubectl apply -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: longhorn annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: driver.longhorn.io allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: Immediate parameters: numberOfReplicas: "2" staleReplicaTimeout: "30" fromBackup: "" fsType: "ext4" EOF
Validation:
# Test PVC creation cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 1Gi EOF kubectl get pvc test-pvc kubectl delete pvc test-pvc
Day 6-7: Load Balancing (MetalLB)
# On M4: # 1. Install MetalLB kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/main/config/manifests/metallb-native.yaml # 2. Wait for MetalLB to be ready kubectl wait --namespace metallb-system \ --for=condition=ready pod \ --selector=app=metallb \ --timeout=90s # 3. Configure IP pool (using Tailscale IPs) cat <<EOF | kubectl apply -f - apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: tailscale-pool namespace: metallb-system spec: addresses: - 100.108.129.100-100.108.129.150 - 100.108.180.100-100.108.180.150 --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: tailscale-l2 namespace: metallb-system spec: ipAddressPools: - tailscale-pool EOF
Validation:
# Test LoadBalancer service kubectl create deployment nginx --image=nginx kubectl expose deployment nginx --port=80 --type=LoadBalancer # Check external IP assigned kubectl get svc nginx # Should show EXTERNAL-IP from pool # Test access curl <EXTERNAL-IP> # Cleanup kubectl delete svc nginx kubectl delete deployment nginx
Phase 2: Core Infrastructure Migration (Week 2)
Day 1-2: Ingress & Gateway
# 1. Install NGINX Ingress Controller kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml # 2. Configure for Tailscale kubectl patch svc ingress-nginx-controller -n ingress-nginx -p '{"spec":{"type":"LoadBalancer"}}' # 3. Get ingress IP kubectl get svc -n ingress-nginx ingress-nginx-controller # 4. Install Cert-Manager (for TLS) kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml # 5. Create ClusterIssuer for self-signed certs cat <<EOF | kubectl apply -f - apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: selfsigned-issuer spec: selfSigned: {} EOF
Day 3-4: Migrate Stateless Services
Strategy: Start with services that have no persistent data
# 1. Create namespace structure kubectl create namespace development kubectl create namespace production kubectl create namespace staging kubectl create namespace csma kubectl create namespace csma-agents kubectl create namespace databases kubectl create namespace monitoring # 2. Deploy Agent Router (first stateless service) cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: agent-router namespace: development spec: replicas: 2 # 1 on M4, 1 on M3 selector: matchLabels: app: agent-router template: metadata: labels: app: agent-router spec: containers: - name: litellm image: ghcr.io/berriai/litellm:main-latest ports: - containerPort: 4000 env: - name: OLLAMA_API_BASE value: "http://ollama.development.svc.cluster.local:11434" resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "2Gi" cpu: "1000m" affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - agent-router topologyKey: kubernetes.io/hostname --- apiVersion: v1 kind: Service metadata: name: agent-router namespace: development spec: type: LoadBalancer selector: app: agent-router ports: - port: 4000 targetPort: 4000 EOF # 3. Verify deployment kubectl get pods -n development -o wide # Should see 1 pod on M4, 1 pod on M3 # 4. Test service ROUTER_IP=$(kubectl get svc agent-router -n development -o jsonpath='{.status.loadBalancer.ingress[0].ip}') curl http://$ROUTER_IP:4000/health
Day 5-7: Migrate Stateful Services with Longhorn
PostgreSQL Migration:
# 1. Backup existing data from OrbStack docker exec postgres-container pg_dumpall -U postgres > /tmp/postgres-backup.sql # 2. Deploy PostgreSQL with Longhorn PVC cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: postgres-data namespace: databases spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 20Gi --- apiVersion: apps/v1 kind: StatefulSet metadata: name: postgresql namespace: databases spec: serviceName: postgresql replicas: 1 selector: matchLabels: app: postgresql template: metadata: labels: app: postgresql spec: nodeSelector: kubernetes.io/hostname: mac-m4 # Pin to M4 for primary containers: - name: postgresql image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: - name: postgres-data mountPath: /var/lib/postgresql/data resources: requests: memory: "2Gi" cpu: "500m" limits: memory: "4Gi" cpu: "2000m" volumes: - name: postgres-data persistentVolumeClaim: claimName: postgres-data --- apiVersion: v1 kind: Service metadata: name: postgresql namespace: databases spec: clusterIP: None selector: app: postgresql ports: - port: 5432 targetPort: 5432 EOF # 3. Restore data kubectl exec -n databases postgresql-0 -i -- psql -U postgres < /tmp/postgres-backup.sql # 4. Verify Longhorn replication kubectl get volumes -n longhorn-system # Should show 2 replicas
Repeat for MongoDB, Redis, Neo4j, Qdrant:
# Template for other databases (adjust per service) # - Backup from OrbStack # - Create PVC with Longhorn # - Deploy StatefulSet with nodeSelector # - Restore data # - Verify replication
Phase 3: Service Distribution & Optimization (Week 3)
Day 1-3: Deploy Compute-Intensive Services to M3
# 1. Label nodes kubectl label node mac-m3 workload-type=compute-intensive kubectl label node mac-m4 workload-type=control-plane # 2. Deploy Ollama to M3 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: ollama namespace: development spec: replicas: 1 selector: matchLabels: app: ollama template: metadata: labels: app: ollama spec: nodeSelector: workload-type: compute-intensive # Force to M3 containers: - name: ollama image: ollama/ollama:latest ports: - containerPort: 11434 volumeMounts: - name: ollama-data mountPath: /root/.ollama resources: requests: memory: "4Gi" cpu: "2000m" limits: memory: "8Gi" cpu: "4000m" volumes: - name: ollama-data persistentVolumeClaim: claimName: ollama-data --- apiVersion: v1 kind: Service metadata: name: ollama namespace: development spec: selector: app: ollama ports: - port: 11434 targetPort: 11434 EOF # 3. Deploy remaining compute services to M3 # - Agent Brain (Qdrant intensive) # - Phoenix (trace processing) # - LibreChat (user-facing) # - Agent Studio (IDE workloads)
Day 4-5: Implement Database Replication
PostgreSQL Streaming Replication:
# 1. Create replica on M3 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: StatefulSet metadata: name: postgresql-replica namespace: databases spec: serviceName: postgresql-replica replicas: 1 selector: matchLabels: app: postgresql-replica template: metadata: labels: app: postgresql-replica spec: nodeSelector: kubernetes.io/hostname: mac-m3 containers: - name: postgresql image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata - name: POSTGRES_PRIMARY_HOST value: postgresql.databases.svc.cluster.local volumeMounts: - name: postgres-replica-data mountPath: /var/lib/postgresql/data - name: recovery-config mountPath: /docker-entrypoint-initdb.d volumes: - name: postgres-replica-data persistentVolumeClaim: claimName: postgres-replica-data - name: recovery-config configMap: name: postgres-replica-config --- apiVersion: v1 kind: ConfigMap metadata: name: postgres-replica-config namespace: databases data: setup-replication.sh: | #!/bin/bash pg_basebackup -h $POSTGRES_PRIMARY_HOST -D /var/lib/postgresql/data -U replication -v -P cat > /var/lib/postgresql/data/postgresql.conf <<EOL primary_conninfo = 'host=$POSTGRES_PRIMARY_HOST port=5432 user=replication' hot_standby = on EOL EOF # 2. Create read-only service pointing to replica cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: postgresql-readonly namespace: databases spec: selector: app: postgresql-replica ports: - port: 5432 targetPort: 5432 EOF
Redis Sentinel:
# Install Redis with Sentinel via Helm helm repo add bitnami https://charts.bitnami.com/bitnami helm install redis bitnami/redis \ --namespace databases \ --set sentinel.enabled=true \ --set sentinel.quorum=2 \ --set master.persistence.storageClass=longhorn \ --set replica.replicaCount=1 \ --set replica.persistence.storageClass=longhorn
Day 6-7: Deploy Observability Stack
# 1. Install Loki via Helm helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true \ --set grafana.enabled=false \ --set loki.persistence.enabled=true \ --set loki.persistence.storageClassName=longhorn \ --set loki.persistence.size=50Gi # 2. Migrate Prometheus to cluster helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi # 3. Configure Grafana with Loki datasource kubectl apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasources namespace: monitoring data: datasources.yaml: | apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 - name: Prometheus type: prometheus access: proxy url: http://prometheus-kube-prometheus-prometheus:9090 EOF
Phase 4: GitOps & Automation (Week 4)
Day 1-2: Argo CD Setup
# 1. Install Argo CD kubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # 2. Expose Argo CD UI via LoadBalancer kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}' # 3. Get admin password kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d # 4. Login to Argo CD UI ARGOCD_IP=$(kubectl get svc argocd-server -n argocd -o jsonpath='{.status.loadBalancer.ingress[0].ip}') open http://$ARGOCD_IP # 5. Create GitLab integration argocd repo add https://gitlab.com/blueflyio/agent-platform.git \ --username <username> \ --password <token>
Day 3-4: External Secrets Operator (1Password Integration)
# 1. Install External Secrets Operator helm repo add external-secrets https://charts.external-secrets.io helm install external-secrets \ external-secrets/external-secrets \ -n external-secrets-system \ --create-namespace # 2. Install 1Password Connect (optional, or use 1Password CLI) # Follow: https://developer.1password.com/docs/connect/ # 3. Create SecretStore cat <<EOF | kubectl apply -f - apiVersion: external-secrets.io/v1beta1 kind: SecretStore metadata: name: onepassword namespace: default spec: provider: onepassword: auth: secretRef: connectTokenSecretRef: name: onepassword-token key: token connectHost: http://onepassword-connect:8080 vaults: csma-secrets: 1 EOF # 4. Create ExternalSecret for GitLab tokens cat <<EOF | kubectl apply -f - apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: gitlab-tokens namespace: default spec: refreshInterval: 1h secretStoreRef: name: onepassword kind: SecretStore target: name: gitlab-tokens creationPolicy: Owner data: - secretKey: token remoteRef: key: gitlab-api-token EOF
Day 5-7: Argo Workflows for Agent Orchestration
# 1. Install Argo Workflows kubectl create namespace argo kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml # 2. Expose Argo Workflows UI kubectl patch svc argo-server -n argo -p '{"spec": {"type": "LoadBalancer"}}' # 3. Create workflow template for agent execution cat <<EOF | kubectl apply -f - apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: agent-execution namespace: argo spec: entrypoint: main arguments: parameters: - name: agent-name - name: task templates: - name: main inputs: parameters: - name: agent-name - name: task container: image: "ghcr.io/blueflyio/{{inputs.parameters.agent-name}}:latest" command: ["/bin/sh"] args: ["-c", "echo '{{inputs.parameters.task}}' | agent-executor"] EOF
Phase 5: Production Readiness (Week 5)
Day 1-2: Backup & Disaster Recovery
# 1. Install Velero brew install velero # 2. Set up MinIO as backup target (already have MinIO) kubectl create namespace velero # 3. Create MinIO bucket for backups kubectl run minio-client --rm -it --image=minio/mc --restart=Never -- \ mc alias set minio http://minio.databases.svc.cluster.local:9000 minioadmin minioadmin mc mb minio/velero-backups # 4. Install Velero with MinIO backend velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backups \ --secret-file ./credentials-velero \ --use-volume-snapshots=true \ --snapshot-location-config region=minio \ --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.databases.svc.cluster.local:9000 # 5. Create backup schedule velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces development,production,databases,csma # 6. Test backup/restore velero backup create test-backup --include-namespaces development velero backup describe test-backup velero restore create --from-backup test-backup
Day 3-4: Performance Tuning
# 1. Deploy Vertical Pod Autoscaler kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml # 2. Create VPA for Agent Router cat <<EOF | kubectl apply -f - apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: agent-router-vpa namespace: development spec: targetRef: apiVersion: apps/v1 kind: Deployment name: agent-router updatePolicy: updateMode: "Auto" EOF # 3. Deploy Horizontal Pod Autoscaler kubectl autoscale deployment agent-router \ --namespace development \ --cpu-percent=70 \ --min=2 \ --max=4 # 4. Enable resource quotas per namespace cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ResourceQuota metadata: name: development-quota namespace: development spec: hard: requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi EOF
Day 5: Security Hardening
# 1. Enable Pod Security Standards kubectl label namespace development pod-security.kubernetes.io/enforce=baseline # 2. Create Network Policies with Cilium cat <<EOF | kubectl apply -f - apiVersion: cilium.io/v2 kind: CiliumNetworkPolicy metadata: name: agent-router-policy namespace: development spec: endpointSelector: matchLabels: app: agent-router ingress: - fromEndpoints: - matchLabels: app: agent-mesh toPorts: - ports: - port: "4000" protocol: TCP egress: - toEndpoints: - matchLabels: app: ollama toPorts: - ports: - port: "11434" protocol: TCP EOF # 3. Install Falco for runtime security helm repo add falcosecurity https://falcosecurity.github.io/charts helm install falco falcosecurity/falco \ --namespace falco \ --create-namespace \ --set tty=true
Day 6-7: Documentation & Runbooks
Create operational runbooks:
# Runbooks (to create) 1. Node Failure Recovery 2. Database Failover Procedure 3. Service Scaling Guidelines 4. Backup/Restore Procedures 5. Security Incident Response 6. Performance Troubleshooting 7. GitOps Deployment Process
Validation & Testing
Cluster Health Checks
# 1. Node status kubectl get nodes -o wide # 2. Pod distribution kubectl get pods --all-namespaces -o wide | grep -E 'mac-m4|mac-m3' # 3. Storage health kubectl get pv,pvc --all-namespaces kubectl -n longhorn-system get volumes # 4. Network connectivity kubectl run test-pod --image=nicolaka/netshoot --rm -it -- /bin/bash # Inside pod: ping services, curl endpoints # 5. Service endpoints kubectl get svc --all-namespaces -o wide
Performance Benchmarks
# 1. Database performance (PostgreSQL) kubectl run pgbench --rm -it --image=postgres:15 -- \ pgbench -h postgresql.databases.svc.cluster.local -U postgres -c 10 -j 2 -t 1000 # 2. Network throughput (iperf3) # On M4: kubectl run iperf-server --image=networkstatic/iperf3 -- -s # On M3: kubectl run iperf-client --rm -it --image=networkstatic/iperf3 -- \ -c iperf-server.default.svc.cluster.local -t 30 # 3. Load testing Agent Router kubectl run load-test --rm -it --image=williamyeh/wrk -- \ -t 4 -c 100 -d 30s http://agent-router.development.svc.cluster.local:4000/health
Failover Testing
# 1. Simulate M3 node failure kubectl drain mac-m3 --ignore-daemonsets --delete-emptydir-data # 2. Verify services moved to M4 kubectl get pods --all-namespaces -o wide | grep mac-m4 # 3. Check database replication kubectl exec -n databases postgresql-replica-0 -- pg_isready # 4. Restore M3 kubectl uncordon mac-m3
Monitoring & Alerting
Grafana Dashboards to Create
-
Cluster Overview
- Node CPU/Memory utilization
- Pod distribution
- Network traffic
-
Service Health
- Agent Router throughput
- Database query latency
- LLM inference time
-
Storage Metrics
- Longhorn volume health
- IOPS per node
- Backup status
-
Cost Tracking
- Resource usage per namespace
- Pod efficiency scores
Prometheus Alerts
# alerts.yaml groups: - name: cluster-health rules: - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 5m annotations: summary: "Node {{ $labels.instance }} is down" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 10m annotations: summary: "Pod {{ $labels.pod }} is crash looping" - alert: HighMemoryUsage expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for: 5m annotations: summary: "Node {{ $labels.instance }} has less than 10% memory available" - alert: StorageVolumeUnhealthy expr: longhorn_volume_robustness != 0 for: 5m annotations: summary: "Longhorn volume {{ $labels.volume }} is unhealthy"
Migration Checklist
Pre-Migration
- Document current OrbStack configuration
- Backup all databases and persistent data
- Export existing Kubernetes manifests
- Test Tailscale connectivity between machines
- Verify disk space on both machines (50GB+ free)
Migration Execution
- Install K3s on both machines
- Verify cluster formation
- Install Cilium CNI
- Install Longhorn storage
- Install MetalLB load balancer
- Migrate databases one at a time
- Deploy stateless services
- Configure service distribution
- Set up replication for stateful services
- Deploy observability stack
Post-Migration
- Validate all services are running
- Test failover scenarios
- Run performance benchmarks
- Configure backups
- Set up monitoring alerts
- Update documentation
- Train on new operational procedures
Expected Outcomes
Performance Improvements
- 2x CPU capacity for agent workloads
- 2x RAM capacity for memory-intensive operations
- 40-50% reduction in M4 resource utilization
- Parallel execution of multiple agent tasks
- Faster LLM inference with dedicated compute on M3
Reliability Improvements
- Zero single point of failure for critical services
- Automatic failover for databases and services
- Data replication across machines (2 copies minimum)
- Self-healing pods via Kubernetes
- Backup/restore capability via Velero
Operational Improvements
- GitOps workflows via Argo CD
- Centralized logging via Loki
- Advanced networking via Cilium
- Secret management via External Secrets
- Automated scaling via HPA/VPA
Cost Savings
- $0 cloud costs during development
- 100% hardware utilization across both machines
- Eliminate resource waste from idle M3
Risk Mitigation
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Network latency via Tailscale | Medium | Medium | Use subnet routing, monitor with Hubble |
| Storage sync issues | Low | High | Longhorn battle-tested, 2 replicas minimum |
| Split-brain scenarios | Low | High | K8s leader election, odd-numbered etcd quorum |
| Service disruption during migration | High | Medium | Migrate incrementally, keep OrbStack running in parallel |
| Configuration complexity | High | Low | Document everything, use GitOps, runbooks |
| M3 failure affecting critical services | Medium | High | Keep critical services on M4, use anti-affinity rules |
Rollback Plan
If issues arise during migration:
# 1. Stop K3s services sudo systemctl stop k3s # M4 sudo systemctl stop k3s-agent # M3 # 2. Restore OrbStack Kubernetes orbstack config set kubernetes.enabled true # 3. Restore databases from backups # (restoration commands per database) # 4. Restart OrbStack containers docker start $(docker ps -aq) # 5. Verify services are running curl http://localhost:4000/health # Agent Router curl http://localhost:3003/health # Agent Mesh
Maintenance Windows
Recommended maintenance schedule:
- Daily: Automated backups (2am)
- Weekly: Health checks and log review (Sunday)
- Monthly: Security updates and patches
- Quarterly: Performance review and optimization
Next Steps After Completion
- Scale to additional machines (if available)
- Implement advanced features:
- Ray for distributed Python workloads
- Kubeflow for ML pipelines
- Temporal for complex workflows
- Optimize costs:
- Fine-tune resource requests/limits
- Implement spot instance patterns
- Production readiness:
- Disaster recovery drills
- Load testing at scale
- Security audits
Contact & Support
Internal Resources:
- GitLab: https://gitlab.com/blueflyio/agent-platform
- Documentation: https://docs.blueflyagents.com
- Wiki: $LLM_ROOT/WIKIs/technical-docs.wiki
External Resources:
- K3s: https://k3s.io
- Cilium: https://cilium.io
- Longhorn: https://longhorn.io
- Argo CD: https://argo-cd.readthedocs.io
End of Technical Implementation Plan