Kubernetes Cluster Setup
Kubernetes Cluster Setup
Production-ready Kubernetes deployment for the LLM Platform
Last Updated: 2025-01-XX
Status: Production + Planning
Overview
This guide covers setting up a production Kubernetes cluster for the LLM Platform with high availability, auto-scaling, and comprehensive monitoring.
Supported Platforms: AWS EKS, Google GKE, Azure AKS, On-premise, OrbStack (local), Multi-machine K3s
Kubernetes Version: 1.28+
Namespace: llm-platform
Current Infrastructure State
OrbStack (Mac M4 - Local Development)
Version & Status:
- OrbStack 2.0.5 (build 2000500)
- Docker 28.5.2 + Kubernetes
- Platform: macOS Darwin 25.1.0 (arm64)
- kubectl v1.32.3
Active Resources:
- 25 active namespaces
- 100+ deployments/StatefulSets
- 30+ running containers
- 100+ Docker images (30 active, ~50GB total)
Key Namespaces:
- development - Primary local development environment
- production - Production deployments
- staging - Staging environment
- kagent - Knowledge agent platform (24 deployments)
- csma - CSMA core services
- databases - Database services
- monitoring - Prometheus, Grafana
Target Architecture: Multi-Machine K3s Cluster
Hardware:
- Mac M4 (100.108.129.7) - K3s Control Plane
- Mac M3 (100.108.180.36) - K3s Worker Node
- GL-BE3600 Router (100.116.110.123) - Subnet router via Tailscale
Network:
- Tailscale mesh (tailcf98b3.ts.net)
- Subnet routing: 192.168.8.0/24
- MagicDNS enabled
Service Distribution:
- Mac M4 (Control Plane): etcd, kube-apiserver, PostgreSQL (primary), Redis (master), Agent Mesh, Agent Router, Observability stack
- Mac M3 (Worker Node): Agent Brain, Ollama, Phoenix, Agent Studio, LibreChat, LiteLLM replicas, Database replicas
Storage:
- Longhorn distributed storage (2 replicas per volume across both nodes)
- Backup to MinIO (9000/9001)
See System Architecture Overview for complete details.
Prerequisites
# Install kubectl brew install kubectl # Install Helm brew install helm # Verify installations kubectl version --client helm version
Quick Start (OrbStack - Local Development)
# 1. Install OrbStack brew install orbstack # 2. Create Kubernetes cluster orb create k8s llm-platform # 3. Set kubectl context kubectl config use-context orbstack # 4. Deploy platform cd $LLM_ROOT/llm-platform helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace # 5. Wait for pods to be ready kubectl wait --for=condition=ready pod --all -n llm-platform --timeout=300s # 6. Access platform kubectl port-forward -n llm-platform svc/drupal 8080:80 open http://localhost:8080
Production Cluster Setup
1. Create Cluster
AWS EKS
# Install eksctl brew install eksctl # Create cluster eksctl create cluster \ --name llm-platform-prod \ --version 1.28 \ --region us-east-1 \ --nodegroup-name standard-workers \ --node-type m5.2xlarge \ --nodes 3 \ --nodes-min 3 \ --nodes-max 10 \ --managed # Configure kubectl aws eks update-kubeconfig \ --name llm-platform-prod \ --region us-east-1
Google GKE
# Install gcloud CLI brew install google-cloud-sdk # Create cluster gcloud container clusters create llm-platform-prod \ --zone us-central1-a \ --num-nodes 3 \ --machine-type n1-standard-4 \ --enable-autoscaling \ --min-nodes 3 \ --max-nodes 10 \ --enable-autorepair \ --enable-autoupgrade # Configure kubectl gcloud container clusters get-credentials llm-platform-prod \ --zone us-central1-a
2. Install Cluster Components
Install Ingress Controller
# NGINX Ingress Controller helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --set controller.replicaCount=2 \ --set controller.nodeSelector."kubernetes\.io/os"=linux \ --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux
Install Cert Manager (TLS)
# Cert Manager for automatic TLS certificates helm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --set installCRDs=true
Install Metrics Server
# Metrics Server for auto-scaling kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
3. Create Namespace and Secrets
# Create namespace kubectl create namespace llm-platform # Create secrets kubectl create secret generic api-keys \ --from-literal=anthropic-api-key=${ANTHROPIC_API_KEY} \ --from-literal=openai-api-key=${OPENAI_API_KEY} \ --namespace llm-platform kubectl create secret generic database \ --from-literal=password=${DB_PASSWORD} \ --namespace llm-platform kubectl create secret generic redis \ --from-literal=password=${REDIS_PASSWORD} \ --namespace llm-platform
Helm Chart Deployment
Chart Structure
infrastructure/helm-chart/
Chart.yaml
values.yaml
values-dev.yaml
values-staging.yaml
values-prod.yaml
templates/
deployments/
drupal.yaml
llm-gateway.yaml
agent-mesh.yaml
agent-tracer.yaml
services/
drupal-service.yaml
llm-gateway-service.yaml
agent-mesh-service.yaml
statefulsets/
postgres.yaml
redis.yaml
qdrant.yaml
ingress.yaml
hpa.yaml
configmaps/
Deploy Platform
# Development helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-dev.yaml # Staging helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-staging.yaml # Production helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-prod.yaml \ --set ingress.hosts[0].host=llm-platform.example.com \ --set ingress.tls[0].secretName=llm-platform-tls
Update Deployment
# Upgrade deployment helm upgrade llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --values infrastructure/helm-chart/values-prod.yaml # Rollback if needed helm rollback llm-platform -n llm-platform
Component Configuration
Drupal Platform
# values-prod.yaml drupal: replicaCount: 3 image: repository: registry.gitlab.com/bluefly/llm-platform tag: "1.0.0" resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 2000m memory: 4Gi autoscaling: enabled: true minReplicas: 3 maxReplicas: 10 targetCPUUtilizationPercentage: 70 ingress: enabled: true className: nginx hosts: - host: llm-platform.example.com paths: - path: / pathType: Prefix tls: - secretName: llm-platform-tls hosts: - llm-platform.example.com
PostgreSQL
postgres: enabled: true primary: persistence: enabled: true size: 100Gi storageClass: gp3 resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 4000m memory: 8Gi auth: username: llm_user database: llm_platform existingSecret: database
Redis
redis: enabled: true architecture: replication master: persistence: enabled: true size: 10Gi replica: replicaCount: 2 auth: existingSecret: redis
Qdrant
qdrant: enabled: true replicaCount: 3 persistence: enabled: true size: 100Gi resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 4000m memory: 8Gi
LLM Gateway
llmGateway: replicaCount: 3 image: repository: registry.gitlab.com/bluefly/llm-gateway tag: "1.0.0" resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi env: - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: api-keys key: anthropic-api-key
Agent Mesh
agentMesh: replicaCount: 2 image: repository: registry.gitlab.com/bluefly/agent-mesh tag: "1.0.0" service: type: ClusterIP httpPort: 3005 grpcPort: 50051 resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi
Auto-Scaling Configuration
Horizontal Pod Autoscaler (HPA)
# templates/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: drupal-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: drupal minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80
Vertical Pod Autoscaler (VPA)
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: drupal-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: drupal updatePolicy: updateMode: "Auto"
Cluster Autoscaler
# AWS EKS eksctl create nodegroup \ --cluster llm-platform-prod \ --name autoscaling-workers \ --node-type m5.2xlarge \ --nodes-min 3 \ --nodes-max 20 \ --managed \ --asg-access
Monitoring Setup
Prometheus Stack
# Install Prometheus + Grafana helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=admin \ --set prometheus.prometheusSpec.retention=30d
Phoenix Arize
# Deploy Phoenix for LLM tracing kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: phoenix namespace: llm-platform spec: replicas: 1 selector: matchLabels: app: phoenix template: metadata: labels: app: phoenix spec: containers: - name: phoenix image: arizephoenix/phoenix:latest ports: - containerPort: 6006 - containerPort: 4317 --- apiVersion: v1 kind: Service metadata: name: phoenix namespace: llm-platform spec: selector: app: phoenix ports: - name: ui port: 6006 targetPort: 6006 - name: otlp port: 4317 targetPort: 4317 EOF
Verify Deployment
# Check all pods are running kubectl get pods -n llm-platform # Check services kubectl get svc -n llm-platform # Check ingress kubectl get ingress -n llm-platform # View logs kubectl logs -n llm-platform deployment/drupal -f # Check auto-scaling kubectl get hpa -n llm-platform
Access Platform
# Get external IP (cloud) kubectl get ingress -n llm-platform # Port forward (local/testing) kubectl port-forward -n llm-platform svc/drupal 8080:80 # Access specific services kubectl port-forward -n llm-platform svc/llm-gateway 4000:4000 kubectl port-forward -n llm-platform svc/qdrant 6333:6333 kubectl port-forward -n llm-platform svc/grafana 3000:80
Troubleshooting
Pods Not Starting
# Check pod status kubectl describe pod -n llm-platform <pod-name> # Check events kubectl get events -n llm-platform --sort-by='.lastTimestamp' # Check resource limits kubectl top pods -n llm-platform kubectl top nodes
Database Connection Issues
# Test database connection kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \ psql -h postgres.llm-platform.svc.cluster.local -U llm_user -d llm_platform # Check database pod kubectl logs -n llm-platform statefulset/postgres
Performance Issues
# Check resource usage kubectl top pods -n llm-platform kubectl top nodes # Check HPA status kubectl get hpa -n llm-platform # Scale manually if needed kubectl scale deployment drupal --replicas=5 -n llm-platform