Skip to main content

Kubernetes Cluster Setup

Kubernetes Cluster Setup

Production-ready Kubernetes deployment for the LLM Platform

Last Updated: 2025-01-XX
Status: Production + Planning

Overview

This guide covers setting up a production Kubernetes cluster for the LLM Platform with high availability, auto-scaling, and comprehensive monitoring.

Supported Platforms: AWS EKS, Google GKE, Azure AKS, On-premise, OrbStack (local), Multi-machine K3s Kubernetes Version: 1.28+ Namespace: llm-platform

Current Infrastructure State

OrbStack (Mac M4 - Local Development)

Version & Status:

  • OrbStack 2.0.5 (build 2000500)
  • Docker 28.5.2 + Kubernetes
  • Platform: macOS Darwin 25.1.0 (arm64)
  • kubectl v1.32.3

Active Resources:

  • 25 active namespaces
  • 100+ deployments/StatefulSets
  • 30+ running containers
  • 100+ Docker images (30 active, ~50GB total)

Key Namespaces:

  • development - Primary local development environment
  • production - Production deployments
  • staging - Staging environment
  • kagent - Knowledge agent platform (24 deployments)
  • csma - CSMA core services
  • databases - Database services
  • monitoring - Prometheus, Grafana

Target Architecture: Multi-Machine K3s Cluster

Hardware:

  • Mac M4 (100.108.129.7) - K3s Control Plane
  • Mac M3 (100.108.180.36) - K3s Worker Node
  • GL-BE3600 Router (100.116.110.123) - Subnet router via Tailscale

Network:

  • Tailscale mesh (tailcf98b3.ts.net)
  • Subnet routing: 192.168.8.0/24
  • MagicDNS enabled

Service Distribution:

  • Mac M4 (Control Plane): etcd, kube-apiserver, PostgreSQL (primary), Redis (master), Agent Mesh, Agent Router, Observability stack
  • Mac M3 (Worker Node): Agent Brain, Ollama, Phoenix, Agent Studio, LibreChat, LiteLLM replicas, Database replicas

Storage:

  • Longhorn distributed storage (2 replicas per volume across both nodes)
  • Backup to MinIO (9000/9001)

See System Architecture Overview for complete details.

Prerequisites

# Install kubectl brew install kubectl # Install Helm brew install helm # Verify installations kubectl version --client helm version

Quick Start (OrbStack - Local Development)

# 1. Install OrbStack brew install orbstack # 2. Create Kubernetes cluster orb create k8s llm-platform # 3. Set kubectl context kubectl config use-context orbstack # 4. Deploy platform cd $LLM_ROOT/llm-platform helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace # 5. Wait for pods to be ready kubectl wait --for=condition=ready pod --all -n llm-platform --timeout=300s # 6. Access platform kubectl port-forward -n llm-platform svc/drupal 8080:80 open http://localhost:8080

Production Cluster Setup

1. Create Cluster

AWS EKS

# Install eksctl brew install eksctl # Create cluster eksctl create cluster \ --name llm-platform-prod \ --version 1.28 \ --region us-east-1 \ --nodegroup-name standard-workers \ --node-type m5.2xlarge \ --nodes 3 \ --nodes-min 3 \ --nodes-max 10 \ --managed # Configure kubectl aws eks update-kubeconfig \ --name llm-platform-prod \ --region us-east-1

Google GKE

# Install gcloud CLI brew install google-cloud-sdk # Create cluster gcloud container clusters create llm-platform-prod \ --zone us-central1-a \ --num-nodes 3 \ --machine-type n1-standard-4 \ --enable-autoscaling \ --min-nodes 3 \ --max-nodes 10 \ --enable-autorepair \ --enable-autoupgrade # Configure kubectl gcloud container clusters get-credentials llm-platform-prod \ --zone us-central1-a

2. Install Cluster Components

Install Ingress Controller

# NGINX Ingress Controller helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update helm install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --set controller.replicaCount=2 \ --set controller.nodeSelector."kubernetes\.io/os"=linux \ --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux

Install Cert Manager (TLS)

# Cert Manager for automatic TLS certificates helm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --set installCRDs=true

Install Metrics Server

# Metrics Server for auto-scaling kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

3. Create Namespace and Secrets

# Create namespace kubectl create namespace llm-platform # Create secrets kubectl create secret generic api-keys \ --from-literal=anthropic-api-key=${ANTHROPIC_API_KEY} \ --from-literal=openai-api-key=${OPENAI_API_KEY} \ --namespace llm-platform kubectl create secret generic database \ --from-literal=password=${DB_PASSWORD} \ --namespace llm-platform kubectl create secret generic redis \ --from-literal=password=${REDIS_PASSWORD} \ --namespace llm-platform

Helm Chart Deployment

Chart Structure

infrastructure/helm-chart/
 Chart.yaml
 values.yaml
 values-dev.yaml
 values-staging.yaml
 values-prod.yaml
 templates/
     deployments/
        drupal.yaml
        llm-gateway.yaml
        agent-mesh.yaml
        agent-tracer.yaml
     services/
        drupal-service.yaml
        llm-gateway-service.yaml
        agent-mesh-service.yaml
     statefulsets/
        postgres.yaml
        redis.yaml
        qdrant.yaml
     ingress.yaml
     hpa.yaml
     configmaps/

Deploy Platform

# Development helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-dev.yaml # Staging helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-staging.yaml # Production helm install llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --create-namespace \ --values infrastructure/helm-chart/values-prod.yaml \ --set ingress.hosts[0].host=llm-platform.example.com \ --set ingress.tls[0].secretName=llm-platform-tls

Update Deployment

# Upgrade deployment helm upgrade llm-platform infrastructure/helm-chart/ \ --namespace llm-platform \ --values infrastructure/helm-chart/values-prod.yaml # Rollback if needed helm rollback llm-platform -n llm-platform

Component Configuration

Drupal Platform

# values-prod.yaml drupal: replicaCount: 3 image: repository: registry.gitlab.com/bluefly/llm-platform tag: "1.0.0" resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 2000m memory: 4Gi autoscaling: enabled: true minReplicas: 3 maxReplicas: 10 targetCPUUtilizationPercentage: 70 ingress: enabled: true className: nginx hosts: - host: llm-platform.example.com paths: - path: / pathType: Prefix tls: - secretName: llm-platform-tls hosts: - llm-platform.example.com

PostgreSQL

postgres: enabled: true primary: persistence: enabled: true size: 100Gi storageClass: gp3 resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 4000m memory: 8Gi auth: username: llm_user database: llm_platform existingSecret: database

Redis

redis: enabled: true architecture: replication master: persistence: enabled: true size: 10Gi replica: replicaCount: 2 auth: existingSecret: redis

Qdrant

qdrant: enabled: true replicaCount: 3 persistence: enabled: true size: 100Gi resources: requests: cpu: 2000m memory: 4Gi limits: cpu: 4000m memory: 8Gi

LLM Gateway

llmGateway: replicaCount: 3 image: repository: registry.gitlab.com/bluefly/llm-gateway tag: "1.0.0" resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi env: - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: api-keys key: anthropic-api-key

Agent Mesh

agentMesh: replicaCount: 2 image: repository: registry.gitlab.com/bluefly/agent-mesh tag: "1.0.0" service: type: ClusterIP httpPort: 3005 grpcPort: 50051 resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

# templates/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: drupal-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: drupal minReplicas: 3 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80

Vertical Pod Autoscaler (VPA)

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: drupal-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: drupal updatePolicy: updateMode: "Auto"

Cluster Autoscaler

# AWS EKS eksctl create nodegroup \ --cluster llm-platform-prod \ --name autoscaling-workers \ --node-type m5.2xlarge \ --nodes-min 3 \ --nodes-max 20 \ --managed \ --asg-access

Monitoring Setup

Prometheus Stack

# Install Prometheus + Grafana helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=admin \ --set prometheus.prometheusSpec.retention=30d

Phoenix Arize

# Deploy Phoenix for LLM tracing kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: phoenix namespace: llm-platform spec: replicas: 1 selector: matchLabels: app: phoenix template: metadata: labels: app: phoenix spec: containers: - name: phoenix image: arizephoenix/phoenix:latest ports: - containerPort: 6006 - containerPort: 4317 --- apiVersion: v1 kind: Service metadata: name: phoenix namespace: llm-platform spec: selector: app: phoenix ports: - name: ui port: 6006 targetPort: 6006 - name: otlp port: 4317 targetPort: 4317 EOF

Verify Deployment

# Check all pods are running kubectl get pods -n llm-platform # Check services kubectl get svc -n llm-platform # Check ingress kubectl get ingress -n llm-platform # View logs kubectl logs -n llm-platform deployment/drupal -f # Check auto-scaling kubectl get hpa -n llm-platform

Access Platform

# Get external IP (cloud) kubectl get ingress -n llm-platform # Port forward (local/testing) kubectl port-forward -n llm-platform svc/drupal 8080:80 # Access specific services kubectl port-forward -n llm-platform svc/llm-gateway 4000:4000 kubectl port-forward -n llm-platform svc/qdrant 6333:6333 kubectl port-forward -n llm-platform svc/grafana 3000:80

Troubleshooting

Pods Not Starting

# Check pod status kubectl describe pod -n llm-platform <pod-name> # Check events kubectl get events -n llm-platform --sort-by='.lastTimestamp' # Check resource limits kubectl top pods -n llm-platform kubectl top nodes

Database Connection Issues

# Test database connection kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \ psql -h postgres.llm-platform.svc.cluster.local -U llm_user -d llm_platform # Check database pod kubectl logs -n llm-platform statefulset/postgres

Performance Issues

# Check resource usage kubectl top pods -n llm-platform kubectl top nodes # Check HPA status kubectl get hpa -n llm-platform # Scale manually if needed kubectl scale deployment drupal --replicas=5 -n llm-platform