Kubernetes Cluster Setup

Production-ready Kubernetes deployment for the LLM Platform

Last Updated: 2025-01-XX
Status: Production + Planning

Overview

This guide covers setting up a production Kubernetes cluster for the LLM Platform with high availability, auto-scaling, and comprehensive monitoring.

Supported Platforms: AWS EKS, Google GKE, Azure AKS, On-premise, OrbStack (local), Multi-machine K3s Kubernetes Version: 1.28+ Namespace: llm-platform

Current Infrastructure State

OrbStack (Mac M4 - Local Development)

Version & Status:

OrbStack 2.0.5 (build 2000500)
Docker 28.5.2 + Kubernetes
Platform: macOS Darwin 25.1.0 (arm64)
kubectl v1.32.3

Active Resources:

25 active namespaces
100+ deployments/StatefulSets
30+ running containers
100+ Docker images (30 active, ~50GB total)

Key Namespaces:

development - Primary local development environment
production - Production deployments
staging - Staging environment
kagent - Knowledge agent platform (24 deployments)
csma - CSMA core services
databases - Database services
monitoring - Prometheus, Grafana

Target Architecture: Multi-Machine K3s Cluster

Hardware:

Mac M4 (100.108.129.7) - K3s Control Plane
Mac M3 (100.108.180.36) - K3s Worker Node
GL-BE3600 Router (100.116.110.123) - Subnet router via Tailscale

Network:

Tailscale mesh (tailcf98b3.ts.net)
Subnet routing: 192.168.8.0/24
MagicDNS enabled

Service Distribution:

Mac M4 (Control Plane): etcd, kube-apiserver, PostgreSQL (primary), Redis (master), Agent Mesh, Agent Router, Observability stack
Mac M3 (Worker Node): Agent Brain, Ollama, Phoenix, Agent Studio, LibreChat, LiteLLM replicas, Database replicas

Storage:

Longhorn distributed storage (2 replicas per volume across both nodes)
Backup to MinIO (9000/9001)

See System Architecture Overview for complete details.

Prerequisites

# Install kubectl
brew install kubectl

# Install Helm
brew install helm

# Verify installations
kubectl version --client
helm version

Quick Start (OrbStack - Local Development)

# 1. Install OrbStack
brew install orbstack

# 2. Create Kubernetes cluster
orb create k8s llm-platform

# 3. Set kubectl context
kubectl config use-context orbstack

# 4. Deploy platform
cd $LLM_ROOT/llm-platform
helm install llm-platform infrastructure/helm-chart/ \
  --namespace llm-platform \
  --create-namespace

# 5. Wait for pods to be ready
kubectl wait --for=condition=ready pod --all -n llm-platform --timeout=300s

# 6. Access platform
kubectl port-forward -n llm-platform svc/drupal 8080:80
open http://localhost:8080

Production Cluster Setup

1. Create Cluster

AWS EKS

# Install eksctl
brew install eksctl

# Create cluster
eksctl create cluster \
  --name llm-platform-prod \
  --version 1.28 \
  --region us-east-1 \
  --nodegroup-name standard-workers \
  --node-type m5.2xlarge \
  --nodes 3 \
  --nodes-min 3 \
  --nodes-max 10 \
  --managed

# Configure kubectl
aws eks update-kubeconfig \
  --name llm-platform-prod \
  --region us-east-1

Google GKE

# Install gcloud CLI
brew install google-cloud-sdk

# Create cluster
gcloud container clusters create llm-platform-prod \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --enable-autoscaling \
  --min-nodes 3 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade

# Configure kubectl
gcloud container clusters get-credentials llm-platform-prod \
  --zone us-central1-a

2. Install Cluster Components

Install Ingress Controller

# NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=2 \
  --set controller.nodeSelector."kubernetes\.io/os"=linux \
  --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux

Install Cert Manager (TLS)

# Cert Manager for automatic TLS certificates
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

Install Metrics Server

# Metrics Server for auto-scaling
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

3. Create Namespace and Secrets

# Create namespace
kubectl create namespace llm-platform

# Create secrets
kubectl create secret generic api-keys \
  --from-literal=anthropic-api-key=${ANTHROPIC_API_KEY} \
  --from-literal=openai-api-key=${OPENAI_API_KEY} \
  --namespace llm-platform

kubectl create secret generic database \
  --from-literal=password=${DB_PASSWORD} \
  --namespace llm-platform

kubectl create secret generic redis \
  --from-literal=password=${REDIS_PASSWORD} \
  --namespace llm-platform

Helm Chart Deployment

Chart Structure

infrastructure/helm-chart/
 Chart.yaml
 values.yaml
 values-dev.yaml
 values-staging.yaml
 values-prod.yaml
 templates/
     deployments/
        drupal.yaml
        llm-gateway.yaml
        agent-mesh.yaml
        agent-tracer.yaml
     services/
        drupal-service.yaml
        llm-gateway-service.yaml
        agent-mesh-service.yaml
     statefulsets/
        postgres.yaml
        redis.yaml
        qdrant.yaml
     ingress.yaml
     hpa.yaml
     configmaps/

Deploy Platform

# Development
helm install llm-platform infrastructure/helm-chart/ \
  --namespace llm-platform \
  --create-namespace \
  --values infrastructure/helm-chart/values-dev.yaml

# Staging
helm install llm-platform infrastructure/helm-chart/ \
  --namespace llm-platform \
  --create-namespace \
  --values infrastructure/helm-chart/values-staging.yaml

# Production
helm install llm-platform infrastructure/helm-chart/ \
  --namespace llm-platform \
  --create-namespace \
  --values infrastructure/helm-chart/values-prod.yaml \
  --set ingress.hosts[0].host=llm-platform.example.com \
  --set ingress.tls[0].secretName=llm-platform-tls

Update Deployment

# Upgrade deployment
helm upgrade llm-platform infrastructure/helm-chart/ \
  --namespace llm-platform \
  --values infrastructure/helm-chart/values-prod.yaml

# Rollback if needed
helm rollback llm-platform -n llm-platform

Component Configuration

Drupal Platform

# values-prod.yaml
drupal:
  replicaCount: 3
  image:
    repository: registry.gitlab.com/bluefly/llm-platform
    tag: "1.0.0"

  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi

  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

  ingress:
    enabled: true
    className: nginx
    hosts:
      - host: llm-platform.example.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: llm-platform-tls
        hosts:
          - llm-platform.example.com

PostgreSQL

postgres:
  enabled: true
  primary:
    persistence:
      enabled: true
      size: 100Gi
      storageClass: gp3
    resources:
      requests:
        cpu: 2000m
        memory: 4Gi
      limits:
        cpu: 4000m
        memory: 8Gi

  auth:
    username: llm_user
    database: llm_platform
    existingSecret: database

Redis

redis:
  enabled: true
  architecture: replication
  master:
    persistence:
      enabled: true
      size: 10Gi
  replica:
    replicaCount: 2
  auth:
    existingSecret: redis

Qdrant

qdrant:
  enabled: true
  replicaCount: 3
  persistence:
    enabled: true
    size: 100Gi
  resources:
    requests:
      cpu: 2000m
      memory: 4Gi
    limits:
      cpu: 4000m
      memory: 8Gi

LLM Gateway

llmGateway:
  replicaCount: 3
  image:
    repository: registry.gitlab.com/bluefly/llm-gateway
    tag: "1.0.0"

  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

  env:
    - name: ANTHROPIC_API_KEY
      valueFrom:
        secretKeyRef:
          name: api-keys
          key: anthropic-api-key

Agent Mesh

agentMesh:
  replicaCount: 2
  image:
    repository: registry.gitlab.com/bluefly/agent-mesh
    tag: "1.0.0"

  service:
    type: ClusterIP
    httpPort: 3005
    grpcPort: 50051

  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

# templates/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: drupal-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: drupal
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Vertical Pod Autoscaler (VPA)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: drupal-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: drupal
  updatePolicy:
    updateMode: "Auto"

Cluster Autoscaler

# AWS EKS
eksctl create nodegroup \
  --cluster llm-platform-prod \
  --name autoscaling-workers \
  --node-type m5.2xlarge \
  --nodes-min 3 \
  --nodes-max 20 \
  --managed \
  --asg-access

Monitoring Setup

Prometheus Stack

# Install Prometheus + Grafana
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin \
  --set prometheus.prometheusSpec.retention=30d

Phoenix Arize

# Deploy Phoenix for LLM tracing
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: phoenix
  namespace: llm-platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: phoenix
  template:
    metadata:
      labels:
        app: phoenix
    spec:
      containers:
      - name: phoenix
        image: arizephoenix/phoenix:latest
        ports:
        - containerPort: 6006
        - containerPort: 4317
---
apiVersion: v1
kind: Service
metadata:
  name: phoenix
  namespace: llm-platform
spec:
  selector:
    app: phoenix
  ports:
  - name: ui
    port: 6006
    targetPort: 6006
  - name: otlp
    port: 4317
    targetPort: 4317
EOF

Verify Deployment

# Check all pods are running
kubectl get pods -n llm-platform

# Check services
kubectl get svc -n llm-platform

# Check ingress
kubectl get ingress -n llm-platform

# View logs
kubectl logs -n llm-platform deployment/drupal -f

# Check auto-scaling
kubectl get hpa -n llm-platform

Access Platform

# Get external IP (cloud)
kubectl get ingress -n llm-platform

# Port forward (local/testing)
kubectl port-forward -n llm-platform svc/drupal 8080:80

# Access specific services
kubectl port-forward -n llm-platform svc/llm-gateway 4000:4000
kubectl port-forward -n llm-platform svc/qdrant 6333:6333
kubectl port-forward -n llm-platform svc/grafana 3000:80

Troubleshooting

Pods Not Starting

# Check pod status
kubectl describe pod -n llm-platform <pod-name>

# Check events
kubectl get events -n llm-platform --sort-by='.lastTimestamp'

# Check resource limits
kubectl top pods -n llm-platform
kubectl top nodes

Database Connection Issues

# Test database connection
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h postgres.llm-platform.svc.cluster.local -U llm_user -d llm_platform

# Check database pod
kubectl logs -n llm-platform statefulset/postgres

Performance Issues

# Check resource usage
kubectl top pods -n llm-platform
kubectl top nodes

# Check HPA status
kubectl get hpa -n llm-platform

# Scale manually if needed
kubectl scale deployment drupal --replicas=5 -n llm-platform

Kubernetes Cluster Setup

Kubernetes Cluster Setup

Overview

Current Infrastructure State

OrbStack (Mac M4 - Local Development)

Target Architecture: Multi-Machine K3s Cluster

Prerequisites

Quick Start (OrbStack - Local Development)

Production Cluster Setup

1. Create Cluster

AWS EKS

Google GKE

2. Install Cluster Components

Install Ingress Controller

Install Cert Manager (TLS)

Install Metrics Server

3. Create Namespace and Secrets

Helm Chart Deployment

Chart Structure

Deploy Platform

Update Deployment

Component Configuration

Drupal Platform

PostgreSQL

Redis

Qdrant

LLM Gateway

Agent Mesh

Auto-Scaling Configuration

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler

Monitoring Setup

Prometheus Stack

Phoenix Arize

Verify Deployment

Access Platform

Troubleshooting

Pods Not Starting

Database Connection Issues

Performance Issues

Related Documentation