Deployment Handbook
Deployment Handbook
Separation of Duties: See Separation of Duties - Deployment documentation is responsible for documenting deployment procedures. It does NOT own agent manifests, execution, or infrastructure configuration.
Vast.ai Integration: See BULLETPROOF_VASTAI_PLAN.md - Complete Vast.ai implementation plan with Cloudflare Tunnel + Tailscale integration.
Single-source deployment reference for all BlueFly Agent Platform projects
This handbook consolidates deployment patterns used across 12+ projects into one authoritative reference. Link to specific sections from project READMEs.
Table of Contents
- Quick Start
- Kubernetes Deployment
- Docker Compose
- GitLab CI/CD Integration
- Environment Configuration
- Health Checks and Monitoring
Quick Start
Choose your deployment target:
| Target | Command | Use Case |
|---|---|---|
| Local Docker | docker compose up -d | Development |
| OrbStack K8s | helm install <app> ./helm-chart | Local K8s testing |
| GitLab CI/CD | Push to branch | Automated deployment |
| Production K8s | helm upgrade --install | Production |
Kubernetes Deployment
Helm Charts
Chart Structure (standard for all projects):
infrastructure/helm-chart/
Chart.yaml # Chart metadata
values.yaml # Default values
values-dev.yaml # Development overrides
values-staging.yaml # Staging overrides
values-prod.yaml # Production overrides
templates/
_helpers.tpl # Template helpers
NOTES.txt # Post-install notes
deployment.yaml # Main deployment
service.yaml # Service definition
configmap.yaml # Configuration
secret.yaml # Secrets (external-secrets recommended)
ingress.yaml # Ingress rules
hpa.yaml # Horizontal Pod Autoscaler
pvc.yaml # Persistent Volume Claims
Chart.yaml Template:
apiVersion: v2 name: <service-name> description: <Service description> type: application version: 1.0.0 appVersion: "1.0.0" keywords: - ai - agent - llm maintainers: - name: BlueFly Team email: dev@bluefly.io dependencies: - name: postgresql version: 12.x.x repository: https://charts.bitnami.com/bitnami condition: postgres.enabled - name: redis version: 18.x.x repository: https://charts.bitnami.com/bitnami condition: redis.enabled
values.yaml Template:
# Global settings global: environment: development domain: service.local # Application configuration app: replicaCount: 1 image: repository: registry.gitlab.com/blueflyio/<project> pullPolicy: IfNotPresent tag: "latest" service: type: ClusterIP port: 3000 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 500m memory: 512Mi autoscaling: enabled: false minReplicas: 1 maxReplicas: 10 targetCPUUtilizationPercentage: 70 # Health checks livenessProbe: httpGet: path: /health port: http initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health/ready port: http initialDelaySeconds: 10 periodSeconds: 5 # Ingress ingress: enabled: false className: nginx annotations: cert-manager.io/cluster-issuer: letsencrypt-prod hosts: - host: service.local paths: - path: / pathType: Prefix tls: []
Environment-Specific Values:
# values-prod.yaml global: environment: production domain: llm-platform.example.com app: replicaCount: 3 autoscaling: enabled: true minReplicas: 3 maxReplicas: 10 targetCPUUtilizationPercentage: 70 resources: requests: cpu: 1000m memory: 1Gi limits: cpu: 2000m memory: 2Gi ingress: enabled: true tls: - secretName: service-tls hosts: - llm-platform.example.com
Helm Commands:
# Install helm install <release> ./infrastructure/helm-chart \ --namespace <namespace> \ --create-namespace \ --values ./infrastructure/helm-chart/values-dev.yaml # Upgrade helm upgrade <release> ./infrastructure/helm-chart \ --namespace <namespace> \ --values ./infrastructure/helm-chart/values-prod.yaml # Rollback helm rollback <release> -n <namespace> # Uninstall helm uninstall <release> -n <namespace> # Dry run / Debug helm install <release> ./infrastructure/helm-chart \ --dry-run --debug \ --namespace <namespace> # Template rendering helm template <release> ./infrastructure/helm-chart \ --values ./infrastructure/helm-chart/values-prod.yaml
Raw Manifests
For simpler deployments without Helm:
Deployment Template:
# k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: ${SERVICE_NAME} labels: app: ${SERVICE_NAME} app.kubernetes.io/name: ${SERVICE_NAME} app.kubernetes.io/version: "${VERSION}" app.kubernetes.io/component: api spec: replicas: ${REPLICAS:-1} selector: matchLabels: app: ${SERVICE_NAME} template: metadata: labels: app: ${SERVICE_NAME} spec: containers: - name: ${SERVICE_NAME} image: ${IMAGE}:${TAG:-latest} ports: - name: http containerPort: ${PORT:-3000} env: - name: NODE_ENV value: "${ENVIRONMENT:-production}" - name: PORT value: "${PORT:-3000}" envFrom: - configMapRef: name: ${SERVICE_NAME}-config - secretRef: name: ${SERVICE_NAME}-secrets livenessProbe: httpGet: path: /health port: http initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health/ready port: http initialDelaySeconds: 10 periodSeconds: 5 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 500m memory: 512Mi
Service Template:
# k8s/service.yaml apiVersion: v1 kind: Service metadata: name: ${SERVICE_NAME} labels: app: ${SERVICE_NAME} spec: type: ClusterIP ports: - name: http port: 80 targetPort: http protocol: TCP selector: app: ${SERVICE_NAME}
OrbStack Local
OrbStack provides local Kubernetes for macOS development:
# Create cluster orb create k8s llm-platform # Set context kubectl config use-context orbstack # Deploy kubectl apply -f k8s/ # Port forward for local access kubectl port-forward svc/<service> 8080:80 # Access via orb.local domain # http://<service>.orb.local
OrbStack-Specific Ingress:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ${SERVICE_NAME} annotations: kubernetes.io/ingress.class: "nginx" spec: rules: - host: ${SERVICE_NAME}.orb.local http: paths: - path: / pathType: Prefix backend: service: name: ${SERVICE_NAME} port: number: 80
Multi-Machine K3s
For production-like local clusters across multiple machines:
Control Plane (Mac M4):
# Install K3s server curl -sfL https://get.k3s.io | sh -s - server \ --cluster-init \ --tls-san=100.108.129.7 \ --disable traefik \ --flannel-backend=wireguard-native # Get join token cat /var/lib/rancher/k3s/server/node-token
Worker Node (Mac M3):
# Join cluster curl -sfL https://get.k3s.io | K3S_URL=https://100.108.129.7:6443 \ K3S_TOKEN=<node-token> sh - # Verify kubectl get nodes
Longhorn Storage:
# Install Longhorn for distributed storage kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml # Set as default StorageClass kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Docker Compose
Development Stack
docker-compose.yml (development):
version: "3.9" services: app: build: context: . dockerfile: Dockerfile target: development volumes: - .:/app - node_modules:/app/node_modules ports: - "${PORT:-3000}:3000" environment: - NODE_ENV=development - DATABASE_URL=postgresql://user:pass@postgres:5432/db - REDIS_URL=redis://redis:6379 depends_on: postgres: condition: service_healthy redis: condition: service_healthy healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s postgres: image: postgres:16-alpine environment: POSTGRES_USER: user POSTGRES_PASSWORD: pass POSTGRES_DB: db volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U user -d db"] interval: 10s timeout: 5s retries: 5 redis: image: redis:7-alpine volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 5s retries: 5 volumes: node_modules: postgres_data: redis_data:
Production Stack
docker-compose.prod.yml:
version: "3.9" services: app: image: registry.gitlab.com/blueflyio/${PROJECT}:${TAG:-latest} restart: unless-stopped ports: - "${PORT:-3000}:3000" environment: - NODE_ENV=production env_file: - .env.production deploy: replicas: 2 resources: limits: cpus: '1.0' memory: 1G reservations: cpus: '0.5' memory: 512M restart_policy: condition: on-failure delay: 5s max_attempts: 3 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/health"] interval: 30s timeout: 10s retries: 3 start_period: 40s logging: driver: "json-file" options: max-size: "10m" max-file: "3"
Usage:
# Development docker compose up -d # Production docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d # With environment-specific overrides docker compose --env-file .env.production up -d # Scale service docker compose up -d --scale app=3 # View logs docker compose logs -f app # Stop and remove docker compose down -v
Service Templates
Common service definitions:
# PostgreSQL with backup postgres: image: postgres:16-alpine restart: unless-stopped environment: POSTGRES_USER: ${POSTGRES_USER:-llm} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} POSTGRES_DB: ${POSTGRES_DB:-llm_platform} volumes: - postgres_data:/var/lib/postgresql/data - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-llm}"] interval: 10s timeout: 5s retries: 5 # Redis with persistence redis: image: redis:7-alpine restart: unless-stopped command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD} volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"] interval: 10s timeout: 5s retries: 5 # Qdrant vector database qdrant: image: qdrant/qdrant:latest restart: unless-stopped ports: - "6333:6333" - "6334:6334" volumes: - qdrant_data:/qdrant/storage environment: QDRANT__SERVICE__GRPC_PORT: 6334 # MinIO object storage minio: image: minio/minio:latest restart: unless-stopped command: server /data --console-address ":9001" ports: - "9000:9000" - "9001:9001" environment: MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin} MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD} volumes: - minio_data:/data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 30s timeout: 20s retries: 3
GitLab CI/CD Integration
Pipeline Configuration
.gitlab-ci.yml (standard template):
include: # Golden workflow component - component: gitlab.com/blueflyio/agent-platform/gitlab_components/golden-workflow@v1 inputs: project_type: npm # npm | drupal | python | go enable_security_scan: true deploy_environments: ["dev", "staging", "prod"] # Global variables variables: DOCKER_REGISTRY: registry.gitlab.com/blueflyio/${CI_PROJECT_NAME} HELM_CHART_PATH: infrastructure/helm-chart KUBERNETES_NAMESPACE: ${CI_PROJECT_NAME} # Stages stages: - validate - test - build - deploy:dev - deploy:staging - deploy:production # Build job build: stage: build image: docker:24 services: - docker:24-dind before_script: - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY script: - docker build -t $DOCKER_REGISTRY:$CI_COMMIT_SHA . - docker build -t $DOCKER_REGISTRY:$CI_COMMIT_REF_SLUG . - docker push $DOCKER_REGISTRY:$CI_COMMIT_SHA - docker push $DOCKER_REGISTRY:$CI_COMMIT_REF_SLUG only: - main - development - tags
Environment Deployment
GitLab Environments:
# Development (automatic) deploy:dev: stage: deploy:dev image: alpine/helm:latest environment: name: development url: https://dev.${CI_PROJECT_NAME}.example.com on_stop: stop:dev auto_stop_in: 7 days before_script: - kubectl config use-context $KUBE_CONTEXT_DEV script: - helm upgrade --install ${CI_PROJECT_NAME} $HELM_CHART_PATH --namespace ${KUBERNETES_NAMESPACE}-dev --create-namespace --values $HELM_CHART_PATH/values-dev.yaml --set image.tag=$CI_COMMIT_SHA --wait only: - development # Staging (automatic on main) deploy:staging: stage: deploy:staging image: alpine/helm:latest environment: name: staging url: https://staging.${CI_PROJECT_NAME}.example.com before_script: - kubectl config use-context $KUBE_CONTEXT_STAGING script: - helm upgrade --install ${CI_PROJECT_NAME} $HELM_CHART_PATH --namespace ${KUBERNETES_NAMESPACE}-staging --create-namespace --values $HELM_CHART_PATH/values-staging.yaml --set image.tag=$CI_COMMIT_SHA --wait only: - main # Production (manual, tags only) deploy:production: stage: deploy:production image: alpine/helm:latest environment: name: production url: https://${CI_PROJECT_NAME}.example.com before_script: - kubectl config use-context $KUBE_CONTEXT_PROD script: - helm upgrade --install ${CI_PROJECT_NAME} $HELM_CHART_PATH --namespace ${KUBERNETES_NAMESPACE} --create-namespace --values $HELM_CHART_PATH/values-prod.yaml --set image.tag=$CI_COMMIT_TAG --wait when: manual only: - tags
Review Apps (for MRs):
deploy:review: stage: deploy:dev image: alpine/helm:latest environment: name: review/$CI_COMMIT_REF_SLUG url: https://$CI_COMMIT_REF_SLUG.review.example.com on_stop: stop:review auto_stop_in: 1 week script: - helm upgrade --install review-$CI_COMMIT_REF_SLUG $HELM_CHART_PATH --namespace review --create-namespace --values $HELM_CHART_PATH/values-dev.yaml --set image.tag=$CI_COMMIT_SHA --set ingress.hosts[0].host=$CI_COMMIT_REF_SLUG.review.example.com --wait only: - merge_requests except: - main stop:review: stage: deploy:dev image: alpine/helm:latest environment: name: review/$CI_COMMIT_REF_SLUG action: stop script: - helm uninstall review-$CI_COMMIT_REF_SLUG --namespace review when: manual only: - merge_requests
Runner Configuration
Runner Tags:
| Tag | Use Case | Runner |
|---|---|---|
docker, local | Generic Docker jobs | docker-runner |
npm-package, docker, local | Node.js/TypeScript | npm-runner |
drupal-module, docker, local | Drupal/PHP | drupal-module-runner |
python, docker, local | Python projects | python-runner |
Runner Selector Component:
include: - component: gitlab.com/blueflyio/agent-platform/gitlab_components/runner-selector@v0.1.x inputs: runner_type: auto fallback_to_shared: true build: extends: .runner-npm script: - npm ci - npm run build
Environment Configuration
Environment Variables
Standard Variables:
# Application NODE_ENV=production PORT=3000 LOG_LEVEL=info # Database DATABASE_URL=postgresql://user:pass@host:5432/db DATABASE_POOL_MIN=2 DATABASE_POOL_MAX=10 # Redis REDIS_URL=redis://:password@host:6379/0 # AI Providers ANTHROPIC_API_KEY=sk-ant-... OPENAI_API_KEY=sk-... OLLAMA_BASE_URL=http://ollama:11434 # Observability OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 OTEL_SERVICE_NAME=my-service OTEL_RESOURCE_ATTRIBUTES=service.version=1.0.0,deployment.environment=production # Security JWT_SECRET=... CORS_ORIGINS=https://example.com
Environment Files:
.env # Default (development)
.env.local # Local overrides (gitignored)
.env.development # Development
.env.staging # Staging
.env.production # Production
.env.test # Testing
Secrets Management
GitLab CI/CD Variables:
# Set in Settings > CI/CD > Variables ANTHROPIC_API_KEY # Masked, protected DATABASE_PASSWORD # Masked, protected KUBECONFIG_PROD # File, protected
Kubernetes Secrets:
apiVersion: v1 kind: Secret metadata: name: api-keys type: Opaque stringData: ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} OPENAI_API_KEY: ${OPENAI_API_KEY}
External Secrets Operator (recommended for production):
apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: api-keys spec: refreshInterval: 1h secretStoreRef: name: vault-backend kind: SecretStore target: name: api-keys data: - secretKey: ANTHROPIC_API_KEY remoteRef: key: secret/data/api-keys property: anthropic
Local Secrets (development):
# Store tokens in ~/.tokens/ ~/.tokens/gitlab ~/.tokens/anthropic ~/.tokens/openai # Load in shell export ANTHROPIC_API_KEY=$(cat ~/.tokens/anthropic)
Configuration Files
ConfigMap Pattern:
apiVersion: v1 kind: ConfigMap metadata: name: app-config data: config.json: | { "port": 3000, "logLevel": "info", "features": { "enableMetrics": true, "enableTracing": true } }
Mounting Configuration:
containers: - name: app volumeMounts: - name: config mountPath: /app/config readOnly: true volumes: - name: config configMap: name: app-config
Health Checks and Monitoring
Health Endpoints
Standard Health Endpoints:
| Endpoint | Purpose | Returns |
|---|---|---|
/health | Basic liveness | {"status": "ok"} |
/health/ready | Readiness with deps | {"status": "ok", "checks": {...}} |
/health/live | Kubernetes liveness | {"status": "ok"} |
/metrics | Prometheus metrics | Prometheus format |
TypeScript Implementation:
// health.controller.ts import { Router } from 'express'; const router = Router(); // Liveness - is the process alive? router.get('/health', (req, res) => { res.json({ status: 'ok', timestamp: new Date().toISOString() }); }); // Readiness - can the service handle requests? router.get('/health/ready', async (req, res) => { const checks = { database: await checkDatabase(), redis: await checkRedis(), external: await checkExternalDeps() }; const allHealthy = Object.values(checks).every(c => c.healthy); res.status(allHealthy ? 200 : 503).json({ status: allHealthy ? 'ok' : 'degraded', checks, timestamp: new Date().toISOString() }); }); async function checkDatabase(): Promise<HealthCheck> { try { await db.query('SELECT 1'); return { healthy: true, latency: 5 }; } catch (error) { return { healthy: false, error: error.message }; } }
Readiness and Liveness
Kubernetes Probes:
containers: - name: app livenessProbe: httpGet: path: /health/live port: http initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health/ready port: http initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 startupProbe: httpGet: path: /health port: http initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 30 # 5s * 30 = 150s max startup time
Probe Best Practices:
| Probe | Purpose | Failure Action |
|---|---|---|
startupProbe | App initialization | Delays other probes |
livenessProbe | Deadlock detection | Container restart |
readinessProbe | Traffic routing | Remove from service |
Observability Stack
OpenTelemetry Configuration:
// tracing.ts import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ resource: new Resource({ 'service.name': process.env.OTEL_SERVICE_NAME, 'service.version': process.env.npm_package_version, 'deployment.environment': process.env.NODE_ENV }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), instrumentations: [getNodeAutoInstrumentations()] }); sdk.start();
Prometheus Metrics:
import { collectDefaultMetrics, Registry, Counter, Histogram } from 'prom-client'; const register = new Registry(); collectDefaultMetrics({ register }); // Custom metrics const httpRequestDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], registers: [register] }); const httpRequestTotal = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'], registers: [register] }); // Metrics endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });
Observability Stack Deployment:
# docker-compose.observability.yml services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus grafana: image: grafana/grafana:latest ports: - "3001:3000" environment: GF_SECURITY_ADMIN_PASSWORD: admin volumes: - grafana_data:/var/lib/grafana jaeger: image: jaegertracing/all-in-one:latest ports: - "16686:16686" # UI - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP phoenix: image: arizephoenix/phoenix:latest ports: - "6006:6006" # UI - "4317:4317" # OTLP volumes: prometheus_data: grafana_data:
Prometheus Scrape Config:
# prometheus.yml global: scrape_interval: 15s scrape_configs: - job_name: 'app' static_configs: - targets: ['app:3000'] metrics_path: '/metrics' - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
Related Documentation
- Kubernetes Cluster Setup
- Helm Charts Reference
- Golden Workflow CI/CD
- OpenTelemetry Configuration
- Development Workflow
Version: 1.0.0 Last Updated: 2026-01-01 Maintainer: BlueFly Agent Platform Team