README
Vast.ai GPU Cloud Integration
AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md
Version: 3.0.0 | Last Updated: 2026-01-04 Status: PRODUCTION - Canonical Event Schema, Registry Service, Security Complete
Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.
TL;DR
Production-ready Vast.ai integration with:
- Canonical Event Schema - Single source of truth for all Vast.ai events
- Security - HMAC verification, replay protection, rate limiting
- Service Discovery - Authoritative registry for GPU instances
- OpenAPI 3.1 - Complete API specification
- PyWorker SDK - TypeScript port for serverless deployment
Implementation Status
| Component | Location | Status |
|---|---|---|
| Canonical Event Schema | agent-router/src/infrastructure/deployment/vastai/events.ts | Complete (289 lines) |
| Security Utilities | agent-router/src/infrastructure/deployment/vastai/security.ts | Complete (247 lines) |
| Registry Service | agent-mesh/src/services/vastai-registry.service.ts | Complete (218 lines) |
| Registry API | agent-mesh/src/api/vastai-registry.routes.ts | Complete (143 lines) |
| OpenAPI Spec | agent-mesh/openapi/vastai-registry.openapi.yml | Complete (OpenAPI 3.1) |
| Duo Gateway Integration | agent-mesh/src/api/duo-gateway.routes.ts | Complete (17 event types) |
| PyWorker SDK | agent-router/src/infrastructure/deployment/vastai/ | Complete |
| Cloudflared Tunnel | mesh.bluefly.internal | Configured |
| agent-mesh | common_npm/agent-mesh (port 3005) | Running |
Architecture
Separation of Duties
Complete Reference: See Separation of Duties and Separation of Duties Audit
| Responsibility | Project | Location |
|---|---|---|
| Event Definitions | agent-router | src/infrastructure/deployment/vastai/events.ts |
| Security | agent-router | src/infrastructure/deployment/vastai/security.ts |
| Service Discovery | agent-mesh | src/services/vastai-registry.service.ts |
| Registry API | agent-mesh | src/api/vastai-registry.routes.ts |
| Event Routing | agent-mesh | src/api/duo-gateway.routes.ts |
| Webhook Handling | platform-agents | src/triggers/vastai-webhook.ts |
| Docker Operations | agent-docker | src/services/vastai-docker.service.ts |
| CI/CD Components | gitlab_components | templates/vastai-deploy/template.yml |
Canonical Event Schema
Single Source of Truth: @bluefly/agent-router/infrastructure/deployment/vastai/events
17 event types using dot-notation:
vastai.instance.*- Lifecycle events (created, provisioning, ready, failed, terminated)vastai.deployment.*- Deployment events (started, completed, failed)vastai.cost.*- Cost events (sampled, threshold_warning, threshold_exceeded, budget_exceeded)vastai.health.*- Health events (check, degraded, unhealthy)vastai.mesh.*- Mesh events (registered, unregistered, heartbeat)
Usage:
import { createEventEnvelope, VastEventType } from '@bluefly/agent-router/infrastructure/deployment/vastai/events'; const event = createEventEnvelope('vastai.instance.created', payload, { triggerId: 'gitlab-pipeline-123', source: 'gitlab', idempotencyKey: crypto.randomUUID(), });
Security
Location: @bluefly/agent-router/infrastructure/deployment/vastai/security
Security features:
- HMAC Signature Verification - Timing-safe comparison
- Replay Protection - Event ID cache with TTL
- Rate Limiting - Per trigger_id (100 req/min default)
- Timestamp Validation - 5-minute window
- Payload Size Limits - 1MB maximum
- Control Character Stripping - Input sanitization
Usage:
import { WebhookVerifier, WebhookSignatureVerifier, InMemoryReplayCache } from '@bluefly/agent-router/infrastructure/deployment/vastai/security'; const verifier = new WebhookVerifier( new WebhookSignatureVerifier(secret), new InMemoryReplayCache(), new TimestampValidator(300), new InMemoryRateLimiter(), new PayloadSizeValidator(1024 * 1024) ); const result = await verifier.verify(rawPayload, headers, triggerId);
Service Discovery Registry
Location: agent-mesh/src/services/vastai-registry.service.ts
Authoritative registry for active GPU instances:
- TTL-based expiration (300s default)
- Health monitoring via heartbeat
- EventEmitter for lifecycle events
- Filtering by environment, trigger_id, status, capabilities
API Endpoints (/api/v1/vastai/registry):
POST /register- Register instanceGET /- List instances (with filters)GET /:instanceId- Get instance by IDDELETE /:instanceId- Deregister instancePOST /:instanceId/heartbeat- Update heartbeat
OpenAPI Spec: agent-mesh/openapi/vastai-registry.openapi.yml
Event Routing
Location: agent-mesh/src/api/duo-gateway.routes.ts
All Vast.ai events route to appropriate OSSA agents:
- Lifecycle events
cluster-operator - Cost events
cost-intelligence-monitor - Health events
cluster-operator - Mesh events
cluster-operator
Machine events (vastai.*) require explicit routing - no defaults.
Network Endpoints
| Context | Endpoint | Use |
|---|---|---|
| Registry API (public) | https://mesh.bluefly.internal/api/v1/vastai/registry | Cloudflare tunnel |
| Registry API (local) | http://localhost:3005/api/v1/vastai/registry | Local development |
| Vast.ai (public) | storage.blueflyagents.com | Cloudflare tunnel |
| Local (private) | blueflynas.tailcf98b3.ts.net:9000 | Tailscale mesh |
Rule: Cloudflare = Public ONLY. Tailscale = Private ONLY. Never mix.
Quick Start - PyWorker SDK
import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai'; const worker = createVastWorker({ modelServerUrl: 'http://127.0.0.1', modelServerPort: 8000, workerPort: 3000, // Vast.ai expects port 3000 handlers: [{ route: '/v1/embeddings', workloadCalculator: (data) => (data.input as string[]).length, allowParallelRequests: true, }], logActionConfig: { onLoad: ['Application startup complete'], onError: ['RuntimeError', 'CUDA out of memory'], }, }); await worker.run();
GPU Pricing
| GPU | $/hr | Best For |
|---|---|---|
| RTX 4090 | $0.34 | Inference, embeddings |
| A100 40GB | $0.66 | Training |
| H100 | $1.99 | Production fine-tuning |
Related Documentation
- vast-ai-serverless.md - PyWorker SDK Guide
- gpu-cluster-status.md - Active instances status
- OpenAPI Spec - Registry API specification
Code Locations
- Event Schema:
common_npm/agent-router/src/infrastructure/deployment/vastai/events.ts - Security:
common_npm/agent-router/src/infrastructure/deployment/vastai/security.ts - Registry Service:
common_npm/agent-mesh/src/services/vastai-registry.service.ts - Registry API:
common_npm/agent-mesh/src/api/vastai-registry.routes.ts - OpenAPI Spec:
common_npm/agent-mesh/openapi/vastai-registry.openapi.yml - Duo Gateway:
common_npm/agent-mesh/src/api/duo-gateway.routes.ts
Best Practices
DRY: Single source of truth for events (agent-router) SOLID: Clear separation of responsibilities OpenAPI-First: Spec before implementation Type Safety: TypeScript + Zod validation Security: HMAC, replay protection, rate limiting Idempotency: Required for mutating actions
Last Updated: 2026-01-04 Status: Production-ready
Network Architecture
Network Separation (CRITICAL)
Key Principle: Cloudflare = Public Ingress ONLY. Tailscale = Private Access ONLY. These planes must NEVER be mixed.
NETWORK ARCHITECTURE
PUBLIC ACCESS (Cloudflare Tunnel) - For Vast.ai & External
HTTPS
Vast.ai GPU storage.blueflyagents.com
Instances (Cloudflare Tunnel :9000)
Synology NAS (192.168.68.60)
MinIO Container (:9000)
PRIVATE ACCESS (Tailscale) - For Local Development
Tailscale
Mac M4/M3 blueflynas.tailcf98b3.ts.net:9000
OrbStack (WireGuard) (Private mesh network)
EXISTING ROUTES:
nas.blueflyagents.com DSM Web UI (:5001)
api.blueflyagents.com Webhook Server (:3001)
storage.blueflyagents.com MinIO S3 (:9000)
mesh.bluefly.internal Registry API (:3005)
Synology NAS Integration
Network Configuration:
synology_nas: # Physical device local_ip: "192.168.68.60" dsm_port: 5001 # Public access (Cloudflare Tunnel) - for Vast.ai public: s3_endpoint: "https://storage.blueflyagents.com" dsm_ui: "https://nas.blueflyagents.com" # Private access (Tailscale) - for local development private: s3_endpoint: "http://blueflynas.tailcf98b3.ts.net:9000" dsm_ui: "https://blueflynas.tailcf98b3.ts.net:5001" # S3 bucket structure buckets: bluefly-models: paths: training-data: "/volume1/llm-platform/training-data/" checkpoints: "/volume1/llm-platform/checkpoints/" models: "/volume1/llm-platform/models/" artifacts: "/volume1/llm-platform/artifacts/"
Using agent-tailscale for NAS Access:
import { TailscaleDiscovery } from '@bluefly/agent-tailscale'; import { S3Client } from '@aws-sdk/client-s3'; const discovery = new TailscaleDiscovery(); // Find the NAS by hostname const peers = await discovery.discoverPeers({ online: true }); const nas = peers.find(p => p.hostname === 'blueflynas'); if (!nas) throw new Error('NAS not found on Tailscale'); const s3Client = new S3Client({ endpoint: `http://${nas.tailscaleIP}:9000`, // 100.104.119.76 credentials: { accessKeyId: process.env.MINIO_ACCESS_KEY!, secretAccessKey: process.env.MINIO_SECRET_KEY!, }, forcePathStyle: true, }); // Or use the known hostname directly: // endpoint: 'http://blueflynas.tailcf98b3.ts.net:9000'
GPU Pricing & Strategy
Pricing Model
| Instance Type | Characteristics | Best For | Discount |
|---|---|---|---|
| On-Demand | Fixed price, guaranteed | Production inference | Baseline |
| Interruptible | Bidding, may interrupt | Batch training | 50-80% off |
| Reserved | Pre-paid commitment | Long-term projects | 20-40% off |
GPU Selection Matrix
| GPU | Price/hr | VRAM | Best Use Case |
|---|---|---|---|
| RTX 4090 | ~$0.34 | 24GB | Testing, inference, embeddings |
| RTX 3090 | ~$0.25 | 24GB | Budget training |
| A100 40GB | ~$0.66 | 40GB | Model training |
| A100 80GB | ~$0.80 | 80GB | Large model training |
| H100 | ~$1.99 | 80GB | Production training, fine-tuning |
Cost Projections
Monthly Estimate (Moderate Usage):
| Category | Hours/Month | Rate | Monthly Cost |
|---|---|---|---|
| Model Training (A100) | 40 | $0.66/hr | $26.40 |
| Embedding Generation (RTX 4090) | 80 | $0.34/hr | $27.20 |
| Inference Endpoints (serverless) | ~200 | $0.34/hr | $68.00 |
| Total GPU | ~$122/month | ||
| Synology NAS (self-hosted) | - | - | $0 |
| Grand Total | ~$122/month |
Comparison with Cloud Providers:
| Provider | Similar Workload | Monthly Cost | Savings |
|---|---|---|---|
| AWS SageMaker | Same GPU hours | ~$800 | 85% |
| GCP Vertex AI | Same GPU hours | ~$700 | 83% |
| Azure ML | Same GPU hours | ~$750 | 84% |
| Vast.ai | Same GPU hours | ~$122 | Baseline |
Savings: 60-85% vs major cloud providers
Registry API Examples
# Register instance curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/register \ -H "Content-Type: application/json" \ -d '{ "instance_id": 29484611, "contract_id": 12345, "tailscale_ip": "100.113.211.78", "tailscale_hostname": "vastai-gpu-worker-1", "capabilities": ["inference", "embeddings"], "status": "ready", "environment": "prod", "trigger_id": "gitlab-pipeline-123", "gpu_type": "RTX_4090", "gpu_name": "NVIDIA RTX 4090", "cost_per_hour": 0.34 }' # List instances curl "https://mesh.bluefly.internal/api/v1/vastai/registry?environment=prod&status=ready" # Heartbeat curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/29484611/heartbeat
Vast.ai CLI Commands
# Search for RTX 4090 instances vastai search offers 'gpu_name=RTX_4090 reliability>0.95' -o 'dph+' # Create instance vastai create instance <OFFER_ID> --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime --disk 50 # Show running instances vastai show instances # Destroy instance vastai destroy instance <INSTANCE_ID>
NAS Storage Operations
# List checkpoints on NAS (via Tailscale - private) aws s3 ls s3://bluefly-models/checkpoints/ \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000 # Upload training data (via Tailscale - private) aws s3 sync ./data s3://bluefly-models/datasets/gov-rfp/ \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000 # Download model checkpoint (via Tailscale - private) aws s3 cp s3://bluefly-models/checkpoints/gov-rfp/latest.pt ./model.pt \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000
Project Mapping
GPU-Intensive Workloads (Offload to Vast.ai)
| Workload | Project | GPU Need | Vast.ai Instance Type | Estimated Cost |
|---|---|---|---|---|
| RFP Document Processing | models/gov-rfp_model | HIGH | A100 40GB | $0.66/hr |
| Policy Compliance Training | models/civicpolicy_model | HIGH | A100 40GB | $0.66/hr |
| Platform Optimization | models/llm-platform_model | MEDIUM | RTX 4090 | $0.34/hr |
| Agent Development Patterns | models/agent-studio_model | MEDIUM | RTX 4090 | $0.34/hr |
| Vector Embeddings | common_npm/agent-brain | HIGH | RTX 4090 | $0.34/hr |
| Document Analysis | common_npm/rfp-automation | HIGH | A100 40GB | $0.66/hr |
CPU Workloads (Keep Local on OrbStack)
| Service | Project | Why Keep Local |
|---|---|---|
| Chat Interface | agent-chat | Low latency required |
| Workflow Engine | workflow-engine | Stateful, Langflow integration |
| Agent Operations | agent-ops | Local orchestration |
| PostgreSQL | Infrastructure | Stateful, data sovereignty |
| Redis | Infrastructure | Low-latency cache |
| Qdrant | Infrastructure | Vector DB (query only) |
CI/CD Variables
| Variable | Type | Protected | Masked | Value/Description |
|---|---|---|---|---|
VASTAI_API_KEY | Variable | Yes | Yes | Vast.ai API key |
VASTAI_SSH_KEY | File | Yes | No | SSH private key for GPU instances |
SYNOLOGY_S3_ENDPOINT | Variable | No | No | https://storage.blueflyagents.com (public, Cloudflare) |
SYNOLOGY_S3_ENDPOINT_PRIVATE | Variable | Yes | No | http://blueflynas.tailcf98b3.ts.net:9000 (private, Tailscale) |
MINIO_ACCESS_KEY | Variable | Yes | Yes | MinIO access key |
MINIO_SECRET_KEY | Variable | Yes | Yes | MinIO secret key |
MLFLOW_TRACKING_URI | Variable | No | No | GitLab MLflow endpoint (auto-set) |
WEBHOOK_SECRET | Variable | Yes | Yes | HMAC secret for webhook verification |
Network Note: Vast.ai instances run externally and MUST use public endpoint via Cloudflare Tunnel. GitLab runners on local network can use private Tailscale endpoint for faster access.
Environment Variables
# Vast.ai API tokens (set in CI/CD or .env) VASTAI_CLUSTER_OP_KEY= # Instance management VASTAI_COST_MONITOR_KEY= # Billing/cost access VASTAI_TASK_DISPATCH_KEY= # Task coordination # Tailscale (optional - for automated joining) TAILSCALE_AUTHKEY= # Pre-auth key for mesh join # Webhook security WEBHOOK_SECRET= # HMAC secret for webhook verification
Troubleshooting
Instance Not Appearing in Registry
- Check instance heartbeat:
curl -X POST /api/v1/vastai/registry/:instanceId/heartbeat - Verify TTL: Default is 300s, instance must heartbeat within this window
- Check registry logs:
agent-meshservice logs
Webhook Verification Failing
- Verify
WEBHOOK_SECRETmatches sender - Check timestamp: Events older than 5 minutes are rejected
- Verify signature header:
X-Signaturemust be present - Check rate limits: 100 req/min per trigger_id
Event Routing Errors
- Verify event type: Must be valid
vastai.*event type - Check agent mapping: All
vastai.*events must have explicit routing - Review duo-gateway logs:
agent-meshservice logs
Last Updated: 2026-01-04 Status: Production-ready Total Code: 1,270+ lines + OpenAPI spec Purpose: Complete reference for AI bots and developers - everything needed to become an expert
Related Documentation
- BULLETPROOF_VASTAI_PLAN.md - Complete implementation plan with Cloudflare Tunnel + Tailscale integration
- Separation of Duties - Project responsibilities
- Separation of Duties Audit - Complete audit of all projects