Skip to main content

README

Vast.ai GPU Cloud Integration

AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md

Version: 3.0.0 | Last Updated: 2026-01-04 Status: PRODUCTION - Canonical Event Schema, Registry Service, Security Complete

Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.

TL;DR

Production-ready Vast.ai integration with:

  • Canonical Event Schema - Single source of truth for all Vast.ai events
  • Security - HMAC verification, replay protection, rate limiting
  • Service Discovery - Authoritative registry for GPU instances
  • OpenAPI 3.1 - Complete API specification
  • PyWorker SDK - TypeScript port for serverless deployment

Implementation Status

ComponentLocationStatus
Canonical Event Schemaagent-router/src/infrastructure/deployment/vastai/events.tsComplete (289 lines)
Security Utilitiesagent-router/src/infrastructure/deployment/vastai/security.tsComplete (247 lines)
Registry Serviceagent-mesh/src/services/vastai-registry.service.tsComplete (218 lines)
Registry APIagent-mesh/src/api/vastai-registry.routes.tsComplete (143 lines)
OpenAPI Specagent-mesh/openapi/vastai-registry.openapi.ymlComplete (OpenAPI 3.1)
Duo Gateway Integrationagent-mesh/src/api/duo-gateway.routes.tsComplete (17 event types)
PyWorker SDKagent-router/src/infrastructure/deployment/vastai/Complete
Cloudflared Tunnelmesh.bluefly.internalConfigured
agent-meshcommon_npm/agent-mesh (port 3005)Running

Architecture

Separation of Duties

Complete Reference: See Separation of Duties and Separation of Duties Audit

ResponsibilityProjectLocation
Event Definitionsagent-routersrc/infrastructure/deployment/vastai/events.ts
Securityagent-routersrc/infrastructure/deployment/vastai/security.ts
Service Discoveryagent-meshsrc/services/vastai-registry.service.ts
Registry APIagent-meshsrc/api/vastai-registry.routes.ts
Event Routingagent-meshsrc/api/duo-gateway.routes.ts
Webhook Handlingplatform-agentssrc/triggers/vastai-webhook.ts
Docker Operationsagent-dockersrc/services/vastai-docker.service.ts
CI/CD Componentsgitlab_componentstemplates/vastai-deploy/template.yml

Canonical Event Schema

Single Source of Truth: @bluefly/agent-router/infrastructure/deployment/vastai/events

17 event types using dot-notation:

  • vastai.instance.* - Lifecycle events (created, provisioning, ready, failed, terminated)
  • vastai.deployment.* - Deployment events (started, completed, failed)
  • vastai.cost.* - Cost events (sampled, threshold_warning, threshold_exceeded, budget_exceeded)
  • vastai.health.* - Health events (check, degraded, unhealthy)
  • vastai.mesh.* - Mesh events (registered, unregistered, heartbeat)

Usage:

import { createEventEnvelope, VastEventType } from '@bluefly/agent-router/infrastructure/deployment/vastai/events'; const event = createEventEnvelope('vastai.instance.created', payload, { triggerId: 'gitlab-pipeline-123', source: 'gitlab', idempotencyKey: crypto.randomUUID(), });

Security

Location: @bluefly/agent-router/infrastructure/deployment/vastai/security

Security features:

  • HMAC Signature Verification - Timing-safe comparison
  • Replay Protection - Event ID cache with TTL
  • Rate Limiting - Per trigger_id (100 req/min default)
  • Timestamp Validation - 5-minute window
  • Payload Size Limits - 1MB maximum
  • Control Character Stripping - Input sanitization

Usage:

import { WebhookVerifier, WebhookSignatureVerifier, InMemoryReplayCache } from '@bluefly/agent-router/infrastructure/deployment/vastai/security'; const verifier = new WebhookVerifier( new WebhookSignatureVerifier(secret), new InMemoryReplayCache(), new TimestampValidator(300), new InMemoryRateLimiter(), new PayloadSizeValidator(1024 * 1024) ); const result = await verifier.verify(rawPayload, headers, triggerId);

Service Discovery Registry

Location: agent-mesh/src/services/vastai-registry.service.ts

Authoritative registry for active GPU instances:

  • TTL-based expiration (300s default)
  • Health monitoring via heartbeat
  • EventEmitter for lifecycle events
  • Filtering by environment, trigger_id, status, capabilities

API Endpoints (/api/v1/vastai/registry):

  • POST /register - Register instance
  • GET / - List instances (with filters)
  • GET /:instanceId - Get instance by ID
  • DELETE /:instanceId - Deregister instance
  • POST /:instanceId/heartbeat - Update heartbeat

OpenAPI Spec: agent-mesh/openapi/vastai-registry.openapi.yml

Event Routing

Location: agent-mesh/src/api/duo-gateway.routes.ts

All Vast.ai events route to appropriate OSSA agents:

  • Lifecycle events cluster-operator
  • Cost events cost-intelligence-monitor
  • Health events cluster-operator
  • Mesh events cluster-operator

Machine events (vastai.*) require explicit routing - no defaults.

Network Endpoints

ContextEndpointUse
Registry API (public)https://mesh.bluefly.internal/api/v1/vastai/registryCloudflare tunnel
Registry API (local)http://localhost:3005/api/v1/vastai/registryLocal development
Vast.ai (public)storage.blueflyagents.comCloudflare tunnel
Local (private)blueflynas.tailcf98b3.ts.net:9000Tailscale mesh

Rule: Cloudflare = Public ONLY. Tailscale = Private ONLY. Never mix.

Quick Start - PyWorker SDK

import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai'; const worker = createVastWorker({ modelServerUrl: 'http://127.0.0.1', modelServerPort: 8000, workerPort: 3000, // Vast.ai expects port 3000 handlers: [{ route: '/v1/embeddings', workloadCalculator: (data) => (data.input as string[]).length, allowParallelRequests: true, }], logActionConfig: { onLoad: ['Application startup complete'], onError: ['RuntimeError', 'CUDA out of memory'], }, }); await worker.run();

GPU Pricing

GPU$/hrBest For
RTX 4090$0.34Inference, embeddings
A100 40GB$0.66Training
H100$1.99Production fine-tuning

Code Locations

  • Event Schema: common_npm/agent-router/src/infrastructure/deployment/vastai/events.ts
  • Security: common_npm/agent-router/src/infrastructure/deployment/vastai/security.ts
  • Registry Service: common_npm/agent-mesh/src/services/vastai-registry.service.ts
  • Registry API: common_npm/agent-mesh/src/api/vastai-registry.routes.ts
  • OpenAPI Spec: common_npm/agent-mesh/openapi/vastai-registry.openapi.yml
  • Duo Gateway: common_npm/agent-mesh/src/api/duo-gateway.routes.ts

Best Practices

DRY: Single source of truth for events (agent-router) SOLID: Clear separation of responsibilities OpenAPI-First: Spec before implementation Type Safety: TypeScript + Zod validation Security: HMAC, replay protection, rate limiting Idempotency: Required for mutating actions


Last Updated: 2026-01-04 Status: Production-ready

Network Architecture

Network Separation (CRITICAL)

Key Principle: Cloudflare = Public Ingress ONLY. Tailscale = Private Access ONLY. These planes must NEVER be mixed.


                        NETWORK ARCHITECTURE                                  

                                                                              
  PUBLIC ACCESS (Cloudflare Tunnel) - For Vast.ai & External                 
                   
      HTTPS                   
   Vast.ai GPU    storage.blueflyagents.com                  
    Instances                (Cloudflare Tunnel  :9000)                 
                              
                                                                             
                                                                             
                                             
                                Synology NAS (192.168.68.60)               
                                 MinIO Container (:9000)                
                                             
                                                                             
  PRIVATE ACCESS (Tailscale) - For Local Development                        
                          
     Tailscale           
   Mac M4/M3      blueflynas.tailcf98b3.ts.net:9000         
    OrbStack      (WireGuard)  (Private mesh network)                   
                         
                                                                              
  EXISTING ROUTES:                                                            
   nas.blueflyagents.com  DSM Web UI (:5001)                             
   api.blueflyagents.com  Webhook Server (:3001)                         
   storage.blueflyagents.com  MinIO S3 (:9000)                           
   mesh.bluefly.internal  Registry API (:3005)                          
                                                                              

Synology NAS Integration

Network Configuration:

synology_nas: # Physical device local_ip: "192.168.68.60" dsm_port: 5001 # Public access (Cloudflare Tunnel) - for Vast.ai public: s3_endpoint: "https://storage.blueflyagents.com" dsm_ui: "https://nas.blueflyagents.com" # Private access (Tailscale) - for local development private: s3_endpoint: "http://blueflynas.tailcf98b3.ts.net:9000" dsm_ui: "https://blueflynas.tailcf98b3.ts.net:5001" # S3 bucket structure buckets: bluefly-models: paths: training-data: "/volume1/llm-platform/training-data/" checkpoints: "/volume1/llm-platform/checkpoints/" models: "/volume1/llm-platform/models/" artifacts: "/volume1/llm-platform/artifacts/"

Using agent-tailscale for NAS Access:

import { TailscaleDiscovery } from '@bluefly/agent-tailscale'; import { S3Client } from '@aws-sdk/client-s3'; const discovery = new TailscaleDiscovery(); // Find the NAS by hostname const peers = await discovery.discoverPeers({ online: true }); const nas = peers.find(p => p.hostname === 'blueflynas'); if (!nas) throw new Error('NAS not found on Tailscale'); const s3Client = new S3Client({ endpoint: `http://${nas.tailscaleIP}:9000`, // 100.104.119.76 credentials: { accessKeyId: process.env.MINIO_ACCESS_KEY!, secretAccessKey: process.env.MINIO_SECRET_KEY!, }, forcePathStyle: true, }); // Or use the known hostname directly: // endpoint: 'http://blueflynas.tailcf98b3.ts.net:9000'

GPU Pricing & Strategy

Pricing Model

Instance TypeCharacteristicsBest ForDiscount
On-DemandFixed price, guaranteedProduction inferenceBaseline
InterruptibleBidding, may interruptBatch training50-80% off
ReservedPre-paid commitmentLong-term projects20-40% off

GPU Selection Matrix

GPUPrice/hrVRAMBest Use Case
RTX 4090~$0.3424GBTesting, inference, embeddings
RTX 3090~$0.2524GBBudget training
A100 40GB~$0.6640GBModel training
A100 80GB~$0.8080GBLarge model training
H100~$1.9980GBProduction training, fine-tuning

Cost Projections

Monthly Estimate (Moderate Usage):

CategoryHours/MonthRateMonthly Cost
Model Training (A100)40$0.66/hr$26.40
Embedding Generation (RTX 4090)80$0.34/hr$27.20
Inference Endpoints (serverless)~200$0.34/hr$68.00
Total GPU~$122/month
Synology NAS (self-hosted)--$0
Grand Total~$122/month

Comparison with Cloud Providers:

ProviderSimilar WorkloadMonthly CostSavings
AWS SageMakerSame GPU hours~$80085%
GCP Vertex AISame GPU hours~$70083%
Azure MLSame GPU hours~$75084%
Vast.aiSame GPU hours~$122Baseline

Savings: 60-85% vs major cloud providers

Registry API Examples

# Register instance curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/register \ -H "Content-Type: application/json" \ -d '{ "instance_id": 29484611, "contract_id": 12345, "tailscale_ip": "100.113.211.78", "tailscale_hostname": "vastai-gpu-worker-1", "capabilities": ["inference", "embeddings"], "status": "ready", "environment": "prod", "trigger_id": "gitlab-pipeline-123", "gpu_type": "RTX_4090", "gpu_name": "NVIDIA RTX 4090", "cost_per_hour": 0.34 }' # List instances curl "https://mesh.bluefly.internal/api/v1/vastai/registry?environment=prod&status=ready" # Heartbeat curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/29484611/heartbeat

Vast.ai CLI Commands

# Search for RTX 4090 instances vastai search offers 'gpu_name=RTX_4090 reliability>0.95' -o 'dph+' # Create instance vastai create instance <OFFER_ID> --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime --disk 50 # Show running instances vastai show instances # Destroy instance vastai destroy instance <INSTANCE_ID>

NAS Storage Operations

# List checkpoints on NAS (via Tailscale - private) aws s3 ls s3://bluefly-models/checkpoints/ \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000 # Upload training data (via Tailscale - private) aws s3 sync ./data s3://bluefly-models/datasets/gov-rfp/ \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000 # Download model checkpoint (via Tailscale - private) aws s3 cp s3://bluefly-models/checkpoints/gov-rfp/latest.pt ./model.pt \ --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000

Project Mapping

GPU-Intensive Workloads (Offload to Vast.ai)

WorkloadProjectGPU NeedVast.ai Instance TypeEstimated Cost
RFP Document Processingmodels/gov-rfp_modelHIGHA100 40GB$0.66/hr
Policy Compliance Trainingmodels/civicpolicy_modelHIGHA100 40GB$0.66/hr
Platform Optimizationmodels/llm-platform_modelMEDIUMRTX 4090$0.34/hr
Agent Development Patternsmodels/agent-studio_modelMEDIUMRTX 4090$0.34/hr
Vector Embeddingscommon_npm/agent-brainHIGHRTX 4090$0.34/hr
Document Analysiscommon_npm/rfp-automationHIGHA100 40GB$0.66/hr

CPU Workloads (Keep Local on OrbStack)

ServiceProjectWhy Keep Local
Chat Interfaceagent-chatLow latency required
Workflow Engineworkflow-engineStateful, Langflow integration
Agent Operationsagent-opsLocal orchestration
PostgreSQLInfrastructureStateful, data sovereignty
RedisInfrastructureLow-latency cache
QdrantInfrastructureVector DB (query only)

CI/CD Variables

VariableTypeProtectedMaskedValue/Description
VASTAI_API_KEYVariableYesYesVast.ai API key
VASTAI_SSH_KEYFileYesNoSSH private key for GPU instances
SYNOLOGY_S3_ENDPOINTVariableNoNohttps://storage.blueflyagents.com (public, Cloudflare)
SYNOLOGY_S3_ENDPOINT_PRIVATEVariableYesNohttp://blueflynas.tailcf98b3.ts.net:9000 (private, Tailscale)
MINIO_ACCESS_KEYVariableYesYesMinIO access key
MINIO_SECRET_KEYVariableYesYesMinIO secret key
MLFLOW_TRACKING_URIVariableNoNoGitLab MLflow endpoint (auto-set)
WEBHOOK_SECRETVariableYesYesHMAC secret for webhook verification

Network Note: Vast.ai instances run externally and MUST use public endpoint via Cloudflare Tunnel. GitLab runners on local network can use private Tailscale endpoint for faster access.

Environment Variables

# Vast.ai API tokens (set in CI/CD or .env) VASTAI_CLUSTER_OP_KEY= # Instance management VASTAI_COST_MONITOR_KEY= # Billing/cost access VASTAI_TASK_DISPATCH_KEY= # Task coordination # Tailscale (optional - for automated joining) TAILSCALE_AUTHKEY= # Pre-auth key for mesh join # Webhook security WEBHOOK_SECRET= # HMAC secret for webhook verification

Troubleshooting

Instance Not Appearing in Registry

  1. Check instance heartbeat: curl -X POST /api/v1/vastai/registry/:instanceId/heartbeat
  2. Verify TTL: Default is 300s, instance must heartbeat within this window
  3. Check registry logs: agent-mesh service logs

Webhook Verification Failing

  1. Verify WEBHOOK_SECRET matches sender
  2. Check timestamp: Events older than 5 minutes are rejected
  3. Verify signature header: X-Signature must be present
  4. Check rate limits: 100 req/min per trigger_id

Event Routing Errors

  1. Verify event type: Must be valid vastai.* event type
  2. Check agent mapping: All vastai.* events must have explicit routing
  3. Review duo-gateway logs: agent-mesh service logs

Last Updated: 2026-01-04 Status: Production-ready Total Code: 1,270+ lines + OpenAPI spec Purpose: Complete reference for AI bots and developers - everything needed to become an expert