README

Vast.ai GPU Cloud Integration

AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md

Version: 3.0.0 | Last Updated: 2026-01-04 Status: PRODUCTION - Canonical Event Schema, Registry Service, Security Complete

Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.

TL;DR

Production-ready Vast.ai integration with:

Canonical Event Schema - Single source of truth for all Vast.ai events
Security - HMAC verification, replay protection, rate limiting
Service Discovery - Authoritative registry for GPU instances
OpenAPI 3.1 - Complete API specification
PyWorker SDK - TypeScript port for serverless deployment

Implementation Status

Component	Location	Status
Canonical Event Schema	`agent-router/src/infrastructure/deployment/vastai/events.ts`	Complete (289 lines)
Security Utilities	`agent-router/src/infrastructure/deployment/vastai/security.ts`	Complete (247 lines)
Registry Service	`agent-mesh/src/services/vastai-registry.service.ts`	Complete (218 lines)
Registry API	`agent-mesh/src/api/vastai-registry.routes.ts`	Complete (143 lines)
OpenAPI Spec	`agent-mesh/openapi/vastai-registry.openapi.yml`	Complete (OpenAPI 3.1)
Duo Gateway Integration	`agent-mesh/src/api/duo-gateway.routes.ts`	Complete (17 event types)
PyWorker SDK	`agent-router/src/infrastructure/deployment/vastai/`	Complete
Cloudflared Tunnel	`mesh.bluefly.internal`	Configured
agent-mesh	`common_npm/agent-mesh` (port 3005)	Running

Architecture

Separation of Duties

Complete Reference: See Separation of Duties and Separation of Duties Audit

Responsibility	Project	Location
Event Definitions	`agent-router`	`src/infrastructure/deployment/vastai/events.ts`
Security	`agent-router`	`src/infrastructure/deployment/vastai/security.ts`
Service Discovery	`agent-mesh`	`src/services/vastai-registry.service.ts`
Registry API	`agent-mesh`	`src/api/vastai-registry.routes.ts`
Event Routing	`agent-mesh`	`src/api/duo-gateway.routes.ts`
Webhook Handling	`platform-agents`	`src/triggers/vastai-webhook.ts`
Docker Operations	`agent-docker`	`src/services/vastai-docker.service.ts`
CI/CD Components	`gitlab_components`	`templates/vastai-deploy/template.yml`

Canonical Event Schema

Single Source of Truth: @bluefly/agent-router/infrastructure/deployment/vastai/events

17 event types using dot-notation:

vastai.instance.* - Lifecycle events (created, provisioning, ready, failed, terminated)
vastai.deployment.* - Deployment events (started, completed, failed)
vastai.cost.* - Cost events (sampled, threshold_warning, threshold_exceeded, budget_exceeded)
vastai.health.* - Health events (check, degraded, unhealthy)
vastai.mesh.* - Mesh events (registered, unregistered, heartbeat)

Usage:

import { createEventEnvelope, VastEventType } from '@bluefly/agent-router/infrastructure/deployment/vastai/events';

const event = createEventEnvelope('vastai.instance.created', payload, {
  triggerId: 'gitlab-pipeline-123',
  source: 'gitlab',
  idempotencyKey: crypto.randomUUID(),
});

Security

Location: @bluefly/agent-router/infrastructure/deployment/vastai/security

Security features:

HMAC Signature Verification - Timing-safe comparison
Replay Protection - Event ID cache with TTL
Rate Limiting - Per trigger_id (100 req/min default)
Timestamp Validation - 5-minute window
Payload Size Limits - 1MB maximum
Control Character Stripping - Input sanitization

Usage:

import { WebhookVerifier, WebhookSignatureVerifier, InMemoryReplayCache } from '@bluefly/agent-router/infrastructure/deployment/vastai/security';

const verifier = new WebhookVerifier(
  new WebhookSignatureVerifier(secret),
  new InMemoryReplayCache(),
  new TimestampValidator(300),
  new InMemoryRateLimiter(),
  new PayloadSizeValidator(1024 * 1024)
);

const result = await verifier.verify(rawPayload, headers, triggerId);

Service Discovery Registry

Location: agent-mesh/src/services/vastai-registry.service.ts

Authoritative registry for active GPU instances:

TTL-based expiration (300s default)
Health monitoring via heartbeat
EventEmitter for lifecycle events
Filtering by environment, trigger_id, status, capabilities

API Endpoints (/api/v1/vastai/registry):

POST /register - Register instance
GET / - List instances (with filters)
GET /:instanceId - Get instance by ID
DELETE /:instanceId - Deregister instance
POST /:instanceId/heartbeat - Update heartbeat

OpenAPI Spec: agent-mesh/openapi/vastai-registry.openapi.yml

Event Routing

Location: agent-mesh/src/api/duo-gateway.routes.ts

All Vast.ai events route to appropriate OSSA agents:

Lifecycle events cluster-operator
Cost events cost-intelligence-monitor
Health events cluster-operator
Mesh events cluster-operator

Machine events (vastai.*) require explicit routing - no defaults.

Network Endpoints

Context	Endpoint	Use
Registry API (public)	`https://mesh.bluefly.internal/api/v1/vastai/registry`	Cloudflare tunnel
Registry API (local)	`http://localhost:3005/api/v1/vastai/registry`	Local development
Vast.ai (public)	`storage.blueflyagents.com`	Cloudflare tunnel
Local (private)	`blueflynas.tailcf98b3.ts.net:9000`	Tailscale mesh

Rule: Cloudflare = Public ONLY. Tailscale = Private ONLY. Never mix.

Quick Start - PyWorker SDK

import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai';

const worker = createVastWorker({
  modelServerUrl: 'http://127.0.0.1',
  modelServerPort: 8000,
  workerPort: 3000,  // Vast.ai expects port 3000
  handlers: [{
    route: '/v1/embeddings',
    workloadCalculator: (data) => (data.input as string[]).length,
    allowParallelRequests: true,
  }],
  logActionConfig: {
    onLoad: ['Application startup complete'],
    onError: ['RuntimeError', 'CUDA out of memory'],
  },
});

await worker.run();

GPU Pricing

GPU	$/hr	Best For
RTX 4090	$0.34	Inference, embeddings
A100 40GB	$0.66	Training
H100	$1.99	Production fine-tuning

vast-ai-serverless.md - PyWorker SDK Guide
gpu-cluster-status.md - Active instances status
OpenAPI Spec - Registry API specification

Code Locations

Event Schema: common_npm/agent-router/src/infrastructure/deployment/vastai/events.ts
Security: common_npm/agent-router/src/infrastructure/deployment/vastai/security.ts
Registry Service: common_npm/agent-mesh/src/services/vastai-registry.service.ts
Registry API: common_npm/agent-mesh/src/api/vastai-registry.routes.ts
OpenAPI Spec: common_npm/agent-mesh/openapi/vastai-registry.openapi.yml
Duo Gateway: common_npm/agent-mesh/src/api/duo-gateway.routes.ts

Best Practices

DRY: Single source of truth for events (agent-router) SOLID: Clear separation of responsibilities OpenAPI-First: Spec before implementation Type Safety: TypeScript + Zod validation Security: HMAC, replay protection, rate limiting Idempotency: Required for mutating actions

Last Updated: 2026-01-04 Status: Production-ready

Network Architecture

Network Separation (CRITICAL)

Key Principle: Cloudflare = Public Ingress ONLY. Tailscale = Private Access ONLY. These planes must NEVER be mixed.


                        NETWORK ARCHITECTURE                                  

                                                                              
  PUBLIC ACCESS (Cloudflare Tunnel) - For Vast.ai & External                 
                   
      HTTPS                   
   Vast.ai GPU    storage.blueflyagents.com                  
    Instances                (Cloudflare Tunnel  :9000)                 
                              
                                                                             
                                                                             
                                             
                                Synology NAS (192.168.68.60)               
                                 MinIO Container (:9000)                
                                             
                                                                             
  PRIVATE ACCESS (Tailscale) - For Local Development                        
                          
     Tailscale           
   Mac M4/M3      blueflynas.tailcf98b3.ts.net:9000         
    OrbStack      (WireGuard)  (Private mesh network)                   
                         
                                                                              
  EXISTING ROUTES:                                                            
   nas.blueflyagents.com  DSM Web UI (:5001)                             
   api.blueflyagents.com  Webhook Server (:3001)                         
   storage.blueflyagents.com  MinIO S3 (:9000)                           
   mesh.bluefly.internal  Registry API (:3005)

Synology NAS Integration

Network Configuration:

synology_nas:
  # Physical device
  local_ip: "192.168.68.60"
  dsm_port: 5001

  # Public access (Cloudflare Tunnel) - for Vast.ai
  public:
    s3_endpoint: "https://storage.blueflyagents.com"
    dsm_ui: "https://nas.blueflyagents.com"

  # Private access (Tailscale) - for local development
  private:
    s3_endpoint: "http://blueflynas.tailcf98b3.ts.net:9000"
    dsm_ui: "https://blueflynas.tailcf98b3.ts.net:5001"

  # S3 bucket structure
  buckets:
    bluefly-models:
      paths:
        training-data: "/volume1/llm-platform/training-data/"
        checkpoints: "/volume1/llm-platform/checkpoints/"
        models: "/volume1/llm-platform/models/"
        artifacts: "/volume1/llm-platform/artifacts/"

Using agent-tailscale for NAS Access:

import { TailscaleDiscovery } from '@bluefly/agent-tailscale';
import { S3Client } from '@aws-sdk/client-s3';

const discovery = new TailscaleDiscovery();

// Find the NAS by hostname
const peers = await discovery.discoverPeers({ online: true });
const nas = peers.find(p => p.hostname === 'blueflynas');

if (!nas) throw new Error('NAS not found on Tailscale');

const s3Client = new S3Client({
  endpoint: `http://${nas.tailscaleIP}:9000`,  // 100.104.119.76
  credentials: {
    accessKeyId: process.env.MINIO_ACCESS_KEY!,
    secretAccessKey: process.env.MINIO_SECRET_KEY!,
  },
  forcePathStyle: true,
});

// Or use the known hostname directly:
// endpoint: 'http://blueflynas.tailcf98b3.ts.net:9000'

GPU Pricing & Strategy

Pricing Model

Instance Type	Characteristics	Best For	Discount
On-Demand	Fixed price, guaranteed	Production inference	Baseline
Interruptible	Bidding, may interrupt	Batch training	50-80% off
Reserved	Pre-paid commitment	Long-term projects	20-40% off

GPU Selection Matrix

GPU	Price/hr	VRAM	Best Use Case
RTX 4090	~$0.34	24GB	Testing, inference, embeddings
RTX 3090	~$0.25	24GB	Budget training
A100 40GB	~$0.66	40GB	Model training
A100 80GB	~$0.80	80GB	Large model training
H100	~$1.99	80GB	Production training, fine-tuning

Cost Projections

Monthly Estimate (Moderate Usage):

Category	Hours/Month	Rate	Monthly Cost
Model Training (A100)	40	$0.66/hr	$26.40
Embedding Generation (RTX 4090)	80	$0.34/hr	$27.20
Inference Endpoints (serverless)	~200	$0.34/hr	$68.00
Total GPU			~$122/month
Synology NAS (self-hosted)	-	-	$0
Grand Total			~$122/month

Comparison with Cloud Providers:

Provider	Similar Workload	Monthly Cost	Savings
AWS SageMaker	Same GPU hours	~$800	85%
GCP Vertex AI	Same GPU hours	~$700	83%
Azure ML	Same GPU hours	~$750	84%
Vast.ai	Same GPU hours	~$122	Baseline

Savings: 60-85% vs major cloud providers

Registry API Examples

# Register instance
curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/register \
  -H "Content-Type: application/json" \
  -d '{
    "instance_id": 29484611,
    "contract_id": 12345,
    "tailscale_ip": "100.113.211.78",
    "tailscale_hostname": "vastai-gpu-worker-1",
    "capabilities": ["inference", "embeddings"],
    "status": "ready",
    "environment": "prod",
    "trigger_id": "gitlab-pipeline-123",
    "gpu_type": "RTX_4090",
    "gpu_name": "NVIDIA RTX 4090",
    "cost_per_hour": 0.34
  }'

# List instances
curl "https://mesh.bluefly.internal/api/v1/vastai/registry?environment=prod&status=ready"

# Heartbeat
curl -X POST https://mesh.bluefly.internal/api/v1/vastai/registry/29484611/heartbeat

Vast.ai CLI Commands

# Search for RTX 4090 instances
vastai search offers 'gpu_name=RTX_4090 reliability>0.95' -o 'dph+'

# Create instance
vastai create instance <OFFER_ID> --image pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime --disk 50

# Show running instances
vastai show instances

# Destroy instance
vastai destroy instance <INSTANCE_ID>

NAS Storage Operations

# List checkpoints on NAS (via Tailscale - private)
aws s3 ls s3://bluefly-models/checkpoints/ \
  --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000

# Upload training data (via Tailscale - private)
aws s3 sync ./data s3://bluefly-models/datasets/gov-rfp/ \
  --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000

# Download model checkpoint (via Tailscale - private)
aws s3 cp s3://bluefly-models/checkpoints/gov-rfp/latest.pt ./model.pt \
  --endpoint-url http://blueflynas.tailcf98b3.ts.net:9000

Project Mapping

GPU-Intensive Workloads (Offload to Vast.ai)

Workload	Project	GPU Need	Vast.ai Instance Type	Estimated Cost
RFP Document Processing	`models/gov-rfp_model`	HIGH	A100 40GB	$0.66/hr
Policy Compliance Training	`models/civicpolicy_model`	HIGH	A100 40GB	$0.66/hr
Platform Optimization	`models/llm-platform_model`	MEDIUM	RTX 4090	$0.34/hr
Agent Development Patterns	`models/agent-studio_model`	MEDIUM	RTX 4090	$0.34/hr
Vector Embeddings	`common_npm/agent-brain`	HIGH	RTX 4090	$0.34/hr
Document Analysis	`common_npm/rfp-automation`	HIGH	A100 40GB	$0.66/hr

CPU Workloads (Keep Local on OrbStack)

Service	Project	Why Keep Local
Chat Interface	`agent-chat`	Low latency required
Workflow Engine	`workflow-engine`	Stateful, Langflow integration
Agent Operations	`agent-ops`	Local orchestration
PostgreSQL	Infrastructure	Stateful, data sovereignty
Redis	Infrastructure	Low-latency cache
Qdrant	Infrastructure	Vector DB (query only)

CI/CD Variables

Variable	Type	Protected	Masked	Value/Description
`VASTAI_API_KEY`	Variable	Yes	Yes	Vast.ai API key
`VASTAI_SSH_KEY`	File	Yes	No	SSH private key for GPU instances
`SYNOLOGY_S3_ENDPOINT`	Variable	No	No	`https://storage.blueflyagents.com` (public, Cloudflare)
`SYNOLOGY_S3_ENDPOINT_PRIVATE`	Variable	Yes	No	`http://blueflynas.tailcf98b3.ts.net:9000` (private, Tailscale)
`MINIO_ACCESS_KEY`	Variable	Yes	Yes	MinIO access key
`MINIO_SECRET_KEY`	Variable	Yes	Yes	MinIO secret key
`MLFLOW_TRACKING_URI`	Variable	No	No	GitLab MLflow endpoint (auto-set)
`WEBHOOK_SECRET`	Variable	Yes	Yes	HMAC secret for webhook verification

Network Note: Vast.ai instances run externally and MUST use public endpoint via Cloudflare Tunnel. GitLab runners on local network can use private Tailscale endpoint for faster access.

Environment Variables

# Vast.ai API tokens (set in CI/CD or .env)
VASTAI_CLUSTER_OP_KEY=     # Instance management
VASTAI_COST_MONITOR_KEY=   # Billing/cost access  
VASTAI_TASK_DISPATCH_KEY=  # Task coordination

# Tailscale (optional - for automated joining)
TAILSCALE_AUTHKEY=         # Pre-auth key for mesh join

# Webhook security
WEBHOOK_SECRET=           # HMAC secret for webhook verification

Troubleshooting

Instance Not Appearing in Registry

Check instance heartbeat: curl -X POST /api/v1/vastai/registry/:instanceId/heartbeat
Verify TTL: Default is 300s, instance must heartbeat within this window
Check registry logs: agent-mesh service logs

Webhook Verification Failing

Verify WEBHOOK_SECRET matches sender
Check timestamp: Events older than 5 minutes are rejected
Verify signature header: X-Signature must be present
Check rate limits: 100 req/min per trigger_id

Event Routing Errors

Verify event type: Must be valid vastai.* event type
Check agent mapping: All vastai.* events must have explicit routing
Review duo-gateway logs: agent-mesh service logs

Last Updated: 2026-01-04 Status: Production-ready Total Code: 1,270+ lines + OpenAPI spec Purpose: Complete reference for AI bots and developers - everything needed to become an expert

BULLETPROOF_VASTAI_PLAN.md - Complete implementation plan with Cloudflare Tunnel + Tailscale integration
Separation of Duties - Project responsibilities
Separation of Duties Audit - Complete audit of all projects