vast ai serverless
Vast.ai Serverless Deployment
AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md
Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.
Overview
Vast.ai Serverless provides GPU compute at 60-85% cost savings compared to AWS/GCP. This guide covers deploying agent services using the TypeScript PyWorker SDK in @bluefly/agent-router.
GPU Pricing Reference
| GPU | ~$/hr | Use Case |
|---|---|---|
| RTX 4090 | $0.34 | Embeddings, small models |
| A100 40GB | $0.66 | Large models, inference |
| H100 | $1.50+ | Training, large inference |
Implementation Location
agent-router/src/infrastructure/deployment/vastai/
index.ts # Barrel exports
types.ts # Zod schemas (WorkerConfig, HandlerConfig, etc.)
worker.ts # VastWorker class implementation
Quick Start
import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai'; const config: WorkerConfig = { modelServerUrl: 'http://127.0.0.1', modelServerPort: 8000, workerPort: 3000, // Vast.ai expects port 3000 handlers: [{ route: '/v1/embeddings', workloadCalculator: (data) => (data.input as string[]).length, allowParallelRequests: true, maxQueueTime: 300, }], logActionConfig: { onLoad: ['Application startup complete'], onError: ['RuntimeError', 'CUDA out of memory'], }, }; const worker = createVastWorker(config); await worker.run();
Key Concepts
WorkloadCalculator
Returns numeric value for autoscaling decisions:
// Embeddings: count of inputs const embeddingWorkload = (data) => (data.input as string[]).length; // Completions: max tokens requested const completionWorkload = (data) => data.max_tokens || 256; // Reranking: number of documents const rerankWorkload = (data) => (data.documents as string[]).length;
Handler Configuration
| Property | Type | Description |
|---|---|---|
route | string | HTTP path (e.g., /v1/embeddings) |
workloadCalculator | function | Returns workload for autoscaling |
allowParallelRequests | boolean | true for batching, false for sequential |
maxQueueTime | number | Max seconds request waits in queue |
requestParser | function | Optional request transformer |
Log Action Configuration
Detects backend health via log patterns:
logActionConfig: { onLoad: ['Server started', 'Model loaded'], // Triggers 'ready' state onError: ['RuntimeError', 'OOM'], // Triggers 'error' state onInfo: ['Downloading', 'Loading'], // Progress info }
Endpoints
The worker exposes:
| Endpoint | Purpose |
|---|---|
/health | Health status + metrics (503 if not ready) |
/metrics | Autoscaling metrics for Vast.ai |
/* | Routes to configured handlers |
Environment Variables
| Variable | Description |
|---|---|
CONTAINER_ID | Vast.ai container ID |
MASTER_TOKEN | Authentication token |
REPORT_ADDR | Control plane URL |
WORKER_PORT | Worker HTTP port (default: 3000) |
MODEL_SERVER_URL | Backend server URL |
MODEL_SERVER_PORT | Backend server port |
HF_TOKEN | HuggingFace token (optional) |
Deployment Candidates
Based on wiki research, priority services for Vast.ai:
- agent-brain - Embedding service (high GPU utilization)
- rfp-automation - Document processing
- compliance-engine - Security scanning with ML
Example: Agent-Brain Deployment
See: agent-router/examples/vastai-worker-example.ts
# Build and deploy docker build -t agent-brain-vastai . vastai deploy --template templates/agent-brain.json
Network Architecture
Cloudflare VastWorker Backend Model
(Public) (Port 3000) (Port 8000)
Vast.ai Control
Plane
Related
Implementation Status
- TypeScript port of PyWorker SDK
- Zod schemas for configuration validation
- Request queue with timeout handling
- Metrics for autoscaling
- Health monitoring via log patterns
- Integration tests
- Dockerfile templates
- vastai CLI integration