vast ai serverless

Vast.ai Serverless Deployment

AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md

Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.

Overview

Vast.ai Serverless provides GPU compute at 60-85% cost savings compared to AWS/GCP. This guide covers deploying agent services using the TypeScript PyWorker SDK in @bluefly/agent-router.

GPU Pricing Reference

GPU	~$/hr	Use Case
RTX 4090	$0.34	Embeddings, small models
A100 40GB	$0.66	Large models, inference
H100	$1.50+	Training, large inference

Implementation Location

agent-router/src/infrastructure/deployment/vastai/
 index.ts    # Barrel exports
 types.ts    # Zod schemas (WorkerConfig, HandlerConfig, etc.)
 worker.ts   # VastWorker class implementation

Quick Start

import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai';

const config: WorkerConfig = {
  modelServerUrl: 'http://127.0.0.1',
  modelServerPort: 8000,
  workerPort: 3000,  // Vast.ai expects port 3000
  handlers: [{
    route: '/v1/embeddings',
    workloadCalculator: (data) => (data.input as string[]).length,
    allowParallelRequests: true,
    maxQueueTime: 300,
  }],
  logActionConfig: {
    onLoad: ['Application startup complete'],
    onError: ['RuntimeError', 'CUDA out of memory'],
  },
};

const worker = createVastWorker(config);
await worker.run();

Key Concepts

WorkloadCalculator

Returns numeric value for autoscaling decisions:

// Embeddings: count of inputs
const embeddingWorkload = (data) => (data.input as string[]).length;

// Completions: max tokens requested
const completionWorkload = (data) => data.max_tokens || 256;

// Reranking: number of documents
const rerankWorkload = (data) => (data.documents as string[]).length;

Handler Configuration

Property	Type	Description
`route`	string	HTTP path (e.g., `/v1/embeddings`)
`workloadCalculator`	function	Returns workload for autoscaling
`allowParallelRequests`	boolean	true for batching, false for sequential
`maxQueueTime`	number	Max seconds request waits in queue
`requestParser`	function	Optional request transformer

Log Action Configuration

Detects backend health via log patterns:

logActionConfig: {
  onLoad: ['Server started', 'Model loaded'],  // Triggers 'ready' state
  onError: ['RuntimeError', 'OOM'],            // Triggers 'error' state
  onInfo: ['Downloading', 'Loading'],          // Progress info
}

Endpoints

The worker exposes:

Endpoint	Purpose
`/health`	Health status + metrics (503 if not ready)
`/metrics`	Autoscaling metrics for Vast.ai
`/*`	Routes to configured handlers

Environment Variables

Variable	Description
`CONTAINER_ID`	Vast.ai container ID
`MASTER_TOKEN`	Authentication token
`REPORT_ADDR`	Control plane URL
`WORKER_PORT`	Worker HTTP port (default: 3000)
`MODEL_SERVER_URL`	Backend server URL
`MODEL_SERVER_PORT`	Backend server port
`HF_TOKEN`	HuggingFace token (optional)

Deployment Candidates

Based on wiki research, priority services for Vast.ai:

agent-brain - Embedding service (high GPU utilization)
rfp-automation - Document processing
compliance-engine - Security scanning with ML

Example: Agent-Brain Deployment

See: agent-router/examples/vastai-worker-example.ts

# Build and deploy
docker build -t agent-brain-vastai .
vastai deploy --template templates/agent-brain.json

Network Architecture

          
   Cloudflare       VastWorker       Backend Model  
   (Public)              (Port 3000)           (Port 8000)    
          
                               
                               
                        
                          Vast.ai Control 
                             Plane

Implementation Status

TypeScript port of PyWorker SDK
Zod schemas for configuration validation
Request queue with timeout handling
Metrics for autoscaling
Health monitoring via log patterns
Integration tests
Dockerfile templates
vastai CLI integration