Skip to main content

vast ai serverless

Vast.ai Serverless Deployment

AUTHORITATIVE SOURCE: BULLETPROOF_VASTAI_PLAN.md

Complete Implementation Plan: See BULLETPROOF_VASTAI_PLAN.md for full details including Cloudflare Tunnel + Tailscale integration, agent-docker service, and CI/CD components.

Overview

Vast.ai Serverless provides GPU compute at 60-85% cost savings compared to AWS/GCP. This guide covers deploying agent services using the TypeScript PyWorker SDK in @bluefly/agent-router.

GPU Pricing Reference

GPU~$/hrUse Case
RTX 4090$0.34Embeddings, small models
A100 40GB$0.66Large models, inference
H100$1.50+Training, large inference

Implementation Location

agent-router/src/infrastructure/deployment/vastai/
 index.ts    # Barrel exports
 types.ts    # Zod schemas (WorkerConfig, HandlerConfig, etc.)
 worker.ts   # VastWorker class implementation

Quick Start

import { createVastWorker, WorkerConfig } from '@bluefly/agent-router/infrastructure/deployment/vastai'; const config: WorkerConfig = { modelServerUrl: 'http://127.0.0.1', modelServerPort: 8000, workerPort: 3000, // Vast.ai expects port 3000 handlers: [{ route: '/v1/embeddings', workloadCalculator: (data) => (data.input as string[]).length, allowParallelRequests: true, maxQueueTime: 300, }], logActionConfig: { onLoad: ['Application startup complete'], onError: ['RuntimeError', 'CUDA out of memory'], }, }; const worker = createVastWorker(config); await worker.run();

Key Concepts

WorkloadCalculator

Returns numeric value for autoscaling decisions:

// Embeddings: count of inputs const embeddingWorkload = (data) => (data.input as string[]).length; // Completions: max tokens requested const completionWorkload = (data) => data.max_tokens || 256; // Reranking: number of documents const rerankWorkload = (data) => (data.documents as string[]).length;

Handler Configuration

PropertyTypeDescription
routestringHTTP path (e.g., /v1/embeddings)
workloadCalculatorfunctionReturns workload for autoscaling
allowParallelRequestsbooleantrue for batching, false for sequential
maxQueueTimenumberMax seconds request waits in queue
requestParserfunctionOptional request transformer

Log Action Configuration

Detects backend health via log patterns:

logActionConfig: { onLoad: ['Server started', 'Model loaded'], // Triggers 'ready' state onError: ['RuntimeError', 'OOM'], // Triggers 'error' state onInfo: ['Downloading', 'Loading'], // Progress info }

Endpoints

The worker exposes:

EndpointPurpose
/healthHealth status + metrics (503 if not ready)
/metricsAutoscaling metrics for Vast.ai
/*Routes to configured handlers

Environment Variables

VariableDescription
CONTAINER_IDVast.ai container ID
MASTER_TOKENAuthentication token
REPORT_ADDRControl plane URL
WORKER_PORTWorker HTTP port (default: 3000)
MODEL_SERVER_URLBackend server URL
MODEL_SERVER_PORTBackend server port
HF_TOKENHuggingFace token (optional)

Deployment Candidates

Based on wiki research, priority services for Vast.ai:

  1. agent-brain - Embedding service (high GPU utilization)
  2. rfp-automation - Document processing
  3. compliance-engine - Security scanning with ML

Example: Agent-Brain Deployment

See: agent-router/examples/vastai-worker-example.ts

# Build and deploy docker build -t agent-brain-vastai . vastai deploy --template templates/agent-brain.json

Network Architecture

          
   Cloudflare       VastWorker       Backend Model  
   (Public)              (Port 3000)           (Port 8000)    
          
                               
                               
                        
                          Vast.ai Control 
                             Plane        
                        

Implementation Status

  • TypeScript port of PyWorker SDK
  • Zod schemas for configuration validation
  • Request queue with timeout handling
  • Metrics for autoscaling
  • Health monitoring via log patterns
  • Integration tests
  • Dockerfile templates
  • vastai CLI integration