Skip to main content

OPERATIONS GUIDE

Vast.ai Operations Guide - BlueFly Platform

Purpose: Day-to-day operational procedures for Vast.ai GPU infrastructure.

For: Platform operators, DevOps, oncall engineers

Knowledge Base: /research/vast-ai/ - Comprehensive technical reference


Quick Reference

Endpoints:

  • vLLM: vastai-gpu.tailcf98b3.ts.net:8000/v1
  • Ollama: vastai-gpu.tailcf98b3.ts.net:11434

Access:

Budget:

  • Target: <$100/month
  • Alert: $150/month
  • Hard Cap: $200/month

Integration:

  • Used by: agent-router (NAS:4000)
  • Fallback: NAS CPU models + OpenAI API

Daily Operations

Morning Health Check

Script: /Volumes/AgentPlatform/scripts/vast-ai-health-check.sh

#!/bin/bash echo "=== Vast.ai Health Check $(date) ===" # Test vLLM echo "vLLM Endpoint:" curl -s http://vastai-gpu.tailcf98b3.ts.net:8000/v1/models | jq -r '.data[].id' || echo "❌ OFFLINE" # Test Ollama echo "Ollama Endpoint:" curl -s http://vastai-gpu.tailcf98b3.ts.net:11434/api/tags | jq -r '.models[].name' || echo "❌ OFFLINE" # Check yesterday's costs echo "Yesterday's Spend:" psql -h timescaledb -p 5433 -U bluefly -d analytics -t -c \ "SELECT '$' || ROUND(COALESCE(SUM(cost), 0)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '1 day'" echo "=== Health Check Complete ==="

Expected Result:

vLLM Endpoint:
llama-2-7b

Ollama Endpoint:
llama2:7b

Yesterday's Spend:
$2.34

Actions if OFFLINE:

  1. Check Tailscale: tailscale status | grep vastai
  2. Check worker status on dashboard
  3. If cold start: Wait 30s and retry
  4. If truly down: Check #11 Troubleshooting section

Cost Monitoring

Check Current Month Spend

-- Connect to TimescaleDB psql -h timescaledb -p 5433 -U bluefly -d analytics -- Current month costs SELECT DATE_TRUNC('day', timestamp)::date as day, ROUND(SUM(cost)::numeric, 2) as daily_cost, COUNT(*) as requests FROM vast_ai_costs WHERE timestamp >= DATE_TRUNC('month', NOW()) GROUP BY day ORDER BY day DESC;

Interpret Results:

  • $0-5/day: ✅ Excellent (on track for <$150/month)
  • $5-10/day: ⚠️ Moderate (on track for $150-300/month)
  • $10+/day: ❌ High (review usage, optimize)

Cost Alerts

Alert Thresholds (Grafana):

  1. $50/month: Info (on track)
  2. $100/month: Warning (review usage)
  3. $150/month: Critical (optimize now)
  4. $200/month: Emergency (hard cap, fallback to CPU)

Alert Channels:


Troubleshooting

Symptom: Endpoint Not Responding

Diagnosis Steps:

  1. Test endpoint:
curl -I http://vastai-gpu.tailcf98b3.ts.net:8000/health

Expected: HTTP/1.1 200 OK If timeout: Cold start (wait 30s) or worker down

  1. Check Tailscale:
tailscale status | grep vastai-gpu

Expected: vastai-gpu ... active If offline: Tailscale connection issue

  1. Check worker status:

Common Resolutions:

Cold Start (Expected):

  • First request after idle takes 10-30s
  • Subsequent requests fast
  • This is normal, no action needed

Tailscale Down:

ssh blueflynas sudo tailscale status sudo tailscale down && sudo tailscale up # Restart if needed

Worker Crashed:

  1. Check logs on Vast.ai dashboard
  2. Restart worker or redeploy
  3. Notify team in #platform-alerts

Symptom: High Costs

Quick Check:

-- Top cost drivers today SELECT model, COUNT(*) as requests, ROUND(SUM(cost)::numeric, 2) as total_cost, ROUND(AVG(duration)::numeric, 2) as avg_duration_sec FROM vast_ai_costs WHERE timestamp >= CURRENT_DATE GROUP BY model ORDER BY total_cost DESC;

Common Causes & Fixes:

  1. Many Cold Starts

    • Cause: Worker scaled to zero between requests
    • Fix: Increase idle timeout from 5 min to 10 min
  2. Large Model Running

    • Cause: Using 70B model instead of 7B
    • Fix: Route simpler queries to 7B model
  3. No Batching

    • Cause: Processing requests sequentially
    • Fix: Implement batching (see optimization guide)
  4. Worker Not Scaling Down

    • Cause: Idle timeout too long
    • Fix: Reduce timeout or check for stuck requests

Symptom: Slow Responses (>10s)

Check Latency Distribution:

-- Latency percentiles SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY duration) as p50_sec, percentile_cont(0.95) WITHIN GROUP (ORDER BY duration) as p95_sec, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration) as p99_sec, ROUND(AVG(CASE WHEN cold_start THEN 1 ELSE 0 END)::numeric, 2) as cold_start_rate FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '24 hours';

Interpretation:

  • p50 < 2s, p95 < 5s: ✅ Good performance
  • p50 < 5s, p95 < 10s: ⚠️ Acceptable (some cold starts)
  • p50 > 5s, p95 > 20s: ❌ Investigate (many cold starts or slow GPU)

Fixes:

  • If cold_start_rate > 0.2: Keep worker warm (increase timeout or min_workers=1)
  • If p50 > 5s even when warm: Use faster GPU (RTX 4090 → A6000)

Deployment Procedures

Deploy New Model to Vast.ai

When: Adding new model (e.g., Llama-3-8B)

Steps:

  1. Build container with model:
FROM vllm/vllm-openai:latest ENV MODEL_NAME="meta-llama/Llama-3-8b" RUN huggingface-cli download ${MODEL_NAME} CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \ "--model", "${MODEL_NAME}"]
  1. Push to GitLab Container Registry:
cd agent-router/docker/vllm docker build -t registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest . docker push registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest
  1. Deploy via Vast.ai API:
import requests import os response = requests.post( "https://cloud.vast.ai/api/v0/serverless/workers/", headers={"Authorization": f"Bearer {os.getenv('VAST_API_KEY')}"}, json={ "image": "registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest", "gpu_type": "RTX_4090", "max_workers": 3, "idle_timeout": 300, "min_workers": 0 } ) worker = response.json() print(f"Deployed worker: {worker['id']}") print(f"Endpoint: {worker['endpoint']}")
  1. Update agent-router config:
# Add to .env echo "VLLM_ENDPOINT_LLAMA3=${worker['endpoint']}" >> /Volumes/AgentPlatform/docker/.env # Restart agent-router docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml restart agent-router
  1. Verify:
curl http://${worker['endpoint']}/v1/models

Update Worker Configuration

When: Adjusting idle timeout, GPU type, or max workers

Via Dashboard (easiest):

  1. Login to https://cloud.vast.ai/serverless/
  2. Find worker
  3. Click "Edit"
  4. Update settings
  5. Save

Via API:

# Get worker ID curl -H "Authorization: Bearer $VAST_API_KEY" \ https://cloud.vast.ai/api/v0/serverless/workers/ | jq '.[] | {id, status}' # Update idle timeout curl -X PATCH \ -H "Authorization: Bearer $VAST_API_KEY" \ -H "Content-Type: application/json" \ -d '{"idle_timeout": 600}' \ https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>

Maintenance

Weekly Review

Every Monday:

  1. Review costs:
-- Last 7 days cost trend SELECT DATE_TRUNC('day', timestamp)::date as day, ROUND(SUM(cost)::numeric, 2) as daily_cost, COUNT(*) as requests, ROUND((SUM(cost) / COUNT(*))::numeric, 5) as cost_per_request FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '7 days' GROUP BY day ORDER BY day DESC;
  1. Check for optimization opportunities:

    • High daily costs? → Implement batching or caching
    • Many cold starts? → Increase timeout
    • Low usage? → Reduce max_workers
  2. Update team:

    • Post summary in #platform-updates
    • Flag any concerns

Monthly Review

End of month:

  1. Calculate total spend:
SELECT ROUND(SUM(cost)::numeric, 2) as monthly_cost, COUNT(*) as total_requests, ROUND((SUM(cost) / COUNT(*))::numeric, 5) as avg_cost_per_request FROM vast_ai_costs WHERE timestamp >= DATE_TRUNC('month', NOW());
  1. Compare to budget:

    • Target: <$100/month
    • Acceptable: $100-150/month
    • Over budget: >$150/month (optimize)
  2. Plan optimizations if over budget:

    • Implement batching (80% savings)
    • Add caching (50% savings)
    • Use cheaper GPU
    • Quantize models

Emergency Procedures

Runaway Costs (>$50/day)

Immediate Actions:

  1. Check current spend:
SELECT ROUND(SUM(cost)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= CURRENT_DATE;
  1. Shut down worker (if needed):
# Via API curl -X DELETE \ -H "Authorization: Bearer $VAST_API_KEY" \ https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>
  1. Enable fallback:
# Update agent-router to use CPU models docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml \ exec agent-router \ sed -i 's/USE_VASTAI=true/USE_VASTAI=false/' /app/.env # Restart docker-compose restart agent-router
  1. Investigate root cause:

    • Check logs for stuck requests
    • Review request patterns
    • Identify optimization opportunities
  2. Notify team:

    • Post in #platform-alerts
    • Tag @platform-ops

Service Down (Complete Outage)

Immediate Actions:

  1. Enable fallback routing:

    • agent-router automatically falls back to NAS CPU → OpenAI
    • Verify fallback working: check agent-router logs
  2. Notify users (if impacting service):

    • Post in #incidents
    • Update status page
  3. Investigate:

    • Check Vast.ai dashboard
    • Check Tailscale connectivity
    • Check worker logs
  4. Restore service:

    • Restart worker if crashed
    • Redeploy if needed
    • Update DNS if IP changed
  5. Post-mortem:

    • Document incident
    • Identify preventions
    • Update runbooks

References

Internal:

External:

Team:


Last Updated: 2026-01-28 Maintainer: Platform Operations Status: Production