OPERATIONS GUIDE
Vast.ai Operations Guide - BlueFly Platform
Purpose: Day-to-day operational procedures for Vast.ai GPU infrastructure.
For: Platform operators, DevOps, oncall engineers
Knowledge Base: /research/vast-ai/ - Comprehensive technical reference
Quick Reference
Endpoints:
- vLLM:
vastai-gpu.tailcf98b3.ts.net:8000/v1 - Ollama:
vastai-gpu.tailcf98b3.ts.net:11434
Access:
- Dashboard: https://cloud.vast.ai/
- API Key: Stored in 1Password (search: "Vast.ai API Key")
Budget:
- Target: <$100/month
- Alert: $150/month
- Hard Cap: $200/month
Integration:
- Used by:
agent-router(NAS:4000) - Fallback: NAS CPU models + OpenAI API
Daily Operations
Morning Health Check
Script: /Volumes/AgentPlatform/scripts/vast-ai-health-check.sh
#!/bin/bash echo "=== Vast.ai Health Check $(date) ===" # Test vLLM echo "vLLM Endpoint:" curl -s http://vastai-gpu.tailcf98b3.ts.net:8000/v1/models | jq -r '.data[].id' || echo "❌ OFFLINE" # Test Ollama echo "Ollama Endpoint:" curl -s http://vastai-gpu.tailcf98b3.ts.net:11434/api/tags | jq -r '.models[].name' || echo "❌ OFFLINE" # Check yesterday's costs echo "Yesterday's Spend:" psql -h timescaledb -p 5433 -U bluefly -d analytics -t -c \ "SELECT '$' || ROUND(COALESCE(SUM(cost), 0)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '1 day'" echo "=== Health Check Complete ==="
Expected Result:
vLLM Endpoint:
llama-2-7b
Ollama Endpoint:
llama2:7b
Yesterday's Spend:
$2.34
Actions if OFFLINE:
- Check Tailscale:
tailscale status | grep vastai - Check worker status on dashboard
- If cold start: Wait 30s and retry
- If truly down: Check #11 Troubleshooting section
Cost Monitoring
Check Current Month Spend
-- Connect to TimescaleDB psql -h timescaledb -p 5433 -U bluefly -d analytics -- Current month costs SELECT DATE_TRUNC('day', timestamp)::date as day, ROUND(SUM(cost)::numeric, 2) as daily_cost, COUNT(*) as requests FROM vast_ai_costs WHERE timestamp >= DATE_TRUNC('month', NOW()) GROUP BY day ORDER BY day DESC;
Interpret Results:
- $0-5/day: ✅ Excellent (on track for <$150/month)
- $5-10/day: ⚠️ Moderate (on track for $150-300/month)
- $10+/day: ❌ High (review usage, optimize)
Cost Alerts
Alert Thresholds (Grafana):
- $50/month: Info (on track)
- $100/month: Warning (review usage)
- $150/month: Critical (optimize now)
- $200/month: Emergency (hard cap, fallback to CPU)
Alert Channels:
- Email: platform-ops@bluefly.io
- Slack: #platform-alerts
Troubleshooting
Symptom: Endpoint Not Responding
Diagnosis Steps:
- Test endpoint:
curl -I http://vastai-gpu.tailcf98b3.ts.net:8000/health
Expected: HTTP/1.1 200 OK
If timeout: Cold start (wait 30s) or worker down
- Check Tailscale:
tailscale status | grep vastai-gpu
Expected: vastai-gpu ... active
If offline: Tailscale connection issue
- Check worker status:
- Login to https://cloud.vast.ai/serverless/
- Verify worker status
- Check logs if errors
Common Resolutions:
Cold Start (Expected):
- First request after idle takes 10-30s
- Subsequent requests fast
- This is normal, no action needed
Tailscale Down:
ssh blueflynas sudo tailscale status sudo tailscale down && sudo tailscale up # Restart if needed
Worker Crashed:
- Check logs on Vast.ai dashboard
- Restart worker or redeploy
- Notify team in #platform-alerts
Symptom: High Costs
Quick Check:
-- Top cost drivers today SELECT model, COUNT(*) as requests, ROUND(SUM(cost)::numeric, 2) as total_cost, ROUND(AVG(duration)::numeric, 2) as avg_duration_sec FROM vast_ai_costs WHERE timestamp >= CURRENT_DATE GROUP BY model ORDER BY total_cost DESC;
Common Causes & Fixes:
-
Many Cold Starts
- Cause: Worker scaled to zero between requests
- Fix: Increase idle timeout from 5 min to 10 min
-
Large Model Running
- Cause: Using 70B model instead of 7B
- Fix: Route simpler queries to 7B model
-
No Batching
- Cause: Processing requests sequentially
- Fix: Implement batching (see optimization guide)
-
Worker Not Scaling Down
- Cause: Idle timeout too long
- Fix: Reduce timeout or check for stuck requests
Symptom: Slow Responses (>10s)
Check Latency Distribution:
-- Latency percentiles SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY duration) as p50_sec, percentile_cont(0.95) WITHIN GROUP (ORDER BY duration) as p95_sec, percentile_cont(0.99) WITHIN GROUP (ORDER BY duration) as p99_sec, ROUND(AVG(CASE WHEN cold_start THEN 1 ELSE 0 END)::numeric, 2) as cold_start_rate FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '24 hours';
Interpretation:
- p50 < 2s, p95 < 5s: ✅ Good performance
- p50 < 5s, p95 < 10s: ⚠️ Acceptable (some cold starts)
- p50 > 5s, p95 > 20s: ❌ Investigate (many cold starts or slow GPU)
Fixes:
- If cold_start_rate > 0.2: Keep worker warm (increase timeout or min_workers=1)
- If p50 > 5s even when warm: Use faster GPU (RTX 4090 → A6000)
Deployment Procedures
Deploy New Model to Vast.ai
When: Adding new model (e.g., Llama-3-8B)
Steps:
- Build container with model:
FROM vllm/vllm-openai:latest ENV MODEL_NAME="meta-llama/Llama-3-8b" RUN huggingface-cli download ${MODEL_NAME} CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \ "--model", "${MODEL_NAME}"]
- Push to GitLab Container Registry:
cd agent-router/docker/vllm docker build -t registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest . docker push registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest
- Deploy via Vast.ai API:
import requests import os response = requests.post( "https://cloud.vast.ai/api/v0/serverless/workers/", headers={"Authorization": f"Bearer {os.getenv('VAST_API_KEY')}"}, json={ "image": "registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest", "gpu_type": "RTX_4090", "max_workers": 3, "idle_timeout": 300, "min_workers": 0 } ) worker = response.json() print(f"Deployed worker: {worker['id']}") print(f"Endpoint: {worker['endpoint']}")
- Update agent-router config:
# Add to .env echo "VLLM_ENDPOINT_LLAMA3=${worker['endpoint']}" >> /Volumes/AgentPlatform/docker/.env # Restart agent-router docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml restart agent-router
- Verify:
curl http://${worker['endpoint']}/v1/models
Update Worker Configuration
When: Adjusting idle timeout, GPU type, or max workers
Via Dashboard (easiest):
- Login to https://cloud.vast.ai/serverless/
- Find worker
- Click "Edit"
- Update settings
- Save
Via API:
# Get worker ID curl -H "Authorization: Bearer $VAST_API_KEY" \ https://cloud.vast.ai/api/v0/serverless/workers/ | jq '.[] | {id, status}' # Update idle timeout curl -X PATCH \ -H "Authorization: Bearer $VAST_API_KEY" \ -H "Content-Type: application/json" \ -d '{"idle_timeout": 600}' \ https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>
Maintenance
Weekly Review
Every Monday:
- Review costs:
-- Last 7 days cost trend SELECT DATE_TRUNC('day', timestamp)::date as day, ROUND(SUM(cost)::numeric, 2) as daily_cost, COUNT(*) as requests, ROUND((SUM(cost) / COUNT(*))::numeric, 5) as cost_per_request FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '7 days' GROUP BY day ORDER BY day DESC;
-
Check for optimization opportunities:
- High daily costs? → Implement batching or caching
- Many cold starts? → Increase timeout
- Low usage? → Reduce max_workers
-
Update team:
- Post summary in #platform-updates
- Flag any concerns
Monthly Review
End of month:
- Calculate total spend:
SELECT ROUND(SUM(cost)::numeric, 2) as monthly_cost, COUNT(*) as total_requests, ROUND((SUM(cost) / COUNT(*))::numeric, 5) as avg_cost_per_request FROM vast_ai_costs WHERE timestamp >= DATE_TRUNC('month', NOW());
-
Compare to budget:
- Target: <$100/month
- Acceptable: $100-150/month
- Over budget: >$150/month (optimize)
-
Plan optimizations if over budget:
- Implement batching (80% savings)
- Add caching (50% savings)
- Use cheaper GPU
- Quantize models
Emergency Procedures
Runaway Costs (>$50/day)
Immediate Actions:
- Check current spend:
SELECT ROUND(SUM(cost)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= CURRENT_DATE;
- Shut down worker (if needed):
# Via API curl -X DELETE \ -H "Authorization: Bearer $VAST_API_KEY" \ https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>
- Enable fallback:
# Update agent-router to use CPU models docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml \ exec agent-router \ sed -i 's/USE_VASTAI=true/USE_VASTAI=false/' /app/.env # Restart docker-compose restart agent-router
-
Investigate root cause:
- Check logs for stuck requests
- Review request patterns
- Identify optimization opportunities
-
Notify team:
- Post in #platform-alerts
- Tag @platform-ops
Service Down (Complete Outage)
Immediate Actions:
-
Enable fallback routing:
- agent-router automatically falls back to NAS CPU → OpenAI
- Verify fallback working: check agent-router logs
-
Notify users (if impacting service):
- Post in #incidents
- Update status page
-
Investigate:
- Check Vast.ai dashboard
- Check Tailscale connectivity
- Check worker logs
-
Restore service:
- Restart worker if crashed
- Redeploy if needed
- Update DNS if IP changed
-
Post-mortem:
- Document incident
- Identify preventions
- Update runbooks
References
Internal:
- Knowledge Base: /research/vast-ai/
- Agent Best Practices: /research/vast-ai/11-agent-best-practices.md
- Cost Optimization: /research/vast-ai/07-cost-optimization.md
External:
- Vast.ai Dashboard: https://cloud.vast.ai/
- Vast.ai Docs: https://docs.vast.ai/
- Support: Discord (https://discord.gg/A9rVHm4MXG)
Team:
- Oncall: platform-ops@bluefly.io
- Alerts: #platform-alerts (Slack)
- Updates: #platform-updates (Slack)
Last Updated: 2026-01-28 Maintainer: Platform Operations Status: Production