OPERATIONS GUIDE

Vast.ai Operations Guide - BlueFly Platform

Purpose: Day-to-day operational procedures for Vast.ai GPU infrastructure.

For: Platform operators, DevOps, oncall engineers

Knowledge Base: /research/vast-ai/ - Comprehensive technical reference

Quick Reference

Endpoints:

vLLM: vastai-gpu.tailcf98b3.ts.net:8000/v1
Ollama: vastai-gpu.tailcf98b3.ts.net:11434

Access:

Dashboard: https://cloud.vast.ai/
API Key: Stored in 1Password (search: "Vast.ai API Key")

Budget:

Target: <$100/month
Alert: $150/month
Hard Cap: $200/month

Integration:

Used by: agent-router (NAS:4000)
Fallback: NAS CPU models + OpenAI API

Daily Operations

Morning Health Check

Script: /Volumes/AgentPlatform/scripts/vast-ai-health-check.sh

#!/bin/bash
echo "=== Vast.ai Health Check $(date) ==="

# Test vLLM
echo "vLLM Endpoint:"
curl -s http://vastai-gpu.tailcf98b3.ts.net:8000/v1/models | jq -r '.data[].id' || echo "❌ OFFLINE"

# Test Ollama
echo "Ollama Endpoint:"
curl -s http://vastai-gpu.tailcf98b3.ts.net:11434/api/tags | jq -r '.models[].name' || echo "❌ OFFLINE"

# Check yesterday's costs
echo "Yesterday's Spend:"
psql -h timescaledb -p 5433 -U bluefly -d analytics -t -c \
  "SELECT '$' || ROUND(COALESCE(SUM(cost), 0)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= NOW() - INTERVAL '1 day'"

echo "=== Health Check Complete ==="

Expected Result:

vLLM Endpoint:
llama-2-7b

Ollama Endpoint:
llama2:7b

Yesterday's Spend:
$2.34

Actions if OFFLINE:

Check Tailscale: tailscale status | grep vastai
Check worker status on dashboard
If cold start: Wait 30s and retry
If truly down: Check #11 Troubleshooting section

Cost Monitoring

Check Current Month Spend

-- Connect to TimescaleDB
psql -h timescaledb -p 5433 -U bluefly -d analytics

-- Current month costs
SELECT
  DATE_TRUNC('day', timestamp)::date as day,
  ROUND(SUM(cost)::numeric, 2) as daily_cost,
  COUNT(*) as requests
FROM vast_ai_costs
WHERE timestamp >= DATE_TRUNC('month', NOW())
GROUP BY day
ORDER BY day DESC;

Interpret Results:

$0-5/day: ✅ Excellent (on track for <$150/month)
$5-10/day: ⚠️ Moderate (on track for $150-300/month)
$10+/day: ❌ High (review usage, optimize)

Cost Alerts

Alert Thresholds (Grafana):

$50/month: Info (on track)
$100/month: Warning (review usage)
$150/month: Critical (optimize now)
$200/month: Emergency (hard cap, fallback to CPU)

Alert Channels:

Email: platform-ops@bluefly.io
Slack: #platform-alerts

Troubleshooting

Symptom: Endpoint Not Responding

Diagnosis Steps:

Test endpoint:

curl -I http://vastai-gpu.tailcf98b3.ts.net:8000/health

Expected: HTTP/1.1 200 OK If timeout: Cold start (wait 30s) or worker down

Check Tailscale:

tailscale status | grep vastai-gpu

Expected: vastai-gpu ... active If offline: Tailscale connection issue

Check worker status:

Login to https://cloud.vast.ai/serverless/
Verify worker status
Check logs if errors

Common Resolutions:

Cold Start (Expected):

First request after idle takes 10-30s
Subsequent requests fast
This is normal, no action needed

Tailscale Down:

ssh blueflynas
sudo tailscale status
sudo tailscale down && sudo tailscale up  # Restart if needed

Worker Crashed:

Check logs on Vast.ai dashboard
Restart worker or redeploy
Notify team in #platform-alerts

Symptom: High Costs

Quick Check:

-- Top cost drivers today
SELECT
  model,
  COUNT(*) as requests,
  ROUND(SUM(cost)::numeric, 2) as total_cost,
  ROUND(AVG(duration)::numeric, 2) as avg_duration_sec
FROM vast_ai_costs
WHERE timestamp >= CURRENT_DATE
GROUP BY model
ORDER BY total_cost DESC;

Common Causes & Fixes:

Many Cold Starts
- Cause: Worker scaled to zero between requests
- Fix: Increase idle timeout from 5 min to 10 min
Large Model Running
- Cause: Using 70B model instead of 7B
- Fix: Route simpler queries to 7B model
No Batching
- Cause: Processing requests sequentially
- Fix: Implement batching (see optimization guide)
Worker Not Scaling Down
- Cause: Idle timeout too long
- Fix: Reduce timeout or check for stuck requests

Symptom: Slow Responses (>10s)

Check Latency Distribution:

-- Latency percentiles
SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY duration) as p50_sec,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY duration) as p95_sec,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY duration) as p99_sec,
  ROUND(AVG(CASE WHEN cold_start THEN 1 ELSE 0 END)::numeric, 2) as cold_start_rate
FROM vast_ai_costs
WHERE timestamp >= NOW() - INTERVAL '24 hours';

Interpretation:

p50 < 2s, p95 < 5s: ✅ Good performance
p50 < 5s, p95 < 10s: ⚠️ Acceptable (some cold starts)
p50 > 5s, p95 > 20s: ❌ Investigate (many cold starts or slow GPU)

Fixes:

If cold_start_rate > 0.2: Keep worker warm (increase timeout or min_workers=1)
If p50 > 5s even when warm: Use faster GPU (RTX 4090 → A6000)

Deployment Procedures

Deploy New Model to Vast.ai

When: Adding new model (e.g., Llama-3-8B)

Steps:

Build container with model:

FROM vllm/vllm-openai:latest

ENV MODEL_NAME="meta-llama/Llama-3-8b"
RUN huggingface-cli download ${MODEL_NAME}

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "${MODEL_NAME}"]

Push to GitLab Container Registry:

cd agent-router/docker/vllm
docker build -t registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest .
docker push registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest

Deploy via Vast.ai API:

import requests
import os

response = requests.post(
    "https://cloud.vast.ai/api/v0/serverless/workers/",
    headers={"Authorization": f"Bearer {os.getenv('VAST_API_KEY')}"},
    json={
        "image": "registry.gitlab.com/blueflyio/vast-ai/llama-3-8b:latest",
        "gpu_type": "RTX_4090",
        "max_workers": 3,
        "idle_timeout": 300,
        "min_workers": 0
    }
)

worker = response.json()
print(f"Deployed worker: {worker['id']}")
print(f"Endpoint: {worker['endpoint']}")

Update agent-router config:

# Add to .env
echo "VLLM_ENDPOINT_LLAMA3=${worker['endpoint']}" >> /Volumes/AgentPlatform/docker/.env

# Restart agent-router
docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml restart agent-router

Verify:

curl http://${worker['endpoint']}/v1/models

Update Worker Configuration

When: Adjusting idle timeout, GPU type, or max workers

Via Dashboard (easiest):

Login to https://cloud.vast.ai/serverless/
Find worker
Click "Edit"
Update settings
Save

Via API:

# Get worker ID
curl -H "Authorization: Bearer $VAST_API_KEY" \
  https://cloud.vast.ai/api/v0/serverless/workers/ | jq '.[] | {id, status}'

# Update idle timeout
curl -X PATCH \
  -H "Authorization: Bearer $VAST_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"idle_timeout": 600}' \
  https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>

Maintenance

Weekly Review

Every Monday:

Review costs:

-- Last 7 days cost trend
SELECT
  DATE_TRUNC('day', timestamp)::date as day,
  ROUND(SUM(cost)::numeric, 2) as daily_cost,
  COUNT(*) as requests,
  ROUND((SUM(cost) / COUNT(*))::numeric, 5) as cost_per_request
FROM vast_ai_costs
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY day
ORDER BY day DESC;

Check for optimization opportunities:
- High daily costs? → Implement batching or caching
- Many cold starts? → Increase timeout
- Low usage? → Reduce max_workers
Update team:
- Post summary in #platform-updates
- Flag any concerns

Monthly Review

End of month:

Calculate total spend:

SELECT
  ROUND(SUM(cost)::numeric, 2) as monthly_cost,
  COUNT(*) as total_requests,
  ROUND((SUM(cost) / COUNT(*))::numeric, 5) as avg_cost_per_request
FROM vast_ai_costs
WHERE timestamp >= DATE_TRUNC('month', NOW());

Compare to budget:
- Target: <$100/month
- Acceptable: $100-150/month
- Over budget: >$150/month (optimize)
Plan optimizations if over budget:
- Implement batching (80% savings)
- Add caching (50% savings)
- Use cheaper GPU
- Quantize models

Emergency Procedures

Runaway Costs (>$50/day)

Immediate Actions:

Check current spend:

SELECT ROUND(SUM(cost)::numeric, 2) FROM vast_ai_costs WHERE timestamp >= CURRENT_DATE;

Shut down worker (if needed):

# Via API
curl -X DELETE \
  -H "Authorization: Bearer $VAST_API_KEY" \
  https://cloud.vast.ai/api/v0/serverless/workers/<worker-id>

Enable fallback:

# Update agent-router to use CPU models
docker-compose -f /Volumes/AgentPlatform/docker/docker-compose.yml \
  exec agent-router \
  sed -i 's/USE_VASTAI=true/USE_VASTAI=false/' /app/.env

# Restart
docker-compose restart agent-router

Investigate root cause:
- Check logs for stuck requests
- Review request patterns
- Identify optimization opportunities
Notify team:
- Post in #platform-alerts
- Tag @platform-ops

Service Down (Complete Outage)

Immediate Actions:

Enable fallback routing:
- agent-router automatically falls back to NAS CPU → OpenAI
- Verify fallback working: check agent-router logs
Notify users (if impacting service):
- Post in #incidents
- Update status page
Investigate:
- Check Vast.ai dashboard
- Check Tailscale connectivity
- Check worker logs
Restore service:
- Restart worker if crashed
- Redeploy if needed
- Update DNS if IP changed
Post-mortem:
- Document incident
- Identify preventions
- Update runbooks

References

Internal:

Knowledge Base: /research/vast-ai/
Agent Best Practices: /research/vast-ai/11-agent-best-practices.md
Cost Optimization: /research/vast-ai/07-cost-optimization.md

External:

Vast.ai Dashboard: https://cloud.vast.ai/
Vast.ai Docs: https://docs.vast.ai/
Support: Discord (https://discord.gg/A9rVHm4MXG)

Team:

Oncall: platform-ops@bluefly.io
Alerts: #platform-alerts (Slack)
Updates: #platform-updates (Slack)

Last Updated: 2026-01-28 Maintainer: Platform Operations Status: Production