Skip to main content

Oracle VM Disaster Recovery

Oracle VM Disaster Recovery

Last Updated: 2026-02-15
Owner: Infrastructure Team
Review Cycle: Monthly


Overview

This document outlines the disaster recovery strategy for the Oracle VM infrastructure running at 100.103.48.75 (oracle.tailcf98b3.ts.net).

Critical Risk: If Oracle VM disappeared today, you would lose:

  • Services code: SAFE (8 git repos can be redeployed)
  • Production data: LOST (all databases, vector stores, logs)
  • Production configs: LOST (main .env + 11 service-specific configs)
  • Service orchestration: LOST (main docker-compose.yml not in git)

What Would Be Lost (NOT in Git)

Production Environment Files

CRITICAL - Contains all secrets and production configs

FileContentsImpact
/opt/bluefly/.envMain env for docker-compose (POSTGRES_PASSWORD, JWT_SECRET, API keys)CRITICAL - All services fail
/opt/bluefly/docker-compose.ymlService orchestration, network config, volume mappingsCRITICAL - Cannot recreate infrastructure

Service-Specific .env Files (11 files):

  • /opt/bluefly/agent-protocol/.env.production
  • /opt/bluefly/agent-chat/infrastructure/.env.production
  • /opt/bluefly/agent-tracer/.env.{observability,k8s,analytics}
  • /opt/bluefly/compliance-engine/.env.{phoenix,unified-auth}
  • /opt/bluefly/agent-router/.env.{litellm,phoenix,vastai}

Production Data Volumes

CRITICAL - All persistent data would be permanently lost

Volume PathServiceData TypeRecovery
/opt/bluefly/data/postgresPostgreSQLAll application databasesNONE (unless backed up)
/opt/bluefly/data/mongodbMongoDBLibreChat conversations, n8n workflowsNONE
/opt/bluefly/data/qdrantQdrantVector embeddings, semantic search indexesNONE (re-index required)
/opt/bluefly/data/redisRedisCache, session data, job queuesACCEPTABLE (ephemeral)
/opt/bluefly/data/dragonfly-dbDragonflyTesting artifacts, compliance reportsMODERATE
/opt/bluefly/data/grafanaGrafanaDashboards, alerts, data sourcesMODERATE (can recreate)
/opt/bluefly/data/{loki,tempo,phoenix}ObservabilityLogs, tracesLOW (historical only)
/opt/bluefly/data/agentsAgent stateAgent runtime state, task queuesHIGH

What Would Survive (In GitLab)

Service Code (Recoverable)

All 8 core services have oracle-deploy in their .gitlab-ci.yml:

  • ✅ agent-router
  • ✅ agent-mesh
  • ✅ agent-protocol
  • ✅ workflow-engine
  • ✅ agent-tracer
  • ✅ compliance-engine
  • ✅ content-guardian
  • ✅ dragonfly

Recovery: Can be redeployed via CI/CD to new Oracle VM.


Recovery Procedures

Full Disaster Recovery (Estimated: 4 hours)

Prerequisites:

  • Backups exist in secure location
  • GitLab access intact
  • New Oracle VM provisioned

Step 1: Restore Configs (30 minutes)

# SSH to new Oracle VM ssh new-oracle # Create directory structure sudo mkdir -p /opt/bluefly/{data,backups} sudo chown -R ubuntu:ubuntu /opt/bluefly # Restore docker-compose.yml and .env scp ~/backups/oracle-vm/compose/docker-compose.yml new-oracle:/opt/bluefly/ scp ~/backups/oracle-vm/configs/.env.production new-oracle:/opt/bluefly/.env

Step 2: Restore Databases (1 hour)

cd /opt/bluefly docker-compose up -d postgres redis mongodb qdrant # Restore Postgres cat ~/backups/oracle-vm/postgres-YYYYMMDD.sql | docker exec -i postgres psql -U bluefly # Restore MongoDB cat ~/backups/oracle-vm/mongodb-YYYYMMDD.archive | docker exec -i mongodb mongorestore --archive # Restore Qdrant docker exec qdrant tar xzf - -C / < ~/backups/oracle-vm/qdrant-YYYYMMDD.tar.gz docker-compose restart qdrant

Step 3: Deploy Services via CI/CD (2 hours)

# Trigger oracle-deploy for each service via GitLab UI or: glab ci run --project blueflyio/agent-platform/agent-router --branch main --job deploy-to-oracle # Repeat for all 8 services

Step 4: Start Third-Party Services (30 minutes)

docker-compose up -d librechat langflow n8n grafana loki tempo phoenix

Step 5: Verify (30 minutes)

docker ps --format 'table {{.Names}}\t{{.Status}}' curl http://localhost:4000/health # agent-router curl http://localhost:3005/health # agent-mesh

Backup Strategy

Daily Automated Backups

Databases:

# Postgres docker exec postgres pg_dumpall -U bluefly | gzip > /backups/postgres-$(date +%Y%m%d).sql.gz # MongoDB docker exec mongodb mongodump --archive --gzip > /backups/mongodb-$(date +%Y%m%d).archive.gz # Qdrant docker exec qdrant tar czf - /qdrant/storage > /backups/qdrant-$(date +%Y%m%d).tar.gz

Retention:

  • Daily backups: 7 days
  • Weekly backups: 30 days
  • Monthly backups: 12 months

Sync to NAS:

rsync -avz /opt/bluefly/backups/ blueflynas:/volume1/backups/oracle-vm/

Currently Running Services

ServiceStatusHealth
Core Platform
agent-routerRunning✅ Healthy
agent-meshRunning✅ Healthy
agent-protocolRunning✅ Healthy
workflow-engineRunning✅ Healthy
compliance-engineRunning✅ Healthy
agent-tracerRunning✅ Healthy
dragonflyRunning✅ Healthy
Infrastructure
postgresRunning✅ Healthy
redisRunning✅ Healthy
qdrantRunning✅ Healthy
mongodbRunning✅ Healthy
Observability
grafanaRunning✅ Healthy
lokiRunning⚠️ No health check
tempoRunning⚠️ No health check
phoenixRunning❌ Unhealthy
otel-collectorRunning❌ Unhealthy

Configuration Management

Tracked in Git

Location: gitlab_components/infrastructure/oracle-vm/

  • docker-compose.production.yml - Service orchestration
  • .env.template - Environment variable template (NO secrets)

Secret Management

Location: Secure encrypted backup (NOT in git)

  • Main .env file
  • Service-specific .env files
  • Database passwords
  • API keys

Monitoring & Alerts

Health Checks

All services must have health checks defined in docker-compose.yml:

healthcheck: test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3

Services Needing Fixes

  • ❌ phoenix - LLM observability unhealthy
  • ❌ otel-collector - OpenTelemetry collection failing
  • ❌ langflow - AI workflow builder unhealthy


Next Review: 2026-03-15