infrastructure separation
Infrastructure Separation Plan — Oracle + NAS + Mac
Overview
| Tier | Host | Role | Resources |
|---|---|---|---|
| Production | Oracle Instance 1 (150.136.74.174) | k3s + core platform Docker | 3 OCPU, 18GB (target) |
| Tools/UI | Oracle Instance 2 (132.145.200.37) | LibreChat, n8n, tracer, grafana | 1 OCPU, 6GB (target) |
| Persistence | NAS (blueflynas.tailcf98b3.ts.net) | NFS volumes, backups, runner cache, wikis | Synology 224+ |
| Dev/Console | Mac M4 | IDE, Langflow (local app), builds, admin | Local |
What Runs Where
Oracle Instance 1 — Core Platform (k3s + Docker)
k3s cluster (kagent namespace):
- kagent-controller, kagent-ui, kagent-tools, kagent-grafana-mcp, kagent-querydoc
- 30+ kagent agents (OSSA + custom)
- kube-prometheus-stack (monitoring namespace)
Docker services (core — must be low-latency):
- postgres (primary DB)
- redis (cache)
- agent-protocol (MCP server, tunneled)
- agent-mesh (routing)
- agent-router (discovery)
- agent-protocol (MCP)
- dragonfly
- compliance-engine
- workflow-engine
- a2a-collector, a2a-stream
- content-guardian
- intel-feed (miniflux)
- cloudflared (tunnel — must stay)
- duo-webhook stack
- social-research-agent, whitepaper-writer-agent, content-reviewer-agent
- qdrant (vector DB)
- mongodb (for LibreChat on Instance 2 — may move with it)
Oracle Instance 2 — Tools & UI
Move here (Phase 3):
- LibreChat (chat UI, MCP client, agent framework) + mongodb + redis
- n8n (workflow automation)
- agent-tracer (observability)
- grafana (Docker instance for platform dashboards)
- phoenix → REMOVED (was dev-only)
NAS — Persistence Layer
NFS mounts from Oracle (Phase 2):
/mnt/nas-data/oracle/postgres— postgres data/mnt/nas-data/oracle/redis— redis persistence/mnt/nas-data/oracle/qdrant— vector storage/mnt/nas-data/oracle/mongodb— MongoDB data/mnt/nas-data/oracle/grafana— Grafana dashboards/config/mnt/nas-data/oracle/loki— log retention/mnt/nas-data/oracle/tempo— trace retention/mnt/nas-data/oracle/logs— application logs/mnt/nas-data/oracle/agents/state— agent state
Keep local on Oracle (NEVER NFS):
/var/lib/docker— container runtime/var/lib/rancher/k3s— k3s state/etcd
Also on NAS (existing):
/Volumes/AgentPlatform/applications/Wikis/— wiki clones/Volumes/AgentPlatform/applications/__BARE_REPOS/— bare git repos/Volumes/AgentPlatform/config/— platform config/Volumes/AgentPlatform/services/— development copies of services
Mac M4 — Dev Console
- Langflow (local Mac app — removed from Oracle, saved 7.75GB)
- IDE (Cursor, Claude Code, VS Code)
- DDEV + Drupal dev
- GitLab runner (optional, for local builds)
- Admin operations (SSH, kubectl via Tailscale)
Chat UI Decision: LibreChat
Chosen: LibreChat over Open WebUI
| Feature | LibreChat | Open WebUI |
|---|---|---|
| Agent framework | First-class, built-in | Basic |
| MCP client | Native, documented | Via proxy |
| Multi-provider | OpenAI, Anthropic, Ollama, custom | OpenAI-compatible, Ollama |
| Extensibility | Agents + tools + presets + endpoints | Plugins |
| Self-hosted | Auth, RBAC, SSO | Auth, basic RBAC |
LibreChat already runs at https://chat.blueflyagents.com branded "OpenStandardAgents".
Open WebUI removed (saved 7GB disk).
Services Removed (2026-02-27)
| Service | Image Size | Reason |
|---|---|---|
| open-webui | 5.88GB | Replaced by LibreChat |
| langflow | 7.75GB | Using Mac local app instead |
| phoenix | 1.02GB | Dev-only, not needed on prod |
OCI Free Tier
Oracle Always Free: 4 A1 OCPUs + 24GB RAM total across all instances.
Target split:
- Instance 1: 3 OCPU / 18GB (core platform)
- Instance 2: 1 OCPU / 6GB (UI + tools)
- Total: 4 OCPU / 24GB = within free tier
IMPORTANT: If both instances currently run at 4+24 each (8 OCPU, 48GB total), resize BEFORE Oracle catches it.
Dead Pod Prevention
CronJob cleanup-dead-pods in kube-system runs every 15 minutes:
- Deletes Succeeded, Failed, and Evicted pods across all namespaces
- Sequential deletion (xargs -L1) — NEVER parallel
- ServiceAccount
cleanup-sawith minimal RBAC
Phase Plan
- Phase 1 (DONE): Remove open-webui, langflow, phoenix → disk 91% → 68%
- Phase 2: NFS mount NAS volumes on Oracle, migrate Docker bind mounts
- Phase 3: Set up Instance 2, move LibreChat+n8n+tracer+grafana
- Phase 4: Resize both instances to fit free tier (3+1 OCPU, 18+6 GB)
- Phase 5: Automated NAS backups (nightly DB dumps + volume snapshots)
Tunnel Routes Update
After migration, update cloudflared config:
chat.blueflyagents.com→ Instance 2 (LibreChat)n8n.blueflyagents.com→ Instance 2grafana.blueflyagents.com→ Instance 2- All others remain on Instance 1