Skip to main content

infrastructure separation

Infrastructure Separation Plan — Oracle + NAS + Mac

Overview

TierHostRoleResources
ProductionOracle Instance 1 (150.136.74.174)k3s + core platform Docker3 OCPU, 18GB (target)
Tools/UIOracle Instance 2 (132.145.200.37)LibreChat, n8n, tracer, grafana1 OCPU, 6GB (target)
PersistenceNAS (blueflynas.tailcf98b3.ts.net)NFS volumes, backups, runner cache, wikisSynology 224+
Dev/ConsoleMac M4IDE, Langflow (local app), builds, adminLocal

What Runs Where

Oracle Instance 1 — Core Platform (k3s + Docker)

k3s cluster (kagent namespace):

  • kagent-controller, kagent-ui, kagent-tools, kagent-grafana-mcp, kagent-querydoc
  • 30+ kagent agents (OSSA + custom)
  • kube-prometheus-stack (monitoring namespace)

Docker services (core — must be low-latency):

  • postgres (primary DB)
  • redis (cache)
  • agent-protocol (MCP server, tunneled)
  • agent-mesh (routing)
  • agent-router (discovery)
  • agent-protocol (MCP)
  • dragonfly
  • compliance-engine
  • workflow-engine
  • a2a-collector, a2a-stream
  • content-guardian
  • intel-feed (miniflux)
  • cloudflared (tunnel — must stay)
  • duo-webhook stack
  • social-research-agent, whitepaper-writer-agent, content-reviewer-agent
  • qdrant (vector DB)
  • mongodb (for LibreChat on Instance 2 — may move with it)

Oracle Instance 2 — Tools & UI

Move here (Phase 3):

  • LibreChat (chat UI, MCP client, agent framework) + mongodb + redis
  • n8n (workflow automation)
  • agent-tracer (observability)
  • grafana (Docker instance for platform dashboards)
  • phoenix → REMOVED (was dev-only)

NAS — Persistence Layer

NFS mounts from Oracle (Phase 2):

  • /mnt/nas-data/oracle/postgres — postgres data
  • /mnt/nas-data/oracle/redis — redis persistence
  • /mnt/nas-data/oracle/qdrant — vector storage
  • /mnt/nas-data/oracle/mongodb — MongoDB data
  • /mnt/nas-data/oracle/grafana — Grafana dashboards/config
  • /mnt/nas-data/oracle/loki — log retention
  • /mnt/nas-data/oracle/tempo — trace retention
  • /mnt/nas-data/oracle/logs — application logs
  • /mnt/nas-data/oracle/agents/state — agent state

Keep local on Oracle (NEVER NFS):

  • /var/lib/docker — container runtime
  • /var/lib/rancher/k3s — k3s state/etcd

Also on NAS (existing):

  • /Volumes/AgentPlatform/applications/Wikis/ — wiki clones
  • /Volumes/AgentPlatform/applications/__BARE_REPOS/ — bare git repos
  • /Volumes/AgentPlatform/config/ — platform config
  • /Volumes/AgentPlatform/services/ — development copies of services

Mac M4 — Dev Console

  • Langflow (local Mac app — removed from Oracle, saved 7.75GB)
  • IDE (Cursor, Claude Code, VS Code)
  • DDEV + Drupal dev
  • GitLab runner (optional, for local builds)
  • Admin operations (SSH, kubectl via Tailscale)

Chat UI Decision: LibreChat

Chosen: LibreChat over Open WebUI

FeatureLibreChatOpen WebUI
Agent frameworkFirst-class, built-inBasic
MCP clientNative, documentedVia proxy
Multi-providerOpenAI, Anthropic, Ollama, customOpenAI-compatible, Ollama
ExtensibilityAgents + tools + presets + endpointsPlugins
Self-hostedAuth, RBAC, SSOAuth, basic RBAC

LibreChat already runs at https://chat.blueflyagents.com branded "OpenStandardAgents". Open WebUI removed (saved 7GB disk).

Services Removed (2026-02-27)

ServiceImage SizeReason
open-webui5.88GBReplaced by LibreChat
langflow7.75GBUsing Mac local app instead
phoenix1.02GBDev-only, not needed on prod

OCI Free Tier

Oracle Always Free: 4 A1 OCPUs + 24GB RAM total across all instances.

Target split:

  • Instance 1: 3 OCPU / 18GB (core platform)
  • Instance 2: 1 OCPU / 6GB (UI + tools)
  • Total: 4 OCPU / 24GB = within free tier

IMPORTANT: If both instances currently run at 4+24 each (8 OCPU, 48GB total), resize BEFORE Oracle catches it.

Dead Pod Prevention

CronJob cleanup-dead-pods in kube-system runs every 15 minutes:

  • Deletes Succeeded, Failed, and Evicted pods across all namespaces
  • Sequential deletion (xargs -L1) — NEVER parallel
  • ServiceAccount cleanup-sa with minimal RBAC

Phase Plan

  1. Phase 1 (DONE): Remove open-webui, langflow, phoenix → disk 91% → 68%
  2. Phase 2: NFS mount NAS volumes on Oracle, migrate Docker bind mounts
  3. Phase 3: Set up Instance 2, move LibreChat+n8n+tracer+grafana
  4. Phase 4: Resize both instances to fit free tier (3+1 OCPU, 18+6 GB)
  5. Phase 5: Automated NAS backups (nightly DB dumps + volume snapshots)

Tunnel Routes Update

After migration, update cloudflared config:

  • chat.blueflyagents.com → Instance 2 (LibreChat)
  • n8n.blueflyagents.com → Instance 2
  • grafana.blueflyagents.com → Instance 2
  • All others remain on Instance 1