Skip to main content
PUBLISHED
Whitepaper

Kubernetes-Native Agent Orchestration: Custom Resources, Operators, and Cloud-Native Patterns for AI Agent Deployment

Complete architecture for Kubernetes-native AI agent orchestration — Custom Resource Definitions, Operator patterns with state-machine reconciliation, HPA/VPA/KEDA scaling, service mesh with mTLS, OpenTelemetry observability, multi-cluster federation, and a 50-agent production reference deployment.

BlueFly.io / OSSA Research Team··36 min read

Kubernetes-Native Agent Orchestration: Custom Resources, Operators, and Cloud-Native Patterns for AI Agent Deployment

BlueFly.io Agent Platform -- Whitepaper Series #4 Version: 1.0 Date: February 2026 Classification: Technical Reference Audience: Platform Engineers, SREs, AI Infrastructure Architects


Abstract

The convergence of autonomous AI agents and cloud-native infrastructure presents both an unprecedented opportunity and a formidable engineering challenge. As organizations scale from single-agent prototypes to fleets of hundreds of cooperating agents, the operational complexity of provisioning, scaling, monitoring, and securing these workloads demands a principled orchestration layer. Kubernetes, with its declarative state model, extensible API machinery, and mature ecosystem, provides a natural substrate for this orchestration.

This whitepaper presents a complete architecture for Kubernetes-native AI agent orchestration. We define Custom Resource Definitions (CRDs) that encode agent specifications, pool topologies, and workflow graphs as first-class Kubernetes objects. We detail an Operator pattern that implements a full reconciliation loop with state-machine semantics, leader election for high availability, and graceful degradation under failure. We address the unique scaling requirements of AI workloads through Horizontal Pod Autoscalers driven by custom metrics (tokens per second, queue depth, active tasks), Vertical Pod Autoscalers for right-sizing GPU memory allocations, and KEDA-based event-driven scaling for bursty inference workloads. Networking patterns encompass service mesh integration with mutual TLS, gRPC load balancing, and network policy isolation. Storage strategies cover persistent vector databases via StatefulSets, ephemeral scratch volumes for model weights, and CSI driver selection for throughput-sensitive workloads. Observability is treated as a first-class concern through OpenTelemetry instrumentation, Prometheus metrics pipelines, Grafana dashboards, and structured logging with Loki. We extend the architecture to multi-cluster federation for geographic distribution and regulatory compliance, and harden the deployment with Pod Security Standards, gVisor sandboxing, OPA policy enforcement, and least-privilege RBAC. A reference architecture for a 50-agent production deployment provides concrete manifests, cost models, and capacity planning formulas. Throughout, we ground our recommendations in production experience operating agent fleets at scale and in the broader cloud-native community's best practices as codified by the CNCF.

The architecture described herein is aligned with the Open Standard for Sustainable Agents (OSSA) v0.3.3 specification and the BlueFly.io Agent Platform's separation-of-duties model. All Kubernetes manifests, CRD schemas, and operator pseudocode are provided as actionable reference implementations.


1. Why Kubernetes for AI Agents

1.1 The Operational Gap

The AI agent landscape has evolved rapidly from research prototypes to production systems that must meet enterprise reliability standards. An agent that performs well in a notebook or a single-process deployment quickly encounters operational challenges when deployed at scale: how do you restart it when it crashes? How do you scale it when load increases? How do you roll out a new model version without downtime? How do you enforce resource limits so that a runaway agent does not consume an entire cluster's GPU allocation?

These are precisely the problems that container orchestration platforms were designed to solve. Kubernetes, as the dominant container orchestration platform with 96% of organizations either using or evaluating it according to the CNCF Annual Survey 2025, provides a battle-tested foundation for addressing these operational concerns.

1.2 Declarative Desired State and Agent Specifications

The fundamental insight that makes Kubernetes suitable for agent orchestration is its declarative model. Rather than writing imperative scripts that specify how to deploy an agent (start process A, then configure network B, then attach volume C), operators declare what they want (an agent with these capabilities, this model, these resource limits, and this scaling policy) and the Kubernetes control plane continuously reconciles the actual state of the world with the desired state.

This model maps naturally to agent specifications. An OSSA agent manifest already declares the agent's identity, capabilities, access tier, and resource requirements in a declarative format. A Kubernetes CRD extends this with operational semantics: replica count, health check endpoints, scaling triggers, affinity rules, and upgrade strategies. The resulting object is simultaneously a complete description of what the agent is (its functional specification) and how it should be operated (its operational specification).

Declarative Alignment:

OSSA Manifest              Kubernetes CRD
--------------              --------------
agent.name          --->    metadata.name
agent.capabilities  --->    spec.capabilities[]
agent.model         --->    spec.runtime.model
agent.tier          --->    spec.security.accessTier
(not specified)     --->    spec.scaling (HPA policy)
(not specified)     --->    spec.resources (CPU/GPU/memory)
(not specified)     --->    spec.networking (service mesh)
(not specified)     --->    spec.observability (metrics/tracing)

Figure 1: Declarative alignment between OSSA agent manifests and Kubernetes CRDs.

1.3 Architecture Tiers and Cost Models

Not every organization requires a full multi-cluster federation with GPU scheduling and service mesh. We define three architecture tiers that allow organizations to adopt Kubernetes-native agent orchestration incrementally.

Table 1: Architecture Tiers

TierMonthly CostNodesAgentsGPUFeatures
Small$100-5001-31-10None/sharedCRDs, basic operator, HPA, Prometheus
Medium$1,000-5,0005-1510-501-4 dedicatedFull operator, KEDA, Istio, VPA, OPA
Large$10,000+20-100+50-500+8+ dedicatedMulti-cluster, federation, custom schedulers

The small tier is achievable on managed Kubernetes services (EKS, GKE, AKS) with a single node pool and provides the foundational CRD and operator patterns. The medium tier adds GPU scheduling, event-driven scaling, and service mesh security. The large tier extends to multi-cluster federation, custom scheduling algorithms, and dedicated GPU pools with preemption policies.

The cost formula for capacity planning at any tier is:

Monthly_Cost = (N_cpu_nodes * C_cpu) + (N_gpu_nodes * C_gpu) + (S_pv * C_storage) + (E_gb * C_egress) + C_managed

Where:
  N_cpu_nodes = number of CPU-only nodes
  C_cpu       = cost per CPU node per month ($50-200 for cloud instances)
  N_gpu_nodes = number of GPU nodes
  C_gpu       = cost per GPU node per month ($500-3000 depending on GPU class)
  S_pv        = total persistent volume storage in GB
  C_storage   = cost per GB per month ($0.10-0.30 for SSD, $0.04-0.08 for HDD)
  E_gb        = monthly egress in GB
  C_egress    = cost per GB egress ($0.08-0.12 for major cloud providers)
  C_managed   = managed K8s control plane cost ($0-74/mo depending on provider)

1.4 Why Not Alternatives?

Before committing to Kubernetes, it is worth considering the alternatives.

Docker Compose: Suitable for single-machine deployments but lacks scheduling, scaling, self-healing, and multi-node support. Cannot handle GPU scheduling or affinity rules.

Nomad: A capable orchestrator with simpler operational characteristics than Kubernetes, but a significantly smaller ecosystem. The lack of CRD-equivalent extensibility means agent specifications must be encoded as job metadata rather than first-class API objects.

Serverless (Lambda/Cloud Run): Attractive for stateless, short-lived inference workloads but fundamentally misaligned with long-running, stateful agents that maintain conversation context, vector store connections, and tool registrations. Cold start latencies of 1-10 seconds are unacceptable for real-time agent interactions.

Custom Orchestration: Building a bespoke orchestration layer is always an option, but it means reimplementing scheduling, health checking, scaling, networking, storage, and observability from scratch. The engineering cost is prohibitive for all but the largest organizations.

Kubernetes occupies the sweet spot: a mature, extensible platform with a vast ecosystem of integrations, supported by every major cloud provider, and designed from the ground up for the kind of declarative, self-healing infrastructure that agent orchestration demands.


2. Custom Resource Definitions

2.1 The Agent CRD

The Agent CRD is the foundational building block of the orchestration layer. It extends the Kubernetes API with a new resource type that encodes everything the operator needs to know about an agent: its runtime configuration, resource requirements, scaling policy, security posture, and observability settings.

apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: agents.ossa.ai annotations: api-approved.kubernetes.io: "https://ossa.ai/api-review/agents" spec: group: ossa.ai versions: - name: v1 served: true storage: true subresources: status: {} scale: specReplicasPath: .spec.scaling.minReplicas statusReplicasPath: .status.replicas additionalPrinterColumns: - name: State type: string jsonPath: .status.state - name: Replicas type: integer jsonPath: .status.replicas - name: Model type: string jsonPath: .spec.runtime.model - name: Age type: date jsonPath: .metadata.creationTimestamp schema: openAPIV3Schema: type: object required: [spec] properties: spec: type: object required: [runtime, capabilities] properties: runtime: type: object required: [model, image] properties: model: type: string description: "LLM model identifier" pattern: "^[a-z0-9-]+/[a-z0-9._-]+:[a-z0-9._-]+$" image: type: string description: "Container image for the agent runtime" command: type: array items: type: string env: type: array items: type: object properties: name: type: string value: type: string valueFrom: type: object properties: secretKeyRef: type: object properties: name: type: string key: type: string providerEndpoint: type: string format: uri maxTokensPerRequest: type: integer minimum: 1 maximum: 200000 default: 4096 temperature: type: number minimum: 0.0 maximum: 2.0 default: 0.7 capabilities: type: array minItems: 1 items: type: object required: [name, type] properties: name: type: string type: type: string enum: [tool, skill, protocol, sensor] version: type: string config: type: object x-kubernetes-preserve-unknown-fields: true resources: type: object properties: requests: type: object properties: cpu: type: string pattern: "^[0-9]+m?$" memory: type: string pattern: "^[0-9]+(Mi|Gi)$" nvidia.com/gpu: type: integer minimum: 0 limits: type: object properties: cpu: type: string memory: type: string nvidia.com/gpu: type: integer scaling: type: object properties: minReplicas: type: integer minimum: 0 default: 1 maxReplicas: type: integer minimum: 1 default: 10 metrics: type: array items: type: object properties: type: type: string enum: [cpu, memory, custom, external] name: type: string target: type: object properties: type: type: string enum: [Utilization, AverageValue, Value] averageValue: type: string averageUtilization: type: integer scaleDownStabilization: type: integer default: 300 description: "Seconds to wait before scaling down" security: type: object properties: accessTier: type: string enum: [tier_1_read, tier_2_write_limited, tier_3_full_access, tier_4_policy] default: tier_1_read runAsNonRoot: type: boolean default: true readOnlyRootFilesystem: type: boolean default: true runtimeClass: type: string description: "RuntimeClass name (e.g., gvisor for sandboxing)" networkPolicy: type: object properties: allowEgress: type: array items: type: object properties: host: type: string port: type: integer denyIngress: type: boolean default: false observability: type: object properties: metricsPort: type: integer default: 9090 metricsPath: type: string default: "/metrics" tracingEnabled: type: boolean default: true logLevel: type: string enum: [debug, info, warn, error] default: info status: type: object properties: state: type: string enum: [Pending, Initializing, Running, Degraded, Terminating, Failed] replicas: type: integer readyReplicas: type: integer lastTransitionTime: type: string format: date-time conditions: type: array items: type: object properties: type: type: string status: type: string enum: ["True", "False", "Unknown"] reason: type: string message: type: string lastTransitionTime: type: string format: date-time metrics: type: object properties: tokensPerSecond: type: number activeTaskCount: type: integer averageLatencyMs: type: number errorRate: type: number scope: Namespaced names: plural: agents singular: agent kind: Agent shortNames: - ag categories: - ossa - ai

2.2 The AgentPool CRD

While individual Agent resources describe single agent types, the AgentPool CRD manages a logical group of agents that share infrastructure resources and scaling policies. An AgentPool defines node affinity, GPU allocation strategies, and pool-level resource quotas.

apiVersion: ossa.ai/v1 kind: AgentPool metadata: name: inference-pool namespace: agent-system spec: nodeSelector: accelerator: nvidia-a100 topology.kubernetes.io/zone: us-east-1a tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule resourceQuota: requests.cpu: "64" requests.memory: "256Gi" requests.nvidia.com/gpu: "8" limits.cpu: "128" limits.memory: "512Gi" limits.nvidia.com/gpu: "8" agents: - name: code-reviewer replicas: 3 - name: security-scanner replicas: 2 - name: test-generator replicas: 5 scheduling: strategy: BinPacking preemptionPolicy: PreemptLowerPriority priorityClassName: high-priority-agents

2.3 The AgentWorkflow CRD

The AgentWorkflow CRD encodes multi-agent workflows as directed acyclic graphs (DAGs) with typed edges representing data flow between agents.

apiVersion: ossa.ai/v1 kind: AgentWorkflow metadata: name: code-review-pipeline namespace: agent-system spec: entrypoint: analyze timeout: 600 retryPolicy: maxRetries: 3 backoff: exponential steps: - name: analyze agentRef: code-analyzer inputs: - name: repository type: git-url outputs: - name: analysis-report type: json next: - review - security-scan - name: review agentRef: code-reviewer inputs: - name: analysis-report fromStep: analyze outputs: - name: review-comments type: json next: - aggregate - name: security-scan agentRef: security-scanner inputs: - name: analysis-report fromStep: analyze outputs: - name: security-findings type: json next: - aggregate - name: aggregate agentRef: report-aggregator inputs: - name: review-comments fromStep: review - name: security-findings fromStep: security-scan outputs: - name: final-report type: json onFailure: step: notify-team agentRef: notification-agent

2.4 etcd Storage Considerations

Every CRD instance is stored in etcd as a key-value pair. The storage footprint per agent resource can be estimated as:

storage_per_agent = base_overhead + spec_size + status_size

Where:
  base_overhead  = ~1.5 KB (key prefix, metadata, timestamps, resourceVersion)
  spec_size      = 0.5 - 3.0 KB (depending on capabilities list and env vars)
  status_size    = 0.2 - 1.0 KB (conditions array, metrics snapshot)

Total per agent: ~2 - 5.5 KB (typically 2-4 KB)

For a deployment of 500 agents with an average spec size of 3 KB, the total etcd storage is approximately 1.75 MB, well within etcd's recommended maximum database size of 8 GB. However, the watch event rate is more significant: every status update generates a watch event that all operator replicas must process. At 500 agents updating status every 30 seconds, this produces approximately 17 events per second, which is within etcd's comfortable operating range of several thousand events per second.

CRD Lifecycle Data Flow:

  User/CI              API Server           etcd          Operator          Kubelet
    |                      |                  |               |                |
    |--- apply Agent CR -->|                  |               |                |
    |                      |--- store ------->|               |                |
    |                      |--- watch event ----------------->|                |
    |                      |                  |               |--- reconcile   |
    |                      |                  |               |    observe     |
    |                      |                  |               |    diff        |
    |                      |                  |               |    act ------->|
    |                      |                  |               |                |--- create pod
    |                      |                  |               |                |--- pull image
    |                      |                  |               |                |--- start container
    |                      |<-- status update ---------------|                |
    |                      |--- store ------->|               |                |
    |<-- event notification|                  |               |                |

Figure 2: CRD lifecycle data flow from creation through reconciliation to pod scheduling.

2.5 Versioning and Migration

CRD versioning follows the Kubernetes API versioning convention. When the schema evolves (for example, adding a new field to the agent specification), a new API version is introduced (e.g., ossa.ai/v1beta2 to ossa.ai/v1). Webhook-based conversion ensures that existing resources are automatically migrated to the new schema without downtime. The operator maintains backward compatibility by supporting reads from all served versions and writes to the storage version.


3. Agent Operator Pattern

3.1 Operator Architecture

The Agent Operator is a Kubernetes controller that watches for changes to Agent, AgentPool, and AgentWorkflow resources and reconciles the cluster state to match the declared specifications. It is built using the Operator SDK framework, which provides scaffolding for controller registration, leader election, metrics exposition, and webhook configuration.

The operator runs as a Deployment with multiple replicas for high availability, but only one replica (the leader) actively processes reconciliation events at any given time. The remaining replicas stand by as hot standbys, ready to assume leadership within seconds if the leader fails.

Operator Architecture:

+------------------------------------------------------------------+
|                     Agent Operator Deployment                     |
|                                                                   |
|  +-------------------+  +-------------------+  +---------------+ |
|  |  Replica 1        |  |  Replica 2        |  |  Replica 3    | |
|  |  (LEADER)         |  |  (STANDBY)        |  |  (STANDBY)    | |
|  |                   |  |                   |  |               | |
|  |  +-------------+  |  |  +-------------+  |  | +----------+  | |
|  |  | Reconciler  |  |  |  | Reconciler  |  |  | |Reconciler|  | |
|  |  | Loop        |  |  |  | (paused)    |  |  | |(paused)  |  | |
|  |  +------+------+  |  |  +-------------+  |  | +----------+  | |
|  |         |         |  |                   |  |               | |
|  |  +------v------+  |  |                   |  |               | |
|  |  | State       |  |  |                   |  |               | |
|  |  | Machine     |  |  |                   |  |               | |
|  |  +------+------+  |  |                   |  |               | |
|  |         |         |  |                   |  |               | |
|  |  +------v------+  |  |                   |  |               | |
|  |  | K8s Client  |  |  |                   |  |               | |
|  |  +-------------+  |  |                   |  |               | |
|  +-------------------+  +-------------------+  +---------------+ |
|                                                                   |
|  +-------------------------------------------------------------+ |
|  |                    Leader Election (Lease)                   | |
|  +-------------------------------------------------------------+ |
+------------------------------------------------------------------+
         |                          |                      |
         v                          v                      v
+------------------+  +------------------+  +------------------+
|  Agent CRs       |  |  AgentPool CRs   |  |  AgentWorkflow   |
|  (watch)         |  |  (watch)         |  |  CRs (watch)     |
+------------------+  +------------------+  +------------------+

Figure 3: Operator architecture with leader election and multi-replica standby.

3.2 Reconciliation Loop

The reconciliation loop is the heart of the operator. It follows the standard observe-diff-act pattern, but with agent-specific logic for model loading, capability registration, and health assessment.

Reconciliation Pseudocode:

function reconcile(agent: Agent) -> Result {
    // OBSERVE: Gather current state
    currentPods = listPods(labelSelector: agent.metadata.name)
    currentService = getService(agent.metadata.name)
    currentHPA = getHPA(agent.metadata.name)
    currentNetworkPolicy = getNetworkPolicy(agent.metadata.name)

    // DIFF: Compare desired vs actual
    desiredReplicas = agent.spec.scaling.minReplicas
    actualReplicas = len(currentPods.filter(phase == Running))

    desiredImage = agent.spec.runtime.image
    actualImages = currentPods.map(p => p.spec.containers[0].image).unique()

    desiredModel = agent.spec.runtime.model
    actualModelStatus = currentPods.map(p => p.annotations["ossa.ai/model-loaded"])

    // ACT: Apply changes based on diff

    // Phase 1: Ensure base resources exist
    if currentService == null {
        createService(agent)
        updateStatus(agent, state: "Initializing", reason: "CreatingService")
        return requeue(after: 5s)
    }

    if currentNetworkPolicy == null && agent.spec.security.networkPolicy != null {
        createNetworkPolicy(agent)
    }

    // Phase 2: Pod management
    if actualReplicas < desiredReplicas {
        // Scale up: create pods with anti-affinity for spread
        deficit = desiredReplicas - actualReplicas
        for i in range(deficit) {
            pod = buildAgentPod(agent, ordinal: actualReplicas + i)
            applySecurityContext(pod, agent.spec.security)
            applyResourceLimits(pod, agent.spec.resources)
            injectObservabilitySidecar(pod, agent.spec.observability)
            createPod(pod)
        }
        updateStatus(agent, state: "Initializing", reason: "ScalingUp")
        return requeue(after: 15s)
    }

    if actualReplicas > desiredReplicas {
        // Scale down: terminate excess pods (newest first)
        excess = actualReplicas - desiredReplicas
        podsToTerminate = currentPods.sortBy(creationTimestamp, desc).take(excess)
        for pod in podsToTerminate {
            // Graceful shutdown: drain active tasks first
            drainAgent(pod, timeout: 60s)
            deletePod(pod)
        }
        updateStatus(agent, state: "Running", reason: "ScalingDown")
        return requeue(after: 30s)
    }

    // Phase 3: Image/model update (rolling update)
    if len(actualImages) > 0 && actualImages[0] != desiredImage {
        performRollingUpdate(agent, currentPods, desiredImage)
        updateStatus(agent, state: "Initializing", reason: "RollingUpdate")
        return requeue(after: 10s)
    }

    // Phase 4: Health assessment
    healthyPods = currentPods.filter(p => p.status.conditions.ready == true)
    unhealthyPods = currentPods.filter(p => p.status.conditions.ready == false)

    if len(unhealthyPods) > 0 && len(healthyPods) < desiredReplicas {
        updateStatus(agent, state: "Degraded",
            reason: fmt("{} of {} replicas unhealthy", len(unhealthyPods), desiredReplicas))

        // Attempt recovery for pods stuck in CrashLoopBackOff
        for pod in unhealthyPods {
            if pod.status.containerStatuses[0].restartCount > 5 {
                deletePod(pod)  // Let the next reconciliation recreate it
            }
        }
        return requeue(after: 30s)
    }

    // Phase 5: HPA management
    if agent.spec.scaling.metrics != null && len(agent.spec.scaling.metrics) > 0 {
        if currentHPA == null {
            createHPA(agent)
        } else {
            updateHPA(agent, currentHPA)
        }
    }

    // Phase 6: Steady state
    updateStatus(agent, state: "Running",
        replicas: len(healthyPods),
        readyReplicas: len(healthyPods),
        metrics: collectMetrics(healthyPods))

    return requeue(after: 60s)  // Periodic reconciliation
}

3.3 State Machine

The agent lifecycle is modeled as a finite state machine with well-defined transitions and invariants.

Table 2: Agent State Machine Transitions

Current StateEventNext StateActions
(none)CR createdPendingValidate spec, set initial status
PendingResources availableInitializingCreate Service, Pods, NetworkPolicy
PendingResources unavailablePendingSet condition "ResourcesUnavailable"
InitializingAll pods readyRunningEnable HPA, register in mesh
InitializingPod failureDegradedLog error, attempt restart
InitializingTimeout (5 min)FailedSet condition "InitializationTimeout"
RunningHealth check passRunningUpdate metrics in status
RunningPartial pod failureDegradedScale replacement, alert
RunningAll pods failFailedAttempt full restart
RunningSpec changeInitializingBegin rolling update
RunningCR deletedTerminatingDrain tasks, delete resources
DegradedRecoveryRunningClear degraded condition
DegradedPersistent failureFailedEscalate alert, stop retries
DegradedCR deletedTerminatingForce delete resources
TerminatingAll resources deleted(none)Remove finalizer
FailedUser interventionPendingReset state, retry
FailedCR deletedTerminatingCleanup remaining resources
State Machine Diagram:

                    +----------+
         create --> | Pending  |
                    +----+-----+
                         |
              resources  |  resources
              available  |  unavailable
                         |  (loop)
                    +----v-----+
                    |Initializ-|
                    |   ing    |
                    +----+-----+
                   /     |      \
          all pods/      |       \timeout
          ready  /       |pod     \
                /        |failure  \
         +----v---+  +---v----+  +-v------+
         | Running|  |Degraded|  | Failed |
         +----+---+  +---+----+  +---+----+
              |           |          |
         spec |  recovery |   user   |
         change|          |  action  |
              |           |          |
              +-----------+----------+
                          |
                     CR deleted
                          |
                   +------v------+
                   | Terminating |
                   +------+------+
                          |
                     resources
                     cleaned up
                          |
                       (removed)

Figure 4: Agent lifecycle state machine with transitions.

3.4 Leader Election

The operator uses Kubernetes Lease objects for leader election. The leader acquires a lease with a configurable duration (default: 15 seconds) and renews it periodically (default: every 10 seconds). If the leader fails to renew the lease, another replica acquires it within the lease duration plus a brief jitter period.

The leader election configuration:

apiVersion: apps/v1 kind: Deployment metadata: name: agent-operator namespace: agent-system spec: replicas: 3 selector: matchLabels: app: agent-operator template: metadata: labels: app: agent-operator spec: serviceAccountName: agent-operator containers: - name: operator image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0 args: - --leader-elect=true - --leader-election-id=agent-operator-leader - --leader-election-namespace=agent-system - --leader-election-lease-duration=15s - --leader-election-renew-deadline=10s - --leader-election-retry-period=2s - --metrics-bind-address=:8080 - --health-probe-bind-address=:8081 ports: - containerPort: 8080 name: metrics - containerPort: 8081 name: health livenessProbe: httpGet: path: /healthz port: health initialDelaySeconds: 15 periodSeconds: 20 readinessProbe: httpGet: path: /readyz port: health initialDelaySeconds: 5 periodSeconds: 10 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi

The maximum failover time can be calculated as:

T_failover = lease_duration + retry_period + reconciliation_backoff
           = 15s + 2s + 5s
           = 22 seconds (worst case)

In practice, failover typically completes within 10-15 seconds because the standby replica's lease acquisition attempt aligns with the expired lease boundary.

3.5 Finalizers and Graceful Cleanup

The operator attaches a finalizer (ossa.ai/agent-cleanup) to every Agent resource. When the user deletes an Agent CR, Kubernetes marks it for deletion but does not remove it from etcd until all finalizers are cleared. The operator's reconciliation loop detects the deletion timestamp, transitions the agent to the Terminating state, drains active tasks from all pods, deletes subordinate resources (Pods, Services, HPAs, NetworkPolicies), and finally removes the finalizer, allowing Kubernetes to complete the deletion.

This ensures that active agent tasks are not abruptly terminated and that orphaned resources are not left in the cluster.


4. Agent Scaling

4.1 Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler adjusts the number of agent replicas based on observed metrics. For AI agents, the most relevant metrics are not traditional CPU and memory utilization but rather domain-specific metrics like tokens per second, active task count, and request queue depth.

The HPA scaling formula is:

desiredReplicas = ceil(currentMetricValue / targetMetricValue * currentReplicas)

Stabilization:
  scaleUp:   max(recommendations[last 0s..scaleUpStabilization])
  scaleDown: min(recommendations[last 0s..scaleDownStabilization])

With tolerance band (default 10%):
  if abs(1 - currentMetricValue/targetMetricValue) < 0.1:
      desiredReplicas = currentReplicas  // no change (within tolerance)

For example, if an agent is currently running 3 replicas processing 150 tokens per second total, and the target is 60 tokens per second per replica:

desiredReplicas = ceil(150 / 60 * 3) = ceil(7.5) = 8

But if scaling down, the stabilization window prevents premature scale-down:

desiredReplicas = min(recommendations[last 300s])
// If recommendations over last 5 minutes were [8, 7, 6, 5, 5]:
desiredReplicas = 5

4.2 Custom Metrics for Agent Workloads

Standard CPU and memory metrics are insufficient for intelligent agent scaling. The following custom metrics provide the signals needed for responsive, cost-effective scaling.

Table 3: Custom Metrics for Agent Scaling

MetricTypeDescriptionTarget RangeScaling Behavior
agent_tokens_per_secondPodsToken throughput per replica50-100 tpsScale up when throughput saturates
agent_active_tasksPodsCurrently executing tasks per replica1-5 tasksScale up when concurrency is high
agent_queue_depthExternalPending tasks in message queue0-10 itemsScale up proactively before saturation
agent_request_latency_p99Pods99th percentile response latency< 2000 msScale up when latency degrades
agent_error_ratePodsError rate over 5-minute window< 0.01 (1%)Scale up if errors from overload
agent_gpu_utilizationPodsGPU compute utilization percentage60-80%Scale up for GPU-bound workloads
agent_model_cache_hit_ratePodsKV cache hit rate for model inference> 0.90 (90%)Scale up if cache pressure high

4.3 Vertical Pod Autoscaler (VPA)

While HPA adjusts replica count, VPA adjusts the resource requests and limits of individual pods. For AI agents, VPA is particularly valuable for right-sizing GPU memory allocations. A model that was initially allocated 16 GB of GPU memory may only use 11 GB in practice; VPA can reduce the request to 12 GB (with headroom), freeing 4 GB for other workloads on the same GPU node.

VPA operates in three modes:

  • Off: VPA generates recommendations but does not apply them. Useful for initial observation.
  • Initial: VPA sets resource requests only at pod creation time. No disruptive restarts.
  • Auto: VPA evicts and recreates pods when resource requests need significant adjustment.

For agent workloads, the Initial mode is recommended for production because it avoids disruptive pod restarts that would interrupt active agent tasks. The Auto mode is suitable for development and staging environments where task interruption is acceptable.

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: code-reviewer-vpa namespace: agent-system spec: targetRef: apiVersion: ossa.ai/v1 kind: Agent name: code-reviewer updatePolicy: updateMode: "Initial" resourcePolicy: containerPolicies: - containerName: agent minAllowed: cpu: 250m memory: 512Mi maxAllowed: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits

4.4 KEDA for Event-Driven Scaling

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes scaling beyond metrics to event sources. For agent workloads, KEDA is invaluable for scaling based on message queue depth, enabling agents to scale from zero when no tasks are pending and scale up rapidly when a burst of tasks arrives.

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: code-reviewer-scaler namespace: agent-system spec: scaleTargetRef: apiVersion: ossa.ai/v1 kind: Agent name: code-reviewer pollingInterval: 15 cooldownPeriod: 300 idleReplicaCount: 0 minReplicaCount: 1 maxReplicaCount: 20 fallback: failureThreshold: 3 replicas: 2 triggers: - type: rabbitmq metadata: protocol: amqp queueName: agent-tasks-code-review mode: QueueLength value: "5" authenticationRef: name: rabbitmq-auth - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: agent_active_tasks query: | sum(agent_active_tasks{agent="code-reviewer"}) threshold: "10" - type: cron metadata: timezone: America/New_York start: 0 8 * * 1-5 end: 0 18 * * 1-5 desiredReplicas: "3"

This configuration enables sophisticated scaling behavior. During business hours (8 AM to 6 PM EST, Monday through Friday), at least 3 replicas are maintained. Outside business hours, the agent scales to zero if no tasks are pending. When tasks arrive in the RabbitMQ queue, the agent scales up by one replica for every 5 pending tasks. The Prometheus trigger provides an additional signal based on active task concurrency across all replicas.

4.5 GPU Scheduling

GPU scheduling in Kubernetes requires the NVIDIA device plugin, which exposes nvidia.com/gpu as a schedulable resource. GPU allocation is binary at the device level: a pod either gets an entire GPU or none (fractional GPU sharing via MIG or time-slicing requires additional configuration).

Table 4: GPU Scheduling Strategies

StrategyConfigurationUse CaseEfficiency
Exclusivenvidia.com/gpu: 1Large models (>10B params)40-70% utilization
MIG (Multi-Instance GPU)nvidia.com/mig-3g.20gb: 1Medium models, multiple agents70-85% utilization
Time-Slicingnvidia.com/gpu: 1 + time-slicing configSmall models, cost-sensitive80-95% utilization
vGPUNVIDIA vGPU licenseEnterprise, guaranteed SLAs60-80% utilization

For agent workloads, MIG partitioning on A100/H100 GPUs provides the best balance of isolation and efficiency. A single A100 80GB can be partitioned into seven 10 GB instances, each running a separate agent with hardware-level memory isolation.

The GPU utilization efficiency formula:

GPU_efficiency = (sum(agent_gpu_compute_time) / (N_gpus * wall_clock_time)) * 100%

Cost_per_token = (GPU_cost_per_hour / 3600) / tokens_per_second

For example, an A100 at $3.00/hour processing 500 tokens/second:
  Cost_per_token = ($3.00 / 3600) / 500 = $0.00000167 per token
  Cost_per_million_tokens = $1.67

5. Networking and Service Mesh

5.1 Kubernetes Service Model for Agents

Each Agent resource is backed by a Kubernetes Service that provides stable DNS-based discovery and load balancing. The operator creates a ClusterIP service for internal communication and optionally a headless service for StatefulSet-based agents that require stable network identities.

apiVersion: v1 kind: Service metadata: name: code-reviewer namespace: agent-system labels: ossa.ai/agent: code-reviewer ossa.ai/type: inference spec: selector: ossa.ai/agent: code-reviewer ports: - name: grpc port: 50051 targetPort: 50051 protocol: TCP - name: http port: 8080 targetPort: 8080 protocol: TCP - name: metrics port: 9090 targetPort: 9090 protocol: TCP type: ClusterIP

5.2 Service Mesh Integration

For production deployments, a service mesh (Istio or Linkerd) provides critical capabilities that are difficult to implement at the application level: mutual TLS for all inter-agent communication, fine-grained traffic management, circuit breaking, and distributed tracing.

Istio's sidecar proxy (Envoy) automatically encrypts all traffic between agent pods using mutual TLS, eliminating the need for agents to manage their own TLS certificates. The mesh also provides L7 load balancing for gRPC, which is essential because Kubernetes' default L4 load balancing does not distribute gRPC streams across multiple backends.

apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: agent-mtls namespace: agent-system spec: mtls: mode: STRICT --- apiVersion: networking.istio.io/v1 kind: DestinationRule metadata: name: code-reviewer-lb namespace: agent-system spec: host: code-reviewer.agent-system.svc.cluster.local trafficPolicy: loadBalancer: simple: ROUND_ROBIN connectionPool: tcp: maxConnections: 100 http: h2UpgradePolicy: UPGRADE maxRequestsPerConnection: 0 outlierDetection: consecutive5xxErrors: 3 interval: 30s baseEjectionTime: 30s maxEjectionPercent: 50 --- apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: code-reviewer-routing namespace: agent-system spec: hosts: - code-reviewer.agent-system.svc.cluster.local http: - match: - headers: x-agent-version: exact: "v2" route: - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v2 weight: 100 - route: - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v1 weight: 90 - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v2 weight: 10

5.3 Network Policies

Network policies implement microsegmentation, ensuring that each agent can only communicate with the services it is authorized to access. The operator generates network policies automatically based on the agent's capability declarations and access tier.

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: code-reviewer-netpol namespace: agent-system spec: podSelector: matchLabels: ossa.ai/agent: code-reviewer policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: ossa.ai/type: orchestrator - podSelector: matchLabels: ossa.ai/agent: report-aggregator ports: - port: 50051 protocol: TCP - port: 8080 protocol: TCP egress: - to: - podSelector: matchLabels: app: qdrant ports: - port: 6334 protocol: TCP - to: - namespaceSelector: matchLabels: name: llm-providers ports: - port: 443 protocol: TCP - to: - podSelector: matchLabels: app: prometheus namespaceSelector: matchLabels: name: monitoring ports: - port: 9090 protocol: TCP

5.4 Ingress and External Access

External access to agent services is provided through an Ingress controller with TLS termination. For production deployments, we recommend dedicated ingress resources per agent group rather than a single ingress with path-based routing, to provide isolation and independent scaling of ingress capacity.

Agent Networking Data Flow:

External Client
      |
      | HTTPS (TLS 1.3)
      v
+------------------+
| Ingress          |
| Controller       |
| (nginx/envoy)    |
+--------+---------+
         |
         | HTTP/2 (plaintext, within cluster)
         v
+------------------+
| Istio Ingress    |
| Gateway          |
+--------+---------+
         |
         | mTLS (Istio-managed certificates)
         v
+------------------+     mTLS      +------------------+
| Agent Pod A      |<------------>| Agent Pod B      |
| (code-reviewer)  |              | (security-scan)  |
| +-------------+  |              | +-------------+  |
| | Envoy Proxy |  |              | | Envoy Proxy |  |
| +------+------+  |              | +------+------+  |
|        |         |              |        |         |
| +------v------+  |              | +------v------+  |
| | Agent       |  |              | | Agent       |  |
| | Container   |  |              | | Container   |  |
| +-------------+  |              | +-------------+  |
+------------------+              +------------------+
         |                                 |
         | mTLS                            | mTLS
         v                                 v
+------------------+              +------------------+
| Qdrant           |              | Prometheus       |
| (Vector DB)      |              | (Monitoring)     |
+------------------+              +------------------+

Figure 5: Agent networking data flow with service mesh mTLS.


6. Storage and Persistence

6.1 Storage Requirements for AI Agents

AI agents have diverse storage requirements that span multiple access patterns and performance tiers.

Table 5: Agent Storage Requirements

Storage TypeAccess PatternPerformancePersistenceExample Use
Model weightsRead-heavy, sequentialHigh throughput (1+ GB/s)Ephemeral (cacheable)LLM model files
Vector indicesRead-write, randomHigh IOPS (3000+)PersistentQdrant/Milvus data
Conversation stateWrite-heavy, appendMedium IOPS (500+)PersistentAgent memory
Task queueRead-write, FIFOLow latency (< 1ms)Semi-persistentPending tasks
Scratch/tempWrite-heavy, sequentialMedium throughputEphemeralIntermediate results
ConfigurationRead-onlyLowPersistentAgent config, prompts

6.2 Persistent Volume Claims

The operator creates PVCs based on the agent's storage declarations. For agents that require persistent state (vector databases, conversation history), the operator uses StorageClass selection to match performance requirements.

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: code-reviewer-vector-store namespace: agent-system labels: ossa.ai/agent: code-reviewer ossa.ai/storage-type: vector-index spec: accessModes: - ReadWriteOnce storageClassName: ssd-high-iops resources: requests: storage: 50Gi --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ssd-high-iops provisioner: ebs.csi.aws.com parameters: type: io2 iopsPerGB: "50" encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true reclaimPolicy: Retain

6.3 StatefulSets for Stateful Agents

Agents that maintain persistent state (such as vector database instances or agents with local model caches) are deployed as StatefulSets rather than Deployments. StatefulSets provide stable network identities (deterministic pod names like qdrant-0, qdrant-1) and ordered, graceful scaling that ensures data consistency.

apiVersion: apps/v1 kind: StatefulSet metadata: name: qdrant-vector-store namespace: agent-system spec: serviceName: qdrant-headless replicas: 3 podManagementPolicy: OrderedReady selector: matchLabels: app: qdrant template: metadata: labels: app: qdrant spec: containers: - name: qdrant image: qdrant/qdrant:v1.12.0 ports: - containerPort: 6333 name: http - containerPort: 6334 name: grpc - containerPort: 6335 name: internal volumeMounts: - name: qdrant-data mountPath: /qdrant/storage resources: requests: cpu: "2" memory: 8Gi limits: cpu: "4" memory: 16Gi volumeClaimTemplates: - metadata: name: qdrant-data spec: accessModes: [ReadWriteOnce] storageClassName: ssd-high-iops resources: requests: storage: 100Gi

6.4 CSI Driver Selection

The choice of CSI (Container Storage Interface) driver significantly impacts storage performance. The following guidelines apply to agent workloads:

Performance benchmarks by storage tier:

Storage Performance Reference:

  SSD (io2/gp3):
    IOPS:       3,000 - 64,000 (provisioned)
    Throughput:  125 - 1,000 MB/s
    Latency:     < 1 ms (p99)
    Cost:        $0.125/GB/month + $0.065/provisioned-IOPS

  HDD (st1/sc1):
    IOPS:       250 - 500
    Throughput:  20 - 500 MB/s
    Latency:     5 - 10 ms (p99)
    Cost:        $0.025 - $0.045/GB/month

  NFS (EFS/Filestore):
    IOPS:       varies (bursting)
    Throughput:  50 - 1,000 MB/s (provisioned)
    Latency:     2 - 10 ms (p99)
    Cost:        $0.30/GB/month (standard), $0.025/GB/month (infrequent access)
    Access:      ReadWriteMany (shared across pods)

  Local NVMe (i3/i4i instances):
    IOPS:       100,000 - 3,300,000
    Throughput:  1,750 - 8,000 MB/s
    Latency:     < 0.1 ms (p99)
    Cost:        included in instance cost (ephemeral)

For vector database workloads that require high random IOPS, provisioned SSD (io2) is recommended. For model weight caching where sequential throughput matters more than IOPS, local NVMe provides the best performance at the lowest cost (since storage is included in the instance price), with the caveat that data is ephemeral and must be re-downloaded if the node is replaced.

6.5 Model Weight Distribution

Large model weights (ranging from 2 GB for 7B-parameter quantized models to 150+ GB for 70B-parameter full-precision models) present a unique storage challenge. Downloading weights from a remote registry (Hugging Face, S3) on every pod startup introduces unacceptable latency. The recommended approach is a tiered caching strategy:

  1. Cluster-level cache: A shared ReadWriteMany NFS volume mounted at /models/cache on all agent nodes, populated by a DaemonSet that pre-fetches model weights.
  2. Node-level cache: A hostPath or local PV mounted at /var/cache/models that persists across pod restarts on the same node.
  3. Pod-level init container: An init container that copies the required model weights from the node cache to the pod's ephemeral volume before the agent container starts.
Model loading time = max(download_time, 0) + copy_time + load_time

Where:
  download_time = model_size / download_bandwidth  (0 if cached)
  copy_time     = model_size / local_disk_bandwidth
  load_time     = model_size / memory_bandwidth + initialization_overhead

Example (13B model, 7.3 GB quantized):
  Cold start:  7.3 GB / 100 MB/s + 7.3 GB / 2 GB/s + 7.3 GB / 20 GB/s + 2s
             = 73s + 3.65s + 0.365s + 2s = ~79 seconds
  Warm start:  0s + 3.65s + 0.365s + 2s = ~6 seconds

7. Observability

7.1 The Three Pillars for Agent Workloads

Observability for AI agent workloads extends beyond traditional infrastructure monitoring. In addition to the standard three pillars (metrics, logs, traces), agent observability requires semantic-level understanding: what decisions did the agent make, what tools did it invoke, what was the quality of its output?

7.2 OpenTelemetry Instrumentation

The OpenTelemetry Collector runs as a DaemonSet on every node, receiving telemetry from agent pods via OTLP (OpenTelemetry Protocol) and routing it to the appropriate backends.

apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: agent-collector namespace: monitoring spec: mode: daemonset config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 5s send_batch_size: 1000 memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 128 attributes: actions: - key: agent.name from_context: resource action: insert - key: agent.model from_context: resource action: insert exporters: prometheusremotewrite: endpoint: http://prometheus.monitoring:9090/api/v1/write otlp/tempo: endpoint: tempo.monitoring:4317 tls: insecure: true loki: endpoint: http://loki.monitoring:3100/loki/api/v1/push service: pipelines: metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheusremotewrite] traces: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [otlp/tempo] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki]

7.3 Prometheus Metrics

The agent operator exposes a comprehensive set of Prometheus metrics that cover both infrastructure health and agent-specific semantics.

Key metrics exposed by the operator:

# Agent lifecycle metrics
agent_operator_reconcile_total{agent, result}              # Total reconciliation attempts
agent_operator_reconcile_duration_seconds{agent, quantile}  # Reconciliation latency
agent_operator_state_transitions_total{agent, from, to}     # State machine transitions
agent_operator_managed_agents_total{state}                  # Agents by state

# Agent runtime metrics (scraped from agent pods)
agent_tokens_processed_total{agent, model}                  # Total tokens processed
agent_tokens_per_second{agent, model}                       # Current throughput
agent_request_duration_seconds{agent, tool, quantile}       # Request latency by tool
agent_active_tasks{agent}                                   # Currently executing tasks
agent_queue_depth{agent}                                    # Pending tasks
agent_tool_invocations_total{agent, tool, status}           # Tool usage by status
agent_model_inference_duration_seconds{agent, model}        # Model inference latency
agent_errors_total{agent, type}                             # Errors by type
agent_gpu_utilization_percent{agent, gpu_index}             # GPU utilization
agent_gpu_memory_used_bytes{agent, gpu_index}               # GPU memory usage
agent_context_window_utilization{agent}                     # Context window fill percentage

7.4 Alert Rules

Critical alert rules for agent workloads:

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: agent-alerts namespace: monitoring spec: groups: - name: agent-health interval: 30s rules: - alert: AgentDown expr: | absent(up{job="agent-system"} == 1) for: 5m labels: severity: critical annotations: summary: "Agent {{ $labels.agent }} is down" description: "Agent {{ $labels.agent }} has been unreachable for 5 minutes." - alert: AgentHighErrorRate expr: | rate(agent_errors_total[5m]) / rate(agent_tokens_processed_total[5m]) > 0.05 for: 10m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} error rate > 5%" - alert: AgentHighLatency expr: | histogram_quantile(0.99, rate(agent_request_duration_seconds_bucket[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} p99 latency > 10s" - alert: AgentGPUMemoryPressure expr: | agent_gpu_memory_used_bytes / agent_gpu_memory_total_bytes > 0.95 for: 2m labels: severity: critical annotations: summary: "Agent {{ $labels.agent }} GPU memory > 95%" - alert: AgentQueueBacklog expr: | agent_queue_depth > 50 for: 10m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} queue depth > 50 for 10 minutes" - alert: AgentScalingMaxed expr: | kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas for: 15m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} at maximum replicas for 15 minutes"

7.5 Grafana Dashboards

A production agent observability stack includes the following Grafana dashboards:

  1. Agent Fleet Overview: Total agents, state distribution, cluster resource utilization, error rates, and throughput aggregates.
  2. Individual Agent Detail: Per-agent metrics including token throughput, latency percentiles, tool invocation breakdown, GPU utilization, and scaling events.
  3. Workflow Execution: AgentWorkflow DAG visualization, step durations, failure rates, and end-to-end latency.
  4. Cost and Capacity: GPU utilization efficiency, cost per token, resource waste (requested vs. used), and capacity planning projections.
  5. Security and Compliance: RBAC audit events, network policy violations, runtime security alerts, and access tier validation.

7.6 Structured Logging with Loki

Agent logs are structured as JSON and shipped to Loki via the OpenTelemetry Collector. Each log entry includes the agent name, model, task ID, and tool invocation context as labels, enabling efficient filtering and correlation.

The log retention policy should account for the volume generated by verbose agent interactions. A single agent processing 100 requests per hour with an average of 10 tool invocations per request generates approximately 1,000 log entries per hour. At an average of 500 bytes per entry, this is 500 KB/hour or ~360 MB/month per agent. For a fleet of 50 agents, total log volume is approximately 18 GB/month before compression (Loki typically achieves 10-15x compression, resulting in approximately 1.2-1.8 GB of stored data).


8. Multi-Cluster Federation

8.1 Why Multi-Cluster?

Single-cluster deployments are sufficient for many organizations, but multi-cluster federation becomes necessary for several reasons:

  • Geographic distribution: Agents that interact with users in multiple regions benefit from reduced latency when deployed closer to the user.
  • Regulatory compliance: Data residency requirements (GDPR, CCPA) may mandate that certain agent workloads and their associated data remain within specific geographic boundaries.
  • Blast radius reduction: Isolating agent workloads across clusters limits the impact of cluster-level failures.
  • Resource specialization: Different clusters can provide different hardware profiles (GPU types, memory configurations) for different agent workloads.
  • Scale limits: etcd performance degrades above approximately 10,000 custom resources per cluster; large agent deployments may need to partition across clusters.

8.2 Federation Architecture

Multi-Cluster Federation:

                    +----------------------------+
                    |    Federation Control      |
                    |    Plane                   |
                    |                            |
                    | +------------------------+ |
                    | | KubeFed Controller     | |
                    | | Manager                | |
                    | +------------------------+ |
                    | | Agent Federation       | |
                    | | Scheduler              | |
                    | +------------------------+ |
                    | | Global Service Mesh    | |
                    | | (Istio Multi-Cluster)  | |
                    | +------------------------+ |
                    +-----+-------+-------+-----+
                          |       |       |
              +-----------+       |       +-----------+
              |                   |                   |
    +---------v--------+ +-------v--------+ +--------v---------+
    | Cluster: US-East | | Cluster: EU    | | Cluster: AP      |
    |                  | |                | |                   |
    | Agents:          | | Agents:        | | Agents:           |
    | - code-reviewer  | | - gdpr-agent   | | - translation     |
    | - security-scan  | | - eu-reviewer  | | - ap-reviewer     |
    | - test-gen       | | - compliance   | | - sentiment       |
    |                  | |                | |                   |
    | GPU: 4x A100     | | GPU: 2x A100  | | GPU: 2x A100     |
    | Nodes: 15        | | Nodes: 8      | | Nodes: 6          |
    +------------------+ +----------------+ +-------------------+

Figure 6: Multi-cluster federation architecture with geographic distribution.

8.3 Federated Agent Resources

KubeFed (Kubernetes Federation v2) enables the propagation of Agent CRs across multiple clusters with placement policies and override mechanisms.

apiVersion: types.kubefed.io/v1beta1 kind: FederatedAgent metadata: name: code-reviewer namespace: agent-system spec: template: spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0 capabilities: - name: code-review type: skill version: "2.1" scaling: minReplicas: 2 maxReplicas: 10 resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: 1 placement: clusters: - name: us-east - name: eu-west clusterSelector: matchLabels: gpu-available: "true" overrides: - clusterName: eu-west clusterOverrides: - path: "/spec/scaling/minReplicas" value: 1 - path: "/spec/scaling/maxReplicas" value: 5 - path: "/spec/runtime/model" value: "anthropic/claude-sonnet-4-20250514:eu-compliant"

8.4 Cross-Cluster Latency Model

Cross-cluster agent communication introduces latency that must be accounted for in workflow design. The total latency for a cross-cluster agent invocation is:

T_cross_cluster = T_serialization + (N_hops * T_per_hop) + T_deserialization + T_processing

Where:
  T_serialization   = message_size / serialization_throughput
                    = typically 0.1 - 2 ms for protobuf (gRPC)
  N_hops            = number of network hops (typically 3-8 for cross-region)
  T_per_hop         = per-hop latency (0.5 - 5 ms per hop)
  T_deserialization  = roughly equal to T_serialization
  T_processing      = agent processing time (highly variable, 100ms - 60s)

Example (US-East to EU-West):
  T_cross_cluster = 0.5ms + (6 * 2ms) + 0.5ms + 500ms
                  = 0.5 + 12 + 0.5 + 500
                  = 513 ms

Compared to same-cluster:
  T_same_cluster  = 0.5ms + (2 * 0.1ms) + 0.5ms + 500ms
                  = 0.5 + 0.2 + 0.5 + 500
                  = 501.2 ms

The 12 ms network overhead of cross-cluster communication is negligible compared to agent processing time for most workloads. However, for workflows with many sequential agent invocations (e.g., a 10-step pipeline), the cumulative overhead becomes significant: 120 ms for cross-cluster vs. 2 ms for same-cluster, a 60x increase in network latency.

The recommendation is to co-locate agents that form tight interaction loops in the same cluster and use cross-cluster communication only for loosely coupled workflows or geographic routing.

8.5 Cluster API for Infrastructure Provisioning

For organizations that manage their own Kubernetes infrastructure (rather than using managed services), the Cluster API provides declarative, Kubernetes-style APIs for creating, configuring, and managing clusters. This enables the agent orchestration layer to provision new clusters on demand in response to scaling requirements or geographic expansion.


9. Security Hardening

9.1 Pod Security Standards

Kubernetes Pod Security Standards define three levels of restriction: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (heavily restricted, following current best practices). Agent workloads should run at the Restricted level with targeted exceptions for GPU access.

apiVersion: v1 kind: Namespace metadata: name: agent-system labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/enforce-version: latest pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted

The Restricted level enforces:

  • Pods must run as non-root
  • Root filesystem must be read-only
  • Privilege escalation must be explicitly disallowed
  • Seccomp profile must be set (RuntimeDefault or Localhost)
  • Host namespaces (hostNetwork, hostPID, hostIPC) are forbidden
  • HostPath volumes are forbidden

For GPU workloads, a RuntimeClass exception is required because the NVIDIA device plugin requires certain capabilities. This is handled through a targeted exemption rather than relaxing the entire namespace.

9.2 RuntimeClass and Sandboxing

For agents that execute untrusted code (e.g., a code execution agent that runs user-submitted programs), gVisor provides an additional layer of isolation beyond standard container boundaries. gVisor intercepts system calls and handles them in user space, preventing the container from directly interacting with the host kernel.

apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: gvisor handler: runsc overhead: podFixed: cpu: 100m memory: 64Mi scheduling: nodeSelector: runtime.gvisor.dev/capable: "true" --- apiVersion: ossa.ai/v1 kind: Agent metadata: name: code-executor namespace: agent-system spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-executor:v1.0.0 security: accessTier: tier_3_full_access runtimeClass: gvisor runAsNonRoot: true readOnlyRootFilesystem: true networkPolicy: allowEgress: - host: "*.internal.svc.cluster.local" port: 443 denyIngress: false

The performance overhead of gVisor is approximately 5-15% for CPU-bound workloads and 20-40% for syscall-heavy workloads. For AI inference workloads that are primarily GPU-bound, the overhead is negligible because GPU operations bypass the gVisor syscall interception layer.

9.3 RBAC (Role-Based Access Control)

RBAC for the agent system follows the principle of least privilege. The operator service account has broad permissions within the agent-system namespace, but agent pods themselves have tightly scoped permissions based on their OSSA access tier.

apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: agent-operator rules: - apiGroups: ["ossa.ai"] resources: ["agents", "agentpools", "agentworkflows"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["ossa.ai"] resources: ["agents/status", "agentpools/status", "agentworkflows/status"] verbs: ["get", "update", "patch"] - apiGroups: [""] resources: ["pods", "services", "configmaps", "secrets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["apps"] resources: ["deployments", "statefulsets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["autoscaling"] resources: ["horizontalpodautoscalers"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["networking.k8s.io"] resources: ["networkpolicies"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: agent-tier1-readonly namespace: agent-system rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list"] - apiGroups: [""] resources: ["secrets"] resourceNames: ["agent-api-keys"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: agent-tier3-executor namespace: agent-system rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "create", "update"] - apiGroups: [""] resources: ["secrets"] resourceNames: ["agent-api-keys", "agent-git-credentials"] verbs: ["get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["create", "get", "list"]

9.4 OPA (Open Policy Agent) for Policy Enforcement

OPA Gatekeeper enforces custom policies that go beyond what Kubernetes RBAC and Pod Security Standards can express. For agent workloads, OPA policies enforce constraints such as:

  • Agents must not request more GPU resources than their access tier permits.
  • Agents in tier_1_read cannot have egress network policies to external endpoints.
  • Agent images must be pulled from the approved container registry.
  • Agent model references must be from the approved model registry.
  • Cross-tier agent communication must follow the role conflict matrix.
apiVersion: templates.gatekeeper.sh/v1 kind: ConstraintTemplate metadata: name: agenttierresourcelimit spec: crd: spec: names: kind: AgentTierResourceLimit validation: openAPIV3Schema: type: object properties: maxGPU: type: object additionalProperties: type: integer targets: - target: admission.k8s.gatekeeper.sh rego: | package agenttierresourcelimit violation[{"msg": msg}] { input.review.object.apiVersion == "ossa.ai/v1" input.review.object.kind == "Agent" tier := input.review.object.spec.security.accessTier requested_gpu := input.review.object.spec.resources.requests["nvidia.com/gpu"] max_gpu := input.parameters.maxGPU[tier] requested_gpu > max_gpu msg := sprintf( "Agent %v in tier %v requests %v GPUs, max allowed: %v", [input.review.object.metadata.name, tier, requested_gpu, max_gpu] ) } --- apiVersion: constraints.gatekeeper.sh/v1beta1 kind: AgentTierResourceLimit metadata: name: agent-gpu-limits-by-tier spec: match: kinds: - apiGroups: ["ossa.ai"] kinds: ["Agent"] parameters: maxGPU: tier_1_read: 0 tier_2_write_limited: 1 tier_3_full_access: 4 tier_4_policy: 0

9.5 Supply Chain Security

Agent container images must pass through a verification pipeline before deployment:

  1. Image signing: All images are signed with Sigstore/Cosign during CI. The operator validates signatures before allowing pod creation.
  2. Vulnerability scanning: Trivy scans all images for CVEs. Images with critical vulnerabilities are blocked.
  3. SBOM generation: Software Bills of Materials are generated for all agent images and stored in the registry alongside the image.
  4. Admission control: Kyverno or OPA policies enforce that only signed, scanned images from approved registries are admitted to the cluster.

10. Reference Architecture

10.1 50-Agent Production Deployment

The following reference architecture describes a production deployment of 50 agents across a medium-tier Kubernetes cluster. This architecture supports a mix of CPU-only agents (lightweight tools, routing, orchestration) and GPU-accelerated agents (inference, code generation, analysis).

Table 6: Reference Architecture Node Pools

Node PoolInstance TypeCountCPUMemoryGPUPurpose
systemm6i.xlarge34 vCPU16 GBNoneControl plane, operator, monitoring
cpu-agentsm6i.2xlarge58 vCPU32 GBNoneCPU-only agents, orchestrators
gpu-inferenceg5.2xlarge48 vCPU32 GB1x A10G 24GBInference agents, code generation
gpu-heavyp4d.24xlarge196 vCPU1152 GB8x A100 40GBLarge model inference, training
storagei3.xlarge34 vCPU30.5 GBNoneQdrant, MinIO, PostgreSQL

Deployment manifest for the complete system:

# Namespace and resource quotas apiVersion: v1 kind: Namespace metadata: name: agent-system labels: pod-security.kubernetes.io/enforce: restricted istio-injection: enabled --- apiVersion: v1 kind: ResourceQuota metadata: name: agent-system-quota namespace: agent-system spec: hard: requests.cpu: "200" requests.memory: 800Gi requests.nvidia.com/gpu: "12" limits.cpu: "400" limits.memory: 1600Gi limits.nvidia.com/gpu: "12" pods: "200" services: "60" persistentvolumeclaims: "50" --- # Agent Operator deployment apiVersion: apps/v1 kind: Deployment metadata: name: agent-operator namespace: agent-system spec: replicas: 3 selector: matchLabels: app: agent-operator template: metadata: labels: app: agent-operator spec: serviceAccountName: agent-operator nodeSelector: node-pool: system containers: - name: operator image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0 args: - --leader-elect=true - --leader-election-id=agent-operator-leader - --metrics-bind-address=:8080 - --health-probe-bind-address=:8081 - --max-concurrent-reconciles=10 resources: requests: cpu: 200m memory: 256Mi limits: cpu: 1000m memory: 1Gi --- # Example agents (representative of 50-agent fleet) apiVersion: ossa.ai/v1 kind: Agent metadata: name: code-reviewer namespace: agent-system spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0 maxTokensPerRequest: 8192 temperature: 0.3 capabilities: - name: code-review type: skill version: "2.1" - name: git-operations type: tool version: "1.0" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: 1 limits: cpu: "4" memory: 16Gi nvidia.com/gpu: 1 scaling: minReplicas: 2 maxReplicas: 8 metrics: - type: custom name: agent_active_tasks target: type: AverageValue averageValue: "3" - type: custom name: agent_queue_depth target: type: Value averageValue: "10" scaleDownStabilization: 300 security: accessTier: tier_3_full_access runAsNonRoot: true readOnlyRootFilesystem: true networkPolicy: allowEgress: - host: "gitlab.com" port: 443 - host: "api.anthropic.com" port: 443 observability: metricsPort: 9090 metricsPath: /metrics tracingEnabled: true logLevel: info --- apiVersion: ossa.ai/v1 kind: Agent metadata: name: routing-orchestrator namespace: agent-system spec: runtime: model: anthropic/claude-haiku-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/router:v1.5.0 maxTokensPerRequest: 2048 temperature: 0.1 capabilities: - name: task-routing type: skill version: "1.5" - name: agent-discovery type: protocol version: "1.0" resources: requests: cpu: "1" memory: 2Gi limits: cpu: "2" memory: 4Gi scaling: minReplicas: 3 maxReplicas: 15 metrics: - type: custom name: agent_active_tasks target: type: AverageValue averageValue: "10" security: accessTier: tier_2_write_limited runAsNonRoot: true readOnlyRootFilesystem: true observability: metricsPort: 9090 tracingEnabled: true logLevel: info

10.2 Cost Model

The monthly cost for this reference architecture is calculated as follows:

Table 7: Monthly Cost Breakdown

ComponentQuantityUnit CostMonthly Cost
System nodes (m6i.xlarge)3$138/mo$414
CPU agent nodes (m6i.2xlarge)5$276/mo$1,380
GPU inference nodes (g5.2xlarge)4$912/mo$3,648
GPU heavy node (p4d.24xlarge)1$23,558/mo$23,558
Storage nodes (i3.xlarge)3$225/mo$675
EBS storage (gp3, 2TB total)2,000 GB$0.08/GB/mo$160
EBS storage (io2, 500GB)500 GB$0.125/GB/mo$62.50
Data transfer (egress)500 GB$0.09/GB$45
EKS control plane1$73/mo$73
Total$30,015.50

For organizations that do not require the p4d.24xlarge heavy GPU node, the cost drops to approximately $6,457/month, well within the medium tier range. The heavy GPU node is only necessary for organizations running large language models (70B+ parameters) locally rather than using API-based inference.

The cost formula for estimating deployment expenses:

Monthly = (N_system * C_system) + (N_cpu * C_cpu_node) + (N_gpu_small * C_gpu_small)
        + (N_gpu_large * C_gpu_large) + (N_storage * C_storage_node)
        + (S_gp3 * 0.08) + (S_io2 * 0.125) + (E_gb * 0.09) + C_managed

Cost_per_agent = Monthly / N_agents
Cost_per_token = Monthly / (N_agents * avg_tokens_per_agent_per_month)

For this reference architecture:

  • Cost per agent: $30,015.50 / 50 = $600.31/month
  • Assuming each agent processes 10 million tokens/month: $0.003/1000 tokens

10.3 Capacity Planning

Capacity planning for agent workloads requires estimating the peak concurrency, throughput requirements, and resource consumption patterns.

Capacity planning formulas:

Required_CPU_nodes = ceil(sum(agent_cpu_requests) / node_allocatable_cpu)
Required_GPU_nodes = ceil(sum(agent_gpu_requests) / gpus_per_node)
Required_memory    = sum(agent_memory_requests) * (1 + headroom_percent)

Throughput_capacity = N_replicas * tokens_per_second_per_replica
Latency_budget     = target_p99_latency - network_overhead - queue_wait
Max_concurrent     = throughput_capacity * latency_budget

Scaling_headroom   = max_replicas / min_replicas  (recommended: 3-5x)

For the 50-agent deployment:

  • Total CPU requests: ~150 vCPU (3 vCPU average per agent)
  • Total GPU requests: 12 GPUs (24% of agents require GPU)
  • Total memory requests: ~300 GB (6 GB average per agent)
  • Peak throughput: ~5,000 tokens/second aggregate
  • P99 latency target: < 5 seconds for inference agents

11. References

  1. Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. ACM Queue, 14(1), 70-93. DOI:10.1145/2898442.2898444 | Google Research

  2. Cloud Native Computing Foundation. (2025). CNCF Annual Survey 2025: Kubernetes Adoption and Trends. https://www.cncf.io/reports/cncf-annual-survey-2025/

  3. Kubernetes Authors. (2025). Custom Resources. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

  4. Kubernetes Authors. (2025). Operator Pattern. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

  5. Operator SDK Authors. (2025). Building Operators with Operator SDK. https://sdk.operatorframework.io/

  6. Kubernetes Authors. (2025). Horizontal Pod Autoscaler. Kubernetes Documentation. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

  7. KEDA Authors. (2025). KEDA: Kubernetes Event-driven Autoscaling. https://keda.sh/docs/

  8. Kubernetes Authors. (2025). Vertical Pod Autoscaler. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler

  9. Istio Authors. (2025). Istio Service Mesh Architecture. https://istio.io/latest/docs/ops/deployment/architecture/

  10. Linkerd Authors. (2025). Linkerd Architecture. https://linkerd.io/2/reference/architecture/

  11. Kubernetes Authors. (2025). Network Policies. Kubernetes Documentation. https://kubernetes.io/docs/concepts/services-networking/network-policies/

  12. Kubernetes Authors. (2025). Persistent Volumes. Kubernetes Documentation. https://kubernetes.io/docs/concepts/storage/persistent-volumes/

  13. Kubernetes Authors. (2025). StatefulSets. Kubernetes Documentation. https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

  14. OpenTelemetry Authors. (2025). OpenTelemetry Collector. https://opentelemetry.io/docs/collector/

  15. Prometheus Authors. (2025). Prometheus Monitoring System. https://prometheus.io/docs/

  16. Grafana Labs. (2025). Grafana Loki: Log Aggregation System. https://grafana.com/docs/loki/latest/

  17. Kubernetes Federation v2 Authors. (2025). KubeFed: Kubernetes Federation v2. https://github.com/kubernetes-sigs/kubefed

  18. Cluster API Authors. (2025). Cluster API Documentation. https://cluster-api.sigs.k8s.io/

  19. Kubernetes Authors. (2025). Pod Security Standards. Kubernetes Documentation. https://kubernetes.io/docs/concepts/security/pod-security-standards/

  20. gVisor Authors. (2025). gVisor: Application Kernel for Containers. https://gvisor.dev/docs/

  21. Open Policy Agent Authors. (2025). OPA Gatekeeper. https://open-policy-agent.github.io/gatekeeper/

  22. NVIDIA. (2025). NVIDIA Device Plugin for Kubernetes. https://github.com/NVIDIA/k8s-device-plugin

  23. NVIDIA. (2025). Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

  24. Sigstore Authors. (2025). Cosign: Container Signing. https://docs.sigstore.dev/cosign/

  25. Aqua Security. (2025). Trivy: Comprehensive Vulnerability Scanner. https://trivy.dev/

  26. BlueFly.io. (2026). Open Standard for Sustainable Agents (OSSA) v0.3.3 Specification. https://gitlab.com/blueflyio/openstandardagents

  27. BlueFly.io. (2026). Agent Platform Technical Documentation. https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/home

  28. Kubernetes Authors. (2025). Resource Management for Pods and Containers. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

  29. Kubernetes Authors. (2025). Scheduling, Preemption, and Eviction. https://kubernetes.io/docs/concepts/scheduling-eviction/

  30. etcd Authors. (2025). etcd Performance and Tuning. https://etcd.io/docs/v3.5/op-guide/performance/


Appendix A: Glossary

TermDefinition
CRDCustom Resource Definition: extends the Kubernetes API with new resource types
HPAHorizontal Pod Autoscaler: adjusts replica count based on metrics
VPAVertical Pod Autoscaler: adjusts resource requests/limits per pod
KEDAKubernetes Event-Driven Autoscaling: scales based on event sources
OSSAOpen Standard for Sustainable Agents: BlueFly.io's agent specification
mTLSMutual TLS: bidirectional certificate-based authentication
OPAOpen Policy Agent: policy enforcement engine
CSIContainer Storage Interface: standard for storage plugins
MIGMulti-Instance GPU: NVIDIA technology for GPU partitioning
RBACRole-Based Access Control: Kubernetes authorization mechanism
CRIContainer Runtime Interface: standard for container runtimes
DAGDirected Acyclic Graph: used for workflow step ordering
PVCPersistent Volume Claim: storage request in Kubernetes
OTLPOpenTelemetry Protocol: telemetry data transport protocol

Appendix B: Checklist for Production Readiness

  • Agent CRDs deployed and validated with OpenAPI v3 schema
  • Agent Operator running with 3 replicas and leader election
  • HPA configured with custom metrics (tokens/sec, queue depth)
  • KEDA ScaledObjects for event-driven scaling with scale-to-zero
  • VPA in Initial mode for GPU memory right-sizing
  • Istio service mesh with STRICT mTLS enabled
  • Network policies applied to all agent pods
  • Pod Security Standards enforced at Restricted level
  • gVisor RuntimeClass for code-execution agents
  • RBAC roles aligned with OSSA access tiers
  • OPA policies for tier-based resource limits
  • OpenTelemetry Collector DaemonSet deployed
  • Prometheus scraping agent metrics endpoints
  • Grafana dashboards for fleet overview and individual agents
  • Alert rules for agent health, latency, GPU pressure, and queue backlog
  • Loki log aggregation with appropriate retention policies
  • Container image signing with Cosign
  • Vulnerability scanning with Trivy in CI pipeline
  • SBOM generation for all agent images
  • Resource quotas applied at namespace level
  • PVCs provisioned with appropriate StorageClass (SSD for vector DBs)
  • Model weight caching strategy implemented (cluster/node/pod tiers)
  • Backup strategy for persistent agent state
  • Disaster recovery plan documented and tested
  • Capacity planning formulas validated against actual usage

This whitepaper is part of the BlueFly.io Agent Platform Whitepaper Series. For questions, contributions, or errata, please open an issue at https://gitlab.com/blueflyio/agent-platform/technical-docs/-/issues.

OSSAAgentsResearch