Kubernetes-Native Agent Orchestration: Custom Resources, Operators, and Cloud-Native Patterns for AI Agent Deployment
BlueFly.io Agent Platform -- Whitepaper Series #4 Version: 1.0 Date: February 2026 Classification: Technical Reference Audience: Platform Engineers, SREs, AI Infrastructure Architects
Abstract
The convergence of autonomous AI agents and cloud-native infrastructure presents both an unprecedented opportunity and a formidable engineering challenge. As organizations scale from single-agent prototypes to fleets of hundreds of cooperating agents, the operational complexity of provisioning, scaling, monitoring, and securing these workloads demands a principled orchestration layer. Kubernetes, with its declarative state model, extensible API machinery, and mature ecosystem, provides a natural substrate for this orchestration.
This whitepaper presents a complete architecture for Kubernetes-native AI agent orchestration. We define Custom Resource Definitions (CRDs) that encode agent specifications, pool topologies, and workflow graphs as first-class Kubernetes objects. We detail an Operator pattern that implements a full reconciliation loop with state-machine semantics, leader election for high availability, and graceful degradation under failure. We address the unique scaling requirements of AI workloads through Horizontal Pod Autoscalers driven by custom metrics (tokens per second, queue depth, active tasks), Vertical Pod Autoscalers for right-sizing GPU memory allocations, and KEDA-based event-driven scaling for bursty inference workloads. Networking patterns encompass service mesh integration with mutual TLS, gRPC load balancing, and network policy isolation. Storage strategies cover persistent vector databases via StatefulSets, ephemeral scratch volumes for model weights, and CSI driver selection for throughput-sensitive workloads. Observability is treated as a first-class concern through OpenTelemetry instrumentation, Prometheus metrics pipelines, Grafana dashboards, and structured logging with Loki. We extend the architecture to multi-cluster federation for geographic distribution and regulatory compliance, and harden the deployment with Pod Security Standards, gVisor sandboxing, OPA policy enforcement, and least-privilege RBAC. A reference architecture for a 50-agent production deployment provides concrete manifests, cost models, and capacity planning formulas. Throughout, we ground our recommendations in production experience operating agent fleets at scale and in the broader cloud-native community's best practices as codified by the CNCF.
The architecture described herein is aligned with the Open Standard for Sustainable Agents (OSSA) v0.3.3 specification and the BlueFly.io Agent Platform's separation-of-duties model. All Kubernetes manifests, CRD schemas, and operator pseudocode are provided as actionable reference implementations.
1. Why Kubernetes for AI Agents
1.1 The Operational Gap
The AI agent landscape has evolved rapidly from research prototypes to production systems that must meet enterprise reliability standards. An agent that performs well in a notebook or a single-process deployment quickly encounters operational challenges when deployed at scale: how do you restart it when it crashes? How do you scale it when load increases? How do you roll out a new model version without downtime? How do you enforce resource limits so that a runaway agent does not consume an entire cluster's GPU allocation?
These are precisely the problems that container orchestration platforms were designed to solve. Kubernetes, as the dominant container orchestration platform with 96% of organizations either using or evaluating it according to the CNCF Annual Survey 2025, provides a battle-tested foundation for addressing these operational concerns.
1.2 Declarative Desired State and Agent Specifications
The fundamental insight that makes Kubernetes suitable for agent orchestration is its declarative model. Rather than writing imperative scripts that specify how to deploy an agent (start process A, then configure network B, then attach volume C), operators declare what they want (an agent with these capabilities, this model, these resource limits, and this scaling policy) and the Kubernetes control plane continuously reconciles the actual state of the world with the desired state.
This model maps naturally to agent specifications. An OSSA agent manifest already declares the agent's identity, capabilities, access tier, and resource requirements in a declarative format. A Kubernetes CRD extends this with operational semantics: replica count, health check endpoints, scaling triggers, affinity rules, and upgrade strategies. The resulting object is simultaneously a complete description of what the agent is (its functional specification) and how it should be operated (its operational specification).
Declarative Alignment:
OSSA Manifest Kubernetes CRD
-------------- --------------
agent.name ---> metadata.name
agent.capabilities ---> spec.capabilities[]
agent.model ---> spec.runtime.model
agent.tier ---> spec.security.accessTier
(not specified) ---> spec.scaling (HPA policy)
(not specified) ---> spec.resources (CPU/GPU/memory)
(not specified) ---> spec.networking (service mesh)
(not specified) ---> spec.observability (metrics/tracing)
Figure 1: Declarative alignment between OSSA agent manifests and Kubernetes CRDs.
1.3 Architecture Tiers and Cost Models
Not every organization requires a full multi-cluster federation with GPU scheduling and service mesh. We define three architecture tiers that allow organizations to adopt Kubernetes-native agent orchestration incrementally.
Table 1: Architecture Tiers
| Tier | Monthly Cost | Nodes | Agents | GPU | Features |
|---|---|---|---|---|---|
| Small | $100-500 | 1-3 | 1-10 | None/shared | CRDs, basic operator, HPA, Prometheus |
| Medium | $1,000-5,000 | 5-15 | 10-50 | 1-4 dedicated | Full operator, KEDA, Istio, VPA, OPA |
| Large | $10,000+ | 20-100+ | 50-500+ | 8+ dedicated | Multi-cluster, federation, custom schedulers |
The small tier is achievable on managed Kubernetes services (EKS, GKE, AKS) with a single node pool and provides the foundational CRD and operator patterns. The medium tier adds GPU scheduling, event-driven scaling, and service mesh security. The large tier extends to multi-cluster federation, custom scheduling algorithms, and dedicated GPU pools with preemption policies.
The cost formula for capacity planning at any tier is:
Monthly_Cost = (N_cpu_nodes * C_cpu) + (N_gpu_nodes * C_gpu) + (S_pv * C_storage) + (E_gb * C_egress) + C_managed
Where:
N_cpu_nodes = number of CPU-only nodes
C_cpu = cost per CPU node per month ($50-200 for cloud instances)
N_gpu_nodes = number of GPU nodes
C_gpu = cost per GPU node per month ($500-3000 depending on GPU class)
S_pv = total persistent volume storage in GB
C_storage = cost per GB per month ($0.10-0.30 for SSD, $0.04-0.08 for HDD)
E_gb = monthly egress in GB
C_egress = cost per GB egress ($0.08-0.12 for major cloud providers)
C_managed = managed K8s control plane cost ($0-74/mo depending on provider)
1.4 Why Not Alternatives?
Before committing to Kubernetes, it is worth considering the alternatives.
Docker Compose: Suitable for single-machine deployments but lacks scheduling, scaling, self-healing, and multi-node support. Cannot handle GPU scheduling or affinity rules.
Nomad: A capable orchestrator with simpler operational characteristics than Kubernetes, but a significantly smaller ecosystem. The lack of CRD-equivalent extensibility means agent specifications must be encoded as job metadata rather than first-class API objects.
Serverless (Lambda/Cloud Run): Attractive for stateless, short-lived inference workloads but fundamentally misaligned with long-running, stateful agents that maintain conversation context, vector store connections, and tool registrations. Cold start latencies of 1-10 seconds are unacceptable for real-time agent interactions.
Custom Orchestration: Building a bespoke orchestration layer is always an option, but it means reimplementing scheduling, health checking, scaling, networking, storage, and observability from scratch. The engineering cost is prohibitive for all but the largest organizations.
Kubernetes occupies the sweet spot: a mature, extensible platform with a vast ecosystem of integrations, supported by every major cloud provider, and designed from the ground up for the kind of declarative, self-healing infrastructure that agent orchestration demands.
2. Custom Resource Definitions
2.1 The Agent CRD
The Agent CRD is the foundational building block of the orchestration layer. It extends the Kubernetes API with a new resource type that encodes everything the operator needs to know about an agent: its runtime configuration, resource requirements, scaling policy, security posture, and observability settings.
apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: agents.ossa.ai annotations: api-approved.kubernetes.io: "https://ossa.ai/api-review/agents" spec: group: ossa.ai versions: - name: v1 served: true storage: true subresources: status: {} scale: specReplicasPath: .spec.scaling.minReplicas statusReplicasPath: .status.replicas additionalPrinterColumns: - name: State type: string jsonPath: .status.state - name: Replicas type: integer jsonPath: .status.replicas - name: Model type: string jsonPath: .spec.runtime.model - name: Age type: date jsonPath: .metadata.creationTimestamp schema: openAPIV3Schema: type: object required: [spec] properties: spec: type: object required: [runtime, capabilities] properties: runtime: type: object required: [model, image] properties: model: type: string description: "LLM model identifier" pattern: "^[a-z0-9-]+/[a-z0-9._-]+:[a-z0-9._-]+$" image: type: string description: "Container image for the agent runtime" command: type: array items: type: string env: type: array items: type: object properties: name: type: string value: type: string valueFrom: type: object properties: secretKeyRef: type: object properties: name: type: string key: type: string providerEndpoint: type: string format: uri maxTokensPerRequest: type: integer minimum: 1 maximum: 200000 default: 4096 temperature: type: number minimum: 0.0 maximum: 2.0 default: 0.7 capabilities: type: array minItems: 1 items: type: object required: [name, type] properties: name: type: string type: type: string enum: [tool, skill, protocol, sensor] version: type: string config: type: object x-kubernetes-preserve-unknown-fields: true resources: type: object properties: requests: type: object properties: cpu: type: string pattern: "^[0-9]+m?$" memory: type: string pattern: "^[0-9]+(Mi|Gi)$" nvidia.com/gpu: type: integer minimum: 0 limits: type: object properties: cpu: type: string memory: type: string nvidia.com/gpu: type: integer scaling: type: object properties: minReplicas: type: integer minimum: 0 default: 1 maxReplicas: type: integer minimum: 1 default: 10 metrics: type: array items: type: object properties: type: type: string enum: [cpu, memory, custom, external] name: type: string target: type: object properties: type: type: string enum: [Utilization, AverageValue, Value] averageValue: type: string averageUtilization: type: integer scaleDownStabilization: type: integer default: 300 description: "Seconds to wait before scaling down" security: type: object properties: accessTier: type: string enum: [tier_1_read, tier_2_write_limited, tier_3_full_access, tier_4_policy] default: tier_1_read runAsNonRoot: type: boolean default: true readOnlyRootFilesystem: type: boolean default: true runtimeClass: type: string description: "RuntimeClass name (e.g., gvisor for sandboxing)" networkPolicy: type: object properties: allowEgress: type: array items: type: object properties: host: type: string port: type: integer denyIngress: type: boolean default: false observability: type: object properties: metricsPort: type: integer default: 9090 metricsPath: type: string default: "/metrics" tracingEnabled: type: boolean default: true logLevel: type: string enum: [debug, info, warn, error] default: info status: type: object properties: state: type: string enum: [Pending, Initializing, Running, Degraded, Terminating, Failed] replicas: type: integer readyReplicas: type: integer lastTransitionTime: type: string format: date-time conditions: type: array items: type: object properties: type: type: string status: type: string enum: ["True", "False", "Unknown"] reason: type: string message: type: string lastTransitionTime: type: string format: date-time metrics: type: object properties: tokensPerSecond: type: number activeTaskCount: type: integer averageLatencyMs: type: number errorRate: type: number scope: Namespaced names: plural: agents singular: agent kind: Agent shortNames: - ag categories: - ossa - ai
2.2 The AgentPool CRD
While individual Agent resources describe single agent types, the AgentPool CRD manages a logical group of agents that share infrastructure resources and scaling policies. An AgentPool defines node affinity, GPU allocation strategies, and pool-level resource quotas.
apiVersion: ossa.ai/v1 kind: AgentPool metadata: name: inference-pool namespace: agent-system spec: nodeSelector: accelerator: nvidia-a100 topology.kubernetes.io/zone: us-east-1a tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule resourceQuota: requests.cpu: "64" requests.memory: "256Gi" requests.nvidia.com/gpu: "8" limits.cpu: "128" limits.memory: "512Gi" limits.nvidia.com/gpu: "8" agents: - name: code-reviewer replicas: 3 - name: security-scanner replicas: 2 - name: test-generator replicas: 5 scheduling: strategy: BinPacking preemptionPolicy: PreemptLowerPriority priorityClassName: high-priority-agents
2.3 The AgentWorkflow CRD
The AgentWorkflow CRD encodes multi-agent workflows as directed acyclic graphs (DAGs) with typed edges representing data flow between agents.
apiVersion: ossa.ai/v1 kind: AgentWorkflow metadata: name: code-review-pipeline namespace: agent-system spec: entrypoint: analyze timeout: 600 retryPolicy: maxRetries: 3 backoff: exponential steps: - name: analyze agentRef: code-analyzer inputs: - name: repository type: git-url outputs: - name: analysis-report type: json next: - review - security-scan - name: review agentRef: code-reviewer inputs: - name: analysis-report fromStep: analyze outputs: - name: review-comments type: json next: - aggregate - name: security-scan agentRef: security-scanner inputs: - name: analysis-report fromStep: analyze outputs: - name: security-findings type: json next: - aggregate - name: aggregate agentRef: report-aggregator inputs: - name: review-comments fromStep: review - name: security-findings fromStep: security-scan outputs: - name: final-report type: json onFailure: step: notify-team agentRef: notification-agent
2.4 etcd Storage Considerations
Every CRD instance is stored in etcd as a key-value pair. The storage footprint per agent resource can be estimated as:
storage_per_agent = base_overhead + spec_size + status_size
Where:
base_overhead = ~1.5 KB (key prefix, metadata, timestamps, resourceVersion)
spec_size = 0.5 - 3.0 KB (depending on capabilities list and env vars)
status_size = 0.2 - 1.0 KB (conditions array, metrics snapshot)
Total per agent: ~2 - 5.5 KB (typically 2-4 KB)
For a deployment of 500 agents with an average spec size of 3 KB, the total etcd storage is approximately 1.75 MB, well within etcd's recommended maximum database size of 8 GB. However, the watch event rate is more significant: every status update generates a watch event that all operator replicas must process. At 500 agents updating status every 30 seconds, this produces approximately 17 events per second, which is within etcd's comfortable operating range of several thousand events per second.
CRD Lifecycle Data Flow:
User/CI API Server etcd Operator Kubelet
| | | | |
|--- apply Agent CR -->| | | |
| |--- store ------->| | |
| |--- watch event ----------------->| |
| | | |--- reconcile |
| | | | observe |
| | | | diff |
| | | | act ------->|
| | | | |--- create pod
| | | | |--- pull image
| | | | |--- start container
| |<-- status update ---------------| |
| |--- store ------->| | |
|<-- event notification| | | |
Figure 2: CRD lifecycle data flow from creation through reconciliation to pod scheduling.
2.5 Versioning and Migration
CRD versioning follows the Kubernetes API versioning convention. When the schema evolves (for example, adding a new field to the agent specification), a new API version is introduced (e.g., ossa.ai/v1beta2 to ossa.ai/v1). Webhook-based conversion ensures that existing resources are automatically migrated to the new schema without downtime. The operator maintains backward compatibility by supporting reads from all served versions and writes to the storage version.
3. Agent Operator Pattern
3.1 Operator Architecture
The Agent Operator is a Kubernetes controller that watches for changes to Agent, AgentPool, and AgentWorkflow resources and reconciles the cluster state to match the declared specifications. It is built using the Operator SDK framework, which provides scaffolding for controller registration, leader election, metrics exposition, and webhook configuration.
The operator runs as a Deployment with multiple replicas for high availability, but only one replica (the leader) actively processes reconciliation events at any given time. The remaining replicas stand by as hot standbys, ready to assume leadership within seconds if the leader fails.
Operator Architecture:
+------------------------------------------------------------------+
| Agent Operator Deployment |
| |
| +-------------------+ +-------------------+ +---------------+ |
| | Replica 1 | | Replica 2 | | Replica 3 | |
| | (LEADER) | | (STANDBY) | | (STANDBY) | |
| | | | | | | |
| | +-------------+ | | +-------------+ | | +----------+ | |
| | | Reconciler | | | | Reconciler | | | |Reconciler| | |
| | | Loop | | | | (paused) | | | |(paused) | | |
| | +------+------+ | | +-------------+ | | +----------+ | |
| | | | | | | | |
| | +------v------+ | | | | | |
| | | State | | | | | | |
| | | Machine | | | | | | |
| | +------+------+ | | | | | |
| | | | | | | | |
| | +------v------+ | | | | | |
| | | K8s Client | | | | | | |
| | +-------------+ | | | | | |
| +-------------------+ +-------------------+ +---------------+ |
| |
| +-------------------------------------------------------------+ |
| | Leader Election (Lease) | |
| +-------------------------------------------------------------+ |
+------------------------------------------------------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| Agent CRs | | AgentPool CRs | | AgentWorkflow |
| (watch) | | (watch) | | CRs (watch) |
+------------------+ +------------------+ +------------------+
Figure 3: Operator architecture with leader election and multi-replica standby.
3.2 Reconciliation Loop
The reconciliation loop is the heart of the operator. It follows the standard observe-diff-act pattern, but with agent-specific logic for model loading, capability registration, and health assessment.
Reconciliation Pseudocode:
function reconcile(agent: Agent) -> Result {
// OBSERVE: Gather current state
currentPods = listPods(labelSelector: agent.metadata.name)
currentService = getService(agent.metadata.name)
currentHPA = getHPA(agent.metadata.name)
currentNetworkPolicy = getNetworkPolicy(agent.metadata.name)
// DIFF: Compare desired vs actual
desiredReplicas = agent.spec.scaling.minReplicas
actualReplicas = len(currentPods.filter(phase == Running))
desiredImage = agent.spec.runtime.image
actualImages = currentPods.map(p => p.spec.containers[0].image).unique()
desiredModel = agent.spec.runtime.model
actualModelStatus = currentPods.map(p => p.annotations["ossa.ai/model-loaded"])
// ACT: Apply changes based on diff
// Phase 1: Ensure base resources exist
if currentService == null {
createService(agent)
updateStatus(agent, state: "Initializing", reason: "CreatingService")
return requeue(after: 5s)
}
if currentNetworkPolicy == null && agent.spec.security.networkPolicy != null {
createNetworkPolicy(agent)
}
// Phase 2: Pod management
if actualReplicas < desiredReplicas {
// Scale up: create pods with anti-affinity for spread
deficit = desiredReplicas - actualReplicas
for i in range(deficit) {
pod = buildAgentPod(agent, ordinal: actualReplicas + i)
applySecurityContext(pod, agent.spec.security)
applyResourceLimits(pod, agent.spec.resources)
injectObservabilitySidecar(pod, agent.spec.observability)
createPod(pod)
}
updateStatus(agent, state: "Initializing", reason: "ScalingUp")
return requeue(after: 15s)
}
if actualReplicas > desiredReplicas {
// Scale down: terminate excess pods (newest first)
excess = actualReplicas - desiredReplicas
podsToTerminate = currentPods.sortBy(creationTimestamp, desc).take(excess)
for pod in podsToTerminate {
// Graceful shutdown: drain active tasks first
drainAgent(pod, timeout: 60s)
deletePod(pod)
}
updateStatus(agent, state: "Running", reason: "ScalingDown")
return requeue(after: 30s)
}
// Phase 3: Image/model update (rolling update)
if len(actualImages) > 0 && actualImages[0] != desiredImage {
performRollingUpdate(agent, currentPods, desiredImage)
updateStatus(agent, state: "Initializing", reason: "RollingUpdate")
return requeue(after: 10s)
}
// Phase 4: Health assessment
healthyPods = currentPods.filter(p => p.status.conditions.ready == true)
unhealthyPods = currentPods.filter(p => p.status.conditions.ready == false)
if len(unhealthyPods) > 0 && len(healthyPods) < desiredReplicas {
updateStatus(agent, state: "Degraded",
reason: fmt("{} of {} replicas unhealthy", len(unhealthyPods), desiredReplicas))
// Attempt recovery for pods stuck in CrashLoopBackOff
for pod in unhealthyPods {
if pod.status.containerStatuses[0].restartCount > 5 {
deletePod(pod) // Let the next reconciliation recreate it
}
}
return requeue(after: 30s)
}
// Phase 5: HPA management
if agent.spec.scaling.metrics != null && len(agent.spec.scaling.metrics) > 0 {
if currentHPA == null {
createHPA(agent)
} else {
updateHPA(agent, currentHPA)
}
}
// Phase 6: Steady state
updateStatus(agent, state: "Running",
replicas: len(healthyPods),
readyReplicas: len(healthyPods),
metrics: collectMetrics(healthyPods))
return requeue(after: 60s) // Periodic reconciliation
}
3.3 State Machine
The agent lifecycle is modeled as a finite state machine with well-defined transitions and invariants.
Table 2: Agent State Machine Transitions
| Current State | Event | Next State | Actions |
|---|---|---|---|
| (none) | CR created | Pending | Validate spec, set initial status |
| Pending | Resources available | Initializing | Create Service, Pods, NetworkPolicy |
| Pending | Resources unavailable | Pending | Set condition "ResourcesUnavailable" |
| Initializing | All pods ready | Running | Enable HPA, register in mesh |
| Initializing | Pod failure | Degraded | Log error, attempt restart |
| Initializing | Timeout (5 min) | Failed | Set condition "InitializationTimeout" |
| Running | Health check pass | Running | Update metrics in status |
| Running | Partial pod failure | Degraded | Scale replacement, alert |
| Running | All pods fail | Failed | Attempt full restart |
| Running | Spec change | Initializing | Begin rolling update |
| Running | CR deleted | Terminating | Drain tasks, delete resources |
| Degraded | Recovery | Running | Clear degraded condition |
| Degraded | Persistent failure | Failed | Escalate alert, stop retries |
| Degraded | CR deleted | Terminating | Force delete resources |
| Terminating | All resources deleted | (none) | Remove finalizer |
| Failed | User intervention | Pending | Reset state, retry |
| Failed | CR deleted | Terminating | Cleanup remaining resources |
State Machine Diagram:
+----------+
create --> | Pending |
+----+-----+
|
resources | resources
available | unavailable
| (loop)
+----v-----+
|Initializ-|
| ing |
+----+-----+
/ | \
all pods/ | \timeout
ready / |pod \
/ |failure \
+----v---+ +---v----+ +-v------+
| Running| |Degraded| | Failed |
+----+---+ +---+----+ +---+----+
| | |
spec | recovery | user |
change| | action |
| | |
+-----------+----------+
|
CR deleted
|
+------v------+
| Terminating |
+------+------+
|
resources
cleaned up
|
(removed)
Figure 4: Agent lifecycle state machine with transitions.
3.4 Leader Election
The operator uses Kubernetes Lease objects for leader election. The leader acquires a lease with a configurable duration (default: 15 seconds) and renews it periodically (default: every 10 seconds). If the leader fails to renew the lease, another replica acquires it within the lease duration plus a brief jitter period.
The leader election configuration:
apiVersion: apps/v1 kind: Deployment metadata: name: agent-operator namespace: agent-system spec: replicas: 3 selector: matchLabels: app: agent-operator template: metadata: labels: app: agent-operator spec: serviceAccountName: agent-operator containers: - name: operator image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0 args: - --leader-elect=true - --leader-election-id=agent-operator-leader - --leader-election-namespace=agent-system - --leader-election-lease-duration=15s - --leader-election-renew-deadline=10s - --leader-election-retry-period=2s - --metrics-bind-address=:8080 - --health-probe-bind-address=:8081 ports: - containerPort: 8080 name: metrics - containerPort: 8081 name: health livenessProbe: httpGet: path: /healthz port: health initialDelaySeconds: 15 periodSeconds: 20 readinessProbe: httpGet: path: /readyz port: health initialDelaySeconds: 5 periodSeconds: 10 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi
The maximum failover time can be calculated as:
T_failover = lease_duration + retry_period + reconciliation_backoff
= 15s + 2s + 5s
= 22 seconds (worst case)
In practice, failover typically completes within 10-15 seconds because the standby replica's lease acquisition attempt aligns with the expired lease boundary.
3.5 Finalizers and Graceful Cleanup
The operator attaches a finalizer (ossa.ai/agent-cleanup) to every Agent resource. When the user deletes an Agent CR, Kubernetes marks it for deletion but does not remove it from etcd until all finalizers are cleared. The operator's reconciliation loop detects the deletion timestamp, transitions the agent to the Terminating state, drains active tasks from all pods, deletes subordinate resources (Pods, Services, HPAs, NetworkPolicies), and finally removes the finalizer, allowing Kubernetes to complete the deletion.
This ensures that active agent tasks are not abruptly terminated and that orphaned resources are not left in the cluster.
4. Agent Scaling
4.1 Horizontal Pod Autoscaler (HPA)
The Horizontal Pod Autoscaler adjusts the number of agent replicas based on observed metrics. For AI agents, the most relevant metrics are not traditional CPU and memory utilization but rather domain-specific metrics like tokens per second, active task count, and request queue depth.
The HPA scaling formula is:
desiredReplicas = ceil(currentMetricValue / targetMetricValue * currentReplicas)
Stabilization:
scaleUp: max(recommendations[last 0s..scaleUpStabilization])
scaleDown: min(recommendations[last 0s..scaleDownStabilization])
With tolerance band (default 10%):
if abs(1 - currentMetricValue/targetMetricValue) < 0.1:
desiredReplicas = currentReplicas // no change (within tolerance)
For example, if an agent is currently running 3 replicas processing 150 tokens per second total, and the target is 60 tokens per second per replica:
desiredReplicas = ceil(150 / 60 * 3) = ceil(7.5) = 8
But if scaling down, the stabilization window prevents premature scale-down:
desiredReplicas = min(recommendations[last 300s])
// If recommendations over last 5 minutes were [8, 7, 6, 5, 5]:
desiredReplicas = 5
4.2 Custom Metrics for Agent Workloads
Standard CPU and memory metrics are insufficient for intelligent agent scaling. The following custom metrics provide the signals needed for responsive, cost-effective scaling.
Table 3: Custom Metrics for Agent Scaling
| Metric | Type | Description | Target Range | Scaling Behavior |
|---|---|---|---|---|
agent_tokens_per_second | Pods | Token throughput per replica | 50-100 tps | Scale up when throughput saturates |
agent_active_tasks | Pods | Currently executing tasks per replica | 1-5 tasks | Scale up when concurrency is high |
agent_queue_depth | External | Pending tasks in message queue | 0-10 items | Scale up proactively before saturation |
agent_request_latency_p99 | Pods | 99th percentile response latency | < 2000 ms | Scale up when latency degrades |
agent_error_rate | Pods | Error rate over 5-minute window | < 0.01 (1%) | Scale up if errors from overload |
agent_gpu_utilization | Pods | GPU compute utilization percentage | 60-80% | Scale up for GPU-bound workloads |
agent_model_cache_hit_rate | Pods | KV cache hit rate for model inference | > 0.90 (90%) | Scale up if cache pressure high |
4.3 Vertical Pod Autoscaler (VPA)
While HPA adjusts replica count, VPA adjusts the resource requests and limits of individual pods. For AI agents, VPA is particularly valuable for right-sizing GPU memory allocations. A model that was initially allocated 16 GB of GPU memory may only use 11 GB in practice; VPA can reduce the request to 12 GB (with headroom), freeing 4 GB for other workloads on the same GPU node.
VPA operates in three modes:
- Off: VPA generates recommendations but does not apply them. Useful for initial observation.
- Initial: VPA sets resource requests only at pod creation time. No disruptive restarts.
- Auto: VPA evicts and recreates pods when resource requests need significant adjustment.
For agent workloads, the Initial mode is recommended for production because it avoids disruptive pod restarts that would interrupt active agent tasks. The Auto mode is suitable for development and staging environments where task interruption is acceptable.
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: code-reviewer-vpa namespace: agent-system spec: targetRef: apiVersion: ossa.ai/v1 kind: Agent name: code-reviewer updatePolicy: updateMode: "Initial" resourcePolicy: containerPolicies: - containerName: agent minAllowed: cpu: 250m memory: 512Mi maxAllowed: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 controlledResources: ["cpu", "memory"] controlledValues: RequestsAndLimits
4.4 KEDA for Event-Driven Scaling
KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes scaling beyond metrics to event sources. For agent workloads, KEDA is invaluable for scaling based on message queue depth, enabling agents to scale from zero when no tasks are pending and scale up rapidly when a burst of tasks arrives.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: code-reviewer-scaler namespace: agent-system spec: scaleTargetRef: apiVersion: ossa.ai/v1 kind: Agent name: code-reviewer pollingInterval: 15 cooldownPeriod: 300 idleReplicaCount: 0 minReplicaCount: 1 maxReplicaCount: 20 fallback: failureThreshold: 3 replicas: 2 triggers: - type: rabbitmq metadata: protocol: amqp queueName: agent-tasks-code-review mode: QueueLength value: "5" authenticationRef: name: rabbitmq-auth - type: prometheus metadata: serverAddress: http://prometheus.monitoring:9090 metricName: agent_active_tasks query: | sum(agent_active_tasks{agent="code-reviewer"}) threshold: "10" - type: cron metadata: timezone: America/New_York start: 0 8 * * 1-5 end: 0 18 * * 1-5 desiredReplicas: "3"
This configuration enables sophisticated scaling behavior. During business hours (8 AM to 6 PM EST, Monday through Friday), at least 3 replicas are maintained. Outside business hours, the agent scales to zero if no tasks are pending. When tasks arrive in the RabbitMQ queue, the agent scales up by one replica for every 5 pending tasks. The Prometheus trigger provides an additional signal based on active task concurrency across all replicas.
4.5 GPU Scheduling
GPU scheduling in Kubernetes requires the NVIDIA device plugin, which exposes nvidia.com/gpu as a schedulable resource. GPU allocation is binary at the device level: a pod either gets an entire GPU or none (fractional GPU sharing via MIG or time-slicing requires additional configuration).
Table 4: GPU Scheduling Strategies
| Strategy | Configuration | Use Case | Efficiency |
|---|---|---|---|
| Exclusive | nvidia.com/gpu: 1 | Large models (>10B params) | 40-70% utilization |
| MIG (Multi-Instance GPU) | nvidia.com/mig-3g.20gb: 1 | Medium models, multiple agents | 70-85% utilization |
| Time-Slicing | nvidia.com/gpu: 1 + time-slicing config | Small models, cost-sensitive | 80-95% utilization |
| vGPU | NVIDIA vGPU license | Enterprise, guaranteed SLAs | 60-80% utilization |
For agent workloads, MIG partitioning on A100/H100 GPUs provides the best balance of isolation and efficiency. A single A100 80GB can be partitioned into seven 10 GB instances, each running a separate agent with hardware-level memory isolation.
The GPU utilization efficiency formula:
GPU_efficiency = (sum(agent_gpu_compute_time) / (N_gpus * wall_clock_time)) * 100%
Cost_per_token = (GPU_cost_per_hour / 3600) / tokens_per_second
For example, an A100 at $3.00/hour processing 500 tokens/second:
Cost_per_token = ($3.00 / 3600) / 500 = $0.00000167 per token
Cost_per_million_tokens = $1.67
5. Networking and Service Mesh
5.1 Kubernetes Service Model for Agents
Each Agent resource is backed by a Kubernetes Service that provides stable DNS-based discovery and load balancing. The operator creates a ClusterIP service for internal communication and optionally a headless service for StatefulSet-based agents that require stable network identities.
apiVersion: v1 kind: Service metadata: name: code-reviewer namespace: agent-system labels: ossa.ai/agent: code-reviewer ossa.ai/type: inference spec: selector: ossa.ai/agent: code-reviewer ports: - name: grpc port: 50051 targetPort: 50051 protocol: TCP - name: http port: 8080 targetPort: 8080 protocol: TCP - name: metrics port: 9090 targetPort: 9090 protocol: TCP type: ClusterIP
5.2 Service Mesh Integration
For production deployments, a service mesh (Istio or Linkerd) provides critical capabilities that are difficult to implement at the application level: mutual TLS for all inter-agent communication, fine-grained traffic management, circuit breaking, and distributed tracing.
Istio's sidecar proxy (Envoy) automatically encrypts all traffic between agent pods using mutual TLS, eliminating the need for agents to manage their own TLS certificates. The mesh also provides L7 load balancing for gRPC, which is essential because Kubernetes' default L4 load balancing does not distribute gRPC streams across multiple backends.
apiVersion: security.istio.io/v1 kind: PeerAuthentication metadata: name: agent-mtls namespace: agent-system spec: mtls: mode: STRICT --- apiVersion: networking.istio.io/v1 kind: DestinationRule metadata: name: code-reviewer-lb namespace: agent-system spec: host: code-reviewer.agent-system.svc.cluster.local trafficPolicy: loadBalancer: simple: ROUND_ROBIN connectionPool: tcp: maxConnections: 100 http: h2UpgradePolicy: UPGRADE maxRequestsPerConnection: 0 outlierDetection: consecutive5xxErrors: 3 interval: 30s baseEjectionTime: 30s maxEjectionPercent: 50 --- apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: code-reviewer-routing namespace: agent-system spec: hosts: - code-reviewer.agent-system.svc.cluster.local http: - match: - headers: x-agent-version: exact: "v2" route: - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v2 weight: 100 - route: - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v1 weight: 90 - destination: host: code-reviewer.agent-system.svc.cluster.local subset: v2 weight: 10
5.3 Network Policies
Network policies implement microsegmentation, ensuring that each agent can only communicate with the services it is authorized to access. The operator generates network policies automatically based on the agent's capability declarations and access tier.
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: code-reviewer-netpol namespace: agent-system spec: podSelector: matchLabels: ossa.ai/agent: code-reviewer policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: ossa.ai/type: orchestrator - podSelector: matchLabels: ossa.ai/agent: report-aggregator ports: - port: 50051 protocol: TCP - port: 8080 protocol: TCP egress: - to: - podSelector: matchLabels: app: qdrant ports: - port: 6334 protocol: TCP - to: - namespaceSelector: matchLabels: name: llm-providers ports: - port: 443 protocol: TCP - to: - podSelector: matchLabels: app: prometheus namespaceSelector: matchLabels: name: monitoring ports: - port: 9090 protocol: TCP
5.4 Ingress and External Access
External access to agent services is provided through an Ingress controller with TLS termination. For production deployments, we recommend dedicated ingress resources per agent group rather than a single ingress with path-based routing, to provide isolation and independent scaling of ingress capacity.
Agent Networking Data Flow:
External Client
|
| HTTPS (TLS 1.3)
v
+------------------+
| Ingress |
| Controller |
| (nginx/envoy) |
+--------+---------+
|
| HTTP/2 (plaintext, within cluster)
v
+------------------+
| Istio Ingress |
| Gateway |
+--------+---------+
|
| mTLS (Istio-managed certificates)
v
+------------------+ mTLS +------------------+
| Agent Pod A |<------------>| Agent Pod B |
| (code-reviewer) | | (security-scan) |
| +-------------+ | | +-------------+ |
| | Envoy Proxy | | | | Envoy Proxy | |
| +------+------+ | | +------+------+ |
| | | | | |
| +------v------+ | | +------v------+ |
| | Agent | | | | Agent | |
| | Container | | | | Container | |
| +-------------+ | | +-------------+ |
+------------------+ +------------------+
| |
| mTLS | mTLS
v v
+------------------+ +------------------+
| Qdrant | | Prometheus |
| (Vector DB) | | (Monitoring) |
+------------------+ +------------------+
Figure 5: Agent networking data flow with service mesh mTLS.
6. Storage and Persistence
6.1 Storage Requirements for AI Agents
AI agents have diverse storage requirements that span multiple access patterns and performance tiers.
Table 5: Agent Storage Requirements
| Storage Type | Access Pattern | Performance | Persistence | Example Use |
|---|---|---|---|---|
| Model weights | Read-heavy, sequential | High throughput (1+ GB/s) | Ephemeral (cacheable) | LLM model files |
| Vector indices | Read-write, random | High IOPS (3000+) | Persistent | Qdrant/Milvus data |
| Conversation state | Write-heavy, append | Medium IOPS (500+) | Persistent | Agent memory |
| Task queue | Read-write, FIFO | Low latency (< 1ms) | Semi-persistent | Pending tasks |
| Scratch/temp | Write-heavy, sequential | Medium throughput | Ephemeral | Intermediate results |
| Configuration | Read-only | Low | Persistent | Agent config, prompts |
6.2 Persistent Volume Claims
The operator creates PVCs based on the agent's storage declarations. For agents that require persistent state (vector databases, conversation history), the operator uses StorageClass selection to match performance requirements.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: code-reviewer-vector-store namespace: agent-system labels: ossa.ai/agent: code-reviewer ossa.ai/storage-type: vector-index spec: accessModes: - ReadWriteOnce storageClassName: ssd-high-iops resources: requests: storage: 50Gi --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ssd-high-iops provisioner: ebs.csi.aws.com parameters: type: io2 iopsPerGB: "50" encrypted: "true" volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true reclaimPolicy: Retain
6.3 StatefulSets for Stateful Agents
Agents that maintain persistent state (such as vector database instances or agents with local model caches) are deployed as StatefulSets rather than Deployments. StatefulSets provide stable network identities (deterministic pod names like qdrant-0, qdrant-1) and ordered, graceful scaling that ensures data consistency.
apiVersion: apps/v1 kind: StatefulSet metadata: name: qdrant-vector-store namespace: agent-system spec: serviceName: qdrant-headless replicas: 3 podManagementPolicy: OrderedReady selector: matchLabels: app: qdrant template: metadata: labels: app: qdrant spec: containers: - name: qdrant image: qdrant/qdrant:v1.12.0 ports: - containerPort: 6333 name: http - containerPort: 6334 name: grpc - containerPort: 6335 name: internal volumeMounts: - name: qdrant-data mountPath: /qdrant/storage resources: requests: cpu: "2" memory: 8Gi limits: cpu: "4" memory: 16Gi volumeClaimTemplates: - metadata: name: qdrant-data spec: accessModes: [ReadWriteOnce] storageClassName: ssd-high-iops resources: requests: storage: 100Gi
6.4 CSI Driver Selection
The choice of CSI (Container Storage Interface) driver significantly impacts storage performance. The following guidelines apply to agent workloads:
Performance benchmarks by storage tier:
Storage Performance Reference:
SSD (io2/gp3):
IOPS: 3,000 - 64,000 (provisioned)
Throughput: 125 - 1,000 MB/s
Latency: < 1 ms (p99)
Cost: $0.125/GB/month + $0.065/provisioned-IOPS
HDD (st1/sc1):
IOPS: 250 - 500
Throughput: 20 - 500 MB/s
Latency: 5 - 10 ms (p99)
Cost: $0.025 - $0.045/GB/month
NFS (EFS/Filestore):
IOPS: varies (bursting)
Throughput: 50 - 1,000 MB/s (provisioned)
Latency: 2 - 10 ms (p99)
Cost: $0.30/GB/month (standard), $0.025/GB/month (infrequent access)
Access: ReadWriteMany (shared across pods)
Local NVMe (i3/i4i instances):
IOPS: 100,000 - 3,300,000
Throughput: 1,750 - 8,000 MB/s
Latency: < 0.1 ms (p99)
Cost: included in instance cost (ephemeral)
For vector database workloads that require high random IOPS, provisioned SSD (io2) is recommended. For model weight caching where sequential throughput matters more than IOPS, local NVMe provides the best performance at the lowest cost (since storage is included in the instance price), with the caveat that data is ephemeral and must be re-downloaded if the node is replaced.
6.5 Model Weight Distribution
Large model weights (ranging from 2 GB for 7B-parameter quantized models to 150+ GB for 70B-parameter full-precision models) present a unique storage challenge. Downloading weights from a remote registry (Hugging Face, S3) on every pod startup introduces unacceptable latency. The recommended approach is a tiered caching strategy:
- Cluster-level cache: A shared ReadWriteMany NFS volume mounted at
/models/cacheon all agent nodes, populated by a DaemonSet that pre-fetches model weights. - Node-level cache: A hostPath or local PV mounted at
/var/cache/modelsthat persists across pod restarts on the same node. - Pod-level init container: An init container that copies the required model weights from the node cache to the pod's ephemeral volume before the agent container starts.
Model loading time = max(download_time, 0) + copy_time + load_time
Where:
download_time = model_size / download_bandwidth (0 if cached)
copy_time = model_size / local_disk_bandwidth
load_time = model_size / memory_bandwidth + initialization_overhead
Example (13B model, 7.3 GB quantized):
Cold start: 7.3 GB / 100 MB/s + 7.3 GB / 2 GB/s + 7.3 GB / 20 GB/s + 2s
= 73s + 3.65s + 0.365s + 2s = ~79 seconds
Warm start: 0s + 3.65s + 0.365s + 2s = ~6 seconds
7. Observability
7.1 The Three Pillars for Agent Workloads
Observability for AI agent workloads extends beyond traditional infrastructure monitoring. In addition to the standard three pillars (metrics, logs, traces), agent observability requires semantic-level understanding: what decisions did the agent make, what tools did it invoke, what was the quality of its output?
7.2 OpenTelemetry Instrumentation
The OpenTelemetry Collector runs as a DaemonSet on every node, receiving telemetry from agent pods via OTLP (OpenTelemetry Protocol) and routing it to the appropriate backends.
apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: agent-collector namespace: monitoring spec: mode: daemonset config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 5s send_batch_size: 1000 memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 128 attributes: actions: - key: agent.name from_context: resource action: insert - key: agent.model from_context: resource action: insert exporters: prometheusremotewrite: endpoint: http://prometheus.monitoring:9090/api/v1/write otlp/tempo: endpoint: tempo.monitoring:4317 tls: insecure: true loki: endpoint: http://loki.monitoring:3100/loki/api/v1/push service: pipelines: metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheusremotewrite] traces: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [otlp/tempo] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki]
7.3 Prometheus Metrics
The agent operator exposes a comprehensive set of Prometheus metrics that cover both infrastructure health and agent-specific semantics.
Key metrics exposed by the operator:
# Agent lifecycle metrics
agent_operator_reconcile_total{agent, result} # Total reconciliation attempts
agent_operator_reconcile_duration_seconds{agent, quantile} # Reconciliation latency
agent_operator_state_transitions_total{agent, from, to} # State machine transitions
agent_operator_managed_agents_total{state} # Agents by state
# Agent runtime metrics (scraped from agent pods)
agent_tokens_processed_total{agent, model} # Total tokens processed
agent_tokens_per_second{agent, model} # Current throughput
agent_request_duration_seconds{agent, tool, quantile} # Request latency by tool
agent_active_tasks{agent} # Currently executing tasks
agent_queue_depth{agent} # Pending tasks
agent_tool_invocations_total{agent, tool, status} # Tool usage by status
agent_model_inference_duration_seconds{agent, model} # Model inference latency
agent_errors_total{agent, type} # Errors by type
agent_gpu_utilization_percent{agent, gpu_index} # GPU utilization
agent_gpu_memory_used_bytes{agent, gpu_index} # GPU memory usage
agent_context_window_utilization{agent} # Context window fill percentage
7.4 Alert Rules
Critical alert rules for agent workloads:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: agent-alerts namespace: monitoring spec: groups: - name: agent-health interval: 30s rules: - alert: AgentDown expr: | absent(up{job="agent-system"} == 1) for: 5m labels: severity: critical annotations: summary: "Agent {{ $labels.agent }} is down" description: "Agent {{ $labels.agent }} has been unreachable for 5 minutes." - alert: AgentHighErrorRate expr: | rate(agent_errors_total[5m]) / rate(agent_tokens_processed_total[5m]) > 0.05 for: 10m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} error rate > 5%" - alert: AgentHighLatency expr: | histogram_quantile(0.99, rate(agent_request_duration_seconds_bucket[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} p99 latency > 10s" - alert: AgentGPUMemoryPressure expr: | agent_gpu_memory_used_bytes / agent_gpu_memory_total_bytes > 0.95 for: 2m labels: severity: critical annotations: summary: "Agent {{ $labels.agent }} GPU memory > 95%" - alert: AgentQueueBacklog expr: | agent_queue_depth > 50 for: 10m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} queue depth > 50 for 10 minutes" - alert: AgentScalingMaxed expr: | kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas for: 15m labels: severity: warning annotations: summary: "Agent {{ $labels.agent }} at maximum replicas for 15 minutes"
7.5 Grafana Dashboards
A production agent observability stack includes the following Grafana dashboards:
- Agent Fleet Overview: Total agents, state distribution, cluster resource utilization, error rates, and throughput aggregates.
- Individual Agent Detail: Per-agent metrics including token throughput, latency percentiles, tool invocation breakdown, GPU utilization, and scaling events.
- Workflow Execution: AgentWorkflow DAG visualization, step durations, failure rates, and end-to-end latency.
- Cost and Capacity: GPU utilization efficiency, cost per token, resource waste (requested vs. used), and capacity planning projections.
- Security and Compliance: RBAC audit events, network policy violations, runtime security alerts, and access tier validation.
7.6 Structured Logging with Loki
Agent logs are structured as JSON and shipped to Loki via the OpenTelemetry Collector. Each log entry includes the agent name, model, task ID, and tool invocation context as labels, enabling efficient filtering and correlation.
The log retention policy should account for the volume generated by verbose agent interactions. A single agent processing 100 requests per hour with an average of 10 tool invocations per request generates approximately 1,000 log entries per hour. At an average of 500 bytes per entry, this is 500 KB/hour or ~360 MB/month per agent. For a fleet of 50 agents, total log volume is approximately 18 GB/month before compression (Loki typically achieves 10-15x compression, resulting in approximately 1.2-1.8 GB of stored data).
8. Multi-Cluster Federation
8.1 Why Multi-Cluster?
Single-cluster deployments are sufficient for many organizations, but multi-cluster federation becomes necessary for several reasons:
- Geographic distribution: Agents that interact with users in multiple regions benefit from reduced latency when deployed closer to the user.
- Regulatory compliance: Data residency requirements (GDPR, CCPA) may mandate that certain agent workloads and their associated data remain within specific geographic boundaries.
- Blast radius reduction: Isolating agent workloads across clusters limits the impact of cluster-level failures.
- Resource specialization: Different clusters can provide different hardware profiles (GPU types, memory configurations) for different agent workloads.
- Scale limits: etcd performance degrades above approximately 10,000 custom resources per cluster; large agent deployments may need to partition across clusters.
8.2 Federation Architecture
Multi-Cluster Federation:
+----------------------------+
| Federation Control |
| Plane |
| |
| +------------------------+ |
| | KubeFed Controller | |
| | Manager | |
| +------------------------+ |
| | Agent Federation | |
| | Scheduler | |
| +------------------------+ |
| | Global Service Mesh | |
| | (Istio Multi-Cluster) | |
| +------------------------+ |
+-----+-------+-------+-----+
| | |
+-----------+ | +-----------+
| | |
+---------v--------+ +-------v--------+ +--------v---------+
| Cluster: US-East | | Cluster: EU | | Cluster: AP |
| | | | | |
| Agents: | | Agents: | | Agents: |
| - code-reviewer | | - gdpr-agent | | - translation |
| - security-scan | | - eu-reviewer | | - ap-reviewer |
| - test-gen | | - compliance | | - sentiment |
| | | | | |
| GPU: 4x A100 | | GPU: 2x A100 | | GPU: 2x A100 |
| Nodes: 15 | | Nodes: 8 | | Nodes: 6 |
+------------------+ +----------------+ +-------------------+
Figure 6: Multi-cluster federation architecture with geographic distribution.
8.3 Federated Agent Resources
KubeFed (Kubernetes Federation v2) enables the propagation of Agent CRs across multiple clusters with placement policies and override mechanisms.
apiVersion: types.kubefed.io/v1beta1 kind: FederatedAgent metadata: name: code-reviewer namespace: agent-system spec: template: spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0 capabilities: - name: code-review type: skill version: "2.1" scaling: minReplicas: 2 maxReplicas: 10 resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: 1 placement: clusters: - name: us-east - name: eu-west clusterSelector: matchLabels: gpu-available: "true" overrides: - clusterName: eu-west clusterOverrides: - path: "/spec/scaling/minReplicas" value: 1 - path: "/spec/scaling/maxReplicas" value: 5 - path: "/spec/runtime/model" value: "anthropic/claude-sonnet-4-20250514:eu-compliant"
8.4 Cross-Cluster Latency Model
Cross-cluster agent communication introduces latency that must be accounted for in workflow design. The total latency for a cross-cluster agent invocation is:
T_cross_cluster = T_serialization + (N_hops * T_per_hop) + T_deserialization + T_processing
Where:
T_serialization = message_size / serialization_throughput
= typically 0.1 - 2 ms for protobuf (gRPC)
N_hops = number of network hops (typically 3-8 for cross-region)
T_per_hop = per-hop latency (0.5 - 5 ms per hop)
T_deserialization = roughly equal to T_serialization
T_processing = agent processing time (highly variable, 100ms - 60s)
Example (US-East to EU-West):
T_cross_cluster = 0.5ms + (6 * 2ms) + 0.5ms + 500ms
= 0.5 + 12 + 0.5 + 500
= 513 ms
Compared to same-cluster:
T_same_cluster = 0.5ms + (2 * 0.1ms) + 0.5ms + 500ms
= 0.5 + 0.2 + 0.5 + 500
= 501.2 ms
The 12 ms network overhead of cross-cluster communication is negligible compared to agent processing time for most workloads. However, for workflows with many sequential agent invocations (e.g., a 10-step pipeline), the cumulative overhead becomes significant: 120 ms for cross-cluster vs. 2 ms for same-cluster, a 60x increase in network latency.
The recommendation is to co-locate agents that form tight interaction loops in the same cluster and use cross-cluster communication only for loosely coupled workflows or geographic routing.
8.5 Cluster API for Infrastructure Provisioning
For organizations that manage their own Kubernetes infrastructure (rather than using managed services), the Cluster API provides declarative, Kubernetes-style APIs for creating, configuring, and managing clusters. This enables the agent orchestration layer to provision new clusters on demand in response to scaling requirements or geographic expansion.
9. Security Hardening
9.1 Pod Security Standards
Kubernetes Pod Security Standards define three levels of restriction: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (heavily restricted, following current best practices). Agent workloads should run at the Restricted level with targeted exceptions for GPU access.
apiVersion: v1 kind: Namespace metadata: name: agent-system labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/enforce-version: latest pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted
The Restricted level enforces:
- Pods must run as non-root
- Root filesystem must be read-only
- Privilege escalation must be explicitly disallowed
- Seccomp profile must be set (RuntimeDefault or Localhost)
- Host namespaces (hostNetwork, hostPID, hostIPC) are forbidden
- HostPath volumes are forbidden
For GPU workloads, a RuntimeClass exception is required because the NVIDIA device plugin requires certain capabilities. This is handled through a targeted exemption rather than relaxing the entire namespace.
9.2 RuntimeClass and Sandboxing
For agents that execute untrusted code (e.g., a code execution agent that runs user-submitted programs), gVisor provides an additional layer of isolation beyond standard container boundaries. gVisor intercepts system calls and handles them in user space, preventing the container from directly interacting with the host kernel.
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: gvisor handler: runsc overhead: podFixed: cpu: 100m memory: 64Mi scheduling: nodeSelector: runtime.gvisor.dev/capable: "true" --- apiVersion: ossa.ai/v1 kind: Agent metadata: name: code-executor namespace: agent-system spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-executor:v1.0.0 security: accessTier: tier_3_full_access runtimeClass: gvisor runAsNonRoot: true readOnlyRootFilesystem: true networkPolicy: allowEgress: - host: "*.internal.svc.cluster.local" port: 443 denyIngress: false
The performance overhead of gVisor is approximately 5-15% for CPU-bound workloads and 20-40% for syscall-heavy workloads. For AI inference workloads that are primarily GPU-bound, the overhead is negligible because GPU operations bypass the gVisor syscall interception layer.
9.3 RBAC (Role-Based Access Control)
RBAC for the agent system follows the principle of least privilege. The operator service account has broad permissions within the agent-system namespace, but agent pods themselves have tightly scoped permissions based on their OSSA access tier.
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: agent-operator rules: - apiGroups: ["ossa.ai"] resources: ["agents", "agentpools", "agentworkflows"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["ossa.ai"] resources: ["agents/status", "agentpools/status", "agentworkflows/status"] verbs: ["get", "update", "patch"] - apiGroups: [""] resources: ["pods", "services", "configmaps", "secrets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["apps"] resources: ["deployments", "statefulsets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["autoscaling"] resources: ["horizontalpodautoscalers"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["networking.k8s.io"] resources: ["networkpolicies"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: agent-tier1-readonly namespace: agent-system rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list"] - apiGroups: [""] resources: ["secrets"] resourceNames: ["agent-api-keys"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: agent-tier3-executor namespace: agent-system rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "create", "update"] - apiGroups: [""] resources: ["secrets"] resourceNames: ["agent-api-keys", "agent-git-credentials"] verbs: ["get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["create", "get", "list"]
9.4 OPA (Open Policy Agent) for Policy Enforcement
OPA Gatekeeper enforces custom policies that go beyond what Kubernetes RBAC and Pod Security Standards can express. For agent workloads, OPA policies enforce constraints such as:
- Agents must not request more GPU resources than their access tier permits.
- Agents in tier_1_read cannot have egress network policies to external endpoints.
- Agent images must be pulled from the approved container registry.
- Agent model references must be from the approved model registry.
- Cross-tier agent communication must follow the role conflict matrix.
apiVersion: templates.gatekeeper.sh/v1 kind: ConstraintTemplate metadata: name: agenttierresourcelimit spec: crd: spec: names: kind: AgentTierResourceLimit validation: openAPIV3Schema: type: object properties: maxGPU: type: object additionalProperties: type: integer targets: - target: admission.k8s.gatekeeper.sh rego: | package agenttierresourcelimit violation[{"msg": msg}] { input.review.object.apiVersion == "ossa.ai/v1" input.review.object.kind == "Agent" tier := input.review.object.spec.security.accessTier requested_gpu := input.review.object.spec.resources.requests["nvidia.com/gpu"] max_gpu := input.parameters.maxGPU[tier] requested_gpu > max_gpu msg := sprintf( "Agent %v in tier %v requests %v GPUs, max allowed: %v", [input.review.object.metadata.name, tier, requested_gpu, max_gpu] ) } --- apiVersion: constraints.gatekeeper.sh/v1beta1 kind: AgentTierResourceLimit metadata: name: agent-gpu-limits-by-tier spec: match: kinds: - apiGroups: ["ossa.ai"] kinds: ["Agent"] parameters: maxGPU: tier_1_read: 0 tier_2_write_limited: 1 tier_3_full_access: 4 tier_4_policy: 0
9.5 Supply Chain Security
Agent container images must pass through a verification pipeline before deployment:
- Image signing: All images are signed with Sigstore/Cosign during CI. The operator validates signatures before allowing pod creation.
- Vulnerability scanning: Trivy scans all images for CVEs. Images with critical vulnerabilities are blocked.
- SBOM generation: Software Bills of Materials are generated for all agent images and stored in the registry alongside the image.
- Admission control: Kyverno or OPA policies enforce that only signed, scanned images from approved registries are admitted to the cluster.
10. Reference Architecture
10.1 50-Agent Production Deployment
The following reference architecture describes a production deployment of 50 agents across a medium-tier Kubernetes cluster. This architecture supports a mix of CPU-only agents (lightweight tools, routing, orchestration) and GPU-accelerated agents (inference, code generation, analysis).
Table 6: Reference Architecture Node Pools
| Node Pool | Instance Type | Count | CPU | Memory | GPU | Purpose |
|---|---|---|---|---|---|---|
| system | m6i.xlarge | 3 | 4 vCPU | 16 GB | None | Control plane, operator, monitoring |
| cpu-agents | m6i.2xlarge | 5 | 8 vCPU | 32 GB | None | CPU-only agents, orchestrators |
| gpu-inference | g5.2xlarge | 4 | 8 vCPU | 32 GB | 1x A10G 24GB | Inference agents, code generation |
| gpu-heavy | p4d.24xlarge | 1 | 96 vCPU | 1152 GB | 8x A100 40GB | Large model inference, training |
| storage | i3.xlarge | 3 | 4 vCPU | 30.5 GB | None | Qdrant, MinIO, PostgreSQL |
Deployment manifest for the complete system:
# Namespace and resource quotas apiVersion: v1 kind: Namespace metadata: name: agent-system labels: pod-security.kubernetes.io/enforce: restricted istio-injection: enabled --- apiVersion: v1 kind: ResourceQuota metadata: name: agent-system-quota namespace: agent-system spec: hard: requests.cpu: "200" requests.memory: 800Gi requests.nvidia.com/gpu: "12" limits.cpu: "400" limits.memory: 1600Gi limits.nvidia.com/gpu: "12" pods: "200" services: "60" persistentvolumeclaims: "50" --- # Agent Operator deployment apiVersion: apps/v1 kind: Deployment metadata: name: agent-operator namespace: agent-system spec: replicas: 3 selector: matchLabels: app: agent-operator template: metadata: labels: app: agent-operator spec: serviceAccountName: agent-operator nodeSelector: node-pool: system containers: - name: operator image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0 args: - --leader-elect=true - --leader-election-id=agent-operator-leader - --metrics-bind-address=:8080 - --health-probe-bind-address=:8081 - --max-concurrent-reconciles=10 resources: requests: cpu: 200m memory: 256Mi limits: cpu: 1000m memory: 1Gi --- # Example agents (representative of 50-agent fleet) apiVersion: ossa.ai/v1 kind: Agent metadata: name: code-reviewer namespace: agent-system spec: runtime: model: anthropic/claude-sonnet-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0 maxTokensPerRequest: 8192 temperature: 0.3 capabilities: - name: code-review type: skill version: "2.1" - name: git-operations type: tool version: "1.0" resources: requests: cpu: "2" memory: 8Gi nvidia.com/gpu: 1 limits: cpu: "4" memory: 16Gi nvidia.com/gpu: 1 scaling: minReplicas: 2 maxReplicas: 8 metrics: - type: custom name: agent_active_tasks target: type: AverageValue averageValue: "3" - type: custom name: agent_queue_depth target: type: Value averageValue: "10" scaleDownStabilization: 300 security: accessTier: tier_3_full_access runAsNonRoot: true readOnlyRootFilesystem: true networkPolicy: allowEgress: - host: "gitlab.com" port: 443 - host: "api.anthropic.com" port: 443 observability: metricsPort: 9090 metricsPath: /metrics tracingEnabled: true logLevel: info --- apiVersion: ossa.ai/v1 kind: Agent metadata: name: routing-orchestrator namespace: agent-system spec: runtime: model: anthropic/claude-haiku-4-20250514:latest image: registry.gitlab.com/blueflyio/agents/router:v1.5.0 maxTokensPerRequest: 2048 temperature: 0.1 capabilities: - name: task-routing type: skill version: "1.5" - name: agent-discovery type: protocol version: "1.0" resources: requests: cpu: "1" memory: 2Gi limits: cpu: "2" memory: 4Gi scaling: minReplicas: 3 maxReplicas: 15 metrics: - type: custom name: agent_active_tasks target: type: AverageValue averageValue: "10" security: accessTier: tier_2_write_limited runAsNonRoot: true readOnlyRootFilesystem: true observability: metricsPort: 9090 tracingEnabled: true logLevel: info
10.2 Cost Model
The monthly cost for this reference architecture is calculated as follows:
Table 7: Monthly Cost Breakdown
| Component | Quantity | Unit Cost | Monthly Cost |
|---|---|---|---|
| System nodes (m6i.xlarge) | 3 | $138/mo | $414 |
| CPU agent nodes (m6i.2xlarge) | 5 | $276/mo | $1,380 |
| GPU inference nodes (g5.2xlarge) | 4 | $912/mo | $3,648 |
| GPU heavy node (p4d.24xlarge) | 1 | $23,558/mo | $23,558 |
| Storage nodes (i3.xlarge) | 3 | $225/mo | $675 |
| EBS storage (gp3, 2TB total) | 2,000 GB | $0.08/GB/mo | $160 |
| EBS storage (io2, 500GB) | 500 GB | $0.125/GB/mo | $62.50 |
| Data transfer (egress) | 500 GB | $0.09/GB | $45 |
| EKS control plane | 1 | $73/mo | $73 |
| Total | $30,015.50 |
For organizations that do not require the p4d.24xlarge heavy GPU node, the cost drops to approximately $6,457/month, well within the medium tier range. The heavy GPU node is only necessary for organizations running large language models (70B+ parameters) locally rather than using API-based inference.
The cost formula for estimating deployment expenses:
Monthly = (N_system * C_system) + (N_cpu * C_cpu_node) + (N_gpu_small * C_gpu_small)
+ (N_gpu_large * C_gpu_large) + (N_storage * C_storage_node)
+ (S_gp3 * 0.08) + (S_io2 * 0.125) + (E_gb * 0.09) + C_managed
Cost_per_agent = Monthly / N_agents
Cost_per_token = Monthly / (N_agents * avg_tokens_per_agent_per_month)
For this reference architecture:
- Cost per agent: $30,015.50 / 50 = $600.31/month
- Assuming each agent processes 10 million tokens/month: $0.003/1000 tokens
10.3 Capacity Planning
Capacity planning for agent workloads requires estimating the peak concurrency, throughput requirements, and resource consumption patterns.
Capacity planning formulas:
Required_CPU_nodes = ceil(sum(agent_cpu_requests) / node_allocatable_cpu)
Required_GPU_nodes = ceil(sum(agent_gpu_requests) / gpus_per_node)
Required_memory = sum(agent_memory_requests) * (1 + headroom_percent)
Throughput_capacity = N_replicas * tokens_per_second_per_replica
Latency_budget = target_p99_latency - network_overhead - queue_wait
Max_concurrent = throughput_capacity * latency_budget
Scaling_headroom = max_replicas / min_replicas (recommended: 3-5x)
For the 50-agent deployment:
- Total CPU requests: ~150 vCPU (3 vCPU average per agent)
- Total GPU requests: 12 GPUs (24% of agents require GPU)
- Total memory requests: ~300 GB (6 GB average per agent)
- Peak throughput: ~5,000 tokens/second aggregate
- P99 latency target: < 5 seconds for inference agents
11. References
-
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. ACM Queue, 14(1), 70-93. DOI:10.1145/2898442.2898444 | Google Research
-
Cloud Native Computing Foundation. (2025). CNCF Annual Survey 2025: Kubernetes Adoption and Trends. https://www.cncf.io/reports/cncf-annual-survey-2025/
-
Kubernetes Authors. (2025). Custom Resources. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
-
Kubernetes Authors. (2025). Operator Pattern. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
-
Operator SDK Authors. (2025). Building Operators with Operator SDK. https://sdk.operatorframework.io/
-
Kubernetes Authors. (2025). Horizontal Pod Autoscaler. Kubernetes Documentation. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
-
KEDA Authors. (2025). KEDA: Kubernetes Event-driven Autoscaling. https://keda.sh/docs/
-
Kubernetes Authors. (2025). Vertical Pod Autoscaler. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
-
Istio Authors. (2025). Istio Service Mesh Architecture. https://istio.io/latest/docs/ops/deployment/architecture/
-
Linkerd Authors. (2025). Linkerd Architecture. https://linkerd.io/2/reference/architecture/
-
Kubernetes Authors. (2025). Network Policies. Kubernetes Documentation. https://kubernetes.io/docs/concepts/services-networking/network-policies/
-
Kubernetes Authors. (2025). Persistent Volumes. Kubernetes Documentation. https://kubernetes.io/docs/concepts/storage/persistent-volumes/
-
Kubernetes Authors. (2025). StatefulSets. Kubernetes Documentation. https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
-
OpenTelemetry Authors. (2025). OpenTelemetry Collector. https://opentelemetry.io/docs/collector/
-
Prometheus Authors. (2025). Prometheus Monitoring System. https://prometheus.io/docs/
-
Grafana Labs. (2025). Grafana Loki: Log Aggregation System. https://grafana.com/docs/loki/latest/
-
Kubernetes Federation v2 Authors. (2025). KubeFed: Kubernetes Federation v2. https://github.com/kubernetes-sigs/kubefed
-
Cluster API Authors. (2025). Cluster API Documentation. https://cluster-api.sigs.k8s.io/
-
Kubernetes Authors. (2025). Pod Security Standards. Kubernetes Documentation. https://kubernetes.io/docs/concepts/security/pod-security-standards/
-
gVisor Authors. (2025). gVisor: Application Kernel for Containers. https://gvisor.dev/docs/
-
Open Policy Agent Authors. (2025). OPA Gatekeeper. https://open-policy-agent.github.io/gatekeeper/
-
NVIDIA. (2025). NVIDIA Device Plugin for Kubernetes. https://github.com/NVIDIA/k8s-device-plugin
-
NVIDIA. (2025). Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
-
Sigstore Authors. (2025). Cosign: Container Signing. https://docs.sigstore.dev/cosign/
-
Aqua Security. (2025). Trivy: Comprehensive Vulnerability Scanner. https://trivy.dev/
-
BlueFly.io. (2026). Open Standard for Sustainable Agents (OSSA) v0.3.3 Specification. https://gitlab.com/blueflyio/openstandardagents
-
BlueFly.io. (2026). Agent Platform Technical Documentation. https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/home
-
Kubernetes Authors. (2025). Resource Management for Pods and Containers. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
Kubernetes Authors. (2025). Scheduling, Preemption, and Eviction. https://kubernetes.io/docs/concepts/scheduling-eviction/
-
etcd Authors. (2025). etcd Performance and Tuning. https://etcd.io/docs/v3.5/op-guide/performance/
Appendix A: Glossary
| Term | Definition |
|---|---|
| CRD | Custom Resource Definition: extends the Kubernetes API with new resource types |
| HPA | Horizontal Pod Autoscaler: adjusts replica count based on metrics |
| VPA | Vertical Pod Autoscaler: adjusts resource requests/limits per pod |
| KEDA | Kubernetes Event-Driven Autoscaling: scales based on event sources |
| OSSA | Open Standard for Sustainable Agents: BlueFly.io's agent specification |
| mTLS | Mutual TLS: bidirectional certificate-based authentication |
| OPA | Open Policy Agent: policy enforcement engine |
| CSI | Container Storage Interface: standard for storage plugins |
| MIG | Multi-Instance GPU: NVIDIA technology for GPU partitioning |
| RBAC | Role-Based Access Control: Kubernetes authorization mechanism |
| CRI | Container Runtime Interface: standard for container runtimes |
| DAG | Directed Acyclic Graph: used for workflow step ordering |
| PVC | Persistent Volume Claim: storage request in Kubernetes |
| OTLP | OpenTelemetry Protocol: telemetry data transport protocol |
Appendix B: Checklist for Production Readiness
- Agent CRDs deployed and validated with OpenAPI v3 schema
- Agent Operator running with 3 replicas and leader election
- HPA configured with custom metrics (tokens/sec, queue depth)
- KEDA ScaledObjects for event-driven scaling with scale-to-zero
- VPA in Initial mode for GPU memory right-sizing
- Istio service mesh with STRICT mTLS enabled
- Network policies applied to all agent pods
- Pod Security Standards enforced at Restricted level
- gVisor RuntimeClass for code-execution agents
- RBAC roles aligned with OSSA access tiers
- OPA policies for tier-based resource limits
- OpenTelemetry Collector DaemonSet deployed
- Prometheus scraping agent metrics endpoints
- Grafana dashboards for fleet overview and individual agents
- Alert rules for agent health, latency, GPU pressure, and queue backlog
- Loki log aggregation with appropriate retention policies
- Container image signing with Cosign
- Vulnerability scanning with Trivy in CI pipeline
- SBOM generation for all agent images
- Resource quotas applied at namespace level
- PVCs provisioned with appropriate StorageClass (SSD for vector DBs)
- Model weight caching strategy implemented (cluster/node/pod tiers)
- Backup strategy for persistent agent state
- Disaster recovery plan documented and tested
- Capacity planning formulas validated against actual usage
This whitepaper is part of the BlueFly.io Agent Platform Whitepaper Series. For questions, contributions, or errata, please open an issue at https://gitlab.com/blueflyio/agent-platform/technical-docs/-/issues.