Kubernetes-Native Agent Orchestration: Custom Resources, Operators, and Cloud-Native Patterns for AI Agent Deployment

BlueFly.io Agent Platform -- Whitepaper Series #4 Version: 1.0 Date: February 2026 Classification: Technical Reference Audience: Platform Engineers, SREs, AI Infrastructure Architects

Abstract

The convergence of autonomous AI agents and cloud-native infrastructure presents both an unprecedented opportunity and a formidable engineering challenge. As organizations scale from single-agent prototypes to fleets of hundreds of cooperating agents, the operational complexity of provisioning, scaling, monitoring, and securing these workloads demands a principled orchestration layer. Kubernetes, with its declarative state model, extensible API machinery, and mature ecosystem, provides a natural substrate for this orchestration.

This whitepaper presents a complete architecture for Kubernetes-native AI agent orchestration. We define Custom Resource Definitions (CRDs) that encode agent specifications, pool topologies, and workflow graphs as first-class Kubernetes objects. We detail an Operator pattern that implements a full reconciliation loop with state-machine semantics, leader election for high availability, and graceful degradation under failure. We address the unique scaling requirements of AI workloads through Horizontal Pod Autoscalers driven by custom metrics (tokens per second, queue depth, active tasks), Vertical Pod Autoscalers for right-sizing GPU memory allocations, and KEDA-based event-driven scaling for bursty inference workloads. Networking patterns encompass service mesh integration with mutual TLS, gRPC load balancing, and network policy isolation. Storage strategies cover persistent vector databases via StatefulSets, ephemeral scratch volumes for model weights, and CSI driver selection for throughput-sensitive workloads. Observability is treated as a first-class concern through OpenTelemetry instrumentation, Prometheus metrics pipelines, Grafana dashboards, and structured logging with Loki. We extend the architecture to multi-cluster federation for geographic distribution and regulatory compliance, and harden the deployment with Pod Security Standards, gVisor sandboxing, OPA policy enforcement, and least-privilege RBAC. A reference architecture for a 50-agent production deployment provides concrete manifests, cost models, and capacity planning formulas. Throughout, we ground our recommendations in production experience operating agent fleets at scale and in the broader cloud-native community's best practices as codified by the CNCF.

The architecture described herein is aligned with the Open Standard for Sustainable Agents (OSSA) v0.3.3 specification and the BlueFly.io Agent Platform's separation-of-duties model. All Kubernetes manifests, CRD schemas, and operator pseudocode are provided as actionable reference implementations.

1. Why Kubernetes for AI Agents

1.1 The Operational Gap

The AI agent landscape has evolved rapidly from research prototypes to production systems that must meet enterprise reliability standards. An agent that performs well in a notebook or a single-process deployment quickly encounters operational challenges when deployed at scale: how do you restart it when it crashes? How do you scale it when load increases? How do you roll out a new model version without downtime? How do you enforce resource limits so that a runaway agent does not consume an entire cluster's GPU allocation?

These are precisely the problems that container orchestration platforms were designed to solve. Kubernetes, as the dominant container orchestration platform with 96% of organizations either using or evaluating it according to the CNCF Annual Survey 2025, provides a battle-tested foundation for addressing these operational concerns.

1.2 Declarative Desired State and Agent Specifications

The fundamental insight that makes Kubernetes suitable for agent orchestration is its declarative model. Rather than writing imperative scripts that specify how to deploy an agent (start process A, then configure network B, then attach volume C), operators declare what they want (an agent with these capabilities, this model, these resource limits, and this scaling policy) and the Kubernetes control plane continuously reconciles the actual state of the world with the desired state.

This model maps naturally to agent specifications. An OSSA agent manifest already declares the agent's identity, capabilities, access tier, and resource requirements in a declarative format. A Kubernetes CRD extends this with operational semantics: replica count, health check endpoints, scaling triggers, affinity rules, and upgrade strategies. The resulting object is simultaneously a complete description of what the agent is (its functional specification) and how it should be operated (its operational specification).

Declarative Alignment:

OSSA Manifest              Kubernetes CRD
--------------              --------------
agent.name          --->    metadata.name
agent.capabilities  --->    spec.capabilities[]
agent.model         --->    spec.runtime.model
agent.tier          --->    spec.security.accessTier
(not specified)     --->    spec.scaling (HPA policy)
(not specified)     --->    spec.resources (CPU/GPU/memory)
(not specified)     --->    spec.networking (service mesh)
(not specified)     --->    spec.observability (metrics/tracing)

Figure 1: Declarative alignment between OSSA agent manifests and Kubernetes CRDs.

1.3 Architecture Tiers and Cost Models

Not every organization requires a full multi-cluster federation with GPU scheduling and service mesh. We define three architecture tiers that allow organizations to adopt Kubernetes-native agent orchestration incrementally.

Table 1: Architecture Tiers

Tier	Monthly Cost	Nodes	Agents	GPU	Features
Small	$100-500	1-3	1-10	None/shared	CRDs, basic operator, HPA, Prometheus
Medium	$1,000-5,000	5-15	10-50	1-4 dedicated	Full operator, KEDA, Istio, VPA, OPA
Large	$10,000+	20-100+	50-500+	8+ dedicated	Multi-cluster, federation, custom schedulers

The small tier is achievable on managed Kubernetes services (EKS, GKE, AKS) with a single node pool and provides the foundational CRD and operator patterns. The medium tier adds GPU scheduling, event-driven scaling, and service mesh security. The large tier extends to multi-cluster federation, custom scheduling algorithms, and dedicated GPU pools with preemption policies.

The cost formula for capacity planning at any tier is:

Monthly_Cost = (N_cpu_nodes * C_cpu) + (N_gpu_nodes * C_gpu) + (S_pv * C_storage) + (E_gb * C_egress) + C_managed

Where:
  N_cpu_nodes = number of CPU-only nodes
  C_cpu       = cost per CPU node per month ($50-200 for cloud instances)
  N_gpu_nodes = number of GPU nodes
  C_gpu       = cost per GPU node per month ($500-3000 depending on GPU class)
  S_pv        = total persistent volume storage in GB
  C_storage   = cost per GB per month ($0.10-0.30 for SSD, $0.04-0.08 for HDD)
  E_gb        = monthly egress in GB
  C_egress    = cost per GB egress ($0.08-0.12 for major cloud providers)
  C_managed   = managed K8s control plane cost ($0-74/mo depending on provider)

1.4 Why Not Alternatives?

Before committing to Kubernetes, it is worth considering the alternatives.

Docker Compose: Suitable for single-machine deployments but lacks scheduling, scaling, self-healing, and multi-node support. Cannot handle GPU scheduling or affinity rules.

Nomad: A capable orchestrator with simpler operational characteristics than Kubernetes, but a significantly smaller ecosystem. The lack of CRD-equivalent extensibility means agent specifications must be encoded as job metadata rather than first-class API objects.

Serverless (Lambda/Cloud Run): Attractive for stateless, short-lived inference workloads but fundamentally misaligned with long-running, stateful agents that maintain conversation context, vector store connections, and tool registrations. Cold start latencies of 1-10 seconds are unacceptable for real-time agent interactions.

Custom Orchestration: Building a bespoke orchestration layer is always an option, but it means reimplementing scheduling, health checking, scaling, networking, storage, and observability from scratch. The engineering cost is prohibitive for all but the largest organizations.

Kubernetes occupies the sweet spot: a mature, extensible platform with a vast ecosystem of integrations, supported by every major cloud provider, and designed from the ground up for the kind of declarative, self-healing infrastructure that agent orchestration demands.

2. Custom Resource Definitions

2.1 The Agent CRD

The Agent CRD is the foundational building block of the orchestration layer. It extends the Kubernetes API with a new resource type that encodes everything the operator needs to know about an agent: its runtime configuration, resource requirements, scaling policy, security posture, and observability settings.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: agents.ossa.ai
  annotations:
    api-approved.kubernetes.io: "https://ossa.ai/api-review/agents"
spec:
  group: ossa.ai
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        status: {}
        scale:
          specReplicasPath: .spec.scaling.minReplicas
          statusReplicasPath: .status.replicas
      additionalPrinterColumns:
        - name: State
          type: string
          jsonPath: .status.state
        - name: Replicas
          type: integer
          jsonPath: .status.replicas
        - name: Model
          type: string
          jsonPath: .spec.runtime.model
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
      schema:
        openAPIV3Schema:
          type: object
          required: [spec]
          properties:
            spec:
              type: object
              required: [runtime, capabilities]
              properties:
                runtime:
                  type: object
                  required: [model, image]
                  properties:
                    model:
                      type: string
                      description: "LLM model identifier"
                      pattern: "^[a-z0-9-]+/[a-z0-9._-]+:[a-z0-9._-]+$"
                    image:
                      type: string
                      description: "Container image for the agent runtime"
                    command:
                      type: array
                      items:
                        type: string
                    env:
                      type: array
                      items:
                        type: object
                        properties:
                          name:
                            type: string
                          value:
                            type: string
                          valueFrom:
                            type: object
                            properties:
                              secretKeyRef:
                                type: object
                                properties:
                                  name:
                                    type: string
                                  key:
                                    type: string
                    providerEndpoint:
                      type: string
                      format: uri
                    maxTokensPerRequest:
                      type: integer
                      minimum: 1
                      maximum: 200000
                      default: 4096
                    temperature:
                      type: number
                      minimum: 0.0
                      maximum: 2.0
                      default: 0.7
                capabilities:
                  type: array
                  minItems: 1
                  items:
                    type: object
                    required: [name, type]
                    properties:
                      name:
                        type: string
                      type:
                        type: string
                        enum: [tool, skill, protocol, sensor]
                      version:
                        type: string
                      config:
                        type: object
                        x-kubernetes-preserve-unknown-fields: true
                resources:
                  type: object
                  properties:
                    requests:
                      type: object
                      properties:
                        cpu:
                          type: string
                          pattern: "^[0-9]+m?$"
                        memory:
                          type: string
                          pattern: "^[0-9]+(Mi|Gi)$"
                        nvidia.com/gpu:
                          type: integer
                          minimum: 0
                    limits:
                      type: object
                      properties:
                        cpu:
                          type: string
                        memory:
                          type: string
                        nvidia.com/gpu:
                          type: integer
                scaling:
                  type: object
                  properties:
                    minReplicas:
                      type: integer
                      minimum: 0
                      default: 1
                    maxReplicas:
                      type: integer
                      minimum: 1
                      default: 10
                    metrics:
                      type: array
                      items:
                        type: object
                        properties:
                          type:
                            type: string
                            enum: [cpu, memory, custom, external]
                          name:
                            type: string
                          target:
                            type: object
                            properties:
                              type:
                                type: string
                                enum: [Utilization, AverageValue, Value]
                              averageValue:
                                type: string
                              averageUtilization:
                                type: integer
                    scaleDownStabilization:
                      type: integer
                      default: 300
                      description: "Seconds to wait before scaling down"
                security:
                  type: object
                  properties:
                    accessTier:
                      type: string
                      enum: [tier_1_read, tier_2_write_limited, tier_3_full_access, tier_4_policy]
                      default: tier_1_read
                    runAsNonRoot:
                      type: boolean
                      default: true
                    readOnlyRootFilesystem:
                      type: boolean
                      default: true
                    runtimeClass:
                      type: string
                      description: "RuntimeClass name (e.g., gvisor for sandboxing)"
                    networkPolicy:
                      type: object
                      properties:
                        allowEgress:
                          type: array
                          items:
                            type: object
                            properties:
                              host:
                                type: string
                              port:
                                type: integer
                        denyIngress:
                          type: boolean
                          default: false
                observability:
                  type: object
                  properties:
                    metricsPort:
                      type: integer
                      default: 9090
                    metricsPath:
                      type: string
                      default: "/metrics"
                    tracingEnabled:
                      type: boolean
                      default: true
                    logLevel:
                      type: string
                      enum: [debug, info, warn, error]
                      default: info
            status:
              type: object
              properties:
                state:
                  type: string
                  enum: [Pending, Initializing, Running, Degraded, Terminating, Failed]
                replicas:
                  type: integer
                readyReplicas:
                  type: integer
                lastTransitionTime:
                  type: string
                  format: date-time
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                        enum: ["True", "False", "Unknown"]
                      reason:
                        type: string
                      message:
                        type: string
                      lastTransitionTime:
                        type: string
                        format: date-time
                metrics:
                  type: object
                  properties:
                    tokensPerSecond:
                      type: number
                    activeTaskCount:
                      type: integer
                    averageLatencyMs:
                      type: number
                    errorRate:
                      type: number
  scope: Namespaced
  names:
    plural: agents
    singular: agent
    kind: Agent
    shortNames:
      - ag
    categories:
      - ossa
      - ai

2.2 The AgentPool CRD

While individual Agent resources describe single agent types, the AgentPool CRD manages a logical group of agents that share infrastructure resources and scaling policies. An AgentPool defines node affinity, GPU allocation strategies, and pool-level resource quotas.

apiVersion: ossa.ai/v1
kind: AgentPool
metadata:
  name: inference-pool
  namespace: agent-system
spec:
  nodeSelector:
    accelerator: nvidia-a100
    topology.kubernetes.io/zone: us-east-1a
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  resourceQuota:
    requests.cpu: "64"
    requests.memory: "256Gi"
    requests.nvidia.com/gpu: "8"
    limits.cpu: "128"
    limits.memory: "512Gi"
    limits.nvidia.com/gpu: "8"
  agents:
    - name: code-reviewer
      replicas: 3
    - name: security-scanner
      replicas: 2
    - name: test-generator
      replicas: 5
  scheduling:
    strategy: BinPacking
    preemptionPolicy: PreemptLowerPriority
    priorityClassName: high-priority-agents

2.3 The AgentWorkflow CRD

The AgentWorkflow CRD encodes multi-agent workflows as directed acyclic graphs (DAGs) with typed edges representing data flow between agents.

apiVersion: ossa.ai/v1
kind: AgentWorkflow
metadata:
  name: code-review-pipeline
  namespace: agent-system
spec:
  entrypoint: analyze
  timeout: 600
  retryPolicy:
    maxRetries: 3
    backoff: exponential
  steps:
    - name: analyze
      agentRef: code-analyzer
      inputs:
        - name: repository
          type: git-url
      outputs:
        - name: analysis-report
          type: json
      next:
        - review
        - security-scan
    - name: review
      agentRef: code-reviewer
      inputs:
        - name: analysis-report
          fromStep: analyze
      outputs:
        - name: review-comments
          type: json
      next:
        - aggregate
    - name: security-scan
      agentRef: security-scanner
      inputs:
        - name: analysis-report
          fromStep: analyze
      outputs:
        - name: security-findings
          type: json
      next:
        - aggregate
    - name: aggregate
      agentRef: report-aggregator
      inputs:
        - name: review-comments
          fromStep: review
        - name: security-findings
          fromStep: security-scan
      outputs:
        - name: final-report
          type: json
  onFailure:
    step: notify-team
    agentRef: notification-agent

2.4 etcd Storage Considerations

Every CRD instance is stored in etcd as a key-value pair. The storage footprint per agent resource can be estimated as:

storage_per_agent = base_overhead + spec_size + status_size

Where:
  base_overhead  = ~1.5 KB (key prefix, metadata, timestamps, resourceVersion)
  spec_size      = 0.5 - 3.0 KB (depending on capabilities list and env vars)
  status_size    = 0.2 - 1.0 KB (conditions array, metrics snapshot)

Total per agent: ~2 - 5.5 KB (typically 2-4 KB)

For a deployment of 500 agents with an average spec size of 3 KB, the total etcd storage is approximately 1.75 MB, well within etcd's recommended maximum database size of 8 GB. However, the watch event rate is more significant: every status update generates a watch event that all operator replicas must process. At 500 agents updating status every 30 seconds, this produces approximately 17 events per second, which is within etcd's comfortable operating range of several thousand events per second.

CRD Lifecycle Data Flow:

  User/CI              API Server           etcd          Operator          Kubelet
    |                      |                  |               |                |
    |--- apply Agent CR -->|                  |               |                |
    |                      |--- store ------->|               |                |
    |                      |--- watch event ----------------->|                |
    |                      |                  |               |--- reconcile   |
    |                      |                  |               |    observe     |
    |                      |                  |               |    diff        |
    |                      |                  |               |    act ------->|
    |                      |                  |               |                |--- create pod
    |                      |                  |               |                |--- pull image
    |                      |                  |               |                |--- start container
    |                      |<-- status update ---------------|                |
    |                      |--- store ------->|               |                |
    |<-- event notification|                  |               |                |

Figure 2: CRD lifecycle data flow from creation through reconciliation to pod scheduling.

2.5 Versioning and Migration

CRD versioning follows the Kubernetes API versioning convention. When the schema evolves (for example, adding a new field to the agent specification), a new API version is introduced (e.g., ossa.ai/v1beta2 to ossa.ai/v1). Webhook-based conversion ensures that existing resources are automatically migrated to the new schema without downtime. The operator maintains backward compatibility by supporting reads from all served versions and writes to the storage version.

3. Agent Operator Pattern

3.1 Operator Architecture

The Agent Operator is a Kubernetes controller that watches for changes to Agent, AgentPool, and AgentWorkflow resources and reconciles the cluster state to match the declared specifications. It is built using the Operator SDK framework, which provides scaffolding for controller registration, leader election, metrics exposition, and webhook configuration.

The operator runs as a Deployment with multiple replicas for high availability, but only one replica (the leader) actively processes reconciliation events at any given time. The remaining replicas stand by as hot standbys, ready to assume leadership within seconds if the leader fails.

Operator Architecture:

+------------------------------------------------------------------+
|                     Agent Operator Deployment                     |
|                                                                   |
|  +-------------------+  +-------------------+  +---------------+ |
|  |  Replica 1        |  |  Replica 2        |  |  Replica 3    | |
|  |  (LEADER)         |  |  (STANDBY)        |  |  (STANDBY)    | |
|  |                   |  |                   |  |               | |
|  |  +-------------+  |  |  +-------------+  |  | +----------+  | |
|  |  | Reconciler  |  |  |  | Reconciler  |  |  | |Reconciler|  | |
|  |  | Loop        |  |  |  | (paused)    |  |  | |(paused)  |  | |
|  |  +------+------+  |  |  +-------------+  |  | +----------+  | |
|  |         |         |  |                   |  |               | |
|  |  +------v------+  |  |                   |  |               | |
|  |  | State       |  |  |                   |  |               | |
|  |  | Machine     |  |  |                   |  |               | |
|  |  +------+------+  |  |                   |  |               | |
|  |         |         |  |                   |  |               | |
|  |  +------v------+  |  |                   |  |               | |
|  |  | K8s Client  |  |  |                   |  |               | |
|  |  +-------------+  |  |                   |  |               | |
|  +-------------------+  +-------------------+  +---------------+ |
|                                                                   |
|  +-------------------------------------------------------------+ |
|  |                    Leader Election (Lease)                   | |
|  +-------------------------------------------------------------+ |
+------------------------------------------------------------------+
         |                          |                      |
         v                          v                      v
+------------------+  +------------------+  +------------------+
|  Agent CRs       |  |  AgentPool CRs   |  |  AgentWorkflow   |
|  (watch)         |  |  (watch)         |  |  CRs (watch)     |
+------------------+  +------------------+  +------------------+

Figure 3: Operator architecture with leader election and multi-replica standby.

3.2 Reconciliation Loop

The reconciliation loop is the heart of the operator. It follows the standard observe-diff-act pattern, but with agent-specific logic for model loading, capability registration, and health assessment.

Reconciliation Pseudocode:

function reconcile(agent: Agent) -> Result {
    // OBSERVE: Gather current state
    currentPods = listPods(labelSelector: agent.metadata.name)
    currentService = getService(agent.metadata.name)
    currentHPA = getHPA(agent.metadata.name)
    currentNetworkPolicy = getNetworkPolicy(agent.metadata.name)

    // DIFF: Compare desired vs actual
    desiredReplicas = agent.spec.scaling.minReplicas
    actualReplicas = len(currentPods.filter(phase == Running))

    desiredImage = agent.spec.runtime.image
    actualImages = currentPods.map(p => p.spec.containers[0].image).unique()

    desiredModel = agent.spec.runtime.model
    actualModelStatus = currentPods.map(p => p.annotations["ossa.ai/model-loaded"])

    // ACT: Apply changes based on diff

    // Phase 1: Ensure base resources exist
    if currentService == null {
        createService(agent)
        updateStatus(agent, state: "Initializing", reason: "CreatingService")
        return requeue(after: 5s)
    }

    if currentNetworkPolicy == null && agent.spec.security.networkPolicy != null {
        createNetworkPolicy(agent)
    }

    // Phase 2: Pod management
    if actualReplicas < desiredReplicas {
        // Scale up: create pods with anti-affinity for spread
        deficit = desiredReplicas - actualReplicas
        for i in range(deficit) {
            pod = buildAgentPod(agent, ordinal: actualReplicas + i)
            applySecurityContext(pod, agent.spec.security)
            applyResourceLimits(pod, agent.spec.resources)
            injectObservabilitySidecar(pod, agent.spec.observability)
            createPod(pod)
        }
        updateStatus(agent, state: "Initializing", reason: "ScalingUp")
        return requeue(after: 15s)
    }

    if actualReplicas > desiredReplicas {
        // Scale down: terminate excess pods (newest first)
        excess = actualReplicas - desiredReplicas
        podsToTerminate = currentPods.sortBy(creationTimestamp, desc).take(excess)
        for pod in podsToTerminate {
            // Graceful shutdown: drain active tasks first
            drainAgent(pod, timeout: 60s)
            deletePod(pod)
        }
        updateStatus(agent, state: "Running", reason: "ScalingDown")
        return requeue(after: 30s)
    }

    // Phase 3: Image/model update (rolling update)
    if len(actualImages) > 0 && actualImages[0] != desiredImage {
        performRollingUpdate(agent, currentPods, desiredImage)
        updateStatus(agent, state: "Initializing", reason: "RollingUpdate")
        return requeue(after: 10s)
    }

    // Phase 4: Health assessment
    healthyPods = currentPods.filter(p => p.status.conditions.ready == true)
    unhealthyPods = currentPods.filter(p => p.status.conditions.ready == false)

    if len(unhealthyPods) > 0 && len(healthyPods) < desiredReplicas {
        updateStatus(agent, state: "Degraded",
            reason: fmt("{} of {} replicas unhealthy", len(unhealthyPods), desiredReplicas))

        // Attempt recovery for pods stuck in CrashLoopBackOff
        for pod in unhealthyPods {
            if pod.status.containerStatuses[0].restartCount > 5 {
                deletePod(pod)  // Let the next reconciliation recreate it
            }
        }
        return requeue(after: 30s)
    }

    // Phase 5: HPA management
    if agent.spec.scaling.metrics != null && len(agent.spec.scaling.metrics) > 0 {
        if currentHPA == null {
            createHPA(agent)
        } else {
            updateHPA(agent, currentHPA)
        }
    }

    // Phase 6: Steady state
    updateStatus(agent, state: "Running",
        replicas: len(healthyPods),
        readyReplicas: len(healthyPods),
        metrics: collectMetrics(healthyPods))

    return requeue(after: 60s)  // Periodic reconciliation
}

3.3 State Machine

The agent lifecycle is modeled as a finite state machine with well-defined transitions and invariants.

Table 2: Agent State Machine Transitions

Current State	Event	Next State	Actions
(none)	CR created	Pending	Validate spec, set initial status
Pending	Resources available	Initializing	Create Service, Pods, NetworkPolicy
Pending	Resources unavailable	Pending	Set condition "ResourcesUnavailable"
Initializing	All pods ready	Running	Enable HPA, register in mesh
Initializing	Pod failure	Degraded	Log error, attempt restart
Initializing	Timeout (5 min)	Failed	Set condition "InitializationTimeout"
Running	Health check pass	Running	Update metrics in status
Running	Partial pod failure	Degraded	Scale replacement, alert
Running	All pods fail	Failed	Attempt full restart
Running	Spec change	Initializing	Begin rolling update
Running	CR deleted	Terminating	Drain tasks, delete resources
Degraded	Recovery	Running	Clear degraded condition
Degraded	Persistent failure	Failed	Escalate alert, stop retries
Degraded	CR deleted	Terminating	Force delete resources
Terminating	All resources deleted	(none)	Remove finalizer
Failed	User intervention	Pending	Reset state, retry
Failed	CR deleted	Terminating	Cleanup remaining resources

State Machine Diagram:

                    +----------+
         create --> | Pending  |
                    +----+-----+
                         |
              resources  |  resources
              available  |  unavailable
                         |  (loop)
                    +----v-----+
                    |Initializ-|
                    |   ing    |
                    +----+-----+
                   /     |      \
          all pods/      |       \timeout
          ready  /       |pod     \
                /        |failure  \
         +----v---+  +---v----+  +-v------+
         | Running|  |Degraded|  | Failed |
         +----+---+  +---+----+  +---+----+
              |           |          |
         spec |  recovery |   user   |
         change|          |  action  |
              |           |          |
              +-----------+----------+
                          |
                     CR deleted
                          |
                   +------v------+
                   | Terminating |
                   +------+------+
                          |
                     resources
                     cleaned up
                          |
                       (removed)

Figure 4: Agent lifecycle state machine with transitions.

3.4 Leader Election

The operator uses Kubernetes Lease objects for leader election. The leader acquires a lease with a configurable duration (default: 15 seconds) and renews it periodically (default: every 10 seconds). If the leader fails to renew the lease, another replica acquires it within the lease duration plus a brief jitter period.

The leader election configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-operator
  namespace: agent-system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-operator
  template:
    metadata:
      labels:
        app: agent-operator
    spec:
      serviceAccountName: agent-operator
      containers:
        - name: operator
          image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0
          args:
            - --leader-elect=true
            - --leader-election-id=agent-operator-leader
            - --leader-election-namespace=agent-system
            - --leader-election-lease-duration=15s
            - --leader-election-renew-deadline=10s
            - --leader-election-retry-period=2s
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
          ports:
            - containerPort: 8080
              name: metrics
            - containerPort: 8081
              name: health
          livenessProbe:
            httpGet:
              path: /healthz
              port: health
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /readyz
              port: health
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

The maximum failover time can be calculated as:

T_failover = lease_duration + retry_period + reconciliation_backoff
           = 15s + 2s + 5s
           = 22 seconds (worst case)

In practice, failover typically completes within 10-15 seconds because the standby replica's lease acquisition attempt aligns with the expired lease boundary.

3.5 Finalizers and Graceful Cleanup

The operator attaches a finalizer (ossa.ai/agent-cleanup) to every Agent resource. When the user deletes an Agent CR, Kubernetes marks it for deletion but does not remove it from etcd until all finalizers are cleared. The operator's reconciliation loop detects the deletion timestamp, transitions the agent to the Terminating state, drains active tasks from all pods, deletes subordinate resources (Pods, Services, HPAs, NetworkPolicies), and finally removes the finalizer, allowing Kubernetes to complete the deletion.

This ensures that active agent tasks are not abruptly terminated and that orphaned resources are not left in the cluster.

4. Agent Scaling

4.1 Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler adjusts the number of agent replicas based on observed metrics. For AI agents, the most relevant metrics are not traditional CPU and memory utilization but rather domain-specific metrics like tokens per second, active task count, and request queue depth.

The HPA scaling formula is:

desiredReplicas = ceil(currentMetricValue / targetMetricValue * currentReplicas)

Stabilization:
  scaleUp:   max(recommendations[last 0s..scaleUpStabilization])
  scaleDown: min(recommendations[last 0s..scaleDownStabilization])

With tolerance band (default 10%):
  if abs(1 - currentMetricValue/targetMetricValue) < 0.1:
      desiredReplicas = currentReplicas  // no change (within tolerance)

For example, if an agent is currently running 3 replicas processing 150 tokens per second total, and the target is 60 tokens per second per replica:

desiredReplicas = ceil(150 / 60 * 3) = ceil(7.5) = 8

But if scaling down, the stabilization window prevents premature scale-down:

desiredReplicas = min(recommendations[last 300s])
// If recommendations over last 5 minutes were [8, 7, 6, 5, 5]:
desiredReplicas = 5

4.2 Custom Metrics for Agent Workloads

Standard CPU and memory metrics are insufficient for intelligent agent scaling. The following custom metrics provide the signals needed for responsive, cost-effective scaling.

Table 3: Custom Metrics for Agent Scaling

Metric	Type	Description	Target Range	Scaling Behavior
`agent_tokens_per_second`	Pods	Token throughput per replica	50-100 tps	Scale up when throughput saturates
`agent_active_tasks`	Pods	Currently executing tasks per replica	1-5 tasks	Scale up when concurrency is high
`agent_queue_depth`	External	Pending tasks in message queue	0-10 items	Scale up proactively before saturation
`agent_request_latency_p99`	Pods	99th percentile response latency	< 2000 ms	Scale up when latency degrades
`agent_error_rate`	Pods	Error rate over 5-minute window	< 0.01 (1%)	Scale up if errors from overload
`agent_gpu_utilization`	Pods	GPU compute utilization percentage	60-80%	Scale up for GPU-bound workloads
`agent_model_cache_hit_rate`	Pods	KV cache hit rate for model inference	> 0.90 (90%)	Scale up if cache pressure high

4.3 Vertical Pod Autoscaler (VPA)

While HPA adjusts replica count, VPA adjusts the resource requests and limits of individual pods. For AI agents, VPA is particularly valuable for right-sizing GPU memory allocations. A model that was initially allocated 16 GB of GPU memory may only use 11 GB in practice; VPA can reduce the request to 12 GB (with headroom), freeing 4 GB for other workloads on the same GPU node.

VPA operates in three modes:

Off: VPA generates recommendations but does not apply them. Useful for initial observation.
Initial: VPA sets resource requests only at pod creation time. No disruptive restarts.
Auto: VPA evicts and recreates pods when resource requests need significant adjustment.

For agent workloads, the Initial mode is recommended for production because it avoids disruptive pod restarts that would interrupt active agent tasks. The Auto mode is suitable for development and staging environments where task interruption is acceptable.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: code-reviewer-vpa
  namespace: agent-system
spec:
  targetRef:
    apiVersion: ossa.ai/v1
    kind: Agent
    name: code-reviewer
  updatePolicy:
    updateMode: "Initial"
  resourcePolicy:
    containerPolicies:
      - containerName: agent
        minAllowed:
          cpu: 250m
          memory: 512Mi
        maxAllowed:
          cpu: 4
          memory: 16Gi
          nvidia.com/gpu: 1
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

4.4 KEDA for Event-Driven Scaling

KEDA (Kubernetes Event-Driven Autoscaling) extends Kubernetes scaling beyond metrics to event sources. For agent workloads, KEDA is invaluable for scaling based on message queue depth, enabling agents to scale from zero when no tasks are pending and scale up rapidly when a burst of tasks arrives.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: code-reviewer-scaler
  namespace: agent-system
spec:
  scaleTargetRef:
    apiVersion: ossa.ai/v1
    kind: Agent
    name: code-reviewer
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 20
  fallback:
    failureThreshold: 3
    replicas: 2
  triggers:
    - type: rabbitmq
      metadata:
        protocol: amqp
        queueName: agent-tasks-code-review
        mode: QueueLength
        value: "5"
      authenticationRef:
        name: rabbitmq-auth
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: agent_active_tasks
        query: |
          sum(agent_active_tasks{agent="code-reviewer"})
        threshold: "10"
    - type: cron
      metadata:
        timezone: America/New_York
        start: 0 8 * * 1-5
        end: 0 18 * * 1-5
        desiredReplicas: "3"

This configuration enables sophisticated scaling behavior. During business hours (8 AM to 6 PM EST, Monday through Friday), at least 3 replicas are maintained. Outside business hours, the agent scales to zero if no tasks are pending. When tasks arrive in the RabbitMQ queue, the agent scales up by one replica for every 5 pending tasks. The Prometheus trigger provides an additional signal based on active task concurrency across all replicas.

4.5 GPU Scheduling

GPU scheduling in Kubernetes requires the NVIDIA device plugin, which exposes nvidia.com/gpu as a schedulable resource. GPU allocation is binary at the device level: a pod either gets an entire GPU or none (fractional GPU sharing via MIG or time-slicing requires additional configuration).

Table 4: GPU Scheduling Strategies

Strategy	Configuration	Use Case	Efficiency
Exclusive	`nvidia.com/gpu: 1`	Large models (>10B params)	40-70% utilization
MIG (Multi-Instance GPU)	`nvidia.com/mig-3g.20gb: 1`	Medium models, multiple agents	70-85% utilization
Time-Slicing	`nvidia.com/gpu: 1` + time-slicing config	Small models, cost-sensitive	80-95% utilization
vGPU	NVIDIA vGPU license	Enterprise, guaranteed SLAs	60-80% utilization

For agent workloads, MIG partitioning on A100/H100 GPUs provides the best balance of isolation and efficiency. A single A100 80GB can be partitioned into seven 10 GB instances, each running a separate agent with hardware-level memory isolation.

The GPU utilization efficiency formula:

GPU_efficiency = (sum(agent_gpu_compute_time) / (N_gpus * wall_clock_time)) * 100%

Cost_per_token = (GPU_cost_per_hour / 3600) / tokens_per_second

For example, an A100 at $3.00/hour processing 500 tokens/second:
  Cost_per_token = ($3.00 / 3600) / 500 = $0.00000167 per token
  Cost_per_million_tokens = $1.67

5. Networking and Service Mesh

5.1 Kubernetes Service Model for Agents

Each Agent resource is backed by a Kubernetes Service that provides stable DNS-based discovery and load balancing. The operator creates a ClusterIP service for internal communication and optionally a headless service for StatefulSet-based agents that require stable network identities.

apiVersion: v1
kind: Service
metadata:
  name: code-reviewer
  namespace: agent-system
  labels:
    ossa.ai/agent: code-reviewer
    ossa.ai/type: inference
spec:
  selector:
    ossa.ai/agent: code-reviewer
  ports:
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP
    - name: http
      port: 8080
      targetPort: 8080
      protocol: TCP
    - name: metrics
      port: 9090
      targetPort: 9090
      protocol: TCP
  type: ClusterIP

5.2 Service Mesh Integration

For production deployments, a service mesh (Istio or Linkerd) provides critical capabilities that are difficult to implement at the application level: mutual TLS for all inter-agent communication, fine-grained traffic management, circuit breaking, and distributed tracing.

Istio's sidecar proxy (Envoy) automatically encrypts all traffic between agent pods using mutual TLS, eliminating the need for agents to manage their own TLS certificates. The mesh also provides L7 load balancing for gRPC, which is essential because Kubernetes' default L4 load balancing does not distribute gRPC streams across multiple backends.

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: agent-mtls
  namespace: agent-system
spec:
  mtls:
    mode: STRICT
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: code-reviewer-lb
  namespace: agent-system
spec:
  host: code-reviewer.agent-system.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        maxRequestsPerConnection: 0
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: code-reviewer-routing
  namespace: agent-system
spec:
  hosts:
    - code-reviewer.agent-system.svc.cluster.local
  http:
    - match:
        - headers:
            x-agent-version:
              exact: "v2"
      route:
        - destination:
            host: code-reviewer.agent-system.svc.cluster.local
            subset: v2
          weight: 100
    - route:
        - destination:
            host: code-reviewer.agent-system.svc.cluster.local
            subset: v1
          weight: 90
        - destination:
            host: code-reviewer.agent-system.svc.cluster.local
            subset: v2
          weight: 10

5.3 Network Policies

Network policies implement microsegmentation, ensuring that each agent can only communicate with the services it is authorized to access. The operator generates network policies automatically based on the agent's capability declarations and access tier.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: code-reviewer-netpol
  namespace: agent-system
spec:
  podSelector:
    matchLabels:
      ossa.ai/agent: code-reviewer
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              ossa.ai/type: orchestrator
        - podSelector:
            matchLabels:
              ossa.ai/agent: report-aggregator
      ports:
        - port: 50051
          protocol: TCP
        - port: 8080
          protocol: TCP
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: qdrant
      ports:
        - port: 6334
          protocol: TCP
    - to:
        - namespaceSelector:
            matchLabels:
              name: llm-providers
      ports:
        - port: 443
          protocol: TCP
    - to:
        - podSelector:
            matchLabels:
              app: prometheus
          namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - port: 9090
          protocol: TCP

5.4 Ingress and External Access

External access to agent services is provided through an Ingress controller with TLS termination. For production deployments, we recommend dedicated ingress resources per agent group rather than a single ingress with path-based routing, to provide isolation and independent scaling of ingress capacity.

Agent Networking Data Flow:

External Client
      |
      | HTTPS (TLS 1.3)
      v
+------------------+
| Ingress          |
| Controller       |
| (nginx/envoy)    |
+--------+---------+
         |
         | HTTP/2 (plaintext, within cluster)
         v
+------------------+
| Istio Ingress    |
| Gateway          |
+--------+---------+
         |
         | mTLS (Istio-managed certificates)
         v
+------------------+     mTLS      +------------------+
| Agent Pod A      |<------------>| Agent Pod B      |
| (code-reviewer)  |              | (security-scan)  |
| +-------------+  |              | +-------------+  |
| | Envoy Proxy |  |              | | Envoy Proxy |  |
| +------+------+  |              | +------+------+  |
|        |         |              |        |         |
| +------v------+  |              | +------v------+  |
| | Agent       |  |              | | Agent       |  |
| | Container   |  |              | | Container   |  |
| +-------------+  |              | +-------------+  |
+------------------+              +------------------+
         |                                 |
         | mTLS                            | mTLS
         v                                 v
+------------------+              +------------------+
| Qdrant           |              | Prometheus       |
| (Vector DB)      |              | (Monitoring)     |
+------------------+              +------------------+

Figure 5: Agent networking data flow with service mesh mTLS.

6. Storage and Persistence

6.1 Storage Requirements for AI Agents

AI agents have diverse storage requirements that span multiple access patterns and performance tiers.

Table 5: Agent Storage Requirements

Storage Type	Access Pattern	Performance	Persistence	Example Use
Model weights	Read-heavy, sequential	High throughput (1+ GB/s)	Ephemeral (cacheable)	LLM model files
Vector indices	Read-write, random	High IOPS (3000+)	Persistent	Qdrant/Milvus data
Conversation state	Write-heavy, append	Medium IOPS (500+)	Persistent	Agent memory
Task queue	Read-write, FIFO	Low latency (< 1ms)	Semi-persistent	Pending tasks
Scratch/temp	Write-heavy, sequential	Medium throughput	Ephemeral	Intermediate results
Configuration	Read-only	Low	Persistent	Agent config, prompts

6.2 Persistent Volume Claims

The operator creates PVCs based on the agent's storage declarations. For agents that require persistent state (vector databases, conversation history), the operator uses StorageClass selection to match performance requirements.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: code-reviewer-vector-store
  namespace: agent-system
  labels:
    ossa.ai/agent: code-reviewer
    ossa.ai/storage-type: vector-index
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd-high-iops
  resources:
    requests:
      storage: 50Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-high-iops
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iopsPerGB: "50"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

6.3 StatefulSets for Stateful Agents

Agents that maintain persistent state (such as vector database instances or agents with local model caches) are deployed as StatefulSets rather than Deployments. StatefulSets provide stable network identities (deterministic pod names like qdrant-0, qdrant-1) and ordered, graceful scaling that ensures data consistency.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant-vector-store
  namespace: agent-system
spec:
  serviceName: qdrant-headless
  replicas: 3
  podManagementPolicy: OrderedReady
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
        - name: qdrant
          image: qdrant/qdrant:v1.12.0
          ports:
            - containerPort: 6333
              name: http
            - containerPort: 6334
              name: grpc
            - containerPort: 6335
              name: internal
          volumeMounts:
            - name: qdrant-data
              mountPath: /qdrant/storage
          resources:
            requests:
              cpu: "2"
              memory: 8Gi
            limits:
              cpu: "4"
              memory: 16Gi
  volumeClaimTemplates:
    - metadata:
        name: qdrant-data
      spec:
        accessModes: [ReadWriteOnce]
        storageClassName: ssd-high-iops
        resources:
          requests:
            storage: 100Gi

6.4 CSI Driver Selection

The choice of CSI (Container Storage Interface) driver significantly impacts storage performance. The following guidelines apply to agent workloads:

Performance benchmarks by storage tier:

Storage Performance Reference:

  SSD (io2/gp3):
    IOPS:       3,000 - 64,000 (provisioned)
    Throughput:  125 - 1,000 MB/s
    Latency:     < 1 ms (p99)
    Cost:        $0.125/GB/month + $0.065/provisioned-IOPS

  HDD (st1/sc1):
    IOPS:       250 - 500
    Throughput:  20 - 500 MB/s
    Latency:     5 - 10 ms (p99)
    Cost:        $0.025 - $0.045/GB/month

  NFS (EFS/Filestore):
    IOPS:       varies (bursting)
    Throughput:  50 - 1,000 MB/s (provisioned)
    Latency:     2 - 10 ms (p99)
    Cost:        $0.30/GB/month (standard), $0.025/GB/month (infrequent access)
    Access:      ReadWriteMany (shared across pods)

  Local NVMe (i3/i4i instances):
    IOPS:       100,000 - 3,300,000
    Throughput:  1,750 - 8,000 MB/s
    Latency:     < 0.1 ms (p99)
    Cost:        included in instance cost (ephemeral)

For vector database workloads that require high random IOPS, provisioned SSD (io2) is recommended. For model weight caching where sequential throughput matters more than IOPS, local NVMe provides the best performance at the lowest cost (since storage is included in the instance price), with the caveat that data is ephemeral and must be re-downloaded if the node is replaced.

6.5 Model Weight Distribution

Large model weights (ranging from 2 GB for 7B-parameter quantized models to 150+ GB for 70B-parameter full-precision models) present a unique storage challenge. Downloading weights from a remote registry (Hugging Face, S3) on every pod startup introduces unacceptable latency. The recommended approach is a tiered caching strategy:

Cluster-level cache: A shared ReadWriteMany NFS volume mounted at /models/cache on all agent nodes, populated by a DaemonSet that pre-fetches model weights.
Node-level cache: A hostPath or local PV mounted at /var/cache/models that persists across pod restarts on the same node.
Pod-level init container: An init container that copies the required model weights from the node cache to the pod's ephemeral volume before the agent container starts.

Model loading time = max(download_time, 0) + copy_time + load_time

Where:
  download_time = model_size / download_bandwidth  (0 if cached)
  copy_time     = model_size / local_disk_bandwidth
  load_time     = model_size / memory_bandwidth + initialization_overhead

Example (13B model, 7.3 GB quantized):
  Cold start:  7.3 GB / 100 MB/s + 7.3 GB / 2 GB/s + 7.3 GB / 20 GB/s + 2s
             = 73s + 3.65s + 0.365s + 2s = ~79 seconds
  Warm start:  0s + 3.65s + 0.365s + 2s = ~6 seconds

7. Observability

7.1 The Three Pillars for Agent Workloads

Observability for AI agent workloads extends beyond traditional infrastructure monitoring. In addition to the standard three pillars (metrics, logs, traces), agent observability requires semantic-level understanding: what decisions did the agent make, what tools did it invoke, what was the quality of its output?

7.2 OpenTelemetry Instrumentation

The OpenTelemetry Collector runs as a DaemonSet on every node, receiving telemetry from agent pods via OTLP (OpenTelemetry Protocol) and routing it to the appropriate backends.

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: agent-collector
  namespace: monitoring
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        timeout: 5s
        send_batch_size: 1000
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
        spike_limit_mib: 128
      attributes:
        actions:
          - key: agent.name
            from_context: resource
            action: insert
          - key: agent.model
            from_context: resource
            action: insert
    exporters:
      prometheusremotewrite:
        endpoint: http://prometheus.monitoring:9090/api/v1/write
      otlp/tempo:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
      loki:
        endpoint: http://loki.monitoring:3100/loki/api/v1/push
    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, attributes]
          exporters: [otlp/tempo]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [loki]

7.3 Prometheus Metrics

The agent operator exposes a comprehensive set of Prometheus metrics that cover both infrastructure health and agent-specific semantics.

Key metrics exposed by the operator:

# Agent lifecycle metrics
agent_operator_reconcile_total{agent, result}              # Total reconciliation attempts
agent_operator_reconcile_duration_seconds{agent, quantile}  # Reconciliation latency
agent_operator_state_transitions_total{agent, from, to}     # State machine transitions
agent_operator_managed_agents_total{state}                  # Agents by state

# Agent runtime metrics (scraped from agent pods)
agent_tokens_processed_total{agent, model}                  # Total tokens processed
agent_tokens_per_second{agent, model}                       # Current throughput
agent_request_duration_seconds{agent, tool, quantile}       # Request latency by tool
agent_active_tasks{agent}                                   # Currently executing tasks
agent_queue_depth{agent}                                    # Pending tasks
agent_tool_invocations_total{agent, tool, status}           # Tool usage by status
agent_model_inference_duration_seconds{agent, model}        # Model inference latency
agent_errors_total{agent, type}                             # Errors by type
agent_gpu_utilization_percent{agent, gpu_index}             # GPU utilization
agent_gpu_memory_used_bytes{agent, gpu_index}               # GPU memory usage
agent_context_window_utilization{agent}                     # Context window fill percentage

7.4 Alert Rules

Critical alert rules for agent workloads:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: agent-alerts
  namespace: monitoring
spec:
  groups:
    - name: agent-health
      interval: 30s
      rules:
        - alert: AgentDown
          expr: |
            absent(up{job="agent-system"} == 1)
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Agent {{ $labels.agent }} is down"
            description: "Agent {{ $labels.agent }} has been unreachable for 5 minutes."

        - alert: AgentHighErrorRate
          expr: |
            rate(agent_errors_total[5m]) / rate(agent_tokens_processed_total[5m]) > 0.05
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Agent {{ $labels.agent }} error rate > 5%"

        - alert: AgentHighLatency
          expr: |
            histogram_quantile(0.99, rate(agent_request_duration_seconds_bucket[5m])) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Agent {{ $labels.agent }} p99 latency > 10s"

        - alert: AgentGPUMemoryPressure
          expr: |
            agent_gpu_memory_used_bytes / agent_gpu_memory_total_bytes > 0.95
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Agent {{ $labels.agent }} GPU memory > 95%"

        - alert: AgentQueueBacklog
          expr: |
            agent_queue_depth > 50
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Agent {{ $labels.agent }} queue depth > 50 for 10 minutes"

        - alert: AgentScalingMaxed
          expr: |
            kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Agent {{ $labels.agent }} at maximum replicas for 15 minutes"

7.5 Grafana Dashboards

A production agent observability stack includes the following Grafana dashboards:

Agent Fleet Overview: Total agents, state distribution, cluster resource utilization, error rates, and throughput aggregates.
Individual Agent Detail: Per-agent metrics including token throughput, latency percentiles, tool invocation breakdown, GPU utilization, and scaling events.
Workflow Execution: AgentWorkflow DAG visualization, step durations, failure rates, and end-to-end latency.
Cost and Capacity: GPU utilization efficiency, cost per token, resource waste (requested vs. used), and capacity planning projections.
Security and Compliance: RBAC audit events, network policy violations, runtime security alerts, and access tier validation.

7.6 Structured Logging with Loki

Agent logs are structured as JSON and shipped to Loki via the OpenTelemetry Collector. Each log entry includes the agent name, model, task ID, and tool invocation context as labels, enabling efficient filtering and correlation.

The log retention policy should account for the volume generated by verbose agent interactions. A single agent processing 100 requests per hour with an average of 10 tool invocations per request generates approximately 1,000 log entries per hour. At an average of 500 bytes per entry, this is 500 KB/hour or ~360 MB/month per agent. For a fleet of 50 agents, total log volume is approximately 18 GB/month before compression (Loki typically achieves 10-15x compression, resulting in approximately 1.2-1.8 GB of stored data).

8. Multi-Cluster Federation

8.1 Why Multi-Cluster?

Single-cluster deployments are sufficient for many organizations, but multi-cluster federation becomes necessary for several reasons:

Geographic distribution: Agents that interact with users in multiple regions benefit from reduced latency when deployed closer to the user.
Regulatory compliance: Data residency requirements (GDPR, CCPA) may mandate that certain agent workloads and their associated data remain within specific geographic boundaries.
Blast radius reduction: Isolating agent workloads across clusters limits the impact of cluster-level failures.
Resource specialization: Different clusters can provide different hardware profiles (GPU types, memory configurations) for different agent workloads.
Scale limits: etcd performance degrades above approximately 10,000 custom resources per cluster; large agent deployments may need to partition across clusters.

8.2 Federation Architecture

Multi-Cluster Federation:

                    +----------------------------+
                    |    Federation Control      |
                    |    Plane                   |
                    |                            |
                    | +------------------------+ |
                    | | KubeFed Controller     | |
                    | | Manager                | |
                    | +------------------------+ |
                    | | Agent Federation       | |
                    | | Scheduler              | |
                    | +------------------------+ |
                    | | Global Service Mesh    | |
                    | | (Istio Multi-Cluster)  | |
                    | +------------------------+ |
                    +-----+-------+-------+-----+
                          |       |       |
              +-----------+       |       +-----------+
              |                   |                   |
    +---------v--------+ +-------v--------+ +--------v---------+
    | Cluster: US-East | | Cluster: EU    | | Cluster: AP      |
    |                  | |                | |                   |
    | Agents:          | | Agents:        | | Agents:           |
    | - code-reviewer  | | - gdpr-agent   | | - translation     |
    | - security-scan  | | - eu-reviewer  | | - ap-reviewer     |
    | - test-gen       | | - compliance   | | - sentiment       |
    |                  | |                | |                   |
    | GPU: 4x A100     | | GPU: 2x A100  | | GPU: 2x A100     |
    | Nodes: 15        | | Nodes: 8      | | Nodes: 6          |
    +------------------+ +----------------+ +-------------------+

Figure 6: Multi-cluster federation architecture with geographic distribution.

8.3 Federated Agent Resources

KubeFed (Kubernetes Federation v2) enables the propagation of Agent CRs across multiple clusters with placement policies and override mechanisms.

apiVersion: types.kubefed.io/v1beta1
kind: FederatedAgent
metadata:
  name: code-reviewer
  namespace: agent-system
spec:
  template:
    spec:
      runtime:
        model: anthropic/claude-sonnet-4-20250514:latest
        image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0
      capabilities:
        - name: code-review
          type: skill
          version: "2.1"
      scaling:
        minReplicas: 2
        maxReplicas: 10
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: 1
  placement:
    clusters:
      - name: us-east
      - name: eu-west
    clusterSelector:
      matchLabels:
        gpu-available: "true"
  overrides:
    - clusterName: eu-west
      clusterOverrides:
        - path: "/spec/scaling/minReplicas"
          value: 1
        - path: "/spec/scaling/maxReplicas"
          value: 5
        - path: "/spec/runtime/model"
          value: "anthropic/claude-sonnet-4-20250514:eu-compliant"

8.4 Cross-Cluster Latency Model

Cross-cluster agent communication introduces latency that must be accounted for in workflow design. The total latency for a cross-cluster agent invocation is:

T_cross_cluster = T_serialization + (N_hops * T_per_hop) + T_deserialization + T_processing

Where:
  T_serialization   = message_size / serialization_throughput
                    = typically 0.1 - 2 ms for protobuf (gRPC)
  N_hops            = number of network hops (typically 3-8 for cross-region)
  T_per_hop         = per-hop latency (0.5 - 5 ms per hop)
  T_deserialization  = roughly equal to T_serialization
  T_processing      = agent processing time (highly variable, 100ms - 60s)

Example (US-East to EU-West):
  T_cross_cluster = 0.5ms + (6 * 2ms) + 0.5ms + 500ms
                  = 0.5 + 12 + 0.5 + 500
                  = 513 ms

Compared to same-cluster:
  T_same_cluster  = 0.5ms + (2 * 0.1ms) + 0.5ms + 500ms
                  = 0.5 + 0.2 + 0.5 + 500
                  = 501.2 ms

The 12 ms network overhead of cross-cluster communication is negligible compared to agent processing time for most workloads. However, for workflows with many sequential agent invocations (e.g., a 10-step pipeline), the cumulative overhead becomes significant: 120 ms for cross-cluster vs. 2 ms for same-cluster, a 60x increase in network latency.

The recommendation is to co-locate agents that form tight interaction loops in the same cluster and use cross-cluster communication only for loosely coupled workflows or geographic routing.

8.5 Cluster API for Infrastructure Provisioning

For organizations that manage their own Kubernetes infrastructure (rather than using managed services), the Cluster API provides declarative, Kubernetes-style APIs for creating, configuring, and managing clusters. This enables the agent orchestration layer to provision new clusters on demand in response to scaling requirements or geographic expansion.

9. Security Hardening

9.1 Pod Security Standards

Kubernetes Pod Security Standards define three levels of restriction: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (heavily restricted, following current best practices). Agent workloads should run at the Restricted level with targeted exceptions for GPU access.

apiVersion: v1
kind: Namespace
metadata:
  name: agent-system
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

The Restricted level enforces:

Pods must run as non-root
Root filesystem must be read-only
Privilege escalation must be explicitly disallowed
Seccomp profile must be set (RuntimeDefault or Localhost)
Host namespaces (hostNetwork, hostPID, hostIPC) are forbidden
HostPath volumes are forbidden

For GPU workloads, a RuntimeClass exception is required because the NVIDIA device plugin requires certain capabilities. This is handled through a targeted exemption rather than relaxing the entire namespace.

9.2 RuntimeClass and Sandboxing

For agents that execute untrusted code (e.g., a code execution agent that runs user-submitted programs), gVisor provides an additional layer of isolation beyond standard container boundaries. gVisor intercepts system calls and handles them in user space, preventing the container from directly interacting with the host kernel.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
overhead:
  podFixed:
    cpu: 100m
    memory: 64Mi
scheduling:
  nodeSelector:
    runtime.gvisor.dev/capable: "true"
---
apiVersion: ossa.ai/v1
kind: Agent
metadata:
  name: code-executor
  namespace: agent-system
spec:
  runtime:
    model: anthropic/claude-sonnet-4-20250514:latest
    image: registry.gitlab.com/blueflyio/agents/code-executor:v1.0.0
  security:
    accessTier: tier_3_full_access
    runtimeClass: gvisor
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    networkPolicy:
      allowEgress:
        - host: "*.internal.svc.cluster.local"
          port: 443
      denyIngress: false

The performance overhead of gVisor is approximately 5-15% for CPU-bound workloads and 20-40% for syscall-heavy workloads. For AI inference workloads that are primarily GPU-bound, the overhead is negligible because GPU operations bypass the gVisor syscall interception layer.

9.3 RBAC (Role-Based Access Control)

RBAC for the agent system follows the principle of least privilege. The operator service account has broad permissions within the agent-system namespace, but agent pods themselves have tightly scoped permissions based on their OSSA access tier.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: agent-operator
rules:
  - apiGroups: ["ossa.ai"]
    resources: ["agents", "agentpools", "agentworkflows"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["ossa.ai"]
    resources: ["agents/status", "agentpools/status", "agentworkflows/status"]
    verbs: ["get", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["networkpolicies"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: agent-tier1-readonly
  namespace: agent-system
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["agent-api-keys"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: agent-tier3-executor
  namespace: agent-system
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "create", "update"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["agent-api-keys", "agent-git-credentials"]
    verbs: ["get"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list"]

9.4 OPA (Open Policy Agent) for Policy Enforcement

OPA Gatekeeper enforces custom policies that go beyond what Kubernetes RBAC and Pod Security Standards can express. For agent workloads, OPA policies enforce constraints such as:

Agents must not request more GPU resources than their access tier permits.
Agents in tier_1_read cannot have egress network policies to external endpoints.
Agent images must be pulled from the approved container registry.
Agent model references must be from the approved model registry.
Cross-tier agent communication must follow the role conflict matrix.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: agenttierresourcelimit
spec:
  crd:
    spec:
      names:
        kind: AgentTierResourceLimit
      validation:
        openAPIV3Schema:
          type: object
          properties:
            maxGPU:
              type: object
              additionalProperties:
                type: integer
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package agenttierresourcelimit

        violation[{"msg": msg}] {
          input.review.object.apiVersion == "ossa.ai/v1"
          input.review.object.kind == "Agent"

          tier := input.review.object.spec.security.accessTier
          requested_gpu := input.review.object.spec.resources.requests["nvidia.com/gpu"]
          max_gpu := input.parameters.maxGPU[tier]

          requested_gpu > max_gpu

          msg := sprintf(
            "Agent %v in tier %v requests %v GPUs, max allowed: %v",
            [input.review.object.metadata.name, tier, requested_gpu, max_gpu]
          )
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: AgentTierResourceLimit
metadata:
  name: agent-gpu-limits-by-tier
spec:
  match:
    kinds:
      - apiGroups: ["ossa.ai"]
        kinds: ["Agent"]
  parameters:
    maxGPU:
      tier_1_read: 0
      tier_2_write_limited: 1
      tier_3_full_access: 4
      tier_4_policy: 0

9.5 Supply Chain Security

Agent container images must pass through a verification pipeline before deployment:

Image signing: All images are signed with Sigstore/Cosign during CI. The operator validates signatures before allowing pod creation.
Vulnerability scanning: Trivy scans all images for CVEs. Images with critical vulnerabilities are blocked.
SBOM generation: Software Bills of Materials are generated for all agent images and stored in the registry alongside the image.
Admission control: Kyverno or OPA policies enforce that only signed, scanned images from approved registries are admitted to the cluster.

10. Reference Architecture

10.1 50-Agent Production Deployment

The following reference architecture describes a production deployment of 50 agents across a medium-tier Kubernetes cluster. This architecture supports a mix of CPU-only agents (lightweight tools, routing, orchestration) and GPU-accelerated agents (inference, code generation, analysis).

Table 6: Reference Architecture Node Pools

Node Pool	Instance Type	Count	CPU	Memory	GPU	Purpose
system	m6i.xlarge	3	4 vCPU	16 GB	None	Control plane, operator, monitoring
cpu-agents	m6i.2xlarge	5	8 vCPU	32 GB	None	CPU-only agents, orchestrators
gpu-inference	g5.2xlarge	4	8 vCPU	32 GB	1x A10G 24GB	Inference agents, code generation
gpu-heavy	p4d.24xlarge	1	96 vCPU	1152 GB	8x A100 40GB	Large model inference, training
storage	i3.xlarge	3	4 vCPU	30.5 GB	None	Qdrant, MinIO, PostgreSQL

Deployment manifest for the complete system:

# Namespace and resource quotas
apiVersion: v1
kind: Namespace
metadata:
  name: agent-system
  labels:
    pod-security.kubernetes.io/enforce: restricted
    istio-injection: enabled
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: agent-system-quota
  namespace: agent-system
spec:
  hard:
    requests.cpu: "200"
    requests.memory: 800Gi
    requests.nvidia.com/gpu: "12"
    limits.cpu: "400"
    limits.memory: 1600Gi
    limits.nvidia.com/gpu: "12"
    pods: "200"
    services: "60"
    persistentvolumeclaims: "50"
---
# Agent Operator deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-operator
  namespace: agent-system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-operator
  template:
    metadata:
      labels:
        app: agent-operator
    spec:
      serviceAccountName: agent-operator
      nodeSelector:
        node-pool: system
      containers:
        - name: operator
          image: registry.gitlab.com/blueflyio/agent-operator:v1.0.0
          args:
            - --leader-elect=true
            - --leader-election-id=agent-operator-leader
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
            - --max-concurrent-reconciles=10
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 1000m
              memory: 1Gi
---
# Example agents (representative of 50-agent fleet)
apiVersion: ossa.ai/v1
kind: Agent
metadata:
  name: code-reviewer
  namespace: agent-system
spec:
  runtime:
    model: anthropic/claude-sonnet-4-20250514:latest
    image: registry.gitlab.com/blueflyio/agents/code-reviewer:v2.1.0
    maxTokensPerRequest: 8192
    temperature: 0.3
  capabilities:
    - name: code-review
      type: skill
      version: "2.1"
    - name: git-operations
      type: tool
      version: "1.0"
  resources:
    requests:
      cpu: "2"
      memory: 8Gi
      nvidia.com/gpu: 1
    limits:
      cpu: "4"
      memory: 16Gi
      nvidia.com/gpu: 1
  scaling:
    minReplicas: 2
    maxReplicas: 8
    metrics:
      - type: custom
        name: agent_active_tasks
        target:
          type: AverageValue
          averageValue: "3"
      - type: custom
        name: agent_queue_depth
        target:
          type: Value
          averageValue: "10"
    scaleDownStabilization: 300
  security:
    accessTier: tier_3_full_access
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    networkPolicy:
      allowEgress:
        - host: "gitlab.com"
          port: 443
        - host: "api.anthropic.com"
          port: 443
  observability:
    metricsPort: 9090
    metricsPath: /metrics
    tracingEnabled: true
    logLevel: info
---
apiVersion: ossa.ai/v1
kind: Agent
metadata:
  name: routing-orchestrator
  namespace: agent-system
spec:
  runtime:
    model: anthropic/claude-haiku-4-20250514:latest
    image: registry.gitlab.com/blueflyio/agents/router:v1.5.0
    maxTokensPerRequest: 2048
    temperature: 0.1
  capabilities:
    - name: task-routing
      type: skill
      version: "1.5"
    - name: agent-discovery
      type: protocol
      version: "1.0"
  resources:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "2"
      memory: 4Gi
  scaling:
    minReplicas: 3
    maxReplicas: 15
    metrics:
      - type: custom
        name: agent_active_tasks
        target:
          type: AverageValue
          averageValue: "10"
  security:
    accessTier: tier_2_write_limited
    runAsNonRoot: true
    readOnlyRootFilesystem: true
  observability:
    metricsPort: 9090
    tracingEnabled: true
    logLevel: info

10.2 Cost Model

The monthly cost for this reference architecture is calculated as follows:

Table 7: Monthly Cost Breakdown

Component	Quantity	Unit Cost	Monthly Cost
System nodes (m6i.xlarge)	3	$138/mo	$414
CPU agent nodes (m6i.2xlarge)	5	$276/mo	$1,380
GPU inference nodes (g5.2xlarge)	4	$912/mo	$3,648
GPU heavy node (p4d.24xlarge)	1	$23,558/mo	$23,558
Storage nodes (i3.xlarge)	3	$225/mo	$675
EBS storage (gp3, 2TB total)	2,000 GB	$0.08/GB/mo	$160
EBS storage (io2, 500GB)	500 GB	$0.125/GB/mo	$62.50
Data transfer (egress)	500 GB	$0.09/GB	$45
EKS control plane	1	$73/mo	$73
Total			$30,015.50

For organizations that do not require the p4d.24xlarge heavy GPU node, the cost drops to approximately $6,457/month, well within the medium tier range. The heavy GPU node is only necessary for organizations running large language models (70B+ parameters) locally rather than using API-based inference.

The cost formula for estimating deployment expenses:

Monthly = (N_system * C_system) + (N_cpu * C_cpu_node) + (N_gpu_small * C_gpu_small)
        + (N_gpu_large * C_gpu_large) + (N_storage * C_storage_node)
        + (S_gp3 * 0.08) + (S_io2 * 0.125) + (E_gb * 0.09) + C_managed

Cost_per_agent = Monthly / N_agents
Cost_per_token = Monthly / (N_agents * avg_tokens_per_agent_per_month)

For this reference architecture:

Cost per agent: $30,015.50 / 50 = $600.31/month
Assuming each agent processes 10 million tokens/month: $0.003/1000 tokens

10.3 Capacity Planning

Capacity planning for agent workloads requires estimating the peak concurrency, throughput requirements, and resource consumption patterns.

Capacity planning formulas:

Required_CPU_nodes = ceil(sum(agent_cpu_requests) / node_allocatable_cpu)
Required_GPU_nodes = ceil(sum(agent_gpu_requests) / gpus_per_node)
Required_memory    = sum(agent_memory_requests) * (1 + headroom_percent)

Throughput_capacity = N_replicas * tokens_per_second_per_replica
Latency_budget     = target_p99_latency - network_overhead - queue_wait
Max_concurrent     = throughput_capacity * latency_budget

Scaling_headroom   = max_replicas / min_replicas  (recommended: 3-5x)

For the 50-agent deployment:

Total CPU requests: ~150 vCPU (3 vCPU average per agent)
Total GPU requests: 12 GPUs (24% of agents require GPU)
Total memory requests: ~300 GB (6 GB average per agent)
Peak throughput: ~5,000 tokens/second aggregate
P99 latency target: < 5 seconds for inference agents

11. References

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. ACM Queue, 14(1), 70-93. DOI:10.1145/2898442.2898444 | Google Research
Cloud Native Computing Foundation. (2025). CNCF Annual Survey 2025: Kubernetes Adoption and Trends. https://www.cncf.io/reports/cncf-annual-survey-2025/
Kubernetes Authors. (2025). Custom Resources. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
Kubernetes Authors. (2025). Operator Pattern. Kubernetes Documentation. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
Operator SDK Authors. (2025). Building Operators with Operator SDK. https://sdk.operatorframework.io/
Kubernetes Authors. (2025). Horizontal Pod Autoscaler. Kubernetes Documentation. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
KEDA Authors. (2025). KEDA: Kubernetes Event-driven Autoscaling. https://keda.sh/docs/
Kubernetes Authors. (2025). Vertical Pod Autoscaler. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
Istio Authors. (2025). Istio Service Mesh Architecture. https://istio.io/latest/docs/ops/deployment/architecture/
Linkerd Authors. (2025). Linkerd Architecture. https://linkerd.io/2/reference/architecture/
Kubernetes Authors. (2025). Network Policies. Kubernetes Documentation. https://kubernetes.io/docs/concepts/services-networking/network-policies/
Kubernetes Authors. (2025). Persistent Volumes. Kubernetes Documentation. https://kubernetes.io/docs/concepts/storage/persistent-volumes/
Kubernetes Authors. (2025). StatefulSets. Kubernetes Documentation. https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
OpenTelemetry Authors. (2025). OpenTelemetry Collector. https://opentelemetry.io/docs/collector/
Prometheus Authors. (2025). Prometheus Monitoring System. https://prometheus.io/docs/
Grafana Labs. (2025). Grafana Loki: Log Aggregation System. https://grafana.com/docs/loki/latest/
Kubernetes Federation v2 Authors. (2025). KubeFed: Kubernetes Federation v2. https://github.com/kubernetes-sigs/kubefed
Cluster API Authors. (2025). Cluster API Documentation. https://cluster-api.sigs.k8s.io/
Kubernetes Authors. (2025). Pod Security Standards. Kubernetes Documentation. https://kubernetes.io/docs/concepts/security/pod-security-standards/
gVisor Authors. (2025). gVisor: Application Kernel for Containers. https://gvisor.dev/docs/
Open Policy Agent Authors. (2025). OPA Gatekeeper. https://open-policy-agent.github.io/gatekeeper/
NVIDIA. (2025). NVIDIA Device Plugin for Kubernetes. https://github.com/NVIDIA/k8s-device-plugin
NVIDIA. (2025). Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
Sigstore Authors. (2025). Cosign: Container Signing. https://docs.sigstore.dev/cosign/
Aqua Security. (2025). Trivy: Comprehensive Vulnerability Scanner. https://trivy.dev/
BlueFly.io. (2026). Open Standard for Sustainable Agents (OSSA) v0.3.3 Specification. https://gitlab.com/blueflyio/openstandardagents
BlueFly.io. (2026). Agent Platform Technical Documentation. https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/home
Kubernetes Authors. (2025). Resource Management for Pods and Containers. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
Kubernetes Authors. (2025). Scheduling, Preemption, and Eviction. https://kubernetes.io/docs/concepts/scheduling-eviction/
etcd Authors. (2025). etcd Performance and Tuning. https://etcd.io/docs/v3.5/op-guide/performance/

Appendix A: Glossary

Term	Definition
CRD	Custom Resource Definition: extends the Kubernetes API with new resource types
HPA	Horizontal Pod Autoscaler: adjusts replica count based on metrics
VPA	Vertical Pod Autoscaler: adjusts resource requests/limits per pod
KEDA	Kubernetes Event-Driven Autoscaling: scales based on event sources
OSSA	Open Standard for Sustainable Agents: BlueFly.io's agent specification
mTLS	Mutual TLS: bidirectional certificate-based authentication
OPA	Open Policy Agent: policy enforcement engine
CSI	Container Storage Interface: standard for storage plugins
MIG	Multi-Instance GPU: NVIDIA technology for GPU partitioning
RBAC	Role-Based Access Control: Kubernetes authorization mechanism
CRI	Container Runtime Interface: standard for container runtimes
DAG	Directed Acyclic Graph: used for workflow step ordering
PVC	Persistent Volume Claim: storage request in Kubernetes
OTLP	OpenTelemetry Protocol: telemetry data transport protocol

Appendix B: Checklist for Production Readiness

Agent CRDs deployed and validated with OpenAPI v3 schema
Agent Operator running with 3 replicas and leader election
HPA configured with custom metrics (tokens/sec, queue depth)
KEDA ScaledObjects for event-driven scaling with scale-to-zero
VPA in Initial mode for GPU memory right-sizing
Istio service mesh with STRICT mTLS enabled
Network policies applied to all agent pods
Pod Security Standards enforced at Restricted level
gVisor RuntimeClass for code-execution agents
RBAC roles aligned with OSSA access tiers
OPA policies for tier-based resource limits
OpenTelemetry Collector DaemonSet deployed
Prometheus scraping agent metrics endpoints
Grafana dashboards for fleet overview and individual agents
Alert rules for agent health, latency, GPU pressure, and queue backlog
Loki log aggregation with appropriate retention policies
Container image signing with Cosign
Vulnerability scanning with Trivy in CI pipeline
SBOM generation for all agent images
Resource quotas applied at namespace level
PVCs provisioned with appropriate StorageClass (SSD for vector DBs)
Model weight caching strategy implemented (cluster/node/pod tiers)
Backup strategy for persistent agent state
Disaster recovery plan documented and tested
Capacity planning formulas validated against actual usage

This whitepaper is part of the BlueFly.io Agent Platform Whitepaper Series. For questions, contributions, or errata, please open an issue at https://gitlab.com/blueflyio/agent-platform/technical-docs/-/issues.