Skip to main content

kagent fix 2026 02 26

kagent Fix — 2026-02-26

Summary

Fixed ALL 35 kagent agents on Oracle k3s. 33/35 now show Accepted=True, Ready=True in kagent-ui.

Root Cause

Helm doesn't upgrade CRDs — this is a known Kubernetes/Helm limitation. When kagent was upgraded from older version to 0.7.17 via helm upgrade, the CRD schemas in kagent-crds-0.7.17 chart were NOT applied to the cluster. Result:

  1. ModelConfig CRD was stale: Only had model + apiKeySecretRef fields. Missing provider (required!), apiKeySecret, apiKeySecretKey, anthropic, openAI config blocks
  2. Agent CRD was stale: Only had flat description, modelConfig, systemMessage, tools. Missing type (Declarative/BYO) and declarative.* wrapper that 0.7.17 controller expects
  3. All ModelConfig fields were silently pruned by K8s at creation time — provider, apiKeySecret, etc. were stripped because the CRD didn't define them
  4. All Agent nested fields were silently prunedspec.type, spec.declarative.* stripped for same reason
  5. Controller 0.7.17 couldn't reconcile agents — read type field, got empty string → "unknown agent type: "

What Was Fixed

1. Updated CRDs (server-side apply with --force-conflicts)

# ModelConfig CRD — from chart manifest kubectl apply -f /tmp/kagent-crds-fix.yaml --server-side --force-conflicts # Agent CRD — custom schema with x-kubernetes-preserve-unknown-fields kubectl apply -f /tmp/agent-crd-full.yaml --server-side --force-conflicts

ModelConfig CRD v1alpha2 now has: model, provider (required), apiKeySecret, apiKeySecretKey, anthropic, openAI, azureOpenAI, ollama, gemini, geminiVertexAI, anthropicVertexAI, bedrock, tls, apiKeyPassthrough, defaultHeaders

Agent CRD v1alpha2 now has: type (Declarative/BYO), description, declarative (preserve-unknown-fields), byo, allowedNamespaces, modelConfig, systemMessage, tools, a2aConfig, deployment, memory, stream, status

2. Re-applied 5 ModelConfigs with correct fields

# Example: claude-opus apiVersion: kagent.dev/v1alpha2 kind: ModelConfig metadata: name: claude-opus namespace: kagent spec: model: claude-opus-4-6 provider: Anthropic # WAS MISSING apiKeySecret: kagent-anthropic # WAS MISSING apiKeySecretKey: ANTHROPIC_API_KEY # WAS MISSING anthropic: maxTokens: 8192
ModelConfigProviderModelSecret
claude-opusAnthropicclaude-opus-4-6kagent-anthropic
claude-sonnetAnthropicclaude-sonnet-4-5-20250929kagent-anthropic
gpt4oOpenAIgpt-4okagent-openai
gpt4o-miniOpenAIgpt-4o-minikagent-openai
default-model-configOpenAIgpt-4.1-minikagent-openai

3. Re-applied 35 agents with correct nested structure

# kagent 0.7.17 expects NESTED spec: apiVersion: kagent.dev/v1alpha2 kind: Agent metadata: name: bluefly-platform-agent namespace: kagent spec: type: Declarative # REQUIRED by controller description: "BlueFly Platform Agent" declarative: # Everything under declarative.* modelConfig: claude-opus systemMessage: "..." tools: - type: McpServer mcpServer: name: kagent-tool-server kind: RemoteMCPServer toolNames: [...]

4. Restarted controller

kubectl -n kagent rollout restart deploy/kagent-controller

Result

CategoryCountStatus
OSSA agents25ALL Accepted + Ready
Custom agents (bluefly-*, compliance, drupal-test)4ALL Accepted + Ready
Helm agents (k8s, helm, cilium, istio, observability, promql)4Accepted + Ready
Failed Helm agents2argo-rollouts (empty modelConfig), kgateway (acceptance issue)
Total3533 OK, 2 failed

Important: kagent CRD Schema

The correct Agent spec structure for kagent 0.7.17 is NESTED:

spec.type: "Declarative"
spec.description: "..."
spec.declarative.modelConfig: "<modelconfig-name>"
spec.declarative.systemMessage: "..."
spec.declarative.tools: [...]
spec.declarative.a2aConfig: {...}
spec.declarative.deployment: {...}

NOT flat:

# WRONG for 0.7.17:
spec.modelConfig: "..."
spec.systemMessage: "..."
spec.tools: [...]

The generate-kagent-crds.ts in platform-agents MUST use the nested structure. The earlier edit that flattened it was incorrect and needs to be reverted.

MCP Tool Server Fixes (same day, later session)

Fixed GKG RemoteMCPServer URL

  • Was: https://gkg.bluefly.internal/api/mcp/sse → returned HTML (404)
  • Fixed to: https://gkg.bluefly.internal/mcp/sse → now discovers 8 tools, Accepted

Fixed kagent-grafana-mcp

  • Container had ZERO env vars — couldn't connect to Grafana (https:///api/datasources = no host)
  • Created Grafana service account (kagent-mcp, Admin role) with token glsa_E48d...
  • Created secret kagent-grafana-mcp with GRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN
  • Patched deployment with env vars from secret
  • Now discovers 54 tools, Accepted

Registered kagent-querydoc as RemoteMCPServer

  • Pod existed but was NOT registered as RemoteMCPServer
  • Created RemoteMCPServer CR: http://kagent-querydoc.kagent:8080/mcp (Streamable HTTP)
  • Now discovers 1 tool, Accepted

Removed broken gitlab RemoteMCPServer

  • mcp.blueflyagents.com is the agent-protocol service, NOT an MCP server
  • All paths return {"error":"Not found"} — no MCP endpoints
  • Removed the RemoteMCPServer CR
  • For GitLab MCP tools, use GitLab's built-in MCP at https://gitlab.com/api/v4/mcp (HTTP transport, OAuth 2.0)

Switched all 24 Anthropic agents to gpt4o

  • Anthropic API key has no credits (credit balance too low)
  • All agents using claude-opus/claude-sonnet ModelConfigs switched to gpt4o

Final MCP Server Status

ServerNamespaceToolsStatus
kagent-tool-serverkagent113Accepted
kagent-grafana-mcpkagent54Accepted
gkggitlab-agent8Accepted
kagent-querydockagent1Accepted

See MCP Tools Reference for complete tool listing and integration guide.

Dead Pod Cleanup (CRITICAL)

Problem

6673 dead pods accumulated in 10 days. kagent agents are Deployments — each chat session creates a pod that Completes/Fails but k3s never auto-cleans them. This caused:

  • Disk usage hit 95% on 96GB volume
  • Node tainted with disk-pressure:NoSchedule
  • New pods couldn't schedule → cascading failures
  • Mass parallel deletion overloaded API server → Oracle went unresponsive

Permanent Fix: CronJob [object Object]

Deployed to kube-system namespace. Runs every 15 minutes.

apiVersion: batch/v1 kind: CronJob metadata: name: cleanup-dead-pods namespace: kube-system spec: schedule: "*/15 * * * *" successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 concurrencyPolicy: Forbid jobTemplate: spec: ttlSecondsAfterFinished: 300 template: spec: serviceAccountName: cleanup-sa restartPolicy: Never containers: - name: cleanup image: bitnami/kubectl:latest command: ["/bin/sh", "-c"] args: - | # Delete Completed, Failed, Evicted pods across ALL namespaces kubectl get pods -A --field-selector=status.phase==Succeeded --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --field-selector=status.phase==Failed --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --no-headers | grep Evicted | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0

RBAC: ServiceAccount cleanup-sa with ClusterRole for pods: get, list, delete.

Manual Cleanup (when needed)

# Safe sequential cleanup (won't overload API server) kubectl get pods -A --field-selector=status.phase==Succeeded --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --field-selector=status.phase==Failed --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 # Check disk and taint df -h / kubectl describe node | grep -A1 Taints # Docker cleanup docker image prune -af docker builder prune -af # Journal cleanup sudo journalctl --vacuum-size=100M # NEVER do mass parallel deletion (xargs -P or &) — it overloads the API server

End-to-End Proof (verified before Oracle disk crash)

TestTool CalledResult
k8s_get_resources (namespaces)tools/call via Streamable HTTP10 namespaces returned
k8s_get_resources (agents CRD)tools/call via Streamable HTTP35 agents listed
helm_list_releasestools/call via Streamable HTTP10 Helm releases returned
search_dashboards (Grafana)tools/call via Streamable HTTPDashboards: Alertmanager, CoreDNS, etcd, K8s
GKG SSE sessionSSE connectValid sessionId returned

Remaining Issues

  1. argo-rollouts-conversion-agent has empty modelConfig reference
  2. kgateway-agent not accepted (Helm default, needs investigation)
  3. Anthropic credits depleted — all agents on OpenAI until recharged
  4. kagent-ui chat returning 502 — needs investigation
  5. Prometheus tools return localhost:9090 error — kagent-tools needs PROMETHEUS_URL configured (env var added but pod restarted into disk pressure)
  6. Oracle node recovery — may need reboot from Oracle Cloud console after disk pressure incident

Prevention

CRD Upgrades

# MUST apply CRDs manually after helm upgrade export KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get manifest kagent-crds -n kagent | kubectl apply --server-side --force-conflicts -f -

Disk Monitoring

  • CronJob cleanup-dead-pods runs every 15 min (deployed 2026-02-26)
  • Alert if disk > 80%: check Grafana alerting
  • Manual: df -h / should stay under 85%
  • Never run docker image prune -af and mass pod deletion simultaneously