kagent fix 2026 02 26
kagent Fix — 2026-02-26
Summary
Fixed ALL 35 kagent agents on Oracle k3s. 33/35 now show Accepted=True, Ready=True in kagent-ui.
Root Cause
Helm doesn't upgrade CRDs — this is a known Kubernetes/Helm limitation. When kagent was upgraded from older version to 0.7.17 via helm upgrade, the CRD schemas in kagent-crds-0.7.17 chart were NOT applied to the cluster. Result:
- ModelConfig CRD was stale: Only had
model+apiKeySecretReffields. Missingprovider(required!),apiKeySecret,apiKeySecretKey,anthropic,openAIconfig blocks - Agent CRD was stale: Only had flat
description,modelConfig,systemMessage,tools. Missingtype(Declarative/BYO) anddeclarative.*wrapper that 0.7.17 controller expects - All ModelConfig fields were silently pruned by K8s at creation time —
provider,apiKeySecret, etc. were stripped because the CRD didn't define them - All Agent nested fields were silently pruned —
spec.type,spec.declarative.*stripped for same reason - Controller 0.7.17 couldn't reconcile agents — read
typefield, got empty string → "unknown agent type: "
What Was Fixed
1. Updated CRDs (server-side apply with --force-conflicts)
# ModelConfig CRD — from chart manifest kubectl apply -f /tmp/kagent-crds-fix.yaml --server-side --force-conflicts # Agent CRD — custom schema with x-kubernetes-preserve-unknown-fields kubectl apply -f /tmp/agent-crd-full.yaml --server-side --force-conflicts
ModelConfig CRD v1alpha2 now has: model, provider (required), apiKeySecret, apiKeySecretKey, anthropic, openAI, azureOpenAI, ollama, gemini, geminiVertexAI, anthropicVertexAI, bedrock, tls, apiKeyPassthrough, defaultHeaders
Agent CRD v1alpha2 now has: type (Declarative/BYO), description, declarative (preserve-unknown-fields), byo, allowedNamespaces, modelConfig, systemMessage, tools, a2aConfig, deployment, memory, stream, status
2. Re-applied 5 ModelConfigs with correct fields
# Example: claude-opus apiVersion: kagent.dev/v1alpha2 kind: ModelConfig metadata: name: claude-opus namespace: kagent spec: model: claude-opus-4-6 provider: Anthropic # WAS MISSING apiKeySecret: kagent-anthropic # WAS MISSING apiKeySecretKey: ANTHROPIC_API_KEY # WAS MISSING anthropic: maxTokens: 8192
| ModelConfig | Provider | Model | Secret |
|---|---|---|---|
| claude-opus | Anthropic | claude-opus-4-6 | kagent-anthropic |
| claude-sonnet | Anthropic | claude-sonnet-4-5-20250929 | kagent-anthropic |
| gpt4o | OpenAI | gpt-4o | kagent-openai |
| gpt4o-mini | OpenAI | gpt-4o-mini | kagent-openai |
| default-model-config | OpenAI | gpt-4.1-mini | kagent-openai |
3. Re-applied 35 agents with correct nested structure
# kagent 0.7.17 expects NESTED spec: apiVersion: kagent.dev/v1alpha2 kind: Agent metadata: name: bluefly-platform-agent namespace: kagent spec: type: Declarative # REQUIRED by controller description: "BlueFly Platform Agent" declarative: # Everything under declarative.* modelConfig: claude-opus systemMessage: "..." tools: - type: McpServer mcpServer: name: kagent-tool-server kind: RemoteMCPServer toolNames: [...]
4. Restarted controller
kubectl -n kagent rollout restart deploy/kagent-controller
Result
| Category | Count | Status |
|---|---|---|
| OSSA agents | 25 | ALL Accepted + Ready |
| Custom agents (bluefly-*, compliance, drupal-test) | 4 | ALL Accepted + Ready |
| Helm agents (k8s, helm, cilium, istio, observability, promql) | 4 | Accepted + Ready |
| Failed Helm agents | 2 | argo-rollouts (empty modelConfig), kgateway (acceptance issue) |
| Total | 35 | 33 OK, 2 failed |
Important: kagent CRD Schema
The correct Agent spec structure for kagent 0.7.17 is NESTED:
spec.type: "Declarative"
spec.description: "..."
spec.declarative.modelConfig: "<modelconfig-name>"
spec.declarative.systemMessage: "..."
spec.declarative.tools: [...]
spec.declarative.a2aConfig: {...}
spec.declarative.deployment: {...}
NOT flat:
# WRONG for 0.7.17:
spec.modelConfig: "..."
spec.systemMessage: "..."
spec.tools: [...]
The generate-kagent-crds.ts in platform-agents MUST use the nested structure. The earlier edit that flattened it was incorrect and needs to be reverted.
MCP Tool Server Fixes (same day, later session)
Fixed GKG RemoteMCPServer URL
- Was:
https://gkg.bluefly.internal/api/mcp/sse→ returned HTML (404) - Fixed to:
https://gkg.bluefly.internal/mcp/sse→ now discovers 8 tools, Accepted
Fixed kagent-grafana-mcp
- Container had ZERO env vars — couldn't connect to Grafana (
https:///api/datasources= no host) - Created Grafana service account (
kagent-mcp, Admin role) with tokenglsa_E48d... - Created secret
kagent-grafana-mcpwithGRAFANA_URLandGRAFANA_SERVICE_ACCOUNT_TOKEN - Patched deployment with env vars from secret
- Now discovers 54 tools, Accepted
Registered kagent-querydoc as RemoteMCPServer
- Pod existed but was NOT registered as RemoteMCPServer
- Created RemoteMCPServer CR:
http://kagent-querydoc.kagent:8080/mcp(Streamable HTTP) - Now discovers 1 tool, Accepted
Removed broken gitlab RemoteMCPServer
mcp.blueflyagents.comis the agent-protocol service, NOT an MCP server- All paths return
{"error":"Not found"}— no MCP endpoints - Removed the RemoteMCPServer CR
- For GitLab MCP tools, use GitLab's built-in MCP at
https://gitlab.com/api/v4/mcp(HTTP transport, OAuth 2.0)
Switched all 24 Anthropic agents to gpt4o
- Anthropic API key has no credits (
credit balance too low) - All agents using claude-opus/claude-sonnet ModelConfigs switched to gpt4o
Final MCP Server Status
| Server | Namespace | Tools | Status |
|---|---|---|---|
| kagent-tool-server | kagent | 113 | Accepted |
| kagent-grafana-mcp | kagent | 54 | Accepted |
| gkg | gitlab-agent | 8 | Accepted |
| kagent-querydoc | kagent | 1 | Accepted |
See MCP Tools Reference for complete tool listing and integration guide.
Dead Pod Cleanup (CRITICAL)
Problem
6673 dead pods accumulated in 10 days. kagent agents are Deployments — each chat session creates a pod that Completes/Fails but k3s never auto-cleans them. This caused:
- Disk usage hit 95% on 96GB volume
- Node tainted with
disk-pressure:NoSchedule - New pods couldn't schedule → cascading failures
- Mass parallel deletion overloaded API server → Oracle went unresponsive
Permanent Fix: CronJob [object Object]
Deployed to kube-system namespace. Runs every 15 minutes.
apiVersion: batch/v1 kind: CronJob metadata: name: cleanup-dead-pods namespace: kube-system spec: schedule: "*/15 * * * *" successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 concurrencyPolicy: Forbid jobTemplate: spec: ttlSecondsAfterFinished: 300 template: spec: serviceAccountName: cleanup-sa restartPolicy: Never containers: - name: cleanup image: bitnami/kubectl:latest command: ["/bin/sh", "-c"] args: - | # Delete Completed, Failed, Evicted pods across ALL namespaces kubectl get pods -A --field-selector=status.phase==Succeeded --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --field-selector=status.phase==Failed --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --no-headers | grep Evicted | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0
RBAC: ServiceAccount cleanup-sa with ClusterRole for pods: get, list, delete.
Manual Cleanup (when needed)
# Safe sequential cleanup (won't overload API server) kubectl get pods -A --field-selector=status.phase==Succeeded --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 kubectl get pods -A --field-selector=status.phase==Failed --no-headers | awk '{print "-n", $1, $2}' | xargs -L1 kubectl delete pod --grace-period=0 # Check disk and taint df -h / kubectl describe node | grep -A1 Taints # Docker cleanup docker image prune -af docker builder prune -af # Journal cleanup sudo journalctl --vacuum-size=100M # NEVER do mass parallel deletion (xargs -P or &) — it overloads the API server
End-to-End Proof (verified before Oracle disk crash)
| Test | Tool Called | Result |
|---|---|---|
| k8s_get_resources (namespaces) | tools/call via Streamable HTTP | 10 namespaces returned |
| k8s_get_resources (agents CRD) | tools/call via Streamable HTTP | 35 agents listed |
| helm_list_releases | tools/call via Streamable HTTP | 10 Helm releases returned |
| search_dashboards (Grafana) | tools/call via Streamable HTTP | Dashboards: Alertmanager, CoreDNS, etcd, K8s |
| GKG SSE session | SSE connect | Valid sessionId returned |
Remaining Issues
- argo-rollouts-conversion-agent has empty modelConfig reference
- kgateway-agent not accepted (Helm default, needs investigation)
- Anthropic credits depleted — all agents on OpenAI until recharged
- kagent-ui chat returning 502 — needs investigation
- Prometheus tools return
localhost:9090error — kagent-tools needsPROMETHEUS_URLconfigured (env var added but pod restarted into disk pressure) - Oracle node recovery — may need reboot from Oracle Cloud console after disk pressure incident
Prevention
CRD Upgrades
# MUST apply CRDs manually after helm upgrade export KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm get manifest kagent-crds -n kagent | kubectl apply --server-side --force-conflicts -f -
Disk Monitoring
- CronJob
cleanup-dead-podsruns every 15 min (deployed 2026-02-26) - Alert if disk > 80%: check Grafana alerting
- Manual:
df -h /should stay under 85% - Never run
docker image prune -afand mass pod deletion simultaneously