Skip to main content

Kubernetes Troubleshooting

Kubernetes Troubleshooting

Common Kubernetes issues and solutions for the LLM platform.


Pod Issues

Pod Not Starting

Check pod status:

kubectl get pods -n agents kubectl describe pod <pod-name> -n agents kubectl logs <pod-name> -n agents

Common causes:

  1. Image pull error
# Check image kubectl describe pod <pod-name> -n agents | grep Image # Fix: Use correct image kubectl set image deployment/<name> container=correct-image -n agents
  1. Insufficient resources
# Check resources kubectl describe nodes # Fix: Adjust resource requests kubectl edit deployment <name> -n agents
  1. Failed health checks
# Check probe configuration kubectl describe pod <pod-name> -n agents | grep -A 10 Liveness # Fix: Adjust probe timings

Service Issues

Service Not Accessible

# Check service kubectl get svc -n agents kubectl describe svc <service-name> -n agents # Check endpoints kubectl get endpoints <service-name> -n agents # Test from within cluster kubectl run test --rm -it --image=curlimages/curl -- sh curl http://<service-name>.<namespace>.svc.cluster.local

Deployment Issues

Deployment Stuck

# Check deployment status kubectl rollout status deployment/<name> -n agents # Check events kubectl get events -n agents --sort-by='.lastTimestamp' # Rollback if needed kubectl rollout undo deployment/<name> -n agents

Network Issues

Pod to Pod Communication

# Test connectivity kubectl exec -it <pod-1> -n agents -- ping <pod-2-ip> # Check network policy kubectl get networkpolicies -n agents # Check CNI logs kubectl logs -n kube-system <cni-pod>

Storage Issues

PVC Not Binding

# Check PVC status kubectl get pvc -n agents # Check PV availability kubectl get pv # Describe PVC for events kubectl describe pvc <pvc-name> -n agents

Resource Limits

OOMKilled

Problem: Pod killed due to out of memory.

# Check pod events kubectl describe pod <pod-name> -n agents # Increase memory limits kubectl patch deployment <name> -n agents -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'