Error Code Reference
Comprehensive reference guide for understanding and resolving platform error codes in edge Kubernetes deployments. This guide provides detailed explanations, root cause analysis, and step-by-step solutions for common error scenarios.
How to Use This Reference
Each error code entry includes:
- Meaning: Clear explanation of what the error indicates
- Root Causes: Common underlying issues that trigger this error
- Immediate Actions: Quick steps to take when encountering the error
- Detailed Solutions: Step-by-step troubleshooting procedures
- Prevention: Best practices to avoid the error in the future
HTTP Error Codes
400 - Bad Request
Meaning: The server cannot process the request due to client-side errors in the request syntax, format, or content.
Root Causes:
- Malformed YAML or JSON in Kubernetes manifests
- Invalid resource specifications or field values
- Missing required fields in API requests
- Incompatible API versions
- Invalid resource names or labels
Immediate Actions:
# Validate manifest syntax before applying
kubectl apply --dry-run=client -f your-manifest.yaml
# Check for YAML syntax errors
yamllint your-manifest.yaml
# Validate against Kubernetes schema
kubectl apply --validate=true --dry-run=client -f your-manifest.yaml
Detailed Solutions:
-
Validate Manifest Structure:
# Use kubeval for schema validation
kubeval your-manifest.yaml
# Check API version compatibility
kubectl api-versions | grep apps -
Common Field Validation Issues:
# Verify resource names follow DNS conventions
# Names must be lowercase, contain only alphanumeric characters and hyphens
# Check label and annotation syntax
kubectl explain pod.metadata.labels -
Debug Specific Field Errors:
# Get detailed field explanations
kubectl explain deployment.spec.template.spec.containers
# Validate resource requirements format
kubectl explain pod.spec.containers.resources
Prevention:
- Use IDE plugins with Kubernetes schema validation
- Implement CI/CD pipeline validation steps
- Use
kubectl --dry-run=clientbefore applying changes - Maintain consistent naming conventions
401 - Unauthorized
Meaning: The request lacks valid authentication credentials or the provided credentials are invalid.
Root Causes:
- Expired authentication tokens or certificates
- Missing or incorrect kubeconfig configuration
- Invalid service account tokens
- Clock skew between client and server
- Revoked or disabled user credentials
Immediate Actions:
# Check current authentication status
kubectl auth whoami
# View current context and credentials
kubectl config current-context
kubectl config view --minify
Detailed Solutions:
-
Token Refresh and Validation:
# Refresh authentication token
kubectl auth refresh
# For OIDC providers, re-authenticate
kubectl oidc-login
# Check token expiration
kubectl config view --raw -o jsonpath='{.users[0].user.auth-provider.config.id-token}' | base64 -d -
Service Account Authentication:
# Check service account token
kubectl get serviceaccount <sa-name> -o yaml
# Verify token mounting in pods
kubectl describe pod <pod-name> | grep -A 5 "Mounts:"
# Create new service account token if needed
kubectl create token <sa-name> -
Certificate-Based Authentication:
# Check client certificate validity
openssl x509 -in ~/.kube/client.crt -text -noout
# Verify certificate against CA
openssl verify -CAfile ~/.kube/ca.crt ~/.kube/client.crt
Prevention:
- Set up automatic token renewal
- Monitor certificate expiration dates
- Use short-lived tokens with refresh mechanisms
- Implement proper RBAC policies
403 - Forbidden
Meaning: The server understood the request but refuses to authorize it due to insufficient permissions.
Root Causes:
- Inadequate RBAC permissions for the requested operation
- Missing role bindings or cluster role bindings
- Incorrect service account assignments
- Namespace-level permission restrictions
- Resource quota or limit range violations
Immediate Actions:
# Check specific permission
kubectl auth can-i <verb> <resource> --namespace <namespace>
# List all permissions for current user
kubectl auth can-i --list --namespace <namespace>
Detailed Solutions:
-
RBAC Permission Analysis:
# Check current user's role bindings
kubectl get rolebinding,clusterrolebinding -A | grep $(kubectl config view --minify -o jsonpath='{.contexts[0].context.user}')
# Examine specific role permissions
kubectl describe role <role-name> -n <namespace>
kubectl describe clusterrole <clusterrole-name> -
Service Account Permission Debugging:
# Check service account's role bindings
kubectl get rolebinding,clusterrolebinding -A | grep <service-account-name>
# Test permissions as service account
kubectl auth can-i create pods --as=system:serviceaccount:<namespace>:<sa-name> -
Resource Quota and Limits:
# Check namespace resource quotas
kubectl describe quota -n <namespace>
# Check limit ranges
kubectl describe limitrange -n <namespace>
Prevention:
- Follow principle of least privilege
- Regularly audit RBAC permissions
- Use namespace-specific service accounts
- Document required permissions for applications
404 - Not Found
Meaning: The requested resource does not exist on the server.
Root Causes:
- Resource was deleted or never created
- Incorrect resource name or namespace
- API version mismatch
- Custom Resource Definition (CRD) not installed
- Typos in resource references
Immediate Actions:
# List all resources in namespace
kubectl get all -n <namespace>
# Search across all namespaces
kubectl get <resource-type> -A | grep <resource-name>
Detailed Solutions:
-
Resource Discovery:
# List all available resource types
kubectl api-resources
# Check specific resource type availability
kubectl api-resources | grep <resource-type>
# Verify CRD installation
kubectl get crd | grep <custom-resource> -
Namespace and Name Verification:
# Check current namespace context
kubectl config view --minify | grep namespace
# List resources with labels
kubectl get <resource-type> --show-labels -A -
API Version Compatibility:
# Check deprecated API versions
kubectl api-versions | sort
# Convert deprecated resources
kubectl convert -f old-manifest.yaml --output-version <new-api-version>
Prevention:
- Use consistent naming conventions
- Implement resource tagging strategies
- Version control for resource manifests
- Regular cluster resource auditing
Kubernetes Pod Error States
ImagePullBackOff / ErrImagePull
Meaning: Kubernetes cannot pull the specified container image from the registry.
Root Causes:
- Image does not exist in the specified registry
- Incorrect image name, tag, or registry URL
- Authentication issues with private registries
- Network connectivity problems to registry
- Registry service unavailable
- Image architecture mismatch (e.g., ARM vs x86)
Immediate Actions:
# Check pod events for specific error details
kubectl describe pod <pod-name>
# Verify image exists manually
docker pull <image-name>:<tag>
Detailed Solutions:
-
Image and Registry Verification:
# Test direct image pull
docker pull <registry>/<image>:<tag>
# Check image manifest
docker manifest inspect <image>:<tag>
# Verify image architecture
docker image inspect <image>:<tag> | grep Architecture -
Registry Authentication:
# Check existing image pull secrets
kubectl get secrets | grep docker
# Describe image pull secret
kubectl get secret <secret-name> -o yaml
# Create new registry secret
kubectl create secret docker-registry <secret-name> \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password>
# Link secret to service account
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}' -
Network and Connectivity:
# Test registry connectivity from node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- curl -v https://<registry-url>
# Check DNS resolution
kubectl exec -it <debug-pod> -- nslookup <registry-domain>
# Verify proxy settings if applicable
kubectl get nodes -o yaml | grep -i proxy
Prevention:
- Use specific image tags instead of 'latest'
- Implement image scanning and validation
- Use private registries with proper authentication
- Test image pulls in CI/CD pipeline
CrashLoopBackOff
Meaning: The container repeatedly crashes and Kubernetes keeps restarting it with increasing delays.
Root Causes:
- Application startup failures due to misconfigurations
- Missing dependencies or environment variables
- Resource constraints (memory/CPU limits too low)
- Port binding conflicts
- Database or external service connectivity issues
- Invalid command or entrypoint specifications
Immediate Actions:
# Check current and previous container logs
kubectl logs <pod-name> --previous
kubectl logs <pod-name> --tail=50
# Get detailed pod information
kubectl describe pod <pod-name>
Detailed Solutions:
-
Application Log Analysis:
# Follow logs in real-time
kubectl logs <pod-name> -f
# Check logs from all containers in pod
kubectl logs <pod-name> --all-containers=true
# Get logs with timestamps
kubectl logs <pod-name> --timestamps=true --since=10m -
Resource and Configuration Review:
# Check resource limits and requests
kubectl get pod <pod-name> -o yaml | grep -A 10 resources
# Verify environment variables
kubectl get pod <pod-name> -o yaml | grep -A 20 env
# Check mounted volumes and secrets
kubectl describe pod <pod-name> | grep -A 10 "Mounts:" -
Interactive Debugging:
# Run debug container with same image
kubectl run debug-container --image=<same-image> -it --rm -- /bin/sh
# Execute commands in running container (if available)
kubectl exec -it <pod-name> -- /bin/sh
# Debug with different entrypoint
kubectl run debug-pod --image=<image> -it --rm --command -- /bin/sh
Prevention:
- Implement proper health checks and readiness probes
- Use init containers for dependency checks
- Set appropriate resource requests and limits
- Test containerized applications locally first
Pending
Meaning: The pod has been accepted by Kubernetes but cannot be scheduled to run on any node.
Root Causes:
- Insufficient CPU or memory resources on available nodes
- Node selector constraints cannot be satisfied
- Pod anti-affinity rules preventing scheduling
- Taints on nodes without corresponding tolerations
- Persistent volume availability issues
- Resource quotas preventing pod creation
Immediate Actions:
# Check pod scheduling events
kubectl describe pod <pod-name>
# Review node resource availability
kubectl top nodes
kubectl describe nodes
Detailed Solutions:
-
Resource Availability Analysis:
# Check detailed node capacity
kubectl describe node <node-name> | grep -A 10 "Allocated resources"
# View all node allocatable resources
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU-ALLOCATABLE:.status.allocatable.cpu,MEMORY-ALLOCATABLE:.status.allocatable.memory
# Check resource quotas
kubectl describe quota -n <namespace> -
Scheduling Constraints Review:
# Check node labels for selector matching
kubectl get nodes --show-labels
# Review pod node selector and affinity rules
kubectl get pod <pod-name> -o yaml | grep -A 10 nodeSelector
kubectl get pod <pod-name> -o yaml | grep -A 20 affinity
# Check node taints
kubectl describe node <node-name> | grep Taints -
Storage and Volume Issues:
# Check persistent volume claims
kubectl get pvc -n <namespace>
# Verify storage class availability
kubectl get storageclass
# Check persistent volume status
kubectl get pv | grep Available
Prevention:
- Monitor cluster resource utilization
- Implement cluster autoscaling
- Use resource requests appropriately
- Plan for node maintenance and capacity
Edge-Specific Error Codes
EDGE-001: Node Connectivity Lost
Meaning: An edge node has lost network connectivity to the Kubernetes control plane.
Root Causes:
- Network infrastructure failures or instability
- Firewall or security group configuration changes
- DNS resolution issues
- Control plane endpoint changes
- Certificate expiration or rotation issues
- Power or hardware failures at edge location
Immediate Actions:
# Check node status from control plane
kubectl get nodes
# Verify node-specific details
kubectl describe node <edge-node-name>
Detailed Solutions:
-
Network Connectivity Testing:
# Test connectivity from edge node to control plane
# (Run these commands on the edge node if accessible)
ping <control-plane-ip>
telnet <control-plane-ip> 6443
# Check DNS resolution
nslookup <control-plane-hostname>
# Verify routing
traceroute <control-plane-ip> -
Certificate and Authentication Verification:
# Check kubelet certificate status
openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout
# Verify kubelet configuration
systemctl status kubelet
journalctl -u kubelet --since "1 hour ago" -
Recovery Procedures:
# Restart kubelet service
systemctl restart kubelet
# Reset node if necessary (from node)
kubeadm reset
kubeadm join <control-plane-endpoint> --token <token> --discovery-token-ca-cert-hash <hash>
# Drain and cordon node for maintenance (from control plane)
kubectl drain <node-name> --ignore-daemonsets
kubectl cordon <node-name>
Prevention:
- Implement redundant network paths
- Monitor network connectivity continuously
- Set up automated certificate renewal
- Use node health monitoring and alerting
EDGE-002: Resource Quota Exceeded
Meaning: The edge node has exceeded its allocated resource limits, preventing new workload scheduling.
Root Causes:
- Insufficient resource planning for edge node capacity
- Memory leaks in running applications
- Unexpected workload spikes or scaling events
- Background processes consuming resources
- Storage space exhaustion
Immediate Actions:
# Check node resource usage
kubectl top node <edge-node-name>
# List resource-intensive pods
kubectl top pods --sort-by=memory -n <namespace>
kubectl top pods --sort-by=cpu -n <namespace>
Detailed Solutions:
-
Resource Usage Analysis:
# Get detailed node resource allocation
kubectl describe node <edge-node-name> | grep -A 20 "Allocated resources"
# Check for resource-intensive pods
kubectl get pods -o custom-columns=NAME:.metadata.name,CPU-REQUEST:.spec.containers[0].resources.requests.cpu,MEMORY-REQUEST:.spec.containers[0].resources.requests.memory --sort-by=.spec.containers[0].resources.requests.memory
# Monitor resource usage over time
watch kubectl top pods -n <namespace> -
Workload Optimization:
# Scale down non-critical workloads
kubectl scale deployment <deployment-name> --replicas=1
# Update resource limits for problematic pods
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"512Mi","cpu":"500m"}}}]}}}}'
# Remove completed or failed pods
kubectl delete pods --field-selector=status.phase=Succeeded
kubectl delete pods --field-selector=status.phase=Failed -
Storage Management:
# Check disk usage on node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- df -h
# Clean up unused volumes
kubectl delete pvc <unused-pvc-name>
# Remove unused images (on node)
docker system prune -a
Prevention:
- Implement resource monitoring and alerting
- Set appropriate resource requests and limits
- Use horizontal pod autoscaling
- Regular cleanup of unused resources
EDGE-003: Storage Limit Reached
Meaning: The edge node has reached its storage capacity limit, affecting application operation and new deployments.
Root Causes:
- Log files growing without rotation or cleanup
- Application data accumulation without management
- Container image buildup without cleanup
- Temporary file accumulation
- Persistent volume space exhaustion
Immediate Actions:
# Check storage usage on node
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- df -h
# Check persistent volume status
kubectl get pv,pvc -A
Detailed Solutions:
-
Storage Analysis and Cleanup:
# Analyze disk usage by directory
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- du -sh /var/lib/docker/*
# Clean up Docker resources
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- docker system prune -a
# Remove unused container images
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- docker image prune -a -
Log Management:
# Check log sizes
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- du -sh /var/log/*
# Rotate logs manually if needed
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- logrotate -f /etc/logrotate.conf
# Clear old journal logs
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- journalctl --vacuum-time=7d -
Persistent Volume Management:
# Check PV usage
kubectl exec -it <pod-using-pv> -- df -h /mount/path
# Resize persistent volumes if supported
kubectl patch pvc <pvc-name> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Backup and clean old data
kubectl exec -it <pod-name> -- tar -czf /backup/data.tar.gz /data/old/
kubectl exec -it <pod-name> -- rm -rf /data/old/*
Prevention:
- Implement log rotation and retention policies
- Set up storage monitoring and alerting
- Use ephemeral storage for temporary data
- Regular maintenance and cleanup schedules
Debugging Commands Quick Reference
Essential Diagnostic Commands
# Get cluster overview
kubectl cluster-info
kubectl get nodes -o wide
kubectl get events --sort-by='.lastTimestamp' -A
# Pod debugging
kubectl get pods -A -o wide
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous --timestamps
# Resource monitoring
kubectl top nodes
kubectl top pods -A --sort-by=memory
kubectl get events --field-selector type=Warning
# Network debugging
kubectl get svc,ep -A
kubectl exec -it <pod-name> -- netstat -tulpn
kubectl run netshoot --image=nicolaka/netshoot -it --rm
# Storage debugging
kubectl get pv,pvc,sc -A
kubectl describe pvc <pvc-name>
# RBAC debugging
kubectl auth can-i --list
kubectl get rolebinding,clusterrolebinding -A
Advanced Debugging Techniques
# Enable verbose output
kubectl apply -f manifest.yaml --v=8
# Use debug containers
kubectl debug <pod-name> -it --image=nicolaka/netshoot
kubectl debug node/<node-name> -it --image=nicolaka/netshoot
# Simulate pod scheduling
kubectl apply --dry-run=server -f pod.yaml
# Force delete stuck resources
kubectl delete pod <pod-name> --force --grace-period=0
# Export resources for analysis
kubectl get pod <pod-name> -o yaml > pod-debug.yaml
kubectl describe pod <pod-name> > pod-description.txt
Next Steps
For additional troubleshooting resources:
- Common Issues & Solutions - Quick fixes for frequent problems
- Debugging Guides - Step-by-step troubleshooting workflows