Debugging Guides
Comprehensive step-by-step guides for debugging complex issues in edge deployments. This guide provides systematic approaches to identify, diagnose, and resolve common problems in Kubernetes-based edge environments.
General Debugging Methodology
Before diving into specific scenarios, follow this general approach:
- Identify the Problem: Clearly define symptoms and expected vs. actual behavior
- Gather Information: Collect logs, metrics, and configuration details
- Isolate Components: Determine which layer is causing the issue (application, platform, infrastructure)
- Test Hypotheses: Make changes incrementally and validate results
- Document Solutions: Record findings for future reference
Application Not Starting
When applications fail to start or remain in pending/error states, follow this systematic approach.
Step 1: Check Pod Status and Events
Start by understanding the current state of your application pods:
# Get overall pod status
kubectl get pods -l app=myapp -o wide
# Get detailed pod information and events
kubectl describe pod <pod-name>
What to look for:
- Pending: Usually indicates resource constraints or scheduling issues
- CrashLoopBackOff: Application is starting but immediately crashing
- ImagePullBackOff: Cannot pull the container image
- Error/Failed: General failure state
Common Events and Their Meanings:
FailedScheduling: No nodes available with required resourcesFailedMount: Issues mounting volumes or secretsFailed to pull image: Registry connectivity or authentication issues
Step 2: Examine Application Logs
Logs provide crucial insights into why applications fail to start:
# Get current logs
kubectl logs <pod-name>
# Get logs from previous container instance (if crashed)
kubectl logs <pod-name> --previous
# Follow logs in real-time
kubectl logs <pod-name> -f
# Get logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>
Log Analysis Tips:
- Look for startup errors, missing dependencies, or configuration issues
- Check for port binding conflicts or permission denied errors
- Identify any stack traces or error codes
- Note any environment variable or secret-related errors
Step 3: Verify Resource Availability
Ensure the cluster has sufficient resources for your application:
# Check node resource usage
kubectl top nodes
# Get detailed node information
kubectl describe node <node-name>
# Check resource quotas (if applicable)
kubectl get resourcequota
# Verify persistent volume claims
kubectl get pvc
Resource Troubleshooting:
- Compare requested resources in your deployment with available node capacity
- Check if resource quotas are preventing pod scheduling
- Verify that persistent volumes are available and properly bound
Step 4: Validate Configuration
Configuration issues are common causes of startup failures:
# Check ConfigMaps
kubectl get configmap <config-name> -o yaml
# Verify Secrets
kubectl get secret <secret-name> -o yaml
# Validate deployment configuration
kubectl get deployment <deployment-name> -o yaml
# Check service account permissions
kubectl get serviceaccount <sa-name> -o yaml
Configuration Checklist:
- Ensure all required environment variables are set
- Verify secret and configmap references are correct
- Check that image tags exist and are accessible
- Validate service account has necessary permissions
Step 5: Network and DNS Resolution
For applications that depend on external services:
# Test DNS resolution from within a pod
kubectl exec -it <pod-name> -- nslookup kubernetes.default
# Check if external services are reachable
kubectl exec -it <pod-name> -- curl -v http://external-service.com
# Verify internal service connectivity
kubectl exec -it <pod-name> -- telnet <service-name> <port>
Performance Issues
When applications are running but experiencing poor performance, use this systematic approach.
Step 1: Monitor Resource Usage
Start by understanding current resource consumption:
# Check pod resource usage
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu
# Monitor specific pod over time
watch kubectl top pod <pod-name>
# Get detailed resource information
kubectl describe pod <pod-name> | grep -A 20 "Containers:"
Performance Indicators:
- High CPU: May indicate inefficient algorithms, infinite loops, or insufficient resources
- High Memory: Could suggest memory leaks, large datasets, or inadequate limits
- Resource Throttling: Check if pods are hitting their limits
Step 2: Analyze Application Metrics
Access application-specific metrics to understand internal performance:
# Port-forward to access metrics endpoint
kubectl port-forward <pod-name> 9090:9090
# Query Prometheus-style metrics
curl http://localhost:9090/metrics
# For custom applications, check health endpoints
curl http://localhost:8080/health
curl http://localhost:8080/ready
Metrics to Focus On:
- Request latency and throughput
- Database connection pool utilization
- Cache hit/miss ratios
- Queue depths and processing times
Step 3: Deep Dive into Logs
Analyze log patterns to identify performance bottlenecks:
# Look for error patterns
kubectl logs <pod-name> | grep ERROR | tail -20
# Search for performance-related issues
kubectl logs <pod-name> | grep -i "timeout\|slow\|performance"
# Check for memory issues
kubectl logs <pod-name> | grep -i "out of memory\|oom"
# Analyze request patterns (for web applications)
kubectl logs <pod-name> | grep "response_time" | tail -50
Step 4: Check Downstream Dependencies
Performance issues often originate from dependencies:
# Test database connectivity and performance
kubectl exec -it <pod-name> -- pg_isready -h <db-host>
# Check external API response times
kubectl exec -it <pod-name> -- time curl -s http://api.example.com/health
# Verify internal service response times
kubectl exec -it <pod-name> -- time curl -s http://internal-service/health
Step 5: Resource Optimization
Based on findings, optimize resource allocation:
# Update resource requests and limits
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"requests":{"memory":"512Mi","cpu":"500m"},"limits":{"memory":"1Gi","cpu":"1000m"}}}]}}}}'
# Scale horizontally if needed
kubectl scale deployment <deployment-name> --replicas=3
# Check if horizontal pod autoscaler would help
kubectl get hpa
Networking Issues
Network connectivity problems can be complex to diagnose in edge environments.
Step 1: Test Basic Connectivity
Start with fundamental network connectivity tests:
# Test pod-to-pod communication
kubectl exec -it <source-pod> -- ping <target-pod-ip>
# Test service connectivity
kubectl exec -it <pod-name> -- telnet <service-name> <port>
# Check DNS resolution
kubectl exec -it <pod-name> -- nslookup <service-name>
# Test external connectivity
kubectl exec -it <pod-name> -- curl -v http://google.com
Common Connectivity Issues:
- DNS resolution failures
- Service discovery problems
- Network policy restrictions
- Firewall or security group blocks
Step 2: Verify Service Configuration
Ensure services are properly configured and have valid endpoints:
# Check service configuration
kubectl get svc <service-name> -o yaml
# Verify service endpoints
kubectl get endpoints <service-name>
# Check if service selector matches pod labels
kubectl get pods --show-labels -l <service-selector>
# Test service from within cluster
kubectl run test-pod --image=nicolaka/netshoot -it --rm -- curl http://<service-name>:<port>
Service Troubleshooting:
- Ensure service selector matches pod labels exactly
- Verify target ports match container ports
- Check that endpoints list shows healthy pods
Step 3: Analyze Network Policies
Network policies can restrict traffic flow:
# List all network policies
kubectl get networkpolicy -A
# Check specific policy details
kubectl describe networkpolicy <policy-name>
# Test connectivity before and after policy application
kubectl exec -it <pod-name> -- curl -v http://<target-service>
Network Policy Debugging:
- Verify ingress/egress rules match your traffic patterns
- Check namespace and pod selectors
- Test connectivity from allowed and denied sources
Step 4: Ingress and Load Balancer Issues
For external traffic routing problems:
# Check ingress configuration
kubectl get ingress <ingress-name> -o yaml
# Verify ingress controller logs
kubectl logs -n ingress-nginx <ingress-controller-pod>
# Test load balancer health
kubectl get svc <loadbalancer-service> -o wide
# Check external DNS resolution
nslookup <your-domain.com>
Step 5: Edge-Specific Network Considerations
Edge environments often have unique networking challenges:
# Check node network configuration
kubectl get nodes -o wide
# Verify cluster networking components
kubectl get pods -n kube-system | grep -E "(calico|flannel|weave)"
# Check for network interface issues on nodes
kubectl debug node/<node-name> -it --image=nicolaka/netshoot
Edge-Specific Debugging Scenarios
Intermittent Connectivity
Edge deployments often experience network instability:
# Monitor connectivity over time
kubectl exec -it <pod-name> -- ping -c 100 <target-host> | grep -E "(loss|min/avg/max)"
# Check for network interface flapping
kubectl exec -it <pod-name> -- ip link show
# Monitor DNS resolution stability
for i in {1..10}; do kubectl exec -it <pod-name> -- nslookup google.com; sleep 5; done
Resource Constraints in Edge Environments
Edge nodes often have limited resources:
# Monitor node resources over time
watch kubectl top nodes
# Check for evicted pods
kubectl get pods -A | grep Evicted
# Verify resource quotas and limits
kubectl describe resourcequota
kubectl describe limitrange
Geographic Distribution Issues
For multi-site edge deployments:
# Check pod distribution across nodes/zones
kubectl get pods -o wide | sort -k 7
# Verify node affinity rules
kubectl get deployment <deployment-name> -o yaml | grep -A 10 affinity
# Test cross-zone connectivity
kubectl exec -it <pod-in-zone-a> -- ping <pod-in-zone-b>
Debugging Best Practices
1. Systematic Approach
- Always start with the most basic checks before diving into complex analysis
- Document your debugging steps and findings
- Use a consistent methodology across different types of issues
2. Comprehensive Logging
# Enable verbose logging for troubleshooting
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
LOG_LEVEL: "DEBUG"
ENABLE_TRACING: "true"
EOF
3. Use Debugging Tools
# Deploy debugging utilities
kubectl run debug-pod --image=nicolaka/netshoot -it --rm
# Use kubectl debug for node-level issues
kubectl debug node/<node-name> -it --image=nicolaka/netshoot
# Employ port-forwarding for direct access
kubectl port-forward <pod-name> 8080:8080
4. Monitor Continuously
- Set up alerts for common failure patterns
- Use observability tools to track trends
- Implement health checks and readiness probes
5. Document and Share
- Maintain a knowledge base of common issues and solutions
- Create runbooks for recurring problems
- Share findings with team members
Advanced Debugging Techniques
Container Runtime Debugging
# Check container runtime status
kubectl get nodes -o wide
# Debug container creation issues
kubectl describe pod <pod-name> | grep -A 10 "Events:"
# Access node-level container information
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- crictl ps
Storage and Volume Issues
# Check persistent volume status
kubectl get pv,pvc -A
# Debug volume mount issues
kubectl describe pod <pod-name> | grep -A 10 "Mounts:"
# Verify storage class configuration
kubectl get storageclass -o yaml
Security and RBAC Debugging
# Check service account permissions
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<namespace>:<serviceaccount>
# Verify RBAC bindings
kubectl get rolebinding,clusterrolebinding -A | grep <serviceaccount>
# Debug security context issues
kubectl get pod <pod-name> -o yaml | grep -A 10 securityContext
Escalation Guidelines
When to escalate issues:
- Infrastructure-level failures affecting multiple applications
- Security incidents or suspected breaches
- Resource exhaustion at the cluster level
- Persistent issues that don't respond to standard troubleshooting
- Performance degradation affecting SLA compliance
Next Steps
For additional troubleshooting resources:
- Common Issues & Solutions - Quick reference for frequent problems
- Error Code Reference - Detailed error code explanations