Common Issues & Solutions: A Practical Troubleshooting Guide
This guide helps you quickly diagnose and resolve the most common problems encountered in edge platform deployments. For each issue, you'll find explanations, step-by-step troubleshooting, example commands, and prevention tips.
How to Use This Guide
- Start with the General Troubleshooting Workflow below if you're not sure where to begin.
- Find your issue in the sections that follow for targeted solutions.
- Use the prevention tips to avoid recurring problems.
- Escalate or ask for help if you're stuck (see the end of this guide).
General Troubleshooting Workflow
- Check Cluster Health:
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp' - Narrow Down the Problem:
- Is it a deployment, networking, or resource issue?
- Check Logs and Events:
kubectl describe pod <pod-name>
kubectl logs <pod-name> - Use Diagnostic Commands:
- See the "Quick Diagnostic Commands" section below.
- Apply Solutions from this Guide.
- Escalate if Needed:
- See "When to Escalate/Ask for Help."
Deployment Issues
Pod Stuck in Pending State
Why it happens:
- Not enough resources on nodes
- Unschedulable pods due to taints, affinity, or node selectors
Step-by-step solution:
- Check node resources:
kubectl describe nodes
kubectl top nodes- Look for nodes with insufficient CPU/memory.
- Check pod events:
kubectl describe pod <pod-name>- Look for messages like "Insufficient memory" or "No nodes available."
- Fix:
- Free up resources or add nodes.
- Adjust pod resource requests/limits.
- Remove taints or adjust node selectors if needed.
Prevention:
- Set realistic resource requests/limits.
- Monitor cluster utilization.
Image Pull Errors (ImagePullBackOff, ErrImagePull)
Why it happens:
- Image does not exist or is misspelled
- Registry authentication issues
Step-by-step solution:
- Check if the image exists:
docker pull <image-name> - Check image pull secrets:
kubectl get secrets
kubectl describe secret <image-pull-secret> - Check registry authentication:
kubectl create secret docker-registry myregistrykey \
--docker-server=myregistry.com \
--docker-username=myuser \
--docker-password=mypassword
Prevention:
- Use image tags, not "latest."
- Store secrets securely and keep them up to date.
Networking Issues
Service Not Reachable
Why it happens:
- Service endpoints not created
- Network policies or firewalls blocking traffic
Step-by-step solution:
- Check service endpoints:
kubectl get endpoints <service-name>
kubectl describe service <service-name> - Test pod-to-pod connectivity:
kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- curl <service-name>:8080 - Check network policies:
- Review any
NetworkPolicyresources that may restrict traffic.
- Review any
Prevention:
- Document service ports and network policies.
- Use readiness/liveness probes.
DNS Resolution Problems
Why it happens:
- CoreDNS not running or misconfigured
- Network issues between pods and DNS
Step-by-step solution:
- Check CoreDNS status:
kubectl get pods -n kube-system -l k8s-app=kube-dns - Test DNS resolution from a pod:
kubectl run test-pod --image=busybox --rm -it -- nslookup kubernetes.default - Restart CoreDNS if needed:
kubectl rollout restart deployment coredns -n kube-system
Prevention:
- Avoid custom DNS settings unless necessary.
- Monitor CoreDNS health.
Resource Issues
Out of Memory (OOM) Errors
Why it happens:
- Pod exceeds its memory limit
- Node is out of memory
Step-by-step solution:
- Check memory usage:
kubectl top pods --sort-by=memory
kubectl describe pod <pod-name> - Increase memory limits if needed:
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}' - Investigate memory leaks in your app.
Prevention:
- Set realistic memory requests/limits.
- Monitor pod memory usage over time.
CPU Throttling
Why it happens:
- Pod exceeds its CPU limit
- Node is under heavy load
Step-by-step solution:
- Check CPU usage:
kubectl top pods --sort-by=cpu - Increase CPU limits if needed:
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"cpu":"1000m"}}}]}}}}' - Optimize application code for efficiency.
Prevention:
- Set appropriate CPU requests/limits.
- Monitor CPU usage and adjust as needed.
Edge-Specific Issues
Intermittent Connectivity
Why it happens:
- Unstable network links to edge nodes
- Node hardware or OS issues
Step-by-step solution:
- Check node status:
kubectl get nodes --show-labels
kubectl describe node <edge-node> - Verify network connectivity:
ping <node-ip>
traceroute <node-ip> - Check for hardware or OS errors in node logs.
Prevention:
- Use redundant network links where possible.
- Monitor node health and connectivity.
Resource Constraints on Edge Nodes
Why it happens:
- Edge nodes have limited CPU/memory/storage
Step-by-step solution:
- Use resource-efficient configurations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: edge-optimized-app
spec:
template:
spec:
containers:
- name: app
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m" - Schedule only essential workloads on edge nodes.
Prevention:
- Profile workloads for resource usage before deploying to edge.
- Use taints/tolerations to control scheduling.
Quick Diagnostic Commands
# General cluster health
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp'
# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces
# Network debugging
kubectl get services --all-namespaces
kubectl get ingress --all-namespaces
# Rook Ceph storage debugging
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status
Advanced Scenarios
PersistentVolume (PV) and Storage Issues
Why it happens:
- PVs not bound to PVCs
- Rook Ceph StorageClass misconfiguration
- Ceph cluster unhealthy or OSDs down
- Underlying storage unavailable
Step-by-step solution:
- Check PVC and PV status:
kubectl get pvc -A
kubectl get pv
kubectl describe pvc <pvc-name>- Look for status "Pending" or "Lost".
- Check StorageClass:
kubectl get storageclass
kubectl describe storageclass <name> - Check Ceph cluster health:
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status - Check events for errors:
kubectl get events --sort-by='.lastTimestamp' - Fix:
- Ensure the correct Rook Ceph StorageClass is set (rook-ceph-block, rook-cephfs, etc.).
- Check Ceph cluster health and OSD status.
- Restart CSI driver pods if they're failing.
- Delete and recreate stuck PVCs if safe.
Prevention:
- Monitor Ceph cluster health regularly.
- Use appropriate Rook Ceph storage classes.
- Ensure adequate disk space on Ceph OSDs.
Node Disk Pressure & Pod Eviction
Why it happens:
- Node disk is full or nearly full
- Kubelet evicts pods to free up space
Step-by-step solution:
- Check node conditions:
kubectl describe node <node-name>- Look for "DiskPressure" in conditions.
- Check evicted pods:
kubectl get pods --all-namespaces | grep Evicted - Free up disk space:
- Clean up unused images, logs, or data on the node.
- Increase node disk size if possible.
Prevention:
- Set up node disk monitoring and alerts.
- Use log rotation and clean-up policies.
Network Policy Debugging
Why it happens:
- NetworkPolicy resources block traffic unintentionally
Step-by-step solution:
- List all network policies:
kubectl get networkpolicy -A - Check policy selectors and rules:
kubectl describe networkpolicy <name> - Test connectivity with netshoot or busybox:
kubectl run -it --rm netshoot --image=nicolaka/netshoot -- bash
# Try ping/curl to target pods/services - Temporarily remove or adjust policies to isolate the issue.
Prevention:
- Document and review all network policies.
- Test policies in staging before production.
API Server Unavailability
Why it happens:
- API server is overloaded, crashed, or unreachable
Step-by-step solution:
- Check API server pod status (if self-hosted):
kubectl get pods -n kube-system | grep apiserver - Check control plane node health:
- Use cloud provider console or SSH to check node status.
- Check etcd health (if applicable):
kubectl get pods -n kube-system | grep etcd - Check for network/firewall issues between nodes.
Prevention:
- Use highly available control plane setups.
- Monitor API server and etcd health.
Cluster Upgrade Failures
Why it happens:
- Incompatible versions or missing prerequisites
- Failed node upgrades
Step-by-step solution:
- Check upgrade logs:
- Use your platform's upgrade logs or cloud provider console.
- Check node versions:
kubectl get nodes -o wide - Roll back or retry upgrade as per platform documentation.
- Contact support if cluster is stuck.
Prevention:
- Always test upgrades in staging.
- Read release notes and upgrade guides.
Advanced Log Collection & Analysis
Why it happens:
- Need to debug complex, multi-pod or multi-node issues
Step-by-step solution:
- Collect logs from all pods in a namespace:
for pod in $(kubectl get pods -n <ns> -o name); do kubectl logs -n <ns> $pod; done - Use log aggregation tools (e.g., EFK/ELK, Loki, etc.):
- Query logs across pods, nodes, and time ranges.
- Correlate logs with events and metrics for root cause analysis.
Prevention:
- Set up centralized logging and monitoring.
- Use structured logging in your applications.
With this guide, you should be able to resolve most common issues in edge platform deployments. For more advanced troubleshooting, see the linked guides or reach out for help.