Skip to main content

Common Issues & Solutions: A Practical Troubleshooting Guide

This guide helps you quickly diagnose and resolve the most common problems encountered in edge platform deployments. For each issue, you'll find explanations, step-by-step troubleshooting, example commands, and prevention tips.


How to Use This Guide

  • Start with the General Troubleshooting Workflow below if you're not sure where to begin.
  • Find your issue in the sections that follow for targeted solutions.
  • Use the prevention tips to avoid recurring problems.
  • Escalate or ask for help if you're stuck (see the end of this guide).

General Troubleshooting Workflow

  1. Check Cluster Health:
    kubectl get nodes
    kubectl get pods --all-namespaces
    kubectl get events --sort-by='.lastTimestamp'
  2. Narrow Down the Problem:
    • Is it a deployment, networking, or resource issue?
  3. Check Logs and Events:
    kubectl describe pod <pod-name>
    kubectl logs <pod-name>
  4. Use Diagnostic Commands:
    • See the "Quick Diagnostic Commands" section below.
  5. Apply Solutions from this Guide.
  6. Escalate if Needed:
    • See "When to Escalate/Ask for Help."

Deployment Issues

Pod Stuck in Pending State

Why it happens:

  • Not enough resources on nodes
  • Unschedulable pods due to taints, affinity, or node selectors

Step-by-step solution:

  1. Check node resources:
    kubectl describe nodes
    kubectl top nodes
    • Look for nodes with insufficient CPU/memory.
  2. Check pod events:
    kubectl describe pod <pod-name>
    • Look for messages like "Insufficient memory" or "No nodes available."
  3. Fix:
    • Free up resources or add nodes.
    • Adjust pod resource requests/limits.
    • Remove taints or adjust node selectors if needed.

Prevention:

  • Set realistic resource requests/limits.
  • Monitor cluster utilization.

Image Pull Errors (ImagePullBackOff, ErrImagePull)

Why it happens:

  • Image does not exist or is misspelled
  • Registry authentication issues

Step-by-step solution:

  1. Check if the image exists:
    docker pull <image-name>
  2. Check image pull secrets:
    kubectl get secrets
    kubectl describe secret <image-pull-secret>
  3. Check registry authentication:
    kubectl create secret docker-registry myregistrykey \
    --docker-server=myregistry.com \
    --docker-username=myuser \
    --docker-password=mypassword

Prevention:

  • Use image tags, not "latest."
  • Store secrets securely and keep them up to date.

Networking Issues

Service Not Reachable

Why it happens:

  • Service endpoints not created
  • Network policies or firewalls blocking traffic

Step-by-step solution:

  1. Check service endpoints:
    kubectl get endpoints <service-name>
    kubectl describe service <service-name>
  2. Test pod-to-pod connectivity:
    kubectl exec -it <pod-name> -- nslookup <service-name>
    kubectl exec -it <pod-name> -- curl <service-name>:8080
  3. Check network policies:
    • Review any NetworkPolicy resources that may restrict traffic.

Prevention:

  • Document service ports and network policies.
  • Use readiness/liveness probes.

DNS Resolution Problems

Why it happens:

  • CoreDNS not running or misconfigured
  • Network issues between pods and DNS

Step-by-step solution:

  1. Check CoreDNS status:
    kubectl get pods -n kube-system -l k8s-app=kube-dns
  2. Test DNS resolution from a pod:
    kubectl run test-pod --image=busybox --rm -it -- nslookup kubernetes.default
  3. Restart CoreDNS if needed:
    kubectl rollout restart deployment coredns -n kube-system

Prevention:

  • Avoid custom DNS settings unless necessary.
  • Monitor CoreDNS health.

Resource Issues

Out of Memory (OOM) Errors

Why it happens:

  • Pod exceeds its memory limit
  • Node is out of memory

Step-by-step solution:

  1. Check memory usage:
    kubectl top pods --sort-by=memory
    kubectl describe pod <pod-name>
  2. Increase memory limits if needed:
    kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}'
  3. Investigate memory leaks in your app.

Prevention:

  • Set realistic memory requests/limits.
  • Monitor pod memory usage over time.

CPU Throttling

Why it happens:

  • Pod exceeds its CPU limit
  • Node is under heavy load

Step-by-step solution:

  1. Check CPU usage:
    kubectl top pods --sort-by=cpu
  2. Increase CPU limits if needed:
    kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"cpu":"1000m"}}}]}}}}'
  3. Optimize application code for efficiency.

Prevention:

  • Set appropriate CPU requests/limits.
  • Monitor CPU usage and adjust as needed.

Edge-Specific Issues

Intermittent Connectivity

Why it happens:

  • Unstable network links to edge nodes
  • Node hardware or OS issues

Step-by-step solution:

  1. Check node status:
    kubectl get nodes --show-labels
    kubectl describe node <edge-node>
  2. Verify network connectivity:
    ping <node-ip>
    traceroute <node-ip>
  3. Check for hardware or OS errors in node logs.

Prevention:

  • Use redundant network links where possible.
  • Monitor node health and connectivity.

Resource Constraints on Edge Nodes

Why it happens:

  • Edge nodes have limited CPU/memory/storage

Step-by-step solution:

  1. Use resource-efficient configurations:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: edge-optimized-app
    spec:
    template:
    spec:
    containers:
    - name: app
    resources:
    requests:
    memory: "64Mi"
    cpu: "100m"
    limits:
    memory: "128Mi"
    cpu: "200m"
  2. Schedule only essential workloads on edge nodes.

Prevention:

  • Profile workloads for resource usage before deploying to edge.
  • Use taints/tolerations to control scheduling.

Quick Diagnostic Commands

# General cluster health
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Network debugging
kubectl get services --all-namespaces
kubectl get ingress --all-namespaces

# Rook Ceph storage debugging
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status

Advanced Scenarios

PersistentVolume (PV) and Storage Issues

Why it happens:

  • PVs not bound to PVCs
  • Rook Ceph StorageClass misconfiguration
  • Ceph cluster unhealthy or OSDs down
  • Underlying storage unavailable

Step-by-step solution:

  1. Check PVC and PV status:
    kubectl get pvc -A
    kubectl get pv
    kubectl describe pvc <pvc-name>
    • Look for status "Pending" or "Lost".
  2. Check StorageClass:
    kubectl get storageclass
    kubectl describe storageclass <name>
  3. Check Ceph cluster health:
    kubectl -n rook-ceph get cephcluster
    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
  4. Check events for errors:
    kubectl get events --sort-by='.lastTimestamp'
  5. Fix:
    • Ensure the correct Rook Ceph StorageClass is set (rook-ceph-block, rook-cephfs, etc.).
    • Check Ceph cluster health and OSD status.
    • Restart CSI driver pods if they're failing.
    • Delete and recreate stuck PVCs if safe.

Prevention:

  • Monitor Ceph cluster health regularly.
  • Use appropriate Rook Ceph storage classes.
  • Ensure adequate disk space on Ceph OSDs.

Node Disk Pressure & Pod Eviction

Why it happens:

  • Node disk is full or nearly full
  • Kubelet evicts pods to free up space

Step-by-step solution:

  1. Check node conditions:
    kubectl describe node <node-name>
    • Look for "DiskPressure" in conditions.
  2. Check evicted pods:
    kubectl get pods --all-namespaces | grep Evicted
  3. Free up disk space:
    • Clean up unused images, logs, or data on the node.
    • Increase node disk size if possible.

Prevention:

  • Set up node disk monitoring and alerts.
  • Use log rotation and clean-up policies.

Network Policy Debugging

Why it happens:

  • NetworkPolicy resources block traffic unintentionally

Step-by-step solution:

  1. List all network policies:
    kubectl get networkpolicy -A
  2. Check policy selectors and rules:
    kubectl describe networkpolicy <name>
  3. Test connectivity with netshoot or busybox:
    kubectl run -it --rm netshoot --image=nicolaka/netshoot -- bash
    # Try ping/curl to target pods/services
  4. Temporarily remove or adjust policies to isolate the issue.

Prevention:

  • Document and review all network policies.
  • Test policies in staging before production.

API Server Unavailability

Why it happens:

  • API server is overloaded, crashed, or unreachable

Step-by-step solution:

  1. Check API server pod status (if self-hosted):
    kubectl get pods -n kube-system | grep apiserver
  2. Check control plane node health:
    • Use cloud provider console or SSH to check node status.
  3. Check etcd health (if applicable):
    kubectl get pods -n kube-system | grep etcd
  4. Check for network/firewall issues between nodes.

Prevention:

  • Use highly available control plane setups.
  • Monitor API server and etcd health.

Cluster Upgrade Failures

Why it happens:

  • Incompatible versions or missing prerequisites
  • Failed node upgrades

Step-by-step solution:

  1. Check upgrade logs:
    • Use your platform's upgrade logs or cloud provider console.
  2. Check node versions:
    kubectl get nodes -o wide
  3. Roll back or retry upgrade as per platform documentation.
  4. Contact support if cluster is stuck.

Prevention:

  • Always test upgrades in staging.
  • Read release notes and upgrade guides.

Advanced Log Collection & Analysis

Why it happens:

  • Need to debug complex, multi-pod or multi-node issues

Step-by-step solution:

  1. Collect logs from all pods in a namespace:
    for pod in $(kubectl get pods -n <ns> -o name); do kubectl logs -n <ns> $pod; done
  2. Use log aggregation tools (e.g., EFK/ELK, Loki, etc.):
    • Query logs across pods, nodes, and time ranges.
  3. Correlate logs with events and metrics for root cause analysis.

Prevention:

  • Set up centralized logging and monitoring.
  • Use structured logging in your applications.

With this guide, you should be able to resolve most common issues in edge platform deployments. For more advanced troubleshooting, see the linked guides or reach out for help.