Common Issues & Solutions: A Practical Troubleshooting Guide

This guide helps you quickly diagnose and resolve the most common problems encountered in edge platform deployments. For each issue, you'll find explanations, step-by-step troubleshooting, example commands, and prevention tips.

How to Use This Guide

Start with the General Troubleshooting Workflow below if you're not sure where to begin.
Find your issue in the sections that follow for targeted solutions.
Use the prevention tips to avoid recurring problems.
Escalate or ask for help if you're stuck (see the end of this guide).

General Troubleshooting Workflow

Check Cluster Health:

kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp'

Narrow Down the Problem:
- Is it a deployment, networking, or resource issue?

Check Logs and Events:

kubectl describe pod <pod-name>
kubectl logs <pod-name>

Use Diagnostic Commands:
- See the "Quick Diagnostic Commands" section below.
Apply Solutions from this Guide.
Escalate if Needed:
- See "When to Escalate/Ask for Help."

Deployment Issues

Pod Stuck in Pending State

Why it happens:

Not enough resources on nodes
Unschedulable pods due to taints, affinity, or node selectors

Step-by-step solution:

Check node resources:
```
kubectl describe nodes
kubectl top nodes
```
- Look for nodes with insufficient CPU/memory.
Check pod events:
```
kubectl describe pod <pod-name>
```
- Look for messages like "Insufficient memory" or "No nodes available."
Fix:
- Free up resources or add nodes.
- Adjust pod resource requests/limits.
- Remove taints or adjust node selectors if needed.

Prevention:

Set realistic resource requests/limits.
Monitor cluster utilization.

Image Pull Errors (ImagePullBackOff, ErrImagePull)

Why it happens:

Image does not exist or is misspelled
Registry authentication issues

Step-by-step solution:

Check if the image exists:
```
docker pull <image-name>
```

Check image pull secrets:

kubectl get secrets
kubectl describe secret <image-pull-secret>

Check registry authentication:

kubectl create secret docker-registry myregistrykey \
  --docker-server=myregistry.com \
  --docker-username=myuser \
  --docker-password=mypassword

Prevention:

Use image tags, not "latest."
Store secrets securely and keep them up to date.

Networking Issues

Service Not Reachable

Why it happens:

Service endpoints not created
Network policies or firewalls blocking traffic

Step-by-step solution:

Check service endpoints:

kubectl get endpoints <service-name>
kubectl describe service <service-name>

Test pod-to-pod connectivity:

kubectl exec -it <pod-name> -- nslookup <service-name>
kubectl exec -it <pod-name> -- curl <service-name>:8080

Check network policies:
- Review any NetworkPolicy resources that may restrict traffic.

Prevention:

Document service ports and network policies.
Use readiness/liveness probes.

DNS Resolution Problems

Why it happens:

CoreDNS not running or misconfigured
Network issues between pods and DNS

Step-by-step solution:

Check CoreDNS status:

kubectl get pods -n kube-system -l k8s-app=kube-dns

Test DNS resolution from a pod:

kubectl run test-pod --image=busybox --rm -it -- nslookup kubernetes.default

Restart CoreDNS if needed:

kubectl rollout restart deployment coredns -n kube-system

Prevention:

Avoid custom DNS settings unless necessary.
Monitor CoreDNS health.

Resource Issues

Out of Memory (OOM) Errors

Why it happens:

Pod exceeds its memory limit
Node is out of memory

Step-by-step solution:

Check memory usage:

kubectl top pods --sort-by=memory
kubectl describe pod <pod-name>

Increase memory limits if needed:

kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1Gi"}}}]}}}}'

Investigate memory leaks in your app.

Prevention:

Set realistic memory requests/limits.
Monitor pod memory usage over time.

CPU Throttling

Why it happens:

Pod exceeds its CPU limit
Node is under heavy load

Step-by-step solution:

Check CPU usage:
```
kubectl top pods --sort-by=cpu
```

Increase CPU limits if needed:

kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"cpu":"1000m"}}}]}}}}'

Optimize application code for efficiency.

Prevention:

Set appropriate CPU requests/limits.
Monitor CPU usage and adjust as needed.

Edge-Specific Issues

Intermittent Connectivity

Why it happens:

Unstable network links to edge nodes
Node hardware or OS issues

Step-by-step solution:

Check node status:

kubectl get nodes --show-labels
kubectl describe node <edge-node>

Verify network connectivity:
```
ping <node-ip>
traceroute <node-ip>
```
Check for hardware or OS errors in node logs.

Prevention:

Use redundant network links where possible.
Monitor node health and connectivity.

Resource Constraints on Edge Nodes

Why it happens:

Edge nodes have limited CPU/memory/storage

Step-by-step solution:

Use resource-efficient configurations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: edge-optimized-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"

Schedule only essential workloads on edge nodes.

Prevention:

Profile workloads for resource usage before deploying to edge.
Use taints/tolerations to control scheduling.

Quick Diagnostic Commands

# General cluster health
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Network debugging
kubectl get services --all-namespaces
kubectl get ingress --all-namespaces

# Rook Ceph storage debugging
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status

Advanced Scenarios

PersistentVolume (PV) and Storage Issues

Why it happens:

PVs not bound to PVCs
Rook Ceph StorageClass misconfiguration
Ceph cluster unhealthy or OSDs down
Underlying storage unavailable

Step-by-step solution:

Check PVC and PV status:

kubectl get pvc -A
kubectl get pv
kubectl describe pvc <pvc-name>

Look for status "Pending" or "Lost".

Check StorageClass:

kubectl get storageclass
kubectl describe storageclass <name>

Check Ceph cluster health:

kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Check events for errors:

kubectl get events --sort-by='.lastTimestamp'

Fix:
- Ensure the correct Rook Ceph StorageClass is set (rook-ceph-block, rook-cephfs, etc.).
- Check Ceph cluster health and OSD status.
- Restart CSI driver pods if they're failing.
- Delete and recreate stuck PVCs if safe.

Prevention:

Monitor Ceph cluster health regularly.
Use appropriate Rook Ceph storage classes.
Ensure adequate disk space on Ceph OSDs.

Node Disk Pressure & Pod Eviction

Why it happens:

Node disk is full or nearly full
Kubelet evicts pods to free up space

Step-by-step solution:

Check node conditions:
```
kubectl describe node <node-name>
```
- Look for "DiskPressure" in conditions.

Check evicted pods:

kubectl get pods --all-namespaces | grep Evicted

Free up disk space:
- Clean up unused images, logs, or data on the node.
- Increase node disk size if possible.

Prevention:

Set up node disk monitoring and alerts.
Use log rotation and clean-up policies.

Network Policy Debugging

Why it happens:

NetworkPolicy resources block traffic unintentionally

Step-by-step solution:

List all network policies:
```
kubectl get networkpolicy -A
```
Check policy selectors and rules:
```
kubectl describe networkpolicy <name>
```

Test connectivity with netshoot or busybox:

kubectl run -it --rm netshoot --image=nicolaka/netshoot -- bash
# Try ping/curl to target pods/services

Temporarily remove or adjust policies to isolate the issue.

Prevention:

Document and review all network policies.
Test policies in staging before production.

API Server Unavailability

Why it happens:

API server is overloaded, crashed, or unreachable

Step-by-step solution:

Check API server pod status (if self-hosted):

kubectl get pods -n kube-system | grep apiserver

Check control plane node health:
- Use cloud provider console or SSH to check node status.

Check etcd health (if applicable):

kubectl get pods -n kube-system | grep etcd

Check for network/firewall issues between nodes.

Prevention:

Use highly available control plane setups.
Monitor API server and etcd health.

Cluster Upgrade Failures

Why it happens:

Incompatible versions or missing prerequisites
Failed node upgrades

Step-by-step solution:

Check upgrade logs:
- Use your platform's upgrade logs or cloud provider console.
Check node versions:
```
kubectl get nodes -o wide
```
Roll back or retry upgrade as per platform documentation.
Contact support if cluster is stuck.

Prevention:

Always test upgrades in staging.
Read release notes and upgrade guides.

Advanced Log Collection & Analysis

Why it happens:

Need to debug complex, multi-pod or multi-node issues

Step-by-step solution:

Collect logs from all pods in a namespace:

for pod in $(kubectl get pods -n <ns> -o name); do kubectl logs -n <ns> $pod; done

Use log aggregation tools (e.g., EFK/ELK, Loki, etc.):
- Query logs across pods, nodes, and time ranges.
Correlate logs with events and metrics for root cause analysis.

Prevention:

Set up centralized logging and monitoring.
Use structured logging in your applications.

With this guide, you should be able to resolve most common issues in edge platform deployments. For more advanced troubleshooting, see the linked guides or reach out for help.

How to Use This Guide​

General Troubleshooting Workflow​

Deployment Issues​

Pod Stuck in Pending State​

Image Pull Errors (ImagePullBackOff, ErrImagePull)​

Networking Issues​

Service Not Reachable​

DNS Resolution Problems​

Resource Issues​

Out of Memory (OOM) Errors​

CPU Throttling​

Edge-Specific Issues​

Intermittent Connectivity​

Resource Constraints on Edge Nodes​

Quick Diagnostic Commands​

Advanced Scenarios​

PersistentVolume (PV) and Storage Issues​

Node Disk Pressure & Pod Eviction​

Network Policy Debugging​

API Server Unavailability​

Cluster Upgrade Failures​

Advanced Log Collection & Analysis​

How to Use This Guide

General Troubleshooting Workflow

Deployment Issues

Pod Stuck in Pending State

Image Pull Errors (ImagePullBackOff, ErrImagePull)

Networking Issues

Service Not Reachable

DNS Resolution Problems

Resource Issues

Out of Memory (OOM) Errors

CPU Throttling

Edge-Specific Issues

Intermittent Connectivity

Resource Constraints on Edge Nodes

Quick Diagnostic Commands

Advanced Scenarios

PersistentVolume (PV) and Storage Issues

Node Disk Pressure & Pod Eviction

Network Policy Debugging

API Server Unavailability

Cluster Upgrade Failures

Advanced Log Collection & Analysis