Kubernetes Error: CrashLoopBackOff — Pod Restart Loop
NAME READY STATUS RESTARTS AGE
api-7d8f6c5b9-abc12 0/1 CrashLoopBackOff 8 12m
# kubectl describe pod api-7d8f6c5b9-abc12
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 26 Apr 2026 09:14:32 +0100
Finished: Fri, 26 Apr 2026 09:14:33 +0100
Events:
Warning BackOff 2m (x42 over 11m) kubelet Back-off restarting failed container
CrashLoopBackOff is the visible symptom of an invisible loop: your container starts, exits, and starts again — over and over — until the kubelet slows down the restart cadence. It’s not a single error type; it’s “your app died several times in a row and we’re rate-limiting you.” The real error is hiding in the previous container’s logs and the pod’s events.
The diagnostic sequence is fixed: kubectl logs --previous to see what the dying app said, kubectl describe pod to see how it died (Error vs OOMKilled vs probe-failed), then fix the right thing. Don’t bump memory if the issue is a missing env var; don’t disable the probe if the app genuinely isn’t healthy.
Why this happens
- Application throws on startup. Missing env var, malformed config, code-level exception during init. `kubectl logs --previous` shows the stack trace from the dying container. The app exits with code 1 (or whatever the runtime maps the exception to) and the kubelet restarts it.
- Container OOMKilled. Your container exceeded its memory limit and the kernel SIGKILLed it. Last State shows `Reason: OOMKilled, Exit Code: 137`. Fix is more memory in the resource request/limit, or reducing the working-set size of your app (the latter is usually right).
- Liveness probe failing. App boots fine but the liveness probe gets HTTP 5xx or doesn't reply within `timeoutSeconds`, so the kubelet decides the container is unhealthy and kills it. Common when the app's liveness endpoint depends on a slow downstream — make the probe shallow.
- Wrong command or arg. Container starts but the entrypoint exits immediately. Could be a typo in `command:`, a missing binary in the image, a script that runs `exit 1`, or an init script that depends on a file that wasn't mounted yet.
- Boot-time dependency unavailable. App connects to Postgres/Redis/etcd in its `init()` and exits if the connection fails. On a cold cluster start, the dependency may not be ready yet. The fix is retry-with-backoff in the app or a proper init container that waits.
How to fix it
Fixes are ordered by likelihood. Start with the first one that matches your context.
1. Read the previous container's logs first
`--previous` shows logs from the last terminated instance, not the currently-failing one. This is where the actual crash reason lives — stack trace, missing env var, connection error, etc.
# The pod is currently restarting; --previous shows the dying container's logs.
kubectl logs <pod> -n <namespace> --previous
# If the container has multiple containers in one pod:
kubectl logs <pod> -c <container-name> --previous
# Tail logs as the new instance starts (sometimes catches startup faster):
kubectl logs <pod> -n <namespace> -f
2. Check `kubectl describe` for OOMKilled and probe failures
The Events and Last State sections tell you whether the container exited on its own (Reason: Error, your code), got killed for memory (Reason: OOMKilled, exit 137), or was killed by the kubelet for failing a probe.
kubectl describe pod <pod> -n <namespace>
# Look for these indicators:
# Last State: Terminated
# Reason: OOMKilled ← memory limit
# Reason: Error ← app exited non-zero
# Exit Code: 137 ← SIGKILL (OOM or kubelet)
# Exit Code: 143 ← SIGTERM (graceful shutdown signal)
#
# In Events, look for:
# Liveness probe failed: ...
# Readiness probe failed: ...
3. Bump memory limits if OOMKilled is the cause
Don't guess — set both `requests` (what's reserved) and `limits` (the cgroup ceiling). Run with the limit raised, observe actual usage with `kubectl top pod`, then set the limit at ~30% above peak.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: myregistry.example/team/api:v2.3.1
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi # raise this if OOMKilled
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
4. Make the liveness probe shallow and add startupProbe
Liveness should answer "is the process alive and not deadlocked?" — not "is the database reachable?" Use a shallow `/healthz` that doesn't fan out. For slow-starting apps, add a `startupProbe` that gives lots of time before liveness kicks in.
# Shallow liveness — just confirms the HTTP server is up.
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
# Readiness can be deeper — checks downstream deps.
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 5
timeoutSeconds: 3
# Startup gives slow-booting apps up to 5 minutes before liveness fires.
startupProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 10
failureThreshold: 30 # 30 * 10s = 5 minutes
5. Verify config (env vars, ConfigMaps, Secrets) actually mount
Run `kubectl exec` against a debug pod with the same spec, or use `kubectl debug` to attach an ephemeral container, and `env` / `cat /etc/...` to confirm the config is where the app expects. Many CrashLoopBackOff incidents are a typo in a Secret name or a missing key.
Detection and monitoring in production
Alert on `kube_pod_container_status_restarts_total` rate (Prometheus + kube-state-metrics) — a pod restarting more than ~5 times in 10 minutes is in or near CrashLoopBackOff regardless of the official phase. Pair with an OOMKill alert (`container_oom_events_total`). Track CrashLoopBackOff incidents by deployment to spot bad releases fast — a fresh deploy that immediately CrashLoops should auto-rollback.
Related errors
- kubernetesImagePullBackOffThe kubelet failed to pull the container image and is now backing off retry attempts. Underlying error is one of ErrImagePull, ImagePullBackOff is the recovery loop. Causes are almost always image name/tag wrong, registry auth missing, registry rate-limit, or the image genuinely doesn't exist for the node's architecture.
- nodejsheap_out_of_memoryV8's old-generation heap filled up and the garbage collector couldn't free enough space, so V8 aborts the process with a fatal allocation failure. Default heap is ~4GB on 64-bit; long-lived references (caches, listeners, closures, big arrays) prevent reclamation.
- postgresECONNREFUSEDYour application tried to open a TCP connection to Postgres and the OS rejected it — Postgres isn't listening on the host:port you specified, or a firewall blocked the connection.
- pythonModuleNotFoundErrorThe Python interpreter walked `sys.path` and couldn't find the module you imported. Most common cause: you installed the package in a different environment (different venv, different Python version, system pip vs project pip) than the one running your code.
- nextjsFUNCTION_INVOCATION_TIMEOUTA Vercel serverless function (a Next.js API route, server action, or `getServerSideProps`) didn't return a response within the plan's max execution time — 10s on Hobby, 60s on Pro, 900s on Enterprise. Vercel kills the invocation and returns 504.
Frequently asked questions
What's the back-off interval in CrashLoopBackOff? +
How do I see logs from a container that's restarting? +
My pod CrashLoopBackOffs but `kubectl logs --previous` is empty. +
What does Exit Code 137 mean? +
Should I disable the liveness probe to stop CrashLoopBackOff? +
How is CrashLoopBackOff different from ImagePullBackOff? +
Will adding restartPolicy: Never fix CrashLoopBackOff? +
My pod CrashLoops on cold cluster start but works after a few minutes. Why? +
When to escalate to Kubernetes support
Escalate to platform/SRE only after `kubectl logs --previous` and `kubectl describe pod` confirm the issue is not in your application or config — for example: (a) every container OOMKilled at exactly the same memory regardless of workload (cgroup misconfiguration), (b) liveness probes timing out because the kubelet itself is overloaded (node-level issue), or (c) the same deployment runs fine on some clusters and CrashLoopBackOffs on others (cluster-version or CRI mismatch). For 99% of CrashLoopBackOff cases, the fix lives in your application code or your YAML.