Anthropic Error: overloaded_error — API Overloaded
import anthropic
client = anthropic.Anthropic()
try:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
)
except anthropic.APIStatusError as e:
# e.status_code == 529
# e.body['error']['type'] == 'overloaded_error'
# e.body['error']['message'] == 'Overloaded'
if e.status_code == 529:
# transient — retry with backoff
...
overloaded_error is Anthropic’s signal that the Claude inference fleet is at capacity for the model you asked for. It returns HTTP 529 — a non-standard status code Anthropic shares with a handful of other services to mean specifically “we’re hot, slow down or come back.” It’s not a rate limit (which is 429) and it’s not a generic server error (which is 500); it’s a capacity signal you should react to with retry and fallback, not reconfiguration.
The right response is mechanical: bounded exponential backoff with jitter, then a fallback to a smaller-fleet model (Haiku) or an alternate provider (Bedrock, Vertex), then a circuit-break that gives up cleanly. Combined with a status-page subscription and per-model 529 metrics, you can keep serving traffic through capacity events that take competitors offline.
Why this happens
- Cluster-wide capacity event at Anthropic. The most common cause — Anthropic's GPU/inference fleet for the requested model is saturated. Major causes include traffic spikes from a viral product (cursor, claude.ai outage), a model release attracting load, or a regional infra incident. status.anthropic.com tracks these as 'partial outage' or 'degraded performance' events.
- Specific model under heavier load than others. Newer flagship models (Opus, the latest Sonnet) get hit hardest after release. Older or smaller models in the same family may have more headroom. Routing some traffic to Haiku or an older Sonnet snapshot during overloads keeps you serving.
- Regional or availability-zone issue. Anthropic runs across multiple regions; a localised hardware fault can drop capacity in one zone while others are healthy. AWS Bedrock and GCP Vertex routes to Claude have their own zone exposure separate from the direct Anthropic API.
- Burst from your own traffic colliding with global load. If you fan out 1,000 concurrent calls just as another large customer also bursts, Anthropic's per-region scheduler can shed load with `overloaded_error` rather than queue indefinitely. Smoothing your own bursts (token bucket, queue) reduces collision probability.
- Free-tier or low-tier deprioritisation during contention. Anthropic prioritises higher-tier paid traffic during capacity events. Free-trial keys and tier-1 paid orgs see `overloaded_error` more often than tier-3+ enterprise traffic when global demand peaks. Upgrading via spend reduces — but doesn't eliminate — the rate.
How to fix it
Fixes are ordered by likelihood. Start with the first one that matches your context.
1. Retry with exponential backoff and jitter
`overloaded_error` is transient — most resolve in 1-30 seconds. Retry up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s) and 25% jitter. The Anthropic SDK does some auto-retry; supplement it with your own bounded loop for production resilience.
import time, random
import anthropic
client = anthropic.Anthropic()
def call_with_retry(messages, model="claude-opus-4-5", max_attempts=5):
for attempt in range(max_attempts):
try:
return client.messages.create(
model=model, max_tokens=1024, messages=messages,
)
except anthropic.APIStatusError as e:
if e.status_code != 529 or attempt == max_attempts - 1:
raise
backoff = (2 ** attempt) + random.uniform(0, 1)
time.sleep(min(backoff, 30))
2. Fall back to a different model on overload
Maintain a fallback chain: `claude-opus-4-5` → `claude-sonnet-4-5` → `claude-haiku-4-5`. Smaller models have larger fleets and lower contention. For latency-sensitive paths, fail over within 2-3 seconds rather than burning the full backoff budget on the primary model.
FALLBACK = ["claude-opus-4-5", "claude-sonnet-4-5", "claude-haiku-4-5"]
def call_with_fallback(messages, max_tokens=1024):
last_err = None
for model in FALLBACK:
try:
return client.messages.create(
model=model, max_tokens=max_tokens, messages=messages,
)
except anthropic.APIStatusError as e:
if e.status_code != 529:
raise
last_err = e
raise last_err
3. Multi-provider failover (AWS Bedrock or GCP Vertex)
The same Claude models are available via AWS Bedrock and GCP Vertex AI. Their capacity pools are separate from Anthropic's direct API — when one is overloaded, the others often aren't. Set up the alternative SDKs and switch providers at the load-balancer or app layer when 529s spike."
4. Use the Message Batches API for non-realtime work
The Batches API processes up to 100k requests asynchronously and is much less prone to `overloaded_error` because it runs on a separate, queued capacity pool. If your workload tolerates 24h latency, batches are 50% cheaper *and* more reliable."
5. Smooth your own bursts with a token bucket
Pre-empting bursts on your side reduces the probability of colliding with global contention. A simple in-memory token-bucket limiter that holds you at ~80% of your tier RPM converts spikes into a queue, not a wave of 529s.
Detection and monitoring in production
Track `overloaded_error` rate per model as a separate metric from rate-limits. Plot it against status.anthropic.com incidents to correlate. Alert if 529s exceed 5% of requests for a single model over a 5-minute window — that's a signal to flip your traffic over to the fallback model or a different provider. A sustained 0% rate at scale is unrealistic; aim for <1%.
Related errors
- anthropicrate_limit_errorYou exceeded one of Anthropic's per-minute caps for the model and tier — RPM (requests/min), ITPM (input tokens/min), or OTPM (output tokens/min). Anthropic enforces all three independently and you can hit any one without breaching the others.
- anthropicauthentication_errorThe `x-api-key` header you sent doesn't match an active Anthropic API key — usually because the env var isn't loaded, the key was rotated or revoked, you're using a Workspace key in the wrong workspace, or a wrong-provider key (Bedrock or Vertex) was sent to the direct Anthropic API.
- openairate_limit_exceededYour account has exceeded its per-minute request (RPM) or per-minute token (TPM) limit for the model you're calling. Limits are tier-based and per-model.
- openaicontext_length_exceededThe total tokens (prompt + max_tokens for completion) exceeds the model's context window. For example, sending 130,000 input tokens to `gpt-4o` (128k window) or asking for 5,000 completion tokens when the prompt is already 125k.
- openaiinsufficient_quotaYour OpenAI organisation has run out of paid credit, hit its monthly hard limit, or hasn't added a payment method yet. Despite the 429 status, this is a billing problem — not a rate-limit problem — and retrying won't help.
Frequently asked questions
What does HTTP 529 mean? It's not in the standard HTTP spec. +
Is `overloaded_error` ever caused by something I did wrong? +
How long do `overloaded_error` events typically last? +
Should I retry `overloaded_error` indefinitely? +
Do `overloaded_error` failures count toward my usage or rate limits? +
Are higher-tier paid customers prioritised during overloads? +
Can I tell the difference between `overloaded_error` and `api_error`? +
Does using streaming reduce overload exposure? +
When to escalate to Anthropic support
Escalate to Anthropic support only if (a) you're seeing sustained `overloaded_error` rates above 10% for hours with no incident on the status page, suggesting an account-specific routing issue, (b) you're on a Priority Tier (Bedrock/Vertex commitment) and not seeing the contracted availability, or (c) overload is affecting the Batches API or the file storage endpoints, which should be insulated from realtime contention. For routine 529s during traffic spikes, retry + fallback is the answer.