Anthropic Error: rate_limit_error — Too Many Requests
import anthropic
client = anthropic.Anthropic()
try:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
)
except anthropic.RateLimitError as e:
# e.status_code == 429
# e.body['error']['type'] == 'rate_limit_error'
# e.response.headers['retry-after'] = '12' # seconds
# e.response.headers['anthropic-ratelimit-requests-remaining'] = '0'
retry_after = int(e.response.headers.get('retry-after', '5'))
rate_limit_error is Anthropic’s HTTP 429 with the structured error type rate_limit_error. Unlike OpenAI, Anthropic enforces three independent per-minute dimensions — RPM, ITPM (input tokens per minute), and OTPM (output tokens per minute) — and you can saturate any one of them without coming close to the others. The error message and headers tell you which dimension hit the wall — the right fix depends on which one.
The right cure depends on which dimension you’re hitting. RPM-bound? Smooth your bursts with a token bucket and consider parallel batches. ITPM-bound? Add prompt caching on stable prefixes — it’s a 10x reduction on cached reads. OTPM-bound? Cap max_tokens more aggressively and stream summaries instead of full essays. And for any sustained production workload, the Message Batches API is a separate, larger capacity pool that absorbs throughput at half the cost.
Why this happens
- RPM (requests per minute) ceiling hit. Each tier has a per-minute request cap. Tier 1 paid is typically ~50 RPM for Opus and 1000 RPM for Haiku. Bursts of concurrent calls — a worker pool spinning up, a batch job kicking off — can saturate RPM for the rest of the rolling minute.
- ITPM (input tokens per minute) ceiling hit. Anthropic uniquely bills *input* tokens per minute as a separate dimension. Long contexts (100k+ token Claude calls) blow ITPM faster than RPM. Tier 1 Sonnet ITPM is typically ~40,000 — only 4 calls of 10k input tokens each per minute.
- OTPM (output tokens per minute) ceiling hit. Output tokens are billed separately from input. Long generations (essays, code synthesis, structured outputs) can exhaust OTPM while ITPM is healthy. The error message tells you which limit you hit.
- Concurrent stream limit on real-time workloads. Streaming responses hold a connection open for the full generation. Many concurrent streams can hit Anthropic's per-org concurrent-stream cap before any of the per-minute caps. The error is the same `rate_limit_error`.
- Cache write/read rate against prompt cache limits. Prompt caching has its own per-minute write caps. Heavy cache-write workloads (lots of unique 100k-token contexts being primed) can trigger 429s with a cache-specific message. Inspect the error body — Anthropic distinguishes cache writes from regular calls.
How to fix it
Fixes are ordered by likelihood. Start with the first one that matches your context.
1. Honour `retry-after` and the `anthropic-ratelimit-*` headers
Anthropic returns these headers on every response (success and 429): `anthropic-ratelimit-requests-remaining`, `anthropic-ratelimit-input-tokens-remaining`, `anthropic-ratelimit-output-tokens-remaining`, plus matching `-reset` timestamps and an explicit `retry-after` on 429s. Use them — don't guess.
import time, random
import anthropic
client = anthropic.Anthropic()
def call_with_retry(messages, model="claude-sonnet-4-5", max_attempts=5):
for attempt in range(max_attempts):
try:
return client.messages.create(
model=model, max_tokens=1024, messages=messages,
)
except anthropic.RateLimitError as e:
if attempt == max_attempts - 1:
raise
retry_after = int(e.response.headers.get('retry-after', 0))
backoff = max(retry_after, 2 ** attempt)
time.sleep(backoff + random.uniform(0, backoff * 0.25))
2. Add a client-side rate limiter at 80% of your cap
Pre-empt 429s by self-throttling. Use a token bucket sized at ~80% of your tier's RPM/ITPM/OTPM caps. Convert bursts into queues, which gives much better latency p99 than retrying after a 429.
from time import monotonic
from threading import Lock
class TokenBucket:
def __init__(self, rate_per_min, capacity):
self.rate = rate_per_min / 60.0
self.capacity = capacity
self.tokens = capacity
self.last = monotonic()
self.lock = Lock()
def take(self, n=1):
with self.lock:
now = monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= n:
self.tokens -= n
return True
return False
# Tier-1 Sonnet: 50 RPM → throttle at 40 RPM
bucket = TokenBucket(rate_per_min=40, capacity=10)
3. Move large workloads to the Message Batches API
Batches process up to 100k requests asynchronously over 24 hours, are 50% cheaper, and have a separate, much higher rate limit. For embeddings-like workloads, evals, bulk transcription cleanup, or any non-realtime task, batches are the right shape — not the synchronous Messages API."
4. Use prompt caching to cut token cost and ITPM pressure
Prompt caching reduces ITPM consumption by 90%+ on repeated long prefixes. Set `cache_control: { type: 'ephemeral' }` on stable system prompts, RAG corpora, or tool definitions. The cached prefix doesn't recount against ITPM on subsequent reads (only a small read-billing). For long-context bots, this is the single biggest ITPM relief."
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{"type": "text",
"text": LONG_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": "..."}],
)
5. Request a tier upgrade or move to Priority Tier
Tier upgrades happen automatically as you spend more. For predictable production load, Priority Tier (via Bedrock or Vertex with a commitment) gives reserved capacity that shields you from contention 429s. Anthropic doesn't grant manual rate-limit exceptions on the standard tier ladder."
Detection and monitoring in production
Log all `anthropic-ratelimit-*` headers from every response, not just 429s. Plot remaining requests/input-tokens/output-tokens over time to see how close you're running to the cap. Alert when any dimension drops below 20% of cap for >5 minutes — you're about to hit 429s. Tag 429 errors with which limit they hit (RPM vs ITPM vs OTPM) since each needs a different fix.
Related errors
- anthropicoverloaded_errorAnthropic's infrastructure is at capacity for the model you requested. This is server-side, not a problem with your code or your account — Claude is experiencing a traffic spike or capacity event and rejecting requests until load eases.
- anthropicauthentication_errorThe `x-api-key` header you sent doesn't match an active Anthropic API key — usually because the env var isn't loaded, the key was rotated or revoked, you're using a Workspace key in the wrong workspace, or a wrong-provider key (Bedrock or Vertex) was sent to the direct Anthropic API.
- openairate_limit_exceededYour account has exceeded its per-minute request (RPM) or per-minute token (TPM) limit for the model you're calling. Limits are tier-based and per-model.
- openaicontext_length_exceededThe total tokens (prompt + max_tokens for completion) exceeds the model's context window. For example, sending 130,000 input tokens to `gpt-4o` (128k window) or asking for 5,000 completion tokens when the prompt is already 125k.
- openaiinsufficient_quotaYour OpenAI organisation has run out of paid credit, hit its monthly hard limit, or hasn't added a payment method yet. Despite the 429 status, this is a billing problem — not a rate-limit problem — and retrying won't help.
Frequently asked questions
What's the difference between `rate_limit_error` and `overloaded_error`? +
Why does Anthropic separate input and output tokens for rate limiting? +
How do I know my current tier and limits? +
Do failed requests count toward my limits? +
Is the `retry-after` header always set on 429? +
Should I share rate-limit pools across multiple API keys? +
Does prompt caching count against ITPM? +
Will Anthropic grant a one-off rate-limit exception? +
When to escalate to Anthropic support
Open a support ticket only if (a) `anthropic-ratelimit-*` headers show plenty of headroom but you still get 429, indicating a routing or sync bug, (b) you've upgraded paid spend past a tier threshold but limits haven't refreshed after 7+ days, or (c) Priority Tier reservations are not honouring contracted RPM/ITPM. For 'I want a higher limit' on the standard tier, only paying more works.
Read more: /guide/handling-rate-limits/