OpenAI Error: rate_limit_exceeded — Too Many Requests
import openai
try:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
except openai.RateLimitError as e:
# e.status_code == 429
# e.code == 'rate_limit_exceeded'
# e.message includes 'Rate limit reached for ... in organization ...'
retry_after = int(e.response.headers.get('Retry-After', '60'))
OpenAI’s rate_limit_exceeded is HTTP 429 with the structured error code rate_limit_exceeded. It’s the single most common production error developers hit when integrating OpenAI APIs, especially after a feature launch or traffic spike. The fix is almost never to retry harder — it’s to be smarter about when and how you call the API.
Why this happens
- Burst of concurrent requests. Most rate-limit hits come from sudden bursts — a deployment kicking off worker pods, a batch job retrying simultaneously, or a user-facing feature getting linked on Hacker News. The 60s rolling window means even a 1-second spike can trigger 429s for the rest of the minute.
- Token-per-minute (TPM) ceiling, not request count. OpenAI enforces both RPM (requests/min) and TPM (tokens/min). On `gpt-4o`, tier 1 is 500 RPM but 30,000 TPM. A few long prompts can blow TPM while RPM is still healthy. The error message tells you which limit you hit.
- Streaming requests counting toward concurrent limits. Streaming responses hold a connection open for the full generation. If you spawn many streaming requests in parallel, you can hit per-org concurrent limits before RPM/TPM, with the same `rate_limit_exceeded` code.
- Wrong tier for the model. Limits scale with usage tier (free, tier 1, tier 2, …, tier 5). New accounts on free tier have very low caps for `gpt-4o`. You can only move up tiers by paying and waiting; you can't request an exception.
How to fix it
Fixes are ordered by likelihood. Start with the first one that matches your context.
1. Honour the `Retry-After` header with exponential backoff + jitter
OpenAI returns a `Retry-After` header (seconds) on 429s. Wait at least that long, then retry with exponential backoff and jitter to avoid thundering herds when many clients retry simultaneously.
import time, random
from openai import OpenAI, RateLimitError
client = OpenAI()
def call_with_retry(messages, model="gpt-4o", max_attempts=5):
for attempt in range(max_attempts):
try:
return client.chat.completions.create(
model=model, messages=messages
)
except RateLimitError as e:
if attempt == max_attempts - 1:
raise
retry_after = int(e.response.headers.get('Retry-After', 0))
backoff = max(retry_after, 2 ** attempt)
jitter = random.uniform(0, backoff * 0.25)
time.sleep(backoff + jitter)
2. Add a client-side token bucket to smooth bursts
Pre-emptively rate-limit yourself at slightly under the OpenAI quota using a token bucket. This converts spikes into queues instead of 429s — much better UX than retrying after a failure.
from limits import storage, strategies, parse
store = storage.MemoryStorage()
limiter = strategies.MovingWindowRateLimiter(store)
rule = parse("450/minute") # 90% of tier-1 RPM cap
def allow(key="openai"):
return limiter.hit(rule, key)
3. Switch large batches to the Batch API
For non-real-time work (>50% of LLM use), the [Batch API](https://platform.openai.com/docs/guides/batch) gives you 50% off and a separate, much higher rate limit. Trade 24h latency for big throughput.
4. Count tokens before sending
Use `tiktoken` to count input + max_output tokens before sending. If you'd push over the per-request or per-minute cap, queue or downgrade to a smaller model. This is faster than catching a 429 and retrying.
import tiktoken
def count_tokens(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
return sum(len(enc.encode(m["content"])) for m in messages) + 4 * len(messages)
5. Request a tier upgrade or reach out to support
Most OpenAI rate-limit pain disappears at tier 2+ ($50+ paid + 7 days). For sustained production load, your app should already be on tier 3+. Tier upgrades are automatic — there's no form to fill out.
Detection and monitoring in production
Track 429s as a metric, not just a log. Alarm if rate-limit errors exceed 1% of total requests over a 5-minute window. Add a tag for which model and which limit (RPM vs TPM) — they need different fixes. Send a Slack alert on tier-saturation events so you know to upgrade before users feel it.
Related errors
- openaiinsufficient_quotaYour OpenAI organisation has run out of paid credit, hit its monthly hard limit, or hasn't added a payment method yet. Despite the 429 status, this is a billing problem — not a rate-limit problem — and retrying won't help.
- openaicontext_length_exceededThe total tokens (prompt + max_tokens for completion) exceeds the model's context window. For example, sending 130,000 input tokens to `gpt-4o` (128k window) or asking for 5,000 completion tokens when the prompt is already 125k.
- anthropicrate_limit_errorYou exceeded one of Anthropic's per-minute caps for the model and tier — RPM (requests/min), ITPM (input tokens/min), or OTPM (output tokens/min). Anthropic enforces all three independently and you can hit any one without breaching the others.
Frequently asked questions
What's the difference between `rate_limit_exceeded` and `insufficient_quota`? +
How do I see my current OpenAI rate limit? +
Does retry-after of 0 mean I can retry immediately? +
Why do I get 429 even though my RPM is well under the limit? +
Will OpenAI grant a rate-limit exception for production traffic? +
Should I retry a streaming request that hit 429 mid-stream? +
Do failed requests count toward my rate limit? +
Can I share a rate-limit pool across multiple API keys? +
When to escalate to OpenAI support
Open a support ticket only if (a) the error persists for hours with no traffic on your side, (b) `x-ratelimit-remaining-requests` shows headroom but you still get 429, or (c) you suspect a billing/tier sync issue (e.g., you paid but tier didn't update after 7 days). For routine "I want a higher limit," there's nothing support can do — only paying more works.
Read more: /guide/handling-rate-limits/