OpenAI HTTP 429 rate-limit

OpenAI Error: `rate_limit_exceeded` — Too Many Requests

openai_call.py python

import openai

try:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}],
    )
except openai.RateLimitError as e:
    # e.status_code == 429
    # e.code == 'rate_limit_exceeded'
    # e.message includes 'Rate limit reached for ... in organization ...'
    retry_after = int(e.response.headers.get('Retry-After', '60'))

OpenAI’s rate_limit_exceeded is HTTP 429 with the structured error code rate_limit_exceeded. It’s the single most common production error developers hit when integrating OpenAI APIs, especially after a feature launch or traffic spike. The fix is almost never to retry harder — it’s to be smarter about when and how you call the API.

Why this happens

Burst of concurrent requests. Most rate-limit hits come from sudden bursts — a deployment kicking off worker pods, a batch job retrying simultaneously, or a user-facing feature getting linked on Hacker News. The 60s rolling window means even a 1-second spike can trigger 429s for the rest of the minute.
Token-per-minute (TPM) ceiling, not request count. OpenAI enforces both RPM (requests/min) and TPM (tokens/min). On `gpt-4o`, tier 1 is 500 RPM but 30,000 TPM. A few long prompts can blow TPM while RPM is still healthy. The error message tells you which limit you hit.
Streaming requests counting toward concurrent limits. Streaming responses hold a connection open for the full generation. If you spawn many streaming requests in parallel, you can hit per-org concurrent limits before RPM/TPM, with the same `rate_limit_exceeded` code.
Wrong tier for the model. Limits scale with usage tier (free, tier 1, tier 2, …, tier 5). New accounts on free tier have very low caps for `gpt-4o`. You can only move up tiers by paying and waiting; you can't request an exception.

How to fix it

Fixes are ordered by likelihood. Start with the first one that matches your context.

1. Honour the `Retry-After` header with exponential backoff + jitter

OpenAI returns a `Retry-After` header (seconds) on 429s. Wait at least that long, then retry with exponential backoff and jitter to avoid thundering herds when many clients retry simultaneously.

retry.py python

import time, random
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_retry(messages, model="gpt-4o", max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return client.chat.completions.create(
                model=model, messages=messages
            )
        except RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            retry_after = int(e.response.headers.get('Retry-After', 0))
            backoff = max(retry_after, 2 ** attempt)
            jitter = random.uniform(0, backoff * 0.25)
            time.sleep(backoff + jitter)

2. Add a client-side token bucket to smooth bursts

Pre-emptively rate-limit yourself at slightly under the OpenAI quota using a token bucket. This converts spikes into queues instead of 429s — much better UX than retrying after a failure.

bucket.py python

from limits import storage, strategies, parse
store = storage.MemoryStorage()
limiter = strategies.MovingWindowRateLimiter(store)
rule = parse("450/minute")  # 90% of tier-1 RPM cap

def allow(key="openai"):
    return limiter.hit(rule, key)

3. Switch large batches to the Batch API

For non-real-time work (>50% of LLM use), the [Batch API](https://platform.openai.com/docs/guides/batch) gives you 50% off and a separate, much higher rate limit. Trade 24h latency for big throughput.

4. Count tokens before sending

Use `tiktoken` to count input + max_output tokens before sending. If you'd push over the per-request or per-minute cap, queue or downgrade to a smaller model. This is faster than catching a 429 and retrying.

count.py python

import tiktoken

def count_tokens(messages, model="gpt-4o"):
    enc = tiktoken.encoding_for_model(model)
    return sum(len(enc.encode(m["content"])) for m in messages) + 4 * len(messages)

5. Request a tier upgrade or reach out to support

Most OpenAI rate-limit pain disappears at tier 2+ ($50+ paid + 7 days). For sustained production load, your app should already be on tier 3+. Tier upgrades are automatic — there's no form to fill out.

Detection and monitoring in production

Track 429s as a metric, not just a log. Alarm if rate-limit errors exceed 1% of total requests over a 5-minute window. Add a tag for which model and which limit (RPM vs TPM) — they need different fixes. Send a Slack alert on tier-saturation events so you know to upgrade before users feel it.

Related errors

Frequently asked questions

What's the difference between `rate_limit_exceeded` and `insufficient_quota`? +

`rate_limit_exceeded` (429) means you're sending too fast — wait and retry. `insufficient_quota` (429 with different code) means you've run out of paid credit or hit your hard usage cap — retrying won't help, you need to add billing or raise your monthly limit in the dashboard.

How do I see my current OpenAI rate limit? +

Every successful response includes `x-ratelimit-limit-requests`, `x-ratelimit-remaining-requests`, `x-ratelimit-reset-requests`, and the same trio for tokens. Log these in production to see how close you're running to the ceiling. The dashboard at platform.openai.com/account/limits shows the per-model limits for your tier.

Does retry-after of 0 mean I can retry immediately? +

No. A 0 or missing `Retry-After` should be treated as "retry with exponential backoff starting at 1s." Some 429s have no `Retry-After` because the limit is a rolling window — there's no fixed reset time, just "soon."

Why do I get 429 even though my RPM is well under the limit? +

You're probably hitting the TPM (tokens per minute) limit, not RPM. On gpt-4o tier 1, RPM is 500 but TPM is only 30k — a single 4k-token prompt every 0.5s already saturates TPM. Check `x-ratelimit-remaining-tokens` to confirm.

Will OpenAI grant a rate-limit exception for production traffic? +

Generally no. Limits are tied to usage tier; there's no manual exception. The fastest path is to spend more (which auto-upgrades your tier after a 7-day cooldown) or to spread load via the Batch API for non-realtime workloads.

Should I retry a streaming request that hit 429 mid-stream? +

A 429 mid-stream is rare but possible during multi-region rate-limit propagation. Treat it like any other 429 — wait `Retry-After`, then retry from scratch (you can't resume mid-completion).

Do failed requests count toward my rate limit? +

Requests that hit 429 do *not* count against your token usage. But requests that error after partial streaming (e.g., disconnect mid-stream) do consume the tokens generated up to that point. Always tear down streams cleanly on errors to avoid token waste.

Can I share a rate-limit pool across multiple API keys? +

Yes — limits are at the *organization* level, not the API key level. All keys in your org share one pool. To isolate teams, create separate organizations under your OpenAI account.

When to escalate to OpenAI support

Open a support ticket only if (a) the error persists for hours with no traffic on your side, (b) `x-ratelimit-remaining-requests` shows headroom but you still get 429, or (c) you suspect a billing/tier sync issue (e.g., you paid but tier didn't update after 7 days). For routine "I want a higher limit," there's nothing support can do — only paying more works.