OpenAI Error: context_length_exceeded — Prompt Too Long
import openai
long_text = open("legal-contract.txt").read() # ~140k tokens
try:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": long_text}],
max_tokens=2000,
)
except openai.BadRequestError as e:
# e.status_code == 400
# e.code == 'context_length_exceeded'
# e.message: "This model's maximum context length is 128000 tokens.
# However, your messages resulted in 142537 tokens.
# Please reduce the length of the messages."
...
context_length_exceeded is OpenAI’s hard ceiling on how much text you can fit into a single API call. The model has a fixed context window — 128k tokens for gpt-4o, 1M for gpt-4.1, 200k for o1 — and your prompt plus the requested max_tokens for completion must fit inside that envelope. Cross it by even one token and the request is rejected with HTTP 400 before any generation starts.
The error message is unusually helpful: it tells you the exact model limit and the exact token count it computed for your messages. Use those numbers — log them, alert on them, drive auto-truncation from them. The cure for context_length_exceeded is almost always pre-flight token counting plus one of three strategies: switch to a bigger-window model, summarise older context, or chunk the work into multiple calls.
Why this happens
- Prompt alone exceeds the model's window. The most direct cause — you tried to send more input tokens than the model can hold. `gpt-4o` is 128k tokens (~96k English words). Even a long PDF, a transcript of a 30-min call, or a few thousand lines of code can push past that.
- Prompt + max_tokens combined exceeds the window. OpenAI counts both directions. If your prompt is 120k tokens on a 128k model and you ask for `max_tokens: 10000`, the request fails before generation starts. The total of input + max_tokens must fit, leaving you only 8k for output in this example.
- Long conversation history accumulating. Multi-turn chat apps that append every message to `messages[]` blow past the limit silently. By turn 30 of a deep conversation, you're routinely sending 50k-100k tokens of history just to ask a one-line follow-up.
- Wrong tokeniser assumed (character vs token mismatch). Developers often estimate at 4 characters per token. That's a rough average — code, JSON, URLs, and non-English languages tokenise denser, sometimes 1-2 chars per token. A `len(text) / 4` estimate is wildly optimistic for code-heavy prompts.
- Embedded images counted as tokens. Vision inputs add tokens based on resolution and detail level. A high-detail 2048x2048 image adds ~1,105 tokens. Send 10 images plus a long prompt and you'll easily clear 128k without realising.
How to fix it
Fixes are ordered by likelihood. Start with the first one that matches your context.
1. Count tokens with tiktoken before sending
Always pre-flight count input + reserved output. If over budget, either truncate, switch model, or chunk. Pre-flight saves a round-trip and lets you fall back gracefully.
import tiktoken
def count_messages_tokens(messages, model="gpt-4o"):
enc = tiktoken.encoding_for_model(model)
tokens = 0
for m in messages:
tokens += 4 # per-message overhead
tokens += len(enc.encode(m["content"]))
tokens += len(enc.encode(m["role"]))
return tokens + 2 # priming
MODEL_LIMITS = {"gpt-4o": 128_000, "gpt-4.1": 1_047_576, "gpt-3.5-turbo": 16_385}
def fits(messages, model, max_tokens=1000):
return count_messages_tokens(messages, model) + max_tokens <= MODEL_LIMITS[model]
2. Switch to a larger-context model
`gpt-4.1` (1,047,576 tokens) and `gpt-4.1-mini` are the long-context options. `o1` and `o1-mini` are 200k. Move to a bigger model for the long-prompt path; you can keep `gpt-4o-mini` for short prompts to save money. Route by token count.
def route_by_length(messages):
tokens = count_messages_tokens(messages)
if tokens < 100_000:
return "gpt-4o"
if tokens < 200_000:
return "o1"
return "gpt-4.1" # 1M context fallback
3. Trim conversation history with a sliding window or summarisation
For chat apps, keep the system prompt + last N user/assistant pairs + a running summary of older turns. Summarise in a separate cheap call (`gpt-4o-mini`) every K turns. Don't naively trim from the front — that drops context the model needs.
SUMMARY_EVERY_N_TURNS = 10
KEEP_RECENT_PAIRS = 6
def compress(history, model="gpt-4o-mini"):
if len(history) <= KEEP_RECENT_PAIRS * 2:
return history
old = history[:-KEEP_RECENT_PAIRS * 2]
recent = history[-KEEP_RECENT_PAIRS * 2:]
summary = openai.chat.completions.create(
model=model,
messages=[{"role": "system", "content": "Summarise this conversation in <300 words, preserving facts and decisions."},
*old],
max_tokens=400,
).choices[0].message.content
return [{"role": "system", "content": f"Summary so far:\n{summary}"}, *recent]
4. Chunk long documents and stitch results
For document analysis (legal, RAG, code review), split by semantic units (paragraphs, sections, functions) at a safe chunk size (~30k tokens), call the model per chunk, then stitch outputs in a final synthesis call. Map-reduce is more reliable than relying on a 1M context window for very long inputs.
5. Reduce max_tokens to free up budget
If you're close to the boundary, lower `max_tokens` from a default like 4096 to what the response actually needs. Most conversational answers fit in 500-800 tokens. This buys headroom on the input side without changing the prompt.
Detection and monitoring in production
Log token counts on every successful response (`response.usage.prompt_tokens` and `completion_tokens`). Plot the p99 of `prompt_tokens` per endpoint over time — a creeping rise means conversation history isn't being trimmed and you'll hit `context_length_exceeded` soon. Add a pre-send guard that logs (and optionally rejects) any request over 90% of the model's limit.
Related errors
- openairate_limit_exceededYour account has exceeded its per-minute request (RPM) or per-minute token (TPM) limit for the model you're calling. Limits are tier-based and per-model.
- openaiinsufficient_quotaYour OpenAI organisation has run out of paid credit, hit its monthly hard limit, or hasn't added a payment method yet. Despite the 429 status, this is a billing problem — not a rate-limit problem — and retrying won't help.
- openaimodel_not_foundYou requested a model name that either doesn't exist (typo, deprecated, renamed) or that your organisation doesn't have access to (tier-gated, geography-restricted, deprecated for new orgs).
- anthropicrate_limit_errorYou exceeded one of Anthropic's per-minute caps for the model and tier — RPM (requests/min), ITPM (input tokens/min), or OTPM (output tokens/min). Anthropic enforces all three independently and you can hit any one without breaching the others.
- anthropicoverloaded_errorAnthropic's infrastructure is at capacity for the model you requested. This is server-side, not a problem with your code or your account — Claude is experiencing a traffic spike or capacity event and rejecting requests until load eases.
Frequently asked questions
What's the actual context window for each OpenAI model? +
Does the system prompt count toward the context window? +
How accurate is `len(text) / 4` for token counting? +
If my prompt fits but my output is too long, do I get `context_length_exceeded`? +
Can I send multiple separate calls instead of one giant call? +
Why does my chat hit `context_length_exceeded` after only a few turns? +
Does streaming reduce the token cost or window usage? +
Can I tell OpenAI to truncate the prompt for me? +
When to escalate to OpenAI support
This is almost never an OpenAI-side issue — escalation rarely helps. The exception: if you're certain your prompt fits within the documented model limit but the API still rejects with the wrong reported token count, log a support ticket with the request ID. Mismatch between SDK token counts and server-side counts has happened during model releases (e.g., new tokeniser shipped before docs updated).