Attention is a Budget

Context windows have gotten massive. Models ingest entire codebases, books, or video files in a single pass, which makes it feel like retrieval is dead. If you can fit everything in the window, why bother with complex RAG pipelines or chunking strategies?

But there is a trap in the abundance. Even with a 1 million token context window, attention is a finite resource.

Throwing more tokens at a problem often makes the result worse. This isn’t just a technical limitation of current architectures, it is a fundamental constraint of information processing. If you hand someone a 1 million word document where only 100 words matter, they still have to find those 100 words.

Context engineering is no longer about stuffing as much as possible into the window. It’s about curation and intentional planning.

Attention as a Finite Resource

In a transformer, every token competes for attention with every other token. When you double the context size, you aren’t just giving the model more information, you are increasing the noise it has to filter.

The model can’t know what’s relevant until it has processed everything. This means that every irrelevant token you include is a tax on the model’s concentration. If the signal is buried in noise, the probability of the model forgetting a crucial constraint or hallucinating a detail increases.

The goal is signal density, not volume.

The 1-Million-Word Document

Imagine you have a question about a specific legal clause in a massive contract.

You could hand a lawyer a 1 million word document containing every contract the firm has ever signed and ask them to find the answer. They might find it, but it will take them a long time and they might get distracted by a similar clause in a different contract.

Or you could hand them a single page containing the relevant contract and the specific section you’re interested in.

Which one results in a more accurate answer?

This is what we’re doing when we dump an entire knowledge base into a prompt. We are asking the model to be a librarian, a researcher, and a domain expert all at once. We are trading tokens for processing power, and it’s often a bad trade.

Context Engineering vs. Context Stuffing

“Stuffing” is the naive approach: throw everything possibly relevant into the context and hope the model figures it out.

“Engineering” is the intentional approach: curate what goes in based on the task at hand.

Sunil Pai notes that “context became curated rather than dumped.”¹ The shift is from providing a library to providing a briefing.

This is where the mental model of RAG (Retrieval-Augmented Generation) shifts. I used to think of RAG as a way to get around context limits. Now I see it as a way to move a task from Recall to Processing. By retrieving only what’s needed, you free up the model’s attention to process that information more deeply.

The Contextual Retrieval Lesson

Anthropic recently shared a powerful lesson from their engineering work on “Contextual Retrieval.”²

The problem with traditional RAG is that individual chunks often lose their meaning when separated from the whole. A chunk might say, “The company’s revenue grew by 3% over the previous quarter.” But without knowing which company or which quarter, that information is useless.

The solution wasn’t to use larger chunks or a larger window. It was to prepend chunk-specific explanatory context before embedding.

By adding a small amount of curated context to each chunk (e.g., “This chunk is from an SEC filing on ACME corp’s performance in Q2 2023”), they reduced retrieval failure rates by 67%.

They didn’t add volume. They added signal.

Practical Patterns for Attention Budgeting

If attention is a budget, how do we spend it wisely?

Summarize before including: If you need to provide a large document for reference, consider having the model summarize the relevant parts first. Trade tokens for signal density.
Prepend context to chunks: Follow the Anthropic pattern or formalize them as agent skills. Don’t just provide a raw chunk of data or code, provide a one-line summary of its role in the system.
Structure by relevance: Place the most critical information (like the specific task or core constraints) at the beginning or end of the prompt where models tend to pay the most attention.
Separate reference from task: Use clear delimiters like XML tags or Markdown headers to separate the “background knowledge” from the “immediate goal.” This helps the model’s internal attention mechanism prioritize.
Use the model to curate: For complex tasks, use a two-step process. First, ask the model to identify which parts of the available context or knowledge base are relevant. Then, provide only those parts in the final prompt.

The Curation Constraint

Larger context windows are fantastic because they give us room to work that we didn’t have a year ago. But room isn’t a license to fill it.

The people getting the best results from frontier models are the ones treating the context window like a thought completion surface, by curating for clarity, density, and intent.

As context windows continue to grow, the ability to decide what not to include will become just as important as the ability to write the prompt itself.

Attention is a budget. Spend it on the signal.

Sunil Pai, “Where Good Ideas Come From”. “Liquid networks aren’t just social. They’re documentary. Agents need the documentary version.” ↩
Anthropic, “Contextual Retrieval”. “Individual chunks lack sufficient context… The solution was to prepend chunk-specific explanatory context.” ↩