Sampling, not retrieval

There is a stubborn mental model that refuses to die: the idea that an LLM knows.

When we ask “Who was the 4th President?”, and the model replies “James Madison”, it feels like a lookup. It feels like the model did a database lookup for row ID 4472, grabbed the value James Madison, and returned it.

But then you ask it “Who was the 4,000th President?” and it confidently replies “Zorblax the Conqueror” (or something equally absurd).

If it were a database, it would return NULL or 404. But it returned a name. Why?

Because it’s not retrieving a record. It’s sampling from a distribution.

The Tree of Possibilities

The core mental model for this post is the tree of possibilities.

A hand-drawn diagram showing a probability tree. Thick branches represent likely next tokens, thin branches represent unlikely ones.

Every single token the model generates is a roll of the dice. It looks at the sequence so far, calculates the probability of every possible next word in its vocabulary, and then chooses one path.

Once that token is chosen, it becomes part of history. The path is committed. The model steps forward and repeats the process for the next token.

It is not fetching an answer. It is traversing a probability tree.

The “Multiverse” of Logprobs

We can prove this by looking at logprobs (log probabilities). The Chat UI hides this from you, as it only shows the winner. But under the hood, there was a whole “multiverse” of options that almost happened.

1
import OpenAI from "openai";
2

3
const client = new OpenAI();
4

5
async function main() {
6
  const response = await client.chat.completions.create({
7
    model: "gpt-4o",
8
    messages: [{ role: "user", content: "The Capital of France is" }],
9
    logprobs: true,
10
    top_logprobs: 5,
11
  });
12

13
  const tokenData = response.choices[0].logprobs?.content?.[0];
14

15
  if (!tokenData) return;
16

17
  console.log(`Chosen Token: '${tokenData.token}'`);
18
  console.log("--- The Multiverse Candidates ---");
19

20
  for (const option of tokenData.top_logprobs) {
21
    const probability = Math.exp(option.logprob) * 100;
22
    console.log(`'${option.token}': ${probability.toFixed(2)}%`);
23
  }
24
}
25

26
main();

1
from openai import OpenAI
2
import math
3

4
client = OpenAI()
5

6
response = client.chat.completions.create(
7
    model="gpt-4o",
8
    messages=[
9
        {"role": "user", "content": "The Capital of France is"}
10
    ],
11
    logprobs=True,
12
    top_logprobs=5 # Show us the top 5 candidates
13
)
14

15
# Get the first token of the answer
16

17
token_data = response.choices[0].logprobs.content[0]
18

19
print(f"Chosen Token: '{token_data.token}'")
20
print("--- The Multiverse Candidates ---")
21

22
for option in token_data.top_logprobs:
23
    probability = math.exp(option.logprob) * 100
24
    print(f"'{option.token}': {probability:.2f}%")

Running this might show:

1
Chosen Token: 'Paris'
2
--- The Multiverse Candidates ---
3
'Paris': 85.00%
4
'the': 5.00%
5
'located': 3.00%
6
'arguably': 1.50%
7
'Lyon': 0.01%

Even for a fact as hard as “Paris”, there is a non-zero chance it could have said “Lyon”. It’s just unlikely.

Controlling Chaos: Temperature

If the model is just rolling dice, how do we stop it from picking “Lyon”?

We use temperature.

Temperature is a modifier that changes the shape of the probability curve before we roll the dice.

A 3D wireframe comparison of two probability landscapes. Left: Low Temperature showing a sharp, single peak (certainty). Right: High Temperature showing chaotic, rolling hills (randomness).

Low Temperature (0.1 - 0.3): Sharpens the peak. It takes the “likely” tokens and makes them overwhelmingly likely. It makes the mountain taller and the plains flatter. Use this for code, JSON, and facts.
High Temperature (0.8 - 1.5): Flattens the curve. It gives the “long tail” options (like “Lyon” or “arguably”) a fighting chance. Use this for creative writing or brainstorming.

The “Blurry JPEG” of the Web

So if it’s just guessing next tokens, how does it know facts at all?

Science fiction writer Ted Chiang famously described ChatGPT as a “Blurry JPEG of the Web”¹.

Think about a JPEG image. It doesn’t store every pixel of the original photo. It uses math (cosine transform) to compress the patterns of the image. When you open the file, it “hallucinates” the pixels back into existence based on those patterns.

If the compression is high, the sharp edges get blurry. Artifacts appear.

Sutskever’s Law: Ilya Sutskever (co-founder of OpenAI) takes this further: “Prediction is compression.”²

To predict the next word in a text accurately, you have to understand the underlying reality that generated the text. The model has “compressed” its entire training corpus into its weights.

When you ask for a specific quote, it isn’t opening a file. It is rendering that quote from the compressed patterns in its neurons.

Hallucination is a Feature

Here is the kicker: Hallucination is not a bug.

The exact same mechanism that allows the model to write a new, original poem (good hallucination) represents the same mechanism that makes it invent a court case that never happened (bad hallucination).

It is simply completing the pattern.

“LLM hallucination is a feature, not a bug.”

As Andrej Karpathy implies, data is the only anchor.

Karpathy describes LLMs as “dream machines”⁴. We direct the dreams with prompts, but if the prompt is loose, the dream wanders.

If you give it a pattern that looks like a legal brief, it will complete it with things that sound like legal citations. It doesn’t know they are fake. It just knows that in the “legal brief” region of latent space, v. and Sup. Ct. often appear together.

Engineering Takeaway: RAG as Grounding

If we can’t trust the model’s memory (the compressed weights), how do we build reliable apps?

We stop asking it to remember.

Retrieval Augmented Generation (RAG) is the engineering fix for the probabilistic nature of LLMs.

Instead of asking:

Tell me about the Q3 financial results. (Relying on internal weights/memory)

We say:

Here is a PDF of the Q3 results. Using only this context, tell me about the results.

We move the task from Recall to Processing.

This is how Google’s Grounding with Search and Anthropic’s Contextual Grounding work. They inject the “truth” into the prompt before the model starts rolling the dice.

They don’t fix the hallucination engine. They just give it a path where the only logical continuation is the truth.

Next up: We’ve looked at text in and text out. But what about when the model outputs JSON? In Post 6, we’ll deconstruct tool calls and why they are the most fragile part of the stack.

Chiang, Ted. (2023). “ChatGPT Is a Blurry JPEG of the Web”. The New Yorker. A conceptual framework explaining how LLMs reconstruct information through lossy patterns rather than verbatim retrieval. ↩
Deletang, G., et al. (2023). “Language Modeling Is Compression”. DeepMind / OpenAI researchers exploring the equivalence of compression and prediction. ↩
Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models”. The foundational paper on how model performance scales with size. ↩
Karpathy, Andrej. (2023). Twitter/X Post. “I always struggle a bit with I’m asked about the ‘hallucination problem’ in LLMs…” ↩