How an LLM Thinks (and Why It Hallucinates)

This is the second post in the AI Demystified series. The first one covered how we got from 1948 statistics to GPT-4, and why an LLM is “a very, very good autocomplete.” Now we look at what happens inside the model on every step of its work: what tokens and context are, how it picks the next word, and why even a perfect LLM is doomed to confidently lie sometimes.

”Thinks” in scare quotes

When we say “the model thinks,” we’re stretching a human word over a process that doesn’t look like thinking at all. There’s no reasoning inside, no inner monologue, no moment of “let me think this over.” There’s only one operation, repeated: look at all the text in the context window — and produce the next token. Then that token gets appended to the end, and the operation repeats.

This is called autoregression: each next token is predicted based on all the previous ones. There are no “let me pause and think” moments in this loop. The model can’t slow down, can’t step back and look at its own answer, can’t say “wait, I made a mistake five tokens ago, let me start over.” Every step it has to produce exactly one token — and immediately move on.

So everything strange we observe in LLMs follows from this constraint. Why does it “forget” the start of a long dialog — because the context window is finite. Why does it confidently bullshit — because the process has no internal truth check. Why does “think step by step” help — because the model writes intermediate tokens to itself and reads them back as part of its own input.

Let’s go through it in order.

Tokens, not words

The model doesn’t see letters or words. It sees tokens — sub-word chunks usually 3–4 characters long. The word hello in most tokenizers is one token. Hallucinates is two or three. Digits split almost character by character. Emojis sometimes take 2–3 tokens each.

The most famous example of the model not seeing letters: count the Rs in strawberry. For a long time GPT-4 said “two,” because for the model strawberry is two tokens, straw and berry, not a sequence of letters. The letter R doesn’t exist in its representation — only two numbers from the vocabulary do.

A similar story with typos. The word ghost in OpenAI’s tokenizer is a single token, say number 7142. The typo gohst is three tokens g, oh, st. To the model these are two completely different inputs. It learned to fix typos only because there were lots of typos in the training data, and it memorized which mangled token chains correspond to which clean ones.

To the model, "strawberry" is two numbers from the vocabulary, not a sequence of letters. That's why it took so long to count the R's in it correctly.

There’s also a practical consequence: non-English languages tokenize worse than English. English text averages about one token per 4 characters, Russian one per 1.5–2. So the same text in Russian takes 2–3× more tokens in the API. Which means more expensive, slower, and faster to hit the context limit. That’s why many apps internally translate queries to English — it’s cheaper.

Context: memory that always ends

The context window is everything the model “remembers” right now. That includes your prompt, the system message, the dialog history, and everything the model has already produced in the current response. Window size is measured in tokens: GPT-4 Turbo has 128k, Claude has 200k, Gemini 1.5 Pro goes up to 1–2 million.

128k tokens is roughly 300 pages of text. Sounds like a lot, but in long dialogs it runs out faster than you’d think: the system prompt, RAG context, conversation history, attached documents — all of it eats space. When the context fills up, older messages get pushed out. Hence the classic “the model forgot what I told it at the start.”

An important subtlety: inside the context window, attention is also distributed unevenly. Studies show models remember the start and end of the context better, while information in the middle gets “lost” — this is called lost in the middle. So if you have a long document, important instructions are better placed at the end, not buried in the middle.

And one more thing: context is not “memory” in the database sense. Between different chats the model remembers nothing. Every new conversation starts with an empty window, and everything you told it last time doesn’t exist for it. When ChatGPT starts to “remember” you across sessions — that’s an add-on: a special module pulls relevant facts and slips them into the system prompt of the new conversation.

One token at a time

Now the most curious part. You ask the model something, it “thinks” for a second, then writes an answer. What’s happening inside that second?

It’s not actually “writing the answer.” It’s running the same operation over and over again. One pass through the network is not a full answer — it’s one token.

It takes your entire prompt. Runs it through all 96 transformer layers. At the output, the model produces a probability distribution over the entire vocabulary: for each of the 50–100 thousand tokens in the vocabulary, a number — the estimated probability that this exact token should come next. From that distribution, one token is picked. It gets appended to the end of the prompt. And it all starts again: the new prompt (old one plus one token) goes through the network, a new distribution comes out, a new token gets picked.

That’s how the entire answer gets wound up, one token at a time. When the model emits a special “end of text” token — generation stops.

At every step the model produces probabilities for all ~50000 tokens in the vocabulary. Sampling picks one of them. Then the picked token gets appended to the context and it all repeats.

A couple of important things follow from this. First, the model can’t “undo” a token it has already emitted. If on the fifth token of its reply it blurted something wrong — it can’t go back and rewrite. It can only continue from there, pretending everything is going to plan. That’s why you sometimes see answers like “Chopin was born in 1820… or rather, in 1810”: the model realized mid-answer it was wrong, but the first mistake is already in the context, and you can’t erase it.

Second, answer speed is linear in length. A long answer is literally many passes through the whole network, one per token. Streaming in chats isn’t an “accelerated playback” — it’s the actual generation rate.

Sampling: why “always the most likely” is bad

You have to pick one token from the distribution. The most obvious way is to take the most likely. This is called greedy, and it works poorly.

If you always take the max, the text becomes flat and quickly loops. Prompt Engineering with LLMs has a textbook example: they asked an old text-curie-001 model to write a list of reasons it likes Star Trek. Here’s what came out:

The characters are well-developed and interesting.

The plot is well-constructed and engaging. …

The franchise has a strong foundation.

The franchise has a passionate fanbase.

The franchise has a strong legacy.

The franchise has a long history. …

The franchise has a strong legacy.

The franchise has a strong following.

The model fell into a “The franchise has a strong X” pattern and can’t escape: at every step continuing the pattern is more likely than breaking it. The list doesn’t end because the next list item is always more likely than “okay, that’s enough.”

To prevent this, randomness is added to the choice. The main knob is temperature. At temperature=0 the model behaves almost like greedy: always picks the most likely. At temperature=1 selection follows the “honest” probabilities. At temperature=2 the distribution gets “flattened” — even unlikely tokens get a chance, and the text becomes wild and creative (and often incoherent).

There are two more subtle settings: top-k (only choose from the k most likely tokens, ignore the rest) and top-p, also known as nucleus sampling (choose from the smallest set of tokens whose cumulative probability ≥ p, e.g. 0.9). Top-p usually beats top-k because it’s adaptive: if the model is confident about one token, top-p will keep just that one; if the distribution is spread out, it’ll keep more options.

In practice, temperature 0.7–0.8 is the sweet spot for most tasks. For code and facts people often use 0.0–0.3 (predictability matters). For creative work — 1.0+.

Why the model hallucinates

Now the most interesting part. Why does even a well-trained LLM sometimes confidently spout nonsense?

People often say: “because the model doesn’t know what’s true and what’s not.” That’s right, but it’s only half the story. The actual problem is deeper and more interesting.

In September 2025 OpenAI researchers published “Why Language Models Hallucinate,” which makes an unexpected claim: the main cause of hallucinations is how models are trained and evaluated. Modern benchmarks are set up like a multiple-choice test: a correct answer earns a point, a wrong answer earns zero, “I don’t know” also earns zero. In such a system, guessing pays off better than admitting you don’t know.

Imagine a model asked about the birthday of an obscure person. It has two strategies:

say “I don’t know” — guaranteed zero
guess any date — 1-in-365 chance of being right, and across thousands of such questions the guessing model on average ends up with more points than the honest one

Benchmarks have rewarded the second strategy for years. And the models learned it. They hallucinate not because they’re “broken,” but because they were specifically, if unintentionally, trained to guess when they don’t know.

The same paper cites fresh numbers from the GPT-5 system card — comparing two models on the SimpleQA benchmark:

Metric	gpt-5-thinking-mini	OpenAI o4-mini
Abstention rate (answers “don’t know”)	52%	1%
Accuracy (correct answers)	22%	24%
Hallucination rate (wrong answers)	26%	75%

The old o4-mini guesses almost always — and therefore is wrong 75% of the time. The new gpt-5-thinking-mini honestly says “I don’t know” half the time — and only hallucinates in 26%. Accuracy is almost identical. In other words: the difference is in honesty, not in knowledge.

But there’s also a deeper, theoretical part. The same group, in “Calibrated Language Models Must Hallucinate,” proves: even a perfectly calibrated model is doomed to hallucinate on rare facts. The logic goes like this. During pretraining the model only sees correct text — without “true/false” labels. It learns to approximate the overall distribution of language. Patterns like punctuation rules it’ll learn flawlessly, because those have structure. But arbitrary one-off facts — an obscure person’s birthday, the exact title of the third paper in some niche journal — don’t fit any pattern. They’re either memorized or not, depending on how many times they appeared in the training data.

An analogy from the same paper: imagine you have to guess a pet’s birthday from its photo. No algorithm can do this, because there’s no relationship between the photo and the date. And since the model is required to emit a token at every step — it’ll emit something. Sometimes plausible.

It follows that hallucinations can’t be fully eliminated while the architecture stays the same. You can teach the model to say “I don’t know” more often — that’s what new models like GPT-5 Thinking do. You can hook up retrieval: pull real facts from a database into the context. You can add tools — search, calculator, code. But this reduces, not eliminates, the model’s pull toward plausible guessing.

It helps to distinguish three types of hallucinations (per the Huang et al. 2023 survey):

Factual — the model says something incorrect about the real world. “Chopin was born in 1820.”
Citational — the model fabricates a source, paper, book, or quote. Especially dangerous in academic and legal work: there have been real cases of US lawyers filing court briefs citing nonexistent court cases generated by ChatGPT.
Computational — the model confidently makes an arithmetic error. This is natural: it’s not supposed to compute, it’s supposed to predict tokens.

A universal sign that you should double-check the model’s answer is specific numbers, names, dates, and links. Anything that can’t be guessed from a pattern, the model probably made up.

Why “think step by step” works

There’s a famous trick: if you add “let’s think step by step” to the prompt, answer quality on reasoning tasks jumps sharply. This is called chain-of-thought, and it sounds a bit mystical — like we’re coaxing the model to “try harder.” Actually the mechanism is completely straightforward.

Remember, the model can’t “stop and think.” Every step it has to emit a token. When you hand it a problem like “Vasya has 5 apples, ate 2, how many are left” — the model has to produce the answer right now, on the next token. It has no inner scratchpad. If the answer requires two or three steps of reasoning — there’s no place to do them.

When we write “think step by step,” we’re effectively letting the model write the intermediate steps to the outside world — to its own output. Those written-out steps become part of its context for the next token. So the model uses its own output as working memory.

This is exactly what’s now called “reasoning models” like o1, o3, GPT-5 Thinking, DeepSeek R1. Inside they do exactly this, just at much greater scale: the model first generates a long internal chain of reasoning (often hidden from the user), and only then — the final answer. It works because in an LLM “thinking” literally means “writing tokens.” The more tokens for scratch work, the harder a problem can be solved.

What we now know

Putting the picture together. An LLM is an autoregressive token predictor. At every step it gets everything in the context window as input and produces a probability distribution over the vocabulary. From that distribution one token is sampled — temperature and top_p control the balance between predictability and creativity. Then the token is appended to the context and the operation repeats.

All its strangeness follows from this mechanic. The model can’t “stop and think” — it can only write tokens, which is why “think step by step” works. The model can’t tell truth from plausibility — which is why it confidently hallucinates. The model remembers nothing past its context window — which is why it forgets the start of the conversation. The model sees tokens, not letters — which is why it gets confused counting letters.

In the next three posts we’ll dig into all of this in detail. First — tokenization and embeddings: how text actually becomes numbers, and why the vector for “king” minus the vector for “man” plus the vector for “woman” equals the vector for “queen.” Then — attention: how exactly the soft-search mechanism works that lets the model look at any token in the context in a single step. And to wrap up — inside the transformer block: how all these little pieces assemble into a working model that we call GPT.

Based on Prompt Engineering with LLMs John Berryman, Albert Ziegler · oreilly.com Read the original →