Attention: how tokens talk to each other

This is the fourth post in the «AI without magic» series. In the first one we looked at where LLMs came from. In the second — how a model «thinks» one token at a time and why it hallucinates. In the third — how text becomes vectors. Now let’s unpack the mechanism that made the transformer what it is: attention.

Why tokens need to talk

Last post we stopped at static embeddings. Every word gets its own vector, and words with similar meanings cluster together in space. But there’s a problem: a word has one vector, while it can have many meanings.

The classic example from 3Blue1Brown: take the word «mole». In «American shrew mole» it’s a kind of rodent. In «one mole of carbon dioxide» — a unit of substance amount. In «take a biopsy of the mole» — a skin growth. Three completely different meanings, one static embedding.

A simpler one: the word «bank». «Investment bank» — a financial institution. «On the bank of the river» — a riverside. «A piggy bank» — a container for money. The embedding layer we built last post will spit out the same vector for all three.

To figure out which meaning is in play right now, the token has to look at its neighbors. «River» nearby — it’s a riverside. «Investment» — finance. «Piggy» — container.

The mechanism that lets tokens «look at each other» and update their vectors based on context is called attention. This post is about how it works.

Soft search: the general idea

In the first post we already mentioned attention briefly — as «soft search». The idea came from machine translation back in 2015: instead of one token hard-picking one related token from the context, let’s get a mix of all of them with different weights.

It’s like searching a library catalog, only soft. You come in with a query — say, «something about neural networks». The catalog doesn’t hand you a single book. It hands you a mix: 60% — a machine learning book, 30% — optimization, 10% — statistics. The better a book matches the query, the bigger its share of the mix.

That’s exactly what happens in attention. Each token issues a query. All other tokens show their keys. We compute how well the query matches each key, normalize those matches into percentages, and collect a weighted sum of those tokens’ values.

Query, Key, Value

Each token in attention plays three roles:

— Query: «what am I looking for in the context?» — Key: «here’s what I can offer if you ask» — Value: «here’s what I’ll pass on if you agree with me»

Query, key and value — three different «roles» each token plays. They are just three projections of the input vector through learnable matrices Wq, Wk, Wv. The famous formula in the dashed box is the entire scaled dot-product attention from Vaswani et al. 2017.

All three are just different projections of the original embedding through three learnable matrices Wq, Wk, Wv. Those matrices are the «parameters» the model learns during training. Nothing more in attention.

Why three different roles? Why not just compare vectors directly? Because the same word wants different things in different contexts. The word «which» as a query is hunting for a noun antecedent. As a key — it’s not particularly interesting. As a value — it carries «here begins a relative clause». Three projections let the model learn what to look for, how to present yourself and what to pass on — independently.

Self-attention step by step

The «self» in self-attention means all queries, keys and values come from the same sequence — the one the model is currently processing. Not like in machine translation, where queries came from the decoder and keys from the encoder. In GPT models, query, key, and value are all different roles played by the same input tokens.

Let’s take a concrete example from Raschka’s textbook — the phrase «Your journey starts with one step». Six tokens, computing self-attention.

Step 1. For each token, compute three vectors: q, k, v (multiplying the embedding by Wq, Wk, Wv).

Step 2. Compute a 6×6 match matrix: for each pair (i, j) take the dot product q_i · k_j. This is the attention score — how well the i-th token’s question «matches» the j-th token’s offering.

Step 3. Divide everything by √d_k (where d_k is the key dimensionality). This keeps scores from blowing up in high dimensions and prevents softmax from saturating.

Step 4. Apply softmax to each row. We get attention weights — normalized weights that sum to 1 along each row.

Step 5. For each token i, compute a weighted sum of all values: z_i = Σ w_ij · v_j. This is the context vector — the updated embedding that already takes context into account.

The 6×6 attention weights matrix for the phrase «Your journey starts with one step». Each row shows which tokens the row's token attends to (softmax makes the weights in each row sum to 1). Dark cells are high weights. «journey» attends most strongly to itself and «starts», «one» — to «step». Every token sees every other one.

The key thing to notice: this matrix is completely full. Any token can pull information from any other in one pass. In an RNN, getting a signal from token one to token ten meant ten recurrence steps — the signal decayed at every one. In attention this happens in a single matrix multiplication.

Causal mask: why GPT can’t see the future

GPT models have a subtlety. When the model trains to predict the next token, it must not peek at the right answer.

Imagine: we feed the model «Your journey starts with one», and it has to predict «step». If, while computing attention for the word «one», the model could see «step», the task would be trivial: look at the next token and emit it. The model would learn nothing, because at inference time the next token doesn’t exist yet — that’s what we’re trying to predict.

The fix is the causal mask. Before softmax, all cells in the upper triangle of the attention scores matrix are replaced with negative infinity. After softmax they become zeros: token i can only look at tokens 1…i, never at i+1, i+2, and so on.

Causal mask. Same matrix, but the upper triangle is zeroed: «Your» (the first token) sees only itself, «journey» — only itself and «Your», and so on. This is critical during training: the model doesn't peek at the next token, otherwise at inference it would fall apart.

The mask is only applied in decoder models (GPT, LLaMA, Claude). In encoders (BERT, classification models) it’s not needed — the task is different there, and tokens freely look both forward and backward. But GPT is autoregressive, generating left to right, so the mask is required.

Multi-head: why we need several heads

One attention head is one «perspective». It learns to relate tokens in some particular way. But there are many kinds of relations in language: subject ↔ verb, adjective ↔ noun, pronoun ↔ antecedent, article ↔ the word it refers to.

If we had just one head, the model would have to pack all these relation types into a single matrix — almost impossible. The fix: run several heads in parallel. Each has its own Wq, Wk, Wv, and each learns its own kind of relation.

A great example from 3Blue1Brown: «The glass ball fell on the steel table, and it shattered». To understand what «it» refers to and what «shattered» means, you need several different attention links at the same time: «ball» ↔ «glass» (property), «table» ↔ «steel» (property), «shattered» ↔ something fragile. One head won’t cut it — you need several, each with its own pattern of weights.

Multi-head attention. Several heads work in parallel; each has its own attention weights matrix and its own Wq, Wk, Wv. One can capture links to immediate neighbours, another — to the first token (often called «attention sink»), a third — long-range links across the whole sequence. GPT-3 has 96 heads per block.

The original «Attention Is All You Need» paper from 2017 had 8 heads. GPT-2 — 12 in the small version, 25 in the largest. GPT-3 — 96 heads per block. And there are many blocks: 12, 24, 96 depending on model size. Each (block × head) combo learns its own kind of relation.

After all heads compute their context vectors, the results are concatenated into one long vector and passed through one more linear projection. That is multi-head attention. The output is an updated vector for each token, holding compressed information from the whole sequence, viewed from several angles at once.

What we now know

— Static embeddings give a word a single vector. But the meaning depends on context — tokens have to «talk to each other».

— Attention is soft search. Each token issues a query, all others show their keys, we get a weighted mix of values.

— Q, K, V are three projections of the input embedding through learnable matrices Wq, Wk, Wv. That’s the whole mechanism — three matrices, dot product, softmax.

— Causal mask forbids a token from looking into the future. Without it GPT couldn’t learn next-token prediction.

— Multi-head runs several heads in parallel, each with its own Wq/Wk/Wv, each learning a different kind of relation. GPT-3 has 96 heads per block.

Attention turned static embeddings into contextual ones. Each token now «knows» its surroundings, and its vector is nudged in the right direction: «mole» in «shrew mole» is now closer to «rodent», while in «one mole of CO2» — closer to «substance amount».

But attention is only half of a transformer block. After it comes a regular feed-forward network, normalization, residual connection. And there are dozens or hundreds of such blocks stacked. In the next, final post of the series, we’ll put it all together and see how a working GPT emerges from these pieces.

Based on Build a Large Language Model (From Scratch) Sebastian Raschka · manning.com Read the original →