Inside the transformer block: how Lego bricks add up to GPT

This is the fifth post in the «AI without magic» series. The first one covered where LLMs even came from. The second — how a model «thinks» one token at a time and why it hallucinates. The third — how text turns into vectors. The fourth — how tokens talk to each other through attention. Now we put it all together and look at the full transformer block — the brick GPT is built from.

Where we left off

In the previous post we worked through attention. The output of every token is an updated vector that has absorbed context. «mole» in «shrew mole» now sits closer to the small mammal, while in «one mole of CO2» it sits closer to the chemistry concept. Good.

But attention is only half a block. The block has four more parts without which a transformer doesn’t work: FFN, residual connection, normalization, and — the most important one — there are many blocks. Dozens or hundreds. GPT-3 has 96.

In this post we assemble the full block, stack them into a tower, and watch how a token’s vector rolls through the stack from embedding to next-word prediction.

FFN: what each token does on its own

Attention is an operation between tokens. Every token looks at the others, gathers a weighted mix of their value vectors, and writes the result back to itself. But if you think about it, attention is essentially a linear operation: all it does is weighted summation. There’s almost no nonlinearity (just softmax inside the weights).

For a model to actually compute things — to derive that «glass» implies «fragile», that «Moscow» is the capital of Russia, that after «sin(0) =» should come «0» — you need nonlinear transforms. That’s what the next part of the block, the feed-forward network (FFN, sometimes called MLP), provides.

FFN is embarrassingly simple. Two linear projections with a nonlinearity in the middle:

FFN inside the block. A vector of size d_model is expanded 4× through the first linear projection, passed through a nonlinearity (GELU), and squeezed back through the second. In GPT-3: 12288 → 49152 → 12288. Applied to each token independently, but the weights are shared. FFN takes up roughly two thirds of all model parameters.

The clever bit is the expansion. The first projection multiplies the dimensionality by 4. Why? A wider hidden layer means more compute capacity, more «rooms» to recognize patterns and store facts. In GPT-3 this turns into monstrous numbers: 12288 × 49152 = ~600 million parameters in a single projection in a single block. There are 96 such blocks, each with two such matrices — so FFNs eat about two thirds of all model parameters.

And here’s where it gets interesting. In 2020, Geva et al. showed that FFN is not «just a nonlinearity» as everyone used to think. It’s an associative memory. The columns of the first matrix act as «keys» that activate on specific input patterns: «past tense», «mention of the Eiffel Tower», «concept of fragility». The rows of the second matrix are «values» that, when activated, add specific information to the output.

So when we say «the model knows something» — that Moscow is the capital of Russia, that Jean Valjean stole bread, that chromium has atomic number 24 — that knowledge physically lives in FFN weights. Not in attention. Attention handles connections, FFN handles knowledge. Convenient division of labor: attention mixes tokens with each other, FFN mixes features within a single token.

Residual connection: a highway through the whole network

Now the main question. GPT-3 has 96 blocks. LLaMA-2 70B has 80. Models like GPT-4 probably have around a hundred (OpenAI doesn’t disclose exact numbers). And here a problem appears that isn’t immediately obvious.

Networks of dozens of layers are fundamentally hard to train with gradient descent. The error signal at the output has to make it back to the first layer. If every layer does something to that signal — multiplies by a matrix, runs it through a nonlinearity — then at every step the signal can decay (or explode). After 50 layers it becomes practically zero. The first layers stop learning.

This problem killed deep networks until 2015. It got solved in ResNet with a very simple idea: instead of every layer replacing the vector, let it add a correction.

before:    y = sublayer(x)
after:     y = x + sublayer(x)

That’s the residual connection (or skip connection). One little plus that changes everything.

What it gives you. First, on the backward pass the gradient gets a direct route back to the start of the network. Through addition it passes unchanged — no matter what horrors are happening inside the sublayer, the signal arrives at the first layer through this «highway» at full strength. Anthropic in their circuits paper call this highway the residual stream — a shared communication channel through the whole network that every layer writes to and every layer reads from.

Second, on the forward pass the model can skip a layer if it’s not needed. If a sublayer outputs something close to zero, the residual connection just passes the vector through unchanged. Layers become optional corrections — some are needed almost always, others fire only in specific situations.

In a transformer block the residual connection appears twice — once around attention and once around FFN:

after attention:  x = x + attention(x)
after FFN:        x = x + ffn(x)

The same residual stream vector flows through the whole block, getting two corrections along the way.

Normalization: keeping numbers from drifting off

There’s one more subtlety. Every time we add something to the residual stream, its magnitude can grow. After 96 blocks the numbers in the vector can fly off into space — and the network blows up. Or the opposite — if corrections are systematically negative, the vector shrinks to zero.

To prevent that, after every sublayer there’s a normalization — an operation that brings the vector back to a standard scale. The most popular one is LayerNorm: for each vector subtract the mean, divide by the standard deviation, multiply by learned parameters. The output is a vector with predictable magnitude that the next layer can work with stably.

There’s a technical detail here that turns out to matter. In the original 2017 paper Vaswani et al. put the norm after the residual connection:

post-norm (2017):    x = LayerNorm(x + sublayer(x))

That seemed logical: first add the correction, then normalize the result. But a few years later it turned out that this scheme trains poorly at depth. Xiong et al. 2020 showed: if you move the norm before the sublayer, training becomes much more stable, you don’t need fancy warmup schedules, you can train networks hundreds of layers deep:

pre-norm (modern):   x = x + sublayer(LayerNorm(x))

All modern models — GPT-2, GPT-3, LLaMA, Claude, Gemini — use pre-norm. The intuition is simple: in post-norm the gradient on the backward pass has to pass through every LayerNorm, and that slowly «strangles» it. In pre-norm the gradient has a clean highway through the residual without LayerNorms in the way.

One more upgrade not present in 2017: modern models like LLaMA use RMSNorm instead of LayerNorm. RMSNorm is a stripped-down version: it skips the mean-subtraction step and keeps only the rescaling. A bit faster, a bit simpler, almost no impact on quality. A typical example of how the architecture keeps getting cleaned up after 2017 — drop a redundant operation and become more efficient.

One block, end to end

Putting it all together. A full modern-style transformer block is:

def transformer_block(x):
    x = x + multi_head_attention(layer_norm(x))   # sublayer 1
    x = x + feed_forward(layer_norm(x))           # sublayer 2
    return x

Two sublayers, each wrapped in pre-norm and residual. Input is a vector of size d_model per token; output is an updated vector of the same size. The dimensionality is preserved end to end because blocks stack on top of each other and have to fit together.

One transformer block in modern (pre-norm) style. On the left — the residual stream that flows through the whole block. On the right — two «branches»: attention first, then FFN. Each branch reads the current state of the stream through LayerNorm, does its work, and adds the result back to the stream. The stream itself is never modified directly — only by addition. That's what the «gradient highway» is: on the backward pass the signal can travel from block output to block input unchanged through these two pluses.

Strip out the residual stream and leave only the sublayers — the network won’t work. Strip out LayerNorm — numbers explode after a dozen blocks. Strip out FFN — the model loses nonlinearity and most of its knowledge. Every brick is needed.

The stack: why there are so many blocks

A single block adds two corrections to the residual stream — one from attention, one from FFN. That’s enough for one round of «refining» the vector. But language is complicated: to understand a sentence you need to deal with syntax, then with phrase meaning, then with style, then with hidden references. One block can’t carry all of that.

So there are many blocks. They’re stacked into a stack — dozens of architecturally identical blocks with different weights. Every block sees the previous one’s output as input and adds its own batch of corrections to the residual stream.

The block stack. A token becomes an embedding, passes through N architecturally identical blocks (with different weights), and at the end goes through final norm and unembedding to become logits over the vocabulary. On the right — the shared residual stream that flows through all blocks: every block writes to it and reads from it.

The most beautiful part: block roles are not programmed. Nobody writes «block 1, you handle syntax; block 50, semantics; block 90, abstraction». That distribution emerges on its own, during gradient-descent training. Interpretability research (Anthropic, Geva et al. 2021) confirms it: lower blocks specialize in surface patterns (letters, morphology, n-grams), middle blocks in phrase meaning and facts, upper blocks in abstract concepts and the final token choice.

Depth and width are two independent scaling levers. Here are the sizes of a few well-known models for scale calibration:

Model	Blocks	Head	d_model
Original transformer (2017)	6	8	512
GPT-2 small	12	12	768
GPT-2 XL	48	25	1600
GPT-3 (175B)	96	96	12288
LLaMA-2 7B	32	32	4096
LLaMA-2 70B	80	64	8192

When people say «model N times bigger», they usually mean parameter count, which is roughly N_blocks × d_model² × ~12. Both levers make the model smarter, but in different ways: more blocks — deeper reasoning, larger d_model — more «rooms» for information per vector.

Closing the loop

Now we have everything to walk through GPT from start to finish in one pass. I’ll lay it out as a sequence — each step covered in one of the posts in the series.

1. Tokens. Text is broken into chunks of appropriate size via BPE — vocabulary is usually 50–100K tokens (post 3).

2. Embedding. Each token is replaced with a vector of size d_model via a learned embedding lookup table. Positional encoding is added so that the model knows the order of tokens (post 3).

3. Block stack. Each token’s vector goes through N blocks. Inside each block — multi-head attention (tokens look at each other, post 4) and FFN (each token processed independently), both with pre-norm and residual.

4. Final normalization and unembedding. At the output of the last block, the vector for the last token in the sequence (the one we want to predict after) is normalized and multiplied by a matrix that’s the inverse of the embeddings — this turns a d_model vector into logits the length of the entire vocabulary.

5. Softmax → distribution. The logits go through softmax, giving probabilities over all 50000 vocabulary tokens: «space» — 0.41, «forward» — 0.18, «moment» — 0.07, and so on (post 2).

6. Sample. We pick one token from the distribution (using temperature, top-k and top-p — that’s a separate story). We append it to the context. Go back to step 1.

That’s the full life cycle of a single generation step. On a long answer this whole machine spins hundreds of times — once per token. When you see chat output streaming word by word, that’s literally the generation tempo.

There’s no secret component in the architecture. Everything we’ve covered is what Vaswani et al. described in one paper in 2017, plus a handful of technical improvements in the years since (pre-norm, RMSNorm, RoPE, slightly different nonlinearity). The magic isn’t in the architecture. The magic is that this fairly simple thing was trained on a giant text corpus with the task of predicting the next token, and in the process the model on its own developed everything we call «knowledge»: grammar, facts, the ability to reason, style, a sense of humor, the ability to write code. FFN weights filled with patterns, attention head learned types of relationships, the residual stream learned to carry information between layers.

Further into the series we can go deeper — into training and scaling, into fine-tuning and RLHF (how a «next-token completer» becomes ChatGPT), into new architectural experiments (mixture of experts, mamba, diffusion text models). There’s plenty of room.

Based on Build a Large Language Model (From Scratch) Sebastian Raschka · manning.com Read the original →