From Markov to GPT-4: what an LLM actually is

This is the first post in the AI Demystified series. The idea is simple: take large language models and AI agents apart and show how they actually work, without hype and without formulas that scare you off the first page. In this post — a short history of how we got from 1948 statistics to GPT-4, and what an LLM actually is once you strip away the words “artificial intelligence.”

Magic that isn’t

When an everyday user first sees ChatGPT answer a question, explain code, or write an email, it looks like magic. So that’s how most users treat LLMs — like magic. Something mysterious inside a black box, sometimes telling the truth, sometimes “hallucinating,” running on rules you can’t possibly understand.

There’s no magic in there. Inside is a fairly straightforward thing with a concrete history, concrete authors, and a concrete moment when it started working. And once you take an LLM apart, you find it isn’t intelligence in the human sense of the word at all. It’s a very, very good autocomplete.

Remember how your iPhone suggests the next word as you’re typing a message? You type “Hi, how” and the iPhone offers “are.” You accept it, it offers the next one. You can keep going for a while and end up with nonsense like “Hi, how are you doing tonight I don’t know what to say.” An LLM does exactly the same thing — only much, much better.

iPhone autocomplete as a metaphor for an LLM: at every step the model picks the next word based on what came before. ChatGPT does exactly this — just over a much longer context.

What a language model is

A language model is a function: given a piece of text, it outputs a probability distribution over the next word. Feed it “I arrived in Moscow and went to the…” and it answers with something like ”{ museum: 0.12, metro: 0.09, store: 0.07, … }”. Then one word is sampled from that distribution, appended to the text, and the process repeats.

That’s it. No exaggeration — that’s all an LLM does. When you talk to ChatGPT, what’s running under the hood is a loop: “given the current text, compute a distribution → pick one word → append it → repeat.” Every word in the response is chosen separately, based on everything said up to that point.

The complexity isn’t in the idea, it’s in how well the model can compute that distribution. The whole history of LLMs is the story of teaching a computer to do that one step better.

1948: n-grams and Shannon

The story begins earlier than most people think. In 1948 Claude Shannon published “A Mathematical Theory of Communication,” in which, among other things, he showed: you can build a statistical model of English by counting how often certain letters and words follow each other.

The idea is simple. Take a large corpus of text and count: 47% of the time the word “artificial” is followed by “intelligence,” 12% by “satellite,” 4% by “light,” and so on. That’s a bigram — a model where the next word depends only on the one preceding it. A trigram looks at the two previous words, an n-gram looks at n−1.

An n-gram model is already an LLM in embryo. Input: context. Output: a distribution over the next word. Architecture: a frequency table.

The problem with n-grams is that they hit a ceiling fast. To use a long context, the model needs tables for every possible 5-tuple, 6-tuple, 7-tuple of words. The number of combinations grows exponentially, and most of them never appear in real text. Trigrams work poorly, 5-grams work slightly better, but past that there’s a wall. NLP researchers spent about sixty years trying to break through that wall with various tricks, and in general, they didn’t.

2014: seq2seq and the bottleneck

The real breakthrough came when neural networks took over from frequency counters. In 2014 several groups almost simultaneously proposed an encoder-decoder architecture on top of recurrent networks (RNNs). The idea: one network (the encoder) reads the input sentence word by word, and at the end produces a single fixed-length vector — the “meaning” of the whole sentence. A second network (the decoder) takes that vector and unrolls it into the output word by word.

It worked for machine translation. Feed in “I love coffee” in English, the encoder compresses it into a vector, the decoder unrolls it into “Я люблю кофе.” A miracle compared to n-grams. But the architecture had a fundamental problem: the bottleneck.

The seq2seq bottleneck: the entire content of the input sentence has to be squeezed into one fixed-size vector. The longer the input, the worse it works.

For short sentences — three to five words — the vector copes. But imagine translating a paragraph of War and Peace. Cramming the meaning of an entire paragraph into one vector of 256 numbers is hopeless. Translation quality on long sentences fell off a cliff.

2015: attention as a soft search

In September 2014 Dzmitry Bahdanau, with Cho and Bengio, published a paper with a long title: “Neural Machine Translation by Jointly Learning to Align and Translate.” In it they proposed a mechanism we now all know as attention.

The idea is brilliantly simple. Instead of forcing the encoder to compress everything into one vector, let’s keep one vector per input word. The decoder, at each step, will decide for itself which input word to “look at.” When the decoder generates “кофе,” it looks mostly at “coffee.” When it generates “люблю,” it looks at “love.”

Technically this works as a soft search. Every input word has a “key” and a “value.” The decoder, at each step, has a “query.” The dot product of the query with each key gives relevances; softmax turns them into a probability distribution; that distribution is then used as weights to average the values. The output is a vector weighted toward the inputs that are most relevant right now.

Attention is a search by keys and values, except instead of finding one result, you get a weighted blend of all of them. Hence the term “soft search.” Unlike a regular lookup, you don’t pick one row from a table — you get a mixture, with proportions determined by how well each row matches the query.

Machine translation immediately started working on long sentences. But nobody yet realized how important the thing was. Attention was seen as a useful add-on to RNN encoders and decoders.

2017: the transformer, or why do we need RNNs at all

In June 2017 a team at Google Brain published a paper with a provocative title: “Attention Is All You Need.” The authors — Vaswani et al. — proposed dropping RNNs entirely. If attention is so good, let’s build an architecture with no recurrent connections at all, just attention.

The result is the transformer. The input is split into tokens (roughly words), each token becomes a vector. Then those vectors are passed through layers many times over, and in each layer every token “looks at” all the others through the same soft search as in Bahdanau 2015. After every attention layer is a small fully-connected network, then normalization. Stack a few of these in a row — six in the original paper, dozens or hundreds of blocks in modern GPTs.

Transformer from above: tokens become vectors, pass through a stack of identical blocks (self-attention plus a feed-forward network), and at the end softmax produces a distribution over the vocabulary. Then — sample one word and repeat.

The transformer beat RNNs on two fronts at once. First, it parallelizes well: in an RNN every next state depends on the previous one, so training is fundamentally sequential. In a transformer all input tokens are processed in parallel — a perfect fit for a GPU. Second, it handles long-range dependencies better, because any token can “fetch” information about any other in a single attention layer, instead of having to traverse dozens of RNN steps where the signal decays.

The transformer’s internals deserve their own post — there’s a lot going on: positional encoding, multi-head attention, causal masking. We’ll come back to that. For now, one thing matters: since 2017, every large language model is a variation of this same architecture.

2018: GPT — let’s drop half of it

The original Vaswani 2017 transformer was designed for machine translation and had two parts: an encoder reads English, a decoder writes French. In June 2018 the OpenAI team led by Alec Radford published “Improving Language Understanding by Generative Pre-Training” (GPT-1 for short), in which they took a radical step: keep only the decoder.

The logic goes like this. If the task is to continue text, you don’t need a separate encoder for the input. The input and the output are the same thing: a long sequence of words. Take the transformer decoder, train it to predict the next token on a huge corpus of text — and that’s it. You get a universal model that can continue any text.

GPT-1 had 117 million parameters and was trained on a corpus of around 7,000 books. Even then it showed that one and the same pre-trained model, fine-tuned for different tasks, beat specialized NLP models almost everywhere — from classification to question answering. But it was still research, not a mass-market product.

2019: GPT-2 and scale as magic

In February 2019 OpenAI released the GPT-2 paper. Same architecture as GPT-1, same training regime, but 10× more parameters (1.5 billion) and 10× more data (40 GB of internet text). And here something interesting happened.

Our model, called GPT‑2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model.

— OpenAI, Better Language Models and Their Implications, 2019

Re-read that. A model trained literally only to “guess the next word” on a pile of internet text turned out so good that its authors deemed it dangerous to release publicly. They first put out only a stripped-down 117-million-parameter version, then over nine months gradually rolled out the rest.

It’s worth pausing here. Six years earlier, in 2013, NLP engineers were fighting over an extra BLEU point on machine translation. And now a model nobody had specifically taught to answer questions or write essays did both — purely because in 40 GB of internet text it had all been there. It learned those tasks not because it was trained on tasks, but because to predict the next word well in arbitrary text, you have to understand a lot of things.

This is the first hint of emergent abilities — capabilities that show up by themselves when you make the model bigger. There are no new ideas in the architecture between GPT-1 and GPT-2. Just more parameters and more data. And that changed everything.

2020–2023: GPT-3, GPT-4, and the law of scale

GPT-3 in 2020 — 175 billion parameters, 100× bigger than GPT-2. Few-shot and in-context learning showed up: give the model two or three examples of a task in the prompt, and it solves a similar one by analogy — without any fine-tuning. GPT-4 in 2023 — OpenAI didn’t publish the exact size, but rumours put it at a mixture-of-experts of trillions of parameters. Multimodality arrived: the model can see images.

Scale evolution: four orders of magnitude in parameter count over five years. The architecture barely changed. What changed was model size, data, and compute spent on training.

With each transition — GPT-1 → 2 → 3 → 4 — the model got better at everything you could test it on, and would sometimes suddenly start doing things the previous version couldn’t do at all. Multi-step arithmetic, getting jokes, writing correct Python, translating between languages without explicit training. That’s what emergent abilities are: capabilities you can’t predict in advance, that “surface” at certain scales.

From a research perspective this is even slightly humbling: it turns out the last six or seven years of “AI progress” has been, in large part, the story of taking the 2017 architecture and training it ten times harder. There’s almost no fundamentally new idea in the model itself. The engineering wins, on the other hand, are huge: distributed training across thousands of GPUs, RLHF (reinforcement learning from human feedback) for aligning answers with human preferences, inference optimization.

What we know now

Putting it all together, the picture is this. An LLM is an autocomplete trained on a huge corpus of text to predict the next token. The architecture is the transformer, invented in 2017 for machine translation. From GPT-1 to GPT-4 the architecture barely changed — what changed was scale. And it’s scale that produced what looks like magic from the outside.

That doesn’t mean the LLM is “stupid.” To predict the next token well in an arbitrary chunk of the internet, the model has to learn a lot: grammar, facts about the world, arithmetic rules, programming-language syntax, communication norms. All of it lives implicitly in its weights. But at heart it’s still a very, very good autocomplete — not a mind.

In the next post in this series we’ll take this autocomplete apart at inference time: what tokens and context are, why the model “hallucinates,” and why ChatGPT sometimes asks you to “think step by step.” After that — three deep dives: tokenization, attention in detail, and how all these little pieces assemble into a working transformer.

Based on Prompt Engineering with LLMs John Berryman, Albert Ziegler · oreilly.com Read the original →