Tokens and embeddings

This is the third article in the «AI without magic» series. The first explained that an LLM is a giant autoregressive token predictor. The second covered what «thinking one token at a time» means and why models hallucinate. Now we go a level deeper: how exactly does text turn into numbers the model can work with.

Why we need numbers at all

A neural network can do one thing — multiply huge matrices of numbers and apply nonlinearities. It cannot understand letters or words directly. So the first thing any LLM does with your prompt is convert text into numbers. And not «letter → ASCII code» — there are two stages.

Stage 1 — tokenizer: splits text into chunks (tokens) and assigns each one an integer. «Hello, how are you?» → [Hello, ,, ·how, ·are, ·you, ?] → [9906, 11, 1268, 527, 499, 30].

Stage 2 — embedding layer: each integer is replaced with a long vector of real numbers, length 768, 4096, or even 12288. These vectors are the input to the rest of the model.

The path is: text → tokens (integers) → vectors. From there, attention and the rest of the transformer take over. But first — let’s look at both stages separately.

BPE: how the token vocabulary came to be

The first design question: how do we cut text into chunks? Options:

— By letter. Vocabulary is tiny (~30-100 symbols), but sequences become huge. The word «strawberry» — 10 model steps instead of one. Expensive.

— By word. Sequences are short, but the vocabulary balloons to millions of entries, and any typo («strawbery») becomes an unknown token.

— Something in between. Frequent words whole, rare ones in pieces. This is BPE (Byte Pair Encoding) — the algorithm behind GPT-2, GPT-3, GPT-4, Claude, Llama, almost everything.

The idea is simple: don’t guess in advance, look at a corpus and iteratively merge the most frequent pairs of symbols into single tokens.

BPE starts with letters and merges the most frequent pairs. After a few thousand iterations «low» becomes a single token, «lower» too, and the rare «lowest» is assembled from pieces. Real GPT-2 does ~50,000 such merges.

In practice GPT-2 starts with 256 bytes (any Unicode character can be represented as bytes) and does ~50,000 merges. The result is a vocabulary of ~50,257 tokens where «the», «hello», «def» are whole tokens, while «antidisestablishmentarianism» is assembled from pieces.

What GPT actually sees

Let’s look at concrete numbers. We run the same phrase in two languages through tiktoken (OpenAI’s official library):

Same meaning, but Cyrillic gets 1.5–3× more tokens. That's because BPE was trained on a corpus where English was vastly overrepresented. Encoding cl100k_base, GPT-3.5/GPT-4.

This explains a few practical things:

— Tokens are the unit of billing. The same prompt in Russian costs 1.5-3× more in the API than in English, and hits the context limit faster. That’s why many apps internally translate to English.

— Emoji and rare symbols often expand to 3-4 tokens each. A single emoji on an emoji-heavy page can «eat» a dozen tokens.

— The letter-counting problem (remember «how many R’s in strawberry» from the previous post?) is exactly this: the model doesn’t see the sequence s-t-r-a-w-b-e-r-r-y, it sees three tokens [str][aw][berry]. The letters disappear at tokenization.

The newer o200k_base encoding (GPT-4o, GPT-5) is twice as efficient for Russian — 19 tokens instead of 36 on our pangram. But even it’s still ~2× less dense than English.

Embedding: token → vector

Okay, text became a list of integers: [9906, 11, 1268, 527, 499, 30]. What now? Feeding them into a matrix as-is is a bad idea: the integer 9906 is no «better» or «worse» than 499, there’s no semantic closeness between them. IDs are just addresses.

So the first layer of any LLM is the embedding layer. It’s a giant table:

— rows: vocabulary size (~50,000 for GPT-2, ~200,000 for GPT-4o); — columns: vector dimensionality (768 for small GPT-2, 12,288 for GPT-3, ~16,000+ for modern models).

Each token ID is replaced with a row from this table — a long vector of real numbers. At the start of training those numbers are random. But during training they get adjusted along with the rest of the model so that semantically similar tokens get similar vectors.

And here it gets interesting.

The geometry of meaning

If you project trained embeddings down to 2D (via t-SNE or UMAP), it turns out words cluster by meaning. Birds — together. Countries — a separate cluster. Seasons — their own. This figure from Raschka’s textbook is a classic:

Semantically similar words cluster together. And the difference between two words often encodes a meaningful operation: «France → Paris» and «Italy → Rome» are the same direction in space, the «capital» direction.

This is the famous observation about embeddings published by Mikolov et al. in 2013 in the word2vec paper: vector arithmetic on word embeddings has meaning.

king − man + woman ≈ queen

Paris − France + Italy ≈ Rome

walking − walk + swim ≈ swimming

In embedding space there are stable directions: «gender», «capital», «past tense». And they can be added to any word.

It sounds like magic — but it’s not. The model trained on billions of texts where it saw «Paris is the capital of France» and «Rome is the capital of Italy». To learn to predict such phrases, it had to encode the «capital-of» relationship as a reproducible direction in vector space. Otherwise it couldn’t have done the prediction task well.

A small honest caveat: the classic king − man + woman = queen example only works because when finding the nearest vector, the algorithm excludes the source words from the answer. Without that exclusion, the closest vector to king − man + woman is… king itself. queen only comes second. But the underlying idea — that directions have meaning — still holds.

Contextual embeddings: one token, many vectors

Word2vec (and its descendants: GloVe, fastText) gave one vector per word. The word «bank» gets one vector regardless of meaning: financial institution or river edge. These are static embeddings.

Modern LLMs work differently. The embedding layer gives a starting vector — but then this vector flows through dozens of transformer layers, and at each layer it mixes with the vectors of surrounding tokens. By the time the «bank» token reaches the output, its vector depends on whether the conversation was about money or about the river.

These are contextual embeddings. Same token — different vectors depending on neighbors. That’s why LLMs handle homonyms and irony so well.

The mechanism that does this mixing is called attention. That’s our next article.

What we now know

Let’s pull it together.

The model doesn’t work with letters or words — it works with tokens (integers from a vocabulary of ~50,000-200,000 entries) and embeddings (a vector of ~1,000-15,000 numbers per token). The token vocabulary is built once from a corpus by BPE: frequent substrings get merged into single tokens, rare ones stay split into pieces.

An embedding is just a row in a giant table that’s trained alongside the model. After training, this table reveals beautiful geometry: semantically similar words sit close, relations between words correspond to recurring directions in space. This wasn’t designed in — it’s a side effect of the «predict the next token» objective.

But static embeddings are only the beginning. From here, vectors need to talk to each other so each token can «see» the context. That’s what attention does — and we’ll look at it closely in the next article.

Based on Build a Large Language Model (From Scratch) Sebastian Raschka · manning.com Read the original →