Key Takeaway

  • Text → tokens → embeddings (high-dimensional vectors)
  • Vectors pass through alternating Attention blocks (cross-token communication) and MLP blocks (per-token fact storage)
  • After many layers, the last token’s vector is projected to 50,257 logits → softmax → next token probability

This chapter answers:

What is a transformer, and how does it turn text into predictions at a high level?

  • What is a Transformer?

    • A Transformer is a neural network architecture
    • that processes sequences by letting every element attend to every other element simultaneously
    • rather than reading them one by one.
  • Introduced in the paper “Attention Is All You Need” (Vaswani et al., 2017).

  • The key innovation: replace recurrence (RNNs) with self-attention, enabling full parallelisation during training.

  • A Transformer takes a sequence of vectors as input, passes them through alternating Attention and MLP blocks, and outputs a transformed sequence of vectors of the same length.

  • Each attention block allows tokens to communicate with each other (context-aware updates).

  • Each MLP block processes each token independently (stores patterns and facts).

  • Information accumulates via residual connections: each block adds to the existing vector rather than replacing it.

In the context of language models:

  • Input = a sequence of token embeddings
  • Output = a richer sequence of context-aware embeddings
  • The final vector is projected to a probability distribution over the next token

What is a GPT?

GPT stands for Generative Pre-trained Transformer:

  • Generative — it generates new text
  • Pre-trained — trained on a massive dataset before being fine-tuned for specific tasks
  • Transformer — based on the transformer neural network architecture

A GPT takes text as input and outputs a probability distribution over all possible next tokens. Text is generated by repeatedly sampling from this distribution and appending the chosen token, then running the model again.

Deep Learning Foundations

What Machine Learning Really Is

  • Traditional programming: humans write explicit rules
  • Machine learning: the model learns rules from data
  • The tunable knobs in a model are called weights (or parameters)
  • GPT-3 has 175 billion weights organized into ~28,000 matrices across 8 matrix categories

The Core Constraint

All operations in a transformer must be expressible as:

  1. Tensor operations — data flows as multi-dimensional arrays of real numbers
  2. Matrix multiplications — the primary computation
  3. Nonlinear functions — sprinkled in between (e.g., softmax, ReLU) to prevent the whole network from collapsing to a single linear transformation

Training uses backpropagation to adjust all weights based on prediction errors.


Step 1: Tokenization

  • Input text is broken into small chunks called tokens
  • Tokens may be whole words, sub-word pieces, or punctuation (e.g., "To| date|,| the| cle|ve|rest|...")
  • GPT-3 has a vocabulary of 50,257 tokens
  • The model never works with raw text — only with these token IDs

Step 2: Embedding — Turning Tokens into Vectors

The Embedding Matrix

  • Each token is mapped to a high-dimensional vector via the embedding matrix
  • has one column per vocabulary token
  • GPT-3 dimensions: 50,257 vocabulary items × 12,288 embedding dimensions
  • Parameter count: (~617M parameters, just for embeddings)

What Embeddings Encode

  • Embeddings are learned during training — not hand-designed
  • Directions in this high-dimensional space correspond to semantic meaning
  • Words with similar meanings cluster nearby in this space

Directional Semantics: Analogy Arithmetic

A famous property of trained embeddings:

The direction “woman minus man” is a learned vector encoding gender. This kind of relational structure emerges automatically from training on large text corpora.

Another example:

This direction correlates with other plurals and with increasing numerical quantities.

Dot Products Measure Similarity

  • The dot product of two vectors measures how aligned they are
  • is large and positive when the vectors point in the same direction
  • This is the primary tool for comparing embeddings

Positional Information

  • The initial embedding encodes what a token is, not where it appears in the sequence
  • Position is encoded separately and added to the embedding vector (positional encoding)

Step 3: The Transformer Layers

The context window of GPT-3 holds 2048 tokens at once. Each token’s embedding vector (12,288-dimensional) flows through the network in parallel.

What “Deep” Means

A large number of stacked layers is what puts the “deep” in deep learning. Each layer refines the embeddings further.

Two Types of Operations Alternate

1. Attention Block

  • Allows all token vectors to communicate with each other
  • Each vector can absorb contextual information from surrounding vectors
  • A vector that started as the embedding of an isolated word gets tugged and refined as it absorbs context
  • Example: the word “mole” has very different meanings in different sentences; only after attention can the model know which meaning is active

2. Feed-Forward Block (MLP — Multi-Layer Perceptron)

  • Each token vector is processed independently (no cross-token communication)
  • Functions like a bank of “questions” asked about each vector
  • Stores patterns and facts learned during training
  • The MLP gives the model extra representational capacity beyond attention alone

These two blocks alternate many times. The data flows through many layers of [Attention → MLP → Attention → MLP → ...].

The Residual Stream View

Each layer adds its output to the existing vector rather than replacing it. This means:

  • Early layers can encode simple features
  • Later layers add progressively more abstract, context-rich information
  • Information from the original embedding persists and is accessible at every layer

Step 4: The Unembedding Matrix

After many layers, the final vector of the last token in the context encodes everything the model has inferred about what should come next.

This final vector is multiplied by the unembedding matrix :

  • shape: 50,257 rows × 12,288 columns (transpose of embedding matrix shape)
  • Output: a list of ~50,257 raw scores, one for each vocabulary token

These raw scores are called logits.


Step 5: Softmax — Converting Logits to Probabilities

Raw logits can be any real number (positive, negative, large, small). We need a valid probability distribution (all values between 0 and 1, summing to 1).
Softmax performs this conversion:

Steps:

  1. Raise to the power of each logit
  2. Divide each result by the sum of all results

Properties:

  • Larger logits → values closer to 1
  • Smaller logits → values closer to 0
  • All outputs sum to 1
  • It’s a “soft” maximum: all tokens still receive some nonzero probability

Temperature

An optional parameter can be inserted into softmax:

Temperature Effect
(high)Distribution becomes more uniform — more random/creative outputs
(low)Distribution becomes more peaked — more deterministic/conservative outputs
All weight goes to the single highest-scoring token

The name “temperature” is borrowed from thermodynamics, where temperature governs the randomness of particle behavior.


Key Terminology Summary

TermMeaning
TokenA small chunk of text (word, sub-word, punctuation)
EmbeddingA high-dimensional vector representing a token’s meaning
LogitsRaw, unnormalized output scores before softmax
WeightsLearnable parameters (the “knobs” of the model)
Context sizeMaximum number of tokens the model processes at once
AttentionMechanism for tokens to share information with each other
MLPFeed-forward sub-network that processes each token independently

GPT-3 Architecture at a Glance

PropertyValue
Total parameters~175 billion
Embedding dimension12,288
Vocabulary size50,257 tokens
Context window2,048 tokens
Number of layers96
Attention heads per layer96
Embedding matrix parameters~617M

The Big Picture

Input text
    ↓  tokenize
[token₁, token₂, ..., tokenₙ]
    ↓  embed (W_E)
[vec₁, vec₂, ..., vecₙ]   ← each is 12,288-dimensional
    ↓  [Attention block] × many layers
    ↓  [MLP block]      × many layers
    ↓  ...
[enriched_vec₁, ..., enriched_vecₙ]
    ↓  take last vector, multiply by W_U
[logit₁, logit₂, ..., logit₅₀₂₅₇]
    ↓  softmax
[prob₁, prob₂, ..., prob₅₀₂₅₇]   ← probability distribution over next token
    ↓  sample
next token

What Comes Next

  • DL6: Deep dive into the attention mechanism — how exactly tokens communicate and update each other
  • DL7: Deep dive into MLP blocks — how facts and patterns are stored in the feed-forward layers