Attention

Attention is a mechanism in neural networks, especially transformers, that lets one token decide which other tokens in the context are most relevant when building its representation.

1. What it is

In language, the meaning of a word often depends on the words around it.
Attention allows each token to look at other tokens and weigh how important they are.
Higher attention weight means that another token should have more influence on the current token’s representation.

In simple terms:

not every earlier word matters equally
attention helps the model focus on the most relevant parts of the context

2. What problem attention solves

Without attention, a model may struggle to decide which earlier words matter most.

Attention solves this by letting the model assign different importance to different tokens in the context.

Typical cases include:

ambiguous word meaning
pronoun resolution
long-range dependencies
using context more selectively instead of treating every token the same

In practice, this means attention helps a model:

choose the right meaning of a word from context
resolve references like he, it, or that
connect words that are far apart in a sentence

3. Where you see it

Attention is used in:

large language models such as GPT-style systems
machine translation
text summarization
question answering
speech and vision transformers

How it shows up in LLM behavior:

tracking the topic of your prompt
connecting later words to earlier instructions
resolving references like it, he, or that
deciding which earlier context matters when generating the next token

4. How it works internally

Intuition version

Each token looks at the other tokens.
It gives each of them a relevance score.
More relevant tokens get higher weight.
Then the token mixes information from those other tokens according to those weights.

So attention is basically:

score other tokens → turn scores into weights → take a weighted combination of their information

Q / K / V version

In transformers, each token is turned into three vectors:

Query (Q): what this token is looking for
Key (K): what this token offers for matching
Value (V): the information this token carries

The process is:

Compare one token’s query with all other tokens’ keys.
Use those comparisons to get attention scores.
Normalize the scores into weights.
Use those weights to combine the value vectors.

So:

Q decides what to search for
K decides how well another token matches
V is the actual content gathered from that token

Formula version

The standard attention formula is:

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V

Meaning of each part:

$Q K^{T}$
- computes similarity scores between queries and keys
$\frac{1}{d _{k}}$
- rescales the scores so they do not grow too large
$softmax (\cdot)$
- turns the scores into attention weights that sum to 1
$V$
- provides the information that will be mixed together

So the formula says:

compare → weight → combine

Concrete example

Suppose the sentence is:

“The animal didn’t cross the street because it was tired.”

Now the model is processing the token it.

Its task is:

When I interpret it, which earlier tokens should I pay most attention to?

Possible relevant tokens include:

animal
street
tired

Step 1: the current token produces a query

For the current token it, the model forms a query vector:

Q_{it}

You can read this as:

What kind of information am I looking for right now?

Since the model is processing it, this query may be useful for finding:

a possible referent
something semantically compatible with it

Step 2: every token has a key

Other tokens in the sentence each have their own key vectors:

The → $K_{The}$
animal → $K_{animal}$
street → $K_{street}$
tired → $K_{tired}$

You can read a key as:

What kind of information do I offer if someone is looking for me?

Step 3: compare the query against all keys

The model compares $Q_{it}$ with each key to produce relevance scores.

Conceptually, this is asking:

How relevant is each token to understanding it?

Suppose the scores look roughly like this:

The → 0.1
animal → 2.8
didn't → 0.2
cross → 0.1
street → 1.1
because → 0.3
tired → 1.5

This suggests:

animal is highly relevant
tired is also somewhat relevant
street has some relevance, but less than animal

Step 4: divide by $d_{k}$

The formula includes:

\frac{Q K ^{T}}{d _{k}}

The division by $d_{k}$ mainly rescales the scores so they do not become too large.

This helps keep the softmax step numerically stable and prevents the distribution from becoming too extreme too early.

Step 5: softmax turns scores into weights

Next, the model applies:

softmax (\frac{Q K ^{T}}{d _{k}})

This converts the raw scores into attention weights that sum to 1.

For example:

The → 0.03
animal → 0.52
didn't → 0.04
cross → 0.03
street → 0.14
because → 0.05
tired → 0.19

Now the model is effectively saying:

pay about 52% attention to animal
pay about 19% attention to tired
pay about 14% attention to street
pay very little attention to the rest

Step 6: use those weights to combine values

Each token also has a value vector, which contains the information it can contribute.

The final step is:

softmax (\frac{Q K ^{T}}{d _{k}}) V

So the model takes:

a lot of information from animal
some information from tired
a smaller amount from street

and mixes them into a new representation for it.

What this means in practice

The token it is not interpreted in isolation.

Instead, attention lets the model dynamically decide:

animal is the most relevant context
tired also matters
street matters a little

That is the practical role of the attention formula:

the current token selects the most relevant context and uses it to update its own representation

This is important because many words cannot be understood alone, including:

it
he
bank
bat

Attention is what lets the model use the whole sentence to resolve that ambiguity.

5. Background

Attention existed in earlier neural network research, but it became central in modern AI with the transformer architecture introduced in 2017.

The transformer relies heavily on self-attention, where tokens in the same sequence interact with one another.

Notes

Explorer

Attention

1. What it is

2. What problem attention solves

3. Where you see it

4. How it works internally

Intuition version

Q / K / V version

Formula version

Concrete example

Step 1: the current token produces a query

Step 2: every token has a key

Step 3: compare the query against all keys

Step 4: divide by $d_{k}$

Step 5: softmax turns scores into weights

Step 6: use those weights to combine values

What this means in practice

5. Background

Table of Contents

Graph View

Table of Contents

Notes

Explorer

Attention

1. What it is

2. What problem attention solves

3. Where you see it

4. How it works internally

Intuition version

Q / K / V version

Formula version

Concrete example

Step 1: the current token produces a query

Step 2: every token has a key

Step 3: compare the query against all keys

Step 4: divide by dk​​

Step 5: softmax turns scores into weights

Step 6: use those weights to combine values

What this means in practice

5. Background

Graph View

Table of Contents

Step 4: divide by $d_{k}$