Attention is a mechanism in neural networks, especially transformers, that lets one token decide which other tokens in the context are most relevant when building its representation.
1. What it is
- In language, the meaning of a word often depends on the words around it.
- Attention allows each token to look at other tokens and weigh how important they are.
- Higher attention weight means that another token should have more influence on the current token’s representation.
In simple terms:
- not every earlier word matters equally
- attention helps the model focus on the most relevant parts of the context
2. What problem attention solves
Without attention, a model may struggle to decide which earlier words matter most.
Attention solves this by letting the model assign different importance to different tokens in the context.
Typical cases include:
- ambiguous word meaning
- pronoun resolution
- long-range dependencies
- using context more selectively instead of treating every token the same
In practice, this means attention helps a model:
- choose the right meaning of a word from context
- resolve references like
he,it, orthat - connect words that are far apart in a sentence
3. Where you see it
Attention is used in:
- large language models such as GPT-style systems
- machine translation
- text summarization
- question answering
- speech and vision transformers
How it shows up in LLM behavior:
- tracking the topic of your prompt
- connecting later words to earlier instructions
- resolving references like
it,he, orthat - deciding which earlier context matters when generating the next token
4. How it works internally
Intuition version
- Each token looks at the other tokens.
- It gives each of them a relevance score.
- More relevant tokens get higher weight.
- Then the token mixes information from those other tokens according to those weights.
So attention is basically:
score other tokens → turn scores into weights → take a weighted combination of their information
Q / K / V version
In transformers, each token is turned into three vectors:
- Query (Q): what this token is looking for
- Key (K): what this token offers for matching
- Value (V): the information this token carries
The process is:
- Compare one token’s query with all other tokens’ keys.
- Use those comparisons to get attention scores.
- Normalize the scores into weights.
- Use those weights to combine the value vectors.
So:
Qdecides what to search forKdecides how well another token matchesVis the actual content gathered from that token
Formula version
The standard attention formula is:
Meaning of each part:
-
- computes similarity scores between queries and keys
-
- rescales the scores so they do not grow too large
-
- turns the scores into attention weights that sum to 1
-
- provides the information that will be mixed together
So the formula says:
compare → weight → combine
Concrete example
Suppose the sentence is:
“The animal didn’t cross the street because it was tired.”
Now the model is processing the token it.
Its task is:
When I interpret
it, which earlier tokens should I pay most attention to?
Possible relevant tokens include:
animalstreettired
Step 1: the current token produces a query
For the current token it, the model forms a query vector:
You can read this as:
What kind of information am I looking for right now?
Since the model is processing it, this query may be useful for finding:
- a possible referent
- something semantically compatible with
it
Step 2: every token has a key
Other tokens in the sentence each have their own key vectors:
The→animal→street→tired→
You can read a key as:
What kind of information do I offer if someone is looking for me?
Step 3: compare the query against all keys
The model compares with each key to produce relevance scores.
Conceptually, this is asking:
How relevant is each token to understanding
it?
Suppose the scores look roughly like this:
The→0.1animal→2.8didn't→0.2cross→0.1street→1.1because→0.3tired→1.5
This suggests:
animalis highly relevanttiredis also somewhat relevantstreethas some relevance, but less thananimal
Step 4: divide by
The formula includes:
The division by mainly rescales the scores so they do not become too large.
This helps keep the softmax step numerically stable and prevents the distribution from becoming too extreme too early.
Step 5: softmax turns scores into weights
Next, the model applies:
This converts the raw scores into attention weights that sum to 1.
For example:
The→0.03animal→0.52didn't→0.04cross→0.03street→0.14because→0.05tired→0.19
Now the model is effectively saying:
- pay about 52% attention to
animal - pay about 19% attention to
tired - pay about 14% attention to
street - pay very little attention to the rest
Step 6: use those weights to combine values
Each token also has a value vector, which contains the information it can contribute.
The final step is:
So the model takes:
- a lot of information from
animal - some information from
tired - a smaller amount from
street
and mixes them into a new representation for it.
What this means in practice
The token it is not interpreted in isolation.
Instead, attention lets the model dynamically decide:
animalis the most relevant contexttiredalso mattersstreetmatters a little
That is the practical role of the attention formula:
the current token selects the most relevant context and uses it to update its own representation
This is important because many words cannot be understood alone, including:
ithebankbat
Attention is what lets the model use the whole sentence to resolve that ambiguity.
5. Background
Attention existed in earlier neural network research, but it became central in modern AI with the transformer architecture introduced in 2017.
The transformer relies heavily on self-attention, where tokens in the same sequence interact with one another.