Attention is a mechanism in neural networks, especially transformers, that lets one token decide which other tokens in the context are most relevant when building its representation.

1. What it is

  • In language, the meaning of a word often depends on the words around it.
  • Attention allows each token to look at other tokens and weigh how important they are.
  • Higher attention weight means that another token should have more influence on the current token’s representation.

In simple terms:

  • not every earlier word matters equally
  • attention helps the model focus on the most relevant parts of the context

2. What problem attention solves

Without attention, a model may struggle to decide which earlier words matter most.

Attention solves this by letting the model assign different importance to different tokens in the context.

Typical cases include:

  • ambiguous word meaning
  • pronoun resolution
  • long-range dependencies
  • using context more selectively instead of treating every token the same

In practice, this means attention helps a model:

  • choose the right meaning of a word from context
  • resolve references like he, it, or that
  • connect words that are far apart in a sentence

3. Where you see it

Attention is used in:

  • large language models such as GPT-style systems
  • machine translation
  • text summarization
  • question answering
  • speech and vision transformers

How it shows up in LLM behavior:

  • tracking the topic of your prompt
  • connecting later words to earlier instructions
  • resolving references like it, he, or that
  • deciding which earlier context matters when generating the next token

4. How it works internally

Intuition version

  • Each token looks at the other tokens.
  • It gives each of them a relevance score.
  • More relevant tokens get higher weight.
  • Then the token mixes information from those other tokens according to those weights.

So attention is basically:

score other tokens turn scores into weights take a weighted combination of their information

Q / K / V version

In transformers, each token is turned into three vectors:

  • Query (Q): what this token is looking for
  • Key (K): what this token offers for matching
  • Value (V): the information this token carries

The process is:

  1. Compare one token’s query with all other tokens’ keys.
  2. Use those comparisons to get attention scores.
  3. Normalize the scores into weights.
  4. Use those weights to combine the value vectors.

So:

  • Q decides what to search for
  • K decides how well another token matches
  • V is the actual content gathered from that token

Formula version

The standard attention formula is:

Meaning of each part:

    • computes similarity scores between queries and keys
    • rescales the scores so they do not grow too large
    • turns the scores into attention weights that sum to 1
    • provides the information that will be mixed together

So the formula says:

compare weight combine

Concrete example

Suppose the sentence is:

“The animal didn’t cross the street because it was tired.”

Now the model is processing the token it.

Its task is:

When I interpret it, which earlier tokens should I pay most attention to?

Possible relevant tokens include:

  • animal
  • street
  • tired

Step 1: the current token produces a query

For the current token it, the model forms a query vector:

You can read this as:

What kind of information am I looking for right now?

Since the model is processing it, this query may be useful for finding:

  • a possible referent
  • something semantically compatible with it

Step 2: every token has a key

Other tokens in the sentence each have their own key vectors:

  • The
  • animal
  • street
  • tired

You can read a key as:

What kind of information do I offer if someone is looking for me?

Step 3: compare the query against all keys

The model compares with each key to produce relevance scores.

Conceptually, this is asking:

How relevant is each token to understanding it?

Suppose the scores look roughly like this:

  • The 0.1
  • animal 2.8
  • didn't 0.2
  • cross 0.1
  • street 1.1
  • because 0.3
  • tired 1.5

This suggests:

  • animal is highly relevant
  • tired is also somewhat relevant
  • street has some relevance, but less than animal

Step 4: divide by

The formula includes:

The division by mainly rescales the scores so they do not become too large.

This helps keep the softmax step numerically stable and prevents the distribution from becoming too extreme too early.

Step 5: softmax turns scores into weights

Next, the model applies:

This converts the raw scores into attention weights that sum to 1.

For example:

  • The 0.03
  • animal 0.52
  • didn't 0.04
  • cross 0.03
  • street 0.14
  • because 0.05
  • tired 0.19

Now the model is effectively saying:

  • pay about 52% attention to animal
  • pay about 19% attention to tired
  • pay about 14% attention to street
  • pay very little attention to the rest

Step 6: use those weights to combine values

Each token also has a value vector, which contains the information it can contribute.

The final step is:

So the model takes:

  • a lot of information from animal
  • some information from tired
  • a smaller amount from street

and mixes them into a new representation for it.

What this means in practice

The token it is not interpreted in isolation.

Instead, attention lets the model dynamically decide:

  • animal is the most relevant context
  • tired also matters
  • street matters a little

That is the practical role of the attention formula:

the current token selects the most relevant context and uses it to update its own representation

This is important because many words cannot be understood alone, including:

  • it
  • he
  • bank
  • bat

Attention is what lets the model use the whole sentence to resolve that ambiguity.


5. Background

Attention existed in earlier neural network research, but it became central in modern AI with the transformer architecture introduced in 2017.

The transformer relies heavily on self-attention, where tokens in the same sequence interact with one another.