An embedding is a numerical vector used to represent a word, token, or other object so that a neural network can process it mathematically.
1. What it is
- Neural networks do not work directly with raw words or symbols.
- They work with numbers.
- An embedding turns a discrete item, such as a token, into a list of continuous values.
- Similar meanings or usages often end up with embeddings that are closer together in vector space.
In simple terms:
- a token is a symbol
- an embedding is the vector the model uses for that symbol
2. What problem embedding solves
Neural networks cannot directly reason over raw words like cat or bank.
Embedding solves this by converting discrete tokens into dense numerical vectors that the model can compute with.
This helps with:
- turning language into numbers
- representing similarity between words or tokens
- giving the model a learnable starting point before attention or other layers
In practice, this means embeddings help a model:
- treat similar tokens more similarly than totally unrelated ones
- pass useful numerical representations into later neural network layers
3. Where you see it
Embeddings are used in:
- large language models
- search and retrieval systems
- recommendation systems
- text classification
- knowledge graph and graph learning systems
How they show up in LLM behavior:
- every input token is first mapped to an embedding vector
- those vectors are then processed by attention and feed-forward layers
- the quality of those initial representations affects what the model can learn later
4. How it works internally
Intuition version
- Start with a token like
dog. - Map it to a vector of numbers.
- Let training gradually adjust that vector so it becomes useful for the model’s task.
So an embedding is basically:
token → vector → learned representation
Matrix version
Suppose the vocabulary has V possible tokens and the model uses embedding dimension d.
Then the model learns an embedding matrix:
Each row corresponds to one token.
If token id i is selected, the model looks up row E_i, which becomes that token’s embedding vector.
What happens next
In a transformer:
- Text is tokenized into token ids.
- Each token id is mapped to an embedding vector.
- Positional information is added so the model knows token order.
- Those vectors are then passed into attention and later layers.
Concrete example
A model may represent words like:
kingqueenmanwoman
as vectors of numbers.
The exact values are learned during training, not manually designed.
The goal is that these vectors capture useful patterns about meaning and usage.
5. Background
Earlier NLP systems often used sparse representations such as one-hot vectors, which do not encode similarity well.
Embeddings became important because they allow models to learn richer numerical representations of language.
In transformers, each input token is typically mapped to an embedding vector before attention and other layers process it.