Embedding

An embedding is a numerical vector used to represent a word, token, or other object so that a neural network can process it mathematically.

1. What it is

Neural networks do not work directly with raw words or symbols.
They work with numbers.
An embedding turns a discrete item, such as a token, into a list of continuous values.
Similar meanings or usages often end up with embeddings that are closer together in vector space.

In simple terms:

a token is a symbol
an embedding is the vector the model uses for that symbol

2. What problem embedding solves

Neural networks cannot directly reason over raw words like cat or bank.

Embedding solves this by converting discrete tokens into dense numerical vectors that the model can compute with.

This helps with:

turning language into numbers
representing similarity between words or tokens
giving the model a learnable starting point before attention or other layers

In practice, this means embeddings help a model:

treat similar tokens more similarly than totally unrelated ones
pass useful numerical representations into later neural network layers

3. Where you see it

Embeddings are used in:

large language models
search and retrieval systems
recommendation systems
text classification
knowledge graph and graph learning systems

How they show up in LLM behavior:

every input token is first mapped to an embedding vector
those vectors are then processed by attention and feed-forward layers
the quality of those initial representations affects what the model can learn later

4. How it works internally

Intuition version

Start with a token like dog.
Map it to a vector of numbers.
Let training gradually adjust that vector so it becomes useful for the model’s task.

So an embedding is basically:

token → vector → learned representation

Matrix version

Suppose the vocabulary has V possible tokens and the model uses embedding dimension d.

Then the model learns an embedding matrix:

E \in R^{V \times d}

Each row corresponds to one token.

If token id i is selected, the model looks up row E_i, which becomes that token’s embedding vector.

What happens next

In a transformer:

Text is tokenized into token ids.
Each token id is mapped to an embedding vector.
Positional information is added so the model knows token order.
Those vectors are then passed into attention and later layers.

Concrete example

A model may represent words like:

king
queen
man
woman

as vectors of numbers.

The exact values are learned during training, not manually designed.

The goal is that these vectors capture useful patterns about meaning and usage.

5. Background

Earlier NLP systems often used sparse representations such as one-hot vectors, which do not encode similarity well.

Embeddings became important because they allow models to learn richer numerical representations of language.

In transformers, each input token is typically mapped to an embedding vector before attention and other layers process it.

Notes

Explorer