AI Document Assistant — RAG Interview Prep

Based on the actual implementation in docs-assistant/backend.

Biggest Challenge: Retrieval Miss

“The answer is in the document, but the system says I don’t know.”

This is the hardest problem in RAG — called a Retrieval Miss. If the right chunk never makes it into top-k, GPT cannot give the correct answer no matter how good the prompt is.

When it happens:

User asks: "How many days for a refund?"
Document says: "All payments will be returned within 5–7 business days"
Different wording → embedding similarity too low → chunk not in top-8 → GPT says “I don’t know”

Why it’s the hardest:

Retrieval Miss → GPT never sees the right chunk → answer is always wrong
      ↑
 most upstream failure — no prompt tuning can fix this

How to fix (production):

Solution	Approach
Hybrid Search	Merge vector + TF-IDF results (union), not fallback
Reranking	Retrieve top-20, use cross-encoder to re-rank, keep top-8
Semantic chunking	Split on paragraph/heading boundaries, not fixed 1000 chars
Eval pipeline	Build test set, measure recall@k to know where the system is missing

Interview answer:

“The biggest challenge is retrieval recall — the answer exists in the document but wasn’t retrieved. Fixed 1000-char chunking can split a semantic unit in half, lowering similarity scores. I’d use hybrid search to union vector and keyword results, add a reranker for precision, and switch to paragraph-boundary chunking.”

What is RAG?

RAG (Retrieval-Augmented Generation) = don’t send the whole document to the LLM. Instead:

Chunk the document into small passages
Embed each chunk as a vector (semantic meaning → numbers)
At query time: embed the question, find the most similar chunks
Send only those top-k chunks as context to the LLM
LLM answers only from that context — not from its training data

Document
  ↓ chunk (1000 chars, 200 overlap)
[chunk_1, chunk_2, ..., chunk_n]
  ↓ embed (OpenAI text-embedding model)
pgvector (stored in PostgreSQL)

Query
  ↓ embed question
  ↓ cosine similarity search (<=> operator)
top-k chunks
  ↓ build context string
  ↓ GPT: "answer only from context"
Answer + Citations

How is it implemented in this project?

Ingestion (`/rag/ingest`)

# For each chunk:
emb = await embed_text(chunk_text)           # OpenAI embedding API
save_embeddings_for_document(db, doc_id, user_id, [(chunk_id, emb)])
# Stored in pgvector: chunk_embeddings table

Retrieval (`ask.py`)

# Prefer semantic (vector) retrieval
question_embedding = await embed_text(request.question)
chunk_ids = vector_retrieve_top_k(db, doc_id, user_id, question_embedding, top_k)
 
# Fall back to keyword (TF-IDF) if vector fails
if not scored_chunks:
    scored_chunks = keyword_retrieve_top_k(request.question, chunks, top_k)

pgvector query (`embedding_store.py`)

SELECT chunk_id FROM chunk_embeddings
WHERE doc_id = :doc_id AND user_id = :user_id
ORDER BY embedding <=> CAST(:emb AS vector)   -- cosine distance
LIMIT :top_k

Hard Questions & Answers

Q: How do you ensure context coherence (上下文连贯性)?

Two levels:

1. Within a single query — chunking overlap

Chunks use 200-character overlap. If an answer spans a chunk boundary, the overlap ensures neither chunk is missing critical context.

chunk_1: "...the policy states that employees must submit..."
                                              ↑ 200-char overlap
chunk_2: "...must submit receipts within 30 days of purchase..."

Without overlap, the answer could be split across two chunks and retrieval might miss half of it.

2. Across multi-turn conversation — current limitation

The current system treats each question as independent — there’s no conversation history. If the user asks:

Q1: “Who wrote this document?”
Q2: “What did he say about pricing?”

Q2 doesn’t know “he” refers to the author from Q1.

Production fix: Include the last N turns in the prompt:

history = [
    {"role": "user", "content": "Who wrote this document?"},
    {"role": "assistant", "content": "John Smith."},
]
# Prepend history to current context before calling GPT

This is the planned next step for the project.

Q: How does the RAG architecture work end to end?

Upload → chunk → embed → store in pgvector
                                    ↓
Query  → embed question → cosine search → top-k chunks
                                    ↓
         build context → GPT (answer only from context)
                                    ↓
                         answer + citations [chunk_id=N]

Two-phase design:

Ingestion (happens once after upload, async)
Retrieval (happens on every query, real-time)

Q: What is Cosine Similarity?

The cosine of the angle between two vectors — measures how similar their directions are, regardless of length.

$similarity = \frac{A \cdot B}{∣ A ∣∣ B ∣}$

Angle	cos value	Meaning
0°	1	Same direction → same semantic meaning
90°	0	Perpendicular → unrelated
180°	-1	Opposite direction → opposite meaning

Why cosine instead of Euclidean distance?

Euclidean distance measures straight-line distance — affected by vector length. Cosine only looks at direction, so a long and short version of the same sentence still score 1.0:

chunk_vec    = [0.6, 0.8]   # length 1.0
question_vec = [0.3, 0.4]   # length 0.5, but same direction
 
# Euclidean: 0.36  ← thinks they're different
# Cosine:    1.0   ← correct, direction is identical

pgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:

ORDER BY embedding <=> query   -- smaller = more similar
LIMIT 8

Q: Why use vector retrieval instead of keyword search (TF-IDF)?

	TF-IDF (keyword)	Vector (semantic)
“What is the refund policy?“	finds chunks with exact words “refund”, “policy”	finds chunks about money-back, returns, cancellation — even if exact words differ
Speed	fast, no GPU needed	slightly slower (embedding API call)
Cost	free	OpenAI API cost per query
Implementation	custom, debuggable	requires embedding model + vector DB

This project uses vector as primary, TF-IDF as fallback — best of both.

Q: What is pgvector? Why use it instead of Pinecone?

pgvector is a PostgreSQL extension that adds a vector data type and the <=> cosine distance operator. It lets you do semantic search inside your existing Postgres database.

Why pgvector over Pinecone:

Already have PostgreSQL — no extra service to manage
Data isolation is simpler — user ownership + doc scoping in same DB
For this scale (thousands of chunks, not billions), Postgres is fast enough
Pinecone makes more sense at massive scale or if you need managed infra

Q: How do you prevent hallucination?

Three layers:

Grounded prompt — system prompt tells GPT: “Answer ONLY from the provided context. If the answer is not in the context, say ‘I don’t know based on the document.‘”
Empty retrieval gate — if scored_chunks is empty, return "I don't know" immediately without calling GPT at all
Citations — answer includes [chunk_id=N] tags; the backend extracts which chunks were actually cited, returns them to the user so they can verify

Q: How do citations work?

The LLM is instructed to include [chunk_id=N] in its answer when citing a chunk. After generation:

CITATION_PATTERN = re.compile(r"\[chunk_id=(\d+)\]")
 
def _build_citations(answer, chunks):
    cited_ids = set(int(m) for m in CITATION_PATTERN.findall(answer))
    # Return preview + full_text for each cited chunk

This makes answers verifiable — the user can see exactly which passage the answer came from.

Q: How do you choose top-k?

Default top_k = 8 (configurable via ASK_TOP_K env var).

Trade-off:

Too small → may miss relevant context, incomplete answers
Too large → more tokens sent to GPT = higher cost + more noise in context

Production improvement: dynamic top-k based on score distribution — stop adding chunks once similarity score drops below a threshold.

Q: What happens if the document is not yet ingested when the user asks?

Status machine check before retrieval:

UPLOADED → still processing → return “still processing”
PROCESSING → ingestion running → return “still processing”
READY → proceed with retrieval
FAILED → return error

The Q&A endpoint is gated on READY status — it won’t attempt retrieval until embeddings are fully stored.

Q: How would you scale this for very large documents?

Async ingestion — already implemented via background worker (Celery), upload returns immediately
Streaming answers — /ask/stream endpoint returns SSE deltas so users see partial answers as they’re generated
Limit token budget — cap total context tokens (e.g., 4000 tokens) rather than a fixed chunk count
Hierarchical chunking — small chunks for retrieval, larger parent chunks for context

Q: Where is the vector database stored?

Currently (this project): pgvector extension inside PostgreSQL — vectors live in chunk_embeddings table alongside regular business data. One DB for everything, ownership/isolation handled by user_id + doc_id.

PostgreSQL
├── users, documents, chunks   ← regular data
└── chunk_embeddings           ← vector data (1536 floats per row)

When to switch to a dedicated vector DB (Pinecone, Qdrant):

Millions of documents / chunks
Need sub-millisecond ANN (Approximate Nearest Neighbor) search
pgvector does exact cosine search — slower at massive scale

Interview answer:

“I store vectors in PostgreSQL via pgvector — same DB as business data, simpler ownership isolation. At larger scale I’d migrate to Pinecone or Qdrant which are optimized for ANN search.”

Q: Is embedding word-by-word or the whole chunk?

Whole chunk → one vector. Not word by word.

chunk: "The refund policy states customers may return products within 30 days..."
                    ↓ embed_text(whole chunk)
one vector: [0.023, -0.145, 0.891, ...]   ← 1536 numbers

The entire chunk’s semantic meaning is compressed into 1536 numbers. That’s the power — “退款” and “refund” produce similar vectors even though the characters are completely different.

Numbers for this project:

100-page document → ~50,000 chars
  ↓ chunk (1000 chars, 200 overlap)
  → ~55 chunks → 55 vectors stored in pgvector

User question → 1 vector
  ↓ cosine compare against all 55 vectors
  → top 8 most similar chunks → sent to GPT

Q: What formula/model does the embedding use?

The embedding model (text-embedding-3-small) is a Transformer neural network — not a simple formula. It takes text → outputs 1536 floats. Internally it runs tokenization → Attention + MLP layers → normalize to unit vector.

The similarity formula is cosine similarity:

$similarity = \frac{A \cdot B}{∣ A ∣∣ B ∣}$

Result	Meaning
1	Same direction → same semantic meaning
0	Perpendicular → unrelated
-1	Opposite direction → opposite meaning

pgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:

ORDER BY embedding <=> query   -- smaller distance = more similar
LIMIT 8

Interview answer:

“Similarity is measured with cosine similarity. The embedding model itself is OpenAI’s text-embedding-3-small, a Transformer that compresses text semantics into a 1536-dim unit vector.”

One-liner answers (for speed rounds)

Question	Answer
What is RAG?	Retrieve relevant document passages, send them as context, LLM answers only from context
How do you ensure coherence?	200-char chunk overlap for within-query; multi-turn history (planned) for cross-query
Why pgvector?	Already have Postgres, same DB for ownership/isolation, sufficient for this scale
Why vector over TF-IDF?	Semantic match — finds relevant chunks even when exact words differ
How do you prevent hallucination?	Grounded prompt + empty-retrieval gate + citations
What’s the latency bottleneck?	Embedding API call + LLM generation; chunking/retrieval is fast
Where are vectors stored?	pgvector in PostgreSQL; migrate to Pinecone/Qdrant at massive scale
Is embedding word-by-word?	No — whole chunk → one vector (1536 dims); semantic meaning compressed
What similarity formula?	Cosine similarity; pgvector uses `<=>` (cosine distance, smaller = more similar)

Notes

Explorer

RAG Interview Prep

AI Document Assistant — RAG Interview Prep

Biggest Challenge: Retrieval Miss

What is RAG?

How is it implemented in this project?

Ingestion (`/rag/ingest`)

Retrieval (`ask.py`)

pgvector query (`embedding_store.py`)

Hard Questions & Answers

Q: How do you ensure context coherence (上下文连贯性)?

Q: How does the RAG architecture work end to end?

Q: What is Cosine Similarity?

Q: Why use vector retrieval instead of keyword search (TF-IDF)?

Q: What is pgvector? Why use it instead of Pinecone?

Q: How do you prevent hallucination?

Q: How do citations work?

Q: How do you choose top-k?

Q: What happens if the document is not yet ingested when the user asks?

Q: How would you scale this for very large documents?

Q: Where is the vector database stored?

Q: Is embedding word-by-word or the whole chunk?

Q: What formula/model does the embedding use?

One-liner answers (for speed rounds)

Table of Contents

Graph View

Table of Contents

Notes

Explorer

RAG Interview Prep

AI Document Assistant — RAG Interview Prep

Biggest Challenge: Retrieval Miss

What is RAG?

How is it implemented in this project?

Ingestion (/rag/ingest)

Retrieval (ask.py)

pgvector query (embedding_store.py)

Hard Questions & Answers

Q: How do you ensure context coherence (上下文连贯性)?

Q: How does the RAG architecture work end to end?

Q: What is Cosine Similarity?

Q: Why use vector retrieval instead of keyword search (TF-IDF)?

Q: What is pgvector? Why use it instead of Pinecone?

Q: How do you prevent hallucination?

Q: How do citations work?

Q: How do you choose top-k?

Q: What happens if the document is not yet ingested when the user asks?

Q: How would you scale this for very large documents?

Q: Where is the vector database stored?

Q: Is embedding word-by-word or the whole chunk?

Q: What formula/model does the embedding use?

One-liner answers (for speed rounds)

Graph View

Table of Contents

Ingestion (`/rag/ingest`)

Retrieval (`ask.py`)

pgvector query (`embedding_store.py`)