AI Document Assistant — RAG Interview Prep
Based on the actual implementation in docs-assistant/backend.
Biggest Challenge: Retrieval Miss
“The answer is in the document, but the system says I don’t know.”
This is the hardest problem in RAG — called a Retrieval Miss. If the right chunk never makes it into top-k, GPT cannot give the correct answer no matter how good the prompt is.
When it happens:
- User asks:
"How many days for a refund?" - Document says:
"All payments will be returned within 5–7 business days" - Different wording → embedding similarity too low → chunk not in top-8 → GPT says “I don’t know”
Why it’s the hardest:
Retrieval Miss → GPT never sees the right chunk → answer is always wrong
↑
most upstream failure — no prompt tuning can fix this
How to fix (production):
| Solution | Approach |
|---|---|
| Hybrid Search | Merge vector + TF-IDF results (union), not fallback |
| Reranking | Retrieve top-20, use cross-encoder to re-rank, keep top-8 |
| Semantic chunking | Split on paragraph/heading boundaries, not fixed 1000 chars |
| Eval pipeline | Build test set, measure recall@k to know where the system is missing |
Interview answer:
“The biggest challenge is retrieval recall — the answer exists in the document but wasn’t retrieved. Fixed 1000-char chunking can split a semantic unit in half, lowering similarity scores. I’d use hybrid search to union vector and keyword results, add a reranker for precision, and switch to paragraph-boundary chunking.”
What is RAG?
RAG (Retrieval-Augmented Generation) = don’t send the whole document to the LLM. Instead:
- Chunk the document into small passages
- Embed each chunk as a vector (semantic meaning → numbers)
- At query time: embed the question, find the most similar chunks
- Send only those top-k chunks as context to the LLM
- LLM answers only from that context — not from its training data
Document
↓ chunk (1000 chars, 200 overlap)
[chunk_1, chunk_2, ..., chunk_n]
↓ embed (OpenAI text-embedding model)
pgvector (stored in PostgreSQL)
Query
↓ embed question
↓ cosine similarity search (<=> operator)
top-k chunks
↓ build context string
↓ GPT: "answer only from context"
Answer + Citations
How is it implemented in this project?
Ingestion (/rag/ingest)
# For each chunk:
emb = await embed_text(chunk_text) # OpenAI embedding API
save_embeddings_for_document(db, doc_id, user_id, [(chunk_id, emb)])
# Stored in pgvector: chunk_embeddings tableRetrieval (ask.py)
# Prefer semantic (vector) retrieval
question_embedding = await embed_text(request.question)
chunk_ids = vector_retrieve_top_k(db, doc_id, user_id, question_embedding, top_k)
# Fall back to keyword (TF-IDF) if vector fails
if not scored_chunks:
scored_chunks = keyword_retrieve_top_k(request.question, chunks, top_k)pgvector query (embedding_store.py)
SELECT chunk_id FROM chunk_embeddings
WHERE doc_id = :doc_id AND user_id = :user_id
ORDER BY embedding <=> CAST(:emb AS vector) -- cosine distance
LIMIT :top_kHard Questions & Answers
Q: How do you ensure context coherence (上下文连贯性)?
Two levels:
1. Within a single query — chunking overlap
Chunks use 200-character overlap. If an answer spans a chunk boundary, the overlap ensures neither chunk is missing critical context.
chunk_1: "...the policy states that employees must submit..."
↑ 200-char overlap
chunk_2: "...must submit receipts within 30 days of purchase..."
Without overlap, the answer could be split across two chunks and retrieval might miss half of it.
2. Across multi-turn conversation — current limitation
The current system treats each question as independent — there’s no conversation history. If the user asks:
- Q1: “Who wrote this document?”
- Q2: “What did he say about pricing?”
Q2 doesn’t know “he” refers to the author from Q1.
Production fix: Include the last N turns in the prompt:
history = [
{"role": "user", "content": "Who wrote this document?"},
{"role": "assistant", "content": "John Smith."},
]
# Prepend history to current context before calling GPTThis is the planned next step for the project.
Q: How does the RAG architecture work end to end?
Upload → chunk → embed → store in pgvector
↓
Query → embed question → cosine search → top-k chunks
↓
build context → GPT (answer only from context)
↓
answer + citations [chunk_id=N]
Two-phase design:
- Ingestion (happens once after upload, async)
- Retrieval (happens on every query, real-time)
Q: What is Cosine Similarity?
The cosine of the angle between two vectors — measures how similar their directions are, regardless of length.
| Angle | cos value | Meaning |
|---|---|---|
| 0° | 1 | Same direction → same semantic meaning |
| 90° | 0 | Perpendicular → unrelated |
| 180° | -1 | Opposite direction → opposite meaning |
Why cosine instead of Euclidean distance?
Euclidean distance measures straight-line distance — affected by vector length. Cosine only looks at direction, so a long and short version of the same sentence still score 1.0:
chunk_vec = [0.6, 0.8] # length 1.0
question_vec = [0.3, 0.4] # length 0.5, but same direction
# Euclidean: 0.36 ← thinks they're different
# Cosine: 1.0 ← correct, direction is identicalpgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:
ORDER BY embedding <=> query -- smaller = more similar
LIMIT 8Q: Why use vector retrieval instead of keyword search (TF-IDF)?
| TF-IDF (keyword) | Vector (semantic) | |
|---|---|---|
| “What is the refund policy?“ | finds chunks with exact words “refund”, “policy” | finds chunks about money-back, returns, cancellation — even if exact words differ |
| Speed | fast, no GPU needed | slightly slower (embedding API call) |
| Cost | free | OpenAI API cost per query |
| Implementation | custom, debuggable | requires embedding model + vector DB |
This project uses vector as primary, TF-IDF as fallback — best of both.
Q: What is pgvector? Why use it instead of Pinecone?
pgvector is a PostgreSQL extension that adds a vector data type and the <=> cosine distance operator. It lets you do semantic search inside your existing Postgres database.
Why pgvector over Pinecone:
- Already have PostgreSQL — no extra service to manage
- Data isolation is simpler — user ownership + doc scoping in same DB
- For this scale (thousands of chunks, not billions), Postgres is fast enough
- Pinecone makes more sense at massive scale or if you need managed infra
Q: How do you prevent hallucination?
Three layers:
-
Grounded prompt — system prompt tells GPT: “Answer ONLY from the provided context. If the answer is not in the context, say ‘I don’t know based on the document.‘”
-
Empty retrieval gate — if
scored_chunksis empty, return"I don't know"immediately without calling GPT at all -
Citations — answer includes
[chunk_id=N]tags; the backend extracts which chunks were actually cited, returns them to the user so they can verify
Q: How do citations work?
The LLM is instructed to include [chunk_id=N] in its answer when citing a chunk. After generation:
CITATION_PATTERN = re.compile(r"\[chunk_id=(\d+)\]")
def _build_citations(answer, chunks):
cited_ids = set(int(m) for m in CITATION_PATTERN.findall(answer))
# Return preview + full_text for each cited chunkThis makes answers verifiable — the user can see exactly which passage the answer came from.
Q: How do you choose top-k?
Default top_k = 8 (configurable via ASK_TOP_K env var).
Trade-off:
- Too small → may miss relevant context, incomplete answers
- Too large → more tokens sent to GPT = higher cost + more noise in context
Production improvement: dynamic top-k based on score distribution — stop adding chunks once similarity score drops below a threshold.
Q: What happens if the document is not yet ingested when the user asks?
Status machine check before retrieval:
UPLOADED→ still processing → return “still processing”PROCESSING→ ingestion running → return “still processing”READY→ proceed with retrievalFAILED→ return error
The Q&A endpoint is gated on READY status — it won’t attempt retrieval until embeddings are fully stored.
Q: How would you scale this for very large documents?
- Async ingestion — already implemented via background worker (Celery), upload returns immediately
- Streaming answers —
/ask/streamendpoint returns SSE deltas so users see partial answers as they’re generated - Limit token budget — cap total context tokens (e.g., 4000 tokens) rather than a fixed chunk count
- Hierarchical chunking — small chunks for retrieval, larger parent chunks for context
Q: Where is the vector database stored?
Currently (this project): pgvector extension inside PostgreSQL — vectors live in chunk_embeddings table alongside regular business data. One DB for everything, ownership/isolation handled by user_id + doc_id.
PostgreSQL
├── users, documents, chunks ← regular data
└── chunk_embeddings ← vector data (1536 floats per row)
When to switch to a dedicated vector DB (Pinecone, Qdrant):
- Millions of documents / chunks
- Need sub-millisecond ANN (Approximate Nearest Neighbor) search
- pgvector does exact cosine search — slower at massive scale
Interview answer:
“I store vectors in PostgreSQL via pgvector — same DB as business data, simpler ownership isolation. At larger scale I’d migrate to Pinecone or Qdrant which are optimized for ANN search.”
Q: Is embedding word-by-word or the whole chunk?
Whole chunk → one vector. Not word by word.
chunk: "The refund policy states customers may return products within 30 days..."
↓ embed_text(whole chunk)
one vector: [0.023, -0.145, 0.891, ...] ← 1536 numbers
The entire chunk’s semantic meaning is compressed into 1536 numbers. That’s the power — “退款” and “refund” produce similar vectors even though the characters are completely different.
Numbers for this project:
100-page document → ~50,000 chars
↓ chunk (1000 chars, 200 overlap)
→ ~55 chunks → 55 vectors stored in pgvector
User question → 1 vector
↓ cosine compare against all 55 vectors
→ top 8 most similar chunks → sent to GPT
Q: What formula/model does the embedding use?
The embedding model (text-embedding-3-small) is a Transformer neural network — not a simple formula. It takes text → outputs 1536 floats. Internally it runs tokenization → Attention + MLP layers → normalize to unit vector.
The similarity formula is cosine similarity:
| Result | Meaning |
|---|---|
| 1 | Same direction → same semantic meaning |
| 0 | Perpendicular → unrelated |
| -1 | Opposite direction → opposite meaning |
pgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:
ORDER BY embedding <=> query -- smaller distance = more similar
LIMIT 8Interview answer:
“Similarity is measured with cosine similarity. The embedding model itself is OpenAI’s
text-embedding-3-small, a Transformer that compresses text semantics into a 1536-dim unit vector.”
One-liner answers (for speed rounds)
| Question | Answer |
|---|---|
| What is RAG? | Retrieve relevant document passages, send them as context, LLM answers only from context |
| How do you ensure coherence? | 200-char chunk overlap for within-query; multi-turn history (planned) for cross-query |
| Why pgvector? | Already have Postgres, same DB for ownership/isolation, sufficient for this scale |
| Why vector over TF-IDF? | Semantic match — finds relevant chunks even when exact words differ |
| How do you prevent hallucination? | Grounded prompt + empty-retrieval gate + citations |
| What’s the latency bottleneck? | Embedding API call + LLM generation; chunking/retrieval is fast |
| Where are vectors stored? | pgvector in PostgreSQL; migrate to Pinecone/Qdrant at massive scale |
| Is embedding word-by-word? | No — whole chunk → one vector (1536 dims); semantic meaning compressed |
| What similarity formula? | Cosine similarity; pgvector uses <=> (cosine distance, smaller = more similar) |