AI Document Assistant — RAG Interview Prep

Based on the actual implementation in docs-assistant/backend.


Biggest Challenge: Retrieval Miss

“The answer is in the document, but the system says I don’t know.”

This is the hardest problem in RAG — called a Retrieval Miss. If the right chunk never makes it into top-k, GPT cannot give the correct answer no matter how good the prompt is.

When it happens:

  • User asks: "How many days for a refund?"
  • Document says: "All payments will be returned within 5–7 business days"
  • Different wording → embedding similarity too low → chunk not in top-8 → GPT says “I don’t know”

Why it’s the hardest:

Retrieval Miss → GPT never sees the right chunk → answer is always wrong
      ↑
 most upstream failure — no prompt tuning can fix this

How to fix (production):

SolutionApproach
Hybrid SearchMerge vector + TF-IDF results (union), not fallback
RerankingRetrieve top-20, use cross-encoder to re-rank, keep top-8
Semantic chunkingSplit on paragraph/heading boundaries, not fixed 1000 chars
Eval pipelineBuild test set, measure recall@k to know where the system is missing

Interview answer:

“The biggest challenge is retrieval recall — the answer exists in the document but wasn’t retrieved. Fixed 1000-char chunking can split a semantic unit in half, lowering similarity scores. I’d use hybrid search to union vector and keyword results, add a reranker for precision, and switch to paragraph-boundary chunking.”


What is RAG?

RAG (Retrieval-Augmented Generation) = don’t send the whole document to the LLM. Instead:

  1. Chunk the document into small passages
  2. Embed each chunk as a vector (semantic meaning → numbers)
  3. At query time: embed the question, find the most similar chunks
  4. Send only those top-k chunks as context to the LLM
  5. LLM answers only from that context — not from its training data
Document
  ↓ chunk (1000 chars, 200 overlap)
[chunk_1, chunk_2, ..., chunk_n]
  ↓ embed (OpenAI text-embedding model)
pgvector (stored in PostgreSQL)

Query
  ↓ embed question
  ↓ cosine similarity search (<=> operator)
top-k chunks
  ↓ build context string
  ↓ GPT: "answer only from context"
Answer + Citations

How is it implemented in this project?

Ingestion (/rag/ingest)

# For each chunk:
emb = await embed_text(chunk_text)           # OpenAI embedding API
save_embeddings_for_document(db, doc_id, user_id, [(chunk_id, emb)])
# Stored in pgvector: chunk_embeddings table

Retrieval (ask.py)

# Prefer semantic (vector) retrieval
question_embedding = await embed_text(request.question)
chunk_ids = vector_retrieve_top_k(db, doc_id, user_id, question_embedding, top_k)
 
# Fall back to keyword (TF-IDF) if vector fails
if not scored_chunks:
    scored_chunks = keyword_retrieve_top_k(request.question, chunks, top_k)

pgvector query (embedding_store.py)

SELECT chunk_id FROM chunk_embeddings
WHERE doc_id = :doc_id AND user_id = :user_id
ORDER BY embedding <=> CAST(:emb AS vector)   -- cosine distance
LIMIT :top_k

Hard Questions & Answers


Q: How do you ensure context coherence (上下文连贯性)?

Two levels:

1. Within a single query — chunking overlap

Chunks use 200-character overlap. If an answer spans a chunk boundary, the overlap ensures neither chunk is missing critical context.

chunk_1: "...the policy states that employees must submit..."
                                              ↑ 200-char overlap
chunk_2: "...must submit receipts within 30 days of purchase..."

Without overlap, the answer could be split across two chunks and retrieval might miss half of it.

2. Across multi-turn conversation — current limitation

The current system treats each question as independent — there’s no conversation history. If the user asks:

  • Q1: “Who wrote this document?”
  • Q2: “What did he say about pricing?”

Q2 doesn’t know “he” refers to the author from Q1.

Production fix: Include the last N turns in the prompt:

history = [
    {"role": "user", "content": "Who wrote this document?"},
    {"role": "assistant", "content": "John Smith."},
]
# Prepend history to current context before calling GPT

This is the planned next step for the project.


Q: How does the RAG architecture work end to end?

Upload → chunk → embed → store in pgvector
                                    ↓
Query  → embed question → cosine search → top-k chunks
                                    ↓
         build context → GPT (answer only from context)
                                    ↓
                         answer + citations [chunk_id=N]

Two-phase design:

  • Ingestion (happens once after upload, async)
  • Retrieval (happens on every query, real-time)

Q: What is Cosine Similarity?

The cosine of the angle between two vectors — measures how similar their directions are, regardless of length.

Anglecos valueMeaning
1Same direction → same semantic meaning
90°0Perpendicular → unrelated
180°-1Opposite direction → opposite meaning

Why cosine instead of Euclidean distance?

Euclidean distance measures straight-line distance — affected by vector length. Cosine only looks at direction, so a long and short version of the same sentence still score 1.0:

chunk_vec    = [0.6, 0.8]   # length 1.0
question_vec = [0.3, 0.4]   # length 0.5, but same direction
 
# Euclidean: 0.36  ← thinks they're different
# Cosine:    1.0   ← correct, direction is identical

pgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:

ORDER BY embedding <=> query   -- smaller = more similar
LIMIT 8

Q: Why use vector retrieval instead of keyword search (TF-IDF)?

TF-IDF (keyword)Vector (semantic)
“What is the refund policy?“finds chunks with exact words “refund”, “policy”finds chunks about money-back, returns, cancellation — even if exact words differ
Speedfast, no GPU neededslightly slower (embedding API call)
CostfreeOpenAI API cost per query
Implementationcustom, debuggablerequires embedding model + vector DB

This project uses vector as primary, TF-IDF as fallback — best of both.


Q: What is pgvector? Why use it instead of Pinecone?

pgvector is a PostgreSQL extension that adds a vector data type and the <=> cosine distance operator. It lets you do semantic search inside your existing Postgres database.

Why pgvector over Pinecone:

  • Already have PostgreSQL — no extra service to manage
  • Data isolation is simpler — user ownership + doc scoping in same DB
  • For this scale (thousands of chunks, not billions), Postgres is fast enough
  • Pinecone makes more sense at massive scale or if you need managed infra

Q: How do you prevent hallucination?

Three layers:

  1. Grounded prompt — system prompt tells GPT: “Answer ONLY from the provided context. If the answer is not in the context, say ‘I don’t know based on the document.‘”

  2. Empty retrieval gate — if scored_chunks is empty, return "I don't know" immediately without calling GPT at all

  3. Citations — answer includes [chunk_id=N] tags; the backend extracts which chunks were actually cited, returns them to the user so they can verify


Q: How do citations work?

The LLM is instructed to include [chunk_id=N] in its answer when citing a chunk. After generation:

CITATION_PATTERN = re.compile(r"\[chunk_id=(\d+)\]")
 
def _build_citations(answer, chunks):
    cited_ids = set(int(m) for m in CITATION_PATTERN.findall(answer))
    # Return preview + full_text for each cited chunk

This makes answers verifiable — the user can see exactly which passage the answer came from.


Q: How do you choose top-k?

Default top_k = 8 (configurable via ASK_TOP_K env var).

Trade-off:

  • Too small → may miss relevant context, incomplete answers
  • Too large → more tokens sent to GPT = higher cost + more noise in context

Production improvement: dynamic top-k based on score distribution — stop adding chunks once similarity score drops below a threshold.


Q: What happens if the document is not yet ingested when the user asks?

Status machine check before retrieval:

  • UPLOADED → still processing → return “still processing”
  • PROCESSING → ingestion running → return “still processing”
  • READY → proceed with retrieval
  • FAILED → return error

The Q&A endpoint is gated on READY status — it won’t attempt retrieval until embeddings are fully stored.


Q: How would you scale this for very large documents?

  1. Async ingestion — already implemented via background worker (Celery), upload returns immediately
  2. Streaming answers/ask/stream endpoint returns SSE deltas so users see partial answers as they’re generated
  3. Limit token budget — cap total context tokens (e.g., 4000 tokens) rather than a fixed chunk count
  4. Hierarchical chunking — small chunks for retrieval, larger parent chunks for context

Q: Where is the vector database stored?

Currently (this project): pgvector extension inside PostgreSQL — vectors live in chunk_embeddings table alongside regular business data. One DB for everything, ownership/isolation handled by user_id + doc_id.

PostgreSQL
├── users, documents, chunks   ← regular data
└── chunk_embeddings           ← vector data (1536 floats per row)

When to switch to a dedicated vector DB (Pinecone, Qdrant):

  • Millions of documents / chunks
  • Need sub-millisecond ANN (Approximate Nearest Neighbor) search
  • pgvector does exact cosine search — slower at massive scale

Interview answer:

“I store vectors in PostgreSQL via pgvector — same DB as business data, simpler ownership isolation. At larger scale I’d migrate to Pinecone or Qdrant which are optimized for ANN search.”


Q: Is embedding word-by-word or the whole chunk?

Whole chunk → one vector. Not word by word.

chunk: "The refund policy states customers may return products within 30 days..."
                    ↓ embed_text(whole chunk)
one vector: [0.023, -0.145, 0.891, ...]   ← 1536 numbers

The entire chunk’s semantic meaning is compressed into 1536 numbers. That’s the power — “退款” and “refund” produce similar vectors even though the characters are completely different.

Numbers for this project:

100-page document → ~50,000 chars
  ↓ chunk (1000 chars, 200 overlap)
  → ~55 chunks → 55 vectors stored in pgvector

User question → 1 vector
  ↓ cosine compare against all 55 vectors
  → top 8 most similar chunks → sent to GPT

Q: What formula/model does the embedding use?

The embedding model (text-embedding-3-small) is a Transformer neural network — not a simple formula. It takes text → outputs 1536 floats. Internally it runs tokenization → Attention + MLP layers → normalize to unit vector.

The similarity formula is cosine similarity:

ResultMeaning
1Same direction → same semantic meaning
0Perpendicular → unrelated
-1Opposite direction → opposite meaning

pgvector uses <=> which is cosine distance = 1 − similarity, so smaller = more similar:

ORDER BY embedding <=> query   -- smaller distance = more similar
LIMIT 8

Interview answer:

“Similarity is measured with cosine similarity. The embedding model itself is OpenAI’s text-embedding-3-small, a Transformer that compresses text semantics into a 1536-dim unit vector.”


One-liner answers (for speed rounds)

QuestionAnswer
What is RAG?Retrieve relevant document passages, send them as context, LLM answers only from context
How do you ensure coherence?200-char chunk overlap for within-query; multi-turn history (planned) for cross-query
Why pgvector?Already have Postgres, same DB for ownership/isolation, sufficient for this scale
Why vector over TF-IDF?Semantic match — finds relevant chunks even when exact words differ
How do you prevent hallucination?Grounded prompt + empty-retrieval gate + citations
What’s the latency bottleneck?Embedding API call + LLM generation; chunking/retrieval is fast
Where are vectors stored?pgvector in PostgreSQL; migrate to Pinecone/Qdrant at massive scale
Is embedding word-by-word?No — whole chunk → one vector (1536 dims); semantic meaning compressed
What similarity formula?Cosine similarity; pgvector uses <=> (cosine distance, smaller = more similar)