AI Document Assistant — System Flow
Two phases: Ingestion (once on upload) · Query (every question)
Two separate OpenAI API calls — Embedding model converts text → vector. GPT generates the answer. They are NOT the same thing.
User
Uploads
PDF / TXT / MD
via frontend
Extract Text
PyMuPDF parses PDF
→ raw text string
User
Types question:
'What is the
refund policy?'
POST /upload
· Validate type & size
· Generate doc_id
· Save file to S3
· DB: status=UPLOADED
→ return immediately
(async from here)
pgvector
chunk_embeddings table
· doc_id
· user_id
· chunk_id
· embedding (1536 floats)
status → READY
Chunking
1000 chars / chunk
200 char overlap
→ [chunk_1,
chunk_2, ...]
embed_text(chunk)
call OpenAI
Embedding API
model: text-embedding-3-small
→ 1536 floats
POST /ask
· JWT auth
· Check status=READY
· Load chunks from DB
(block if not READY)
embed_text(question)
same Embedding API
→ 1536-dim vector
(same space as chunks
→ comparable)
Answer + Citations
'The policy states...
[chunk_id=3]'
Citations returned:
· chunk_id
· preview text
· full_text
pgvector cosine search
SELECT chunk_id
ORDER BY
embedding <=> query
LIMIT 8
Fallback: TF-IDF keyword
if vector unavailable
Top-k Chunks
Build context string:
[chunk_id=3]
chunk text...
[chunk_id=7]
more text...
gpt-3.5-turbo
System prompt:
'Answer ONLY from
the context below'
Temp: 0.2
Cite: [chunk_id=N]
Two OpenAI API calls
text-embedding-3-small
· converts text → 1536-dim vector
· used for BOTH chunks and questions
· same vector space = comparable by cosine distance
· NOT GPT, does NOT generate text
gpt-3.5-turbo
· reads context + question
· generates the answer text
· never sees raw vectors