② Query Phase (real-time — on every question)
① Ingestion Phase (async — runs once after upload)

AI Document Assistant — System Flow

Two phases: Ingestion (once on upload) · Query (every question)

Two separate OpenAI API calls — Embedding model converts text → vector. GPT generates the answer. They are NOT the same thing.

User
Uploads
PDF / TXT / MD
via frontend

Extract Text
PyMuPDF parses PDF
→ raw text string

User
Types question:
'What is the
refund policy?'

POST /upload
· Validate type & size
· Generate doc_id
· Save file to S3
· DB: status=UPLOADED
→ return immediately
(async from here)

pgvector
chunk_embeddings table
· doc_id
· user_id
· chunk_id
· embedding (1536 floats)
status → READY

Chunking
1000 chars / chunk
200 char overlap

→ [chunk_1,
chunk_2, ...]

embed_text(chunk)
call OpenAI
Embedding API
model: text-embedding-3-small
→ 1536 floats

POST /ask
· JWT auth
· Check status=READY
· Load chunks from DB
(block if not READY)

embed_text(question)
same Embedding API
→ 1536-dim vector

(same space as chunks
→ comparable)

Answer + Citations
'The policy states...
[chunk_id=3]'

Citations returned:
· chunk_id
· preview text
· full_text

pgvector cosine search
SELECT chunk_id
ORDER BY
embedding <=> query
LIMIT 8

Fallback: TF-IDF keyword
if vector unavailable

Top-k Chunks
Build context string:

[chunk_id=3]
chunk text...

[chunk_id=7]
more text...

gpt-3.5-turbo
System prompt:
'Answer ONLY from
the context below'
Temp: 0.2
Cite: [chunk_id=N]

Two OpenAI API calls

text-embedding-3-small
· converts text → 1536-dim vector
· used for BOTH chunks and questions
· same vector space = comparable by cosine distance
· NOT GPT, does NOT generate text

gpt-3.5-turbo
· reads context + question
· generates the answer text
· never sees raw vectors