As a disclaimer, my project was created without spending any money on tokens at Voyage, Grok or Clerk for auth - meaning all the implementation was done on their free tier. The principles remain the same no matter what models you use. Find the project here.

I wanted to understand RAG because it is one of those things that sounds simple until you look at the words people use when explaining it.

Embeddings.

Vectors.

Cosine distance.

Vector databases.

Rerankers.

HNSW.

And usually the explanation goes something like this:

Split the document into chunks,
embed them,
store them in a vector database,
then retrieve the relevant context.

Which is not really an explanation.

It is just a list of things you are now supposed to understand.

So this is how I think about RAG now, after building a PDF chat application.

The user uploads a PDF.
The application reads the PDF locally.
The application splits the PDF into smaller pieces locally.
The application sends those smaller pieces to an embeddings API.
The embeddings API returns numbers.
Those numbers are saved in PostgreSQL next to the original text.
Later, when the user asks a question, we turn that question into numbers too.
Then PostgreSQL finds the stored chunks whose numbers are closest to the question numbers.
Those chunks are sent to the LLM.
The LLM writes the answer and we stream it back to the user.

That is RAG.

The whole thing in one picture

Before explaining every part, this is the full mental model:

PDF file
↓
Read PDF bytes locally in Node.js
↓
Extract text locally with unpdf
↓
Split text locally into chunks
↓
Send chunks to Voyage AI
↓
Voyage returns one vector per chunk
↓
Save chunk text + vector in PostgreSQL
↓
User asks a question
↓
Send only the question to Voyage AI
↓
Voyage returns a fresh question vector
↓
pgvector compares that vector with stored chunk vectors
↓
PostgreSQL returns the closest chunks
↓
Voyage reranks the possible chunks
↓
Send the best chunks to the LLM
↓
LLM writes an answer using those chunks

There are three important places where work happens:

Locally in our worker:
Read PDFs
Extract text
Split text into chunks
Save data in PostgreSQL

Externally in Voyage AI:
Create document embeddings
Create question embeddings
Rerank retrieved chunks

Inside PostgreSQL with pgvector:
Store vectors
Compare vectors
Find similar chunks quickly

This distinction helped me a lot.

Chunking is not done by Voyage.

PDF parsing is not done by Voyage.

PostgreSQL does not create embeddings.

Voyage does not store our PDFs.

Each part has one job.

What problem does RAG solve?

Imagine the user uploads a 100-page car manual and asks:

How long is the warranty?

The LLM does not automatically know what is inside that specific PDF.

You could send the entire PDF to the LLM every time the user asks a question.

For a small document, this can be a good idea.

But for large documents, or many documents, it becomes a problem.

Large prompt
↓
More tokens
↓
More cost
↓
Slower answer
↓
More irrelevant information
↓
Harder citations

The user does not need the entire manual to answer a warranty question.

The user probably needs two or three paragraphs.

So RAG does this:

User asks:
"How long is the warranty?"

↓

Find the warranty paragraphs

↓

Give those paragraphs to the LLM

↓

Ask the LLM to answer only from those paragraphs

RAG means Retrieval-Augmented Generation.

Retrieval
Find useful text.

Augmented
Add that useful text to the prompt.

Generation
Let the LLM write the answer.

The important thing is that RAG is not a model.

RAG is a pipeline around a model.

The LLM is still the part that writes words.

RAG is the part that finds the information the LLM should use.

First: a PDF is not searchable text yet

When a user uploads a PDF, we have a file.

At first, it is just bytes.

PDF file
↓
Raw bytes

In the worker, we read those bytes from storage.

const bytes = await readDocumentFile(document.storagePath);
parsed = await parsePdf(bytes);

The first line reads the original PDF file from disk.

const bytes = await readDocumentFile(document.storagePath);

bytes means the raw file contents.

Not text yet.

Not paragraphs yet.

Not chunks yet.

Just the actual PDF file data.

Then we parse it.

parsed = await parsePdf(bytes);

In my project, parsePdf uses unpdf.

unpdf reads the text layer inside the PDF and gives us text page by page.

Conceptually, the result looks like this:

{
  pages: [
    {
      pageNumber: 1,
      text: "Welcome to the product manual..."
    },
    {
      pageNumber: 2,
      text: "Warranty coverage starts on the date..."
    }
  ]
}

This is an important point.

The PDF parser does not know anything about RAG.

It does not know embeddings.

It does not know vectors.

It just answers this question:

What text exists on every page of this PDF?

Why page numbers matter

Keeping page numbers looks like a small detail.

It is not.

Imagine the application gives this answer:

The warranty lasts for two years.

The user should be able to see where that answer came from.

Source: Car manual.pdf, page 14

The page number starts at parsing time.

PDF page
↓
Parsed page
↓
Chunk created from that page
↓
Chunk saved with page number
↓
Citation shown to user

If you lose this information early, citations become much harder later.

That is why the parser returns text per page instead of one giant string.

Not every PDF has readable text

Some PDFs look normal when you open them.

But they are actually just images.

For example, someone prints a contract, signs it, scans it and uploads the scanned version.

You can see words on the page.

But internally the PDF may contain this:

One large image

instead of this:

Selectable text

unpdf can read actual PDF text.

It does not perform OCR.

OCR means Optical Character Recognition.

OCR is the process that reads text from an image.

So, in the current version of the app:

Text PDF
✓ We can extract text
✓ We can chunk it
✓ We can create embeddings

Scanned image-only PDF
✗ No text to extract
✗ No chunks to create
✗ Document fails with "no OCR"

This is not a temporary error.

Retrying will not help.

The PDF needs OCR support, or the user needs to upload a text-based PDF.

The first important RAG concept: chunks

A chunk is a small piece of a document.

That is all it is.

A PDF might have 100 pages.

You do not want to treat all 100 pages as one big thing.

Imagine a contract like this:

Page 1
Definitions

Page 2
Payment terms

Page 3
Late payment rules

Page 4
Termination rules

Page 5
Confidentiality

The user asks:

What happens if a payment is late?

The answer is probably in the late payment section.

It is not in the definition section.

It is not in the confidentiality section.

So we split the document into smaller pieces.

Chunk 1
Definitions

Chunk 2
Payment terms

Chunk 3
Late payment rules

Chunk 4
Termination rules

Chunk 5
Confidentiality

Now the application can retrieve chunk 3 instead of sending the entire contract to the LLM.

A chunk is not necessarily one paragraph.

It can contain several paragraphs.

It can contain a heading and paragraphs below it.

It can contain part of a longer section.

The goal is to make each piece small enough to be specific, but large enough to still make sense.

Chunking happens locally

This is important.

Chunking is not an AI call.

Chunking is not done by Voyage.

Chunking happens locally in the worker after parsing the PDF.

PDF pages
↓
Plain text
↓
Local JavaScript code
↓
Chunks

In my project, the chunking function is pure.

const chunks = chunkDocument(parsed);

Pure means:

It receives data.

It returns data.

It does not call an API.

It does not write to the database.

It does not modify something outside itself.

This makes it easier to test.

You can give it a parsed document and check the chunks it returns.

How the chunker decides where to split

You do not want to split text randomly.

This is bad:

Chunk 1
The warranty covers defects in materials and

Chunk 2
workmanship for a period of two years.

Neither chunk is useful by itself.

So the chunker tries to split at useful boundaries.

Paragraph boundary
↓
Line boundary
↓
Sentence boundary
↓
Word boundary
↓
Hard character limit

The hard character limit is the last option.

It only happens when there is no better place to split.

In the current project, a chunk can be around:

4000 characters

This is roughly around 1000 tokens.

A token is not exactly a word.

But a rough approximation for English is:

4 characters ≈ 1 token

This is enough for chunking decisions.

The chunker does not need to know the exact token count used by every LLM.

It only needs a reasonable estimate.

Why chunks overlap

This is where things get more interesting.

A useful idea can start at the end of one chunk and finish at the beginning of the next one.

Imagine this document text:

The customer must pay every invoice within 30 days.

If payment remains unpaid for more than 30 days after its due date,
the supplier may terminate this agreement.

Termination does not remove the customer's obligation to pay all
amounts already due.

Without overlap, the chunks could look like this:

Chunk 10

The customer must pay every invoice within 30 days.

If payment remains unpaid for more than 30 days after its due date,
the supplier may terminate this agreement.

Chunk 11

Termination does not remove the customer's obligation to pay all
amounts already due.

Chunk 11 is missing context.

It says:

Termination does not remove...

But termination of what?

Why did termination happen?

If chunk 11 gets retrieved on its own, it is weaker.

With overlap, part of chunk 10 is repeated inside chunk 11.

Chunk 10

The customer must pay every invoice within 30 days.

If payment remains unpaid for more than 30 days after its due date,
the supplier may terminate this agreement.

Chunk 11

If payment remains unpaid for more than 30 days after its due date,
the supplier may terminate this agreement.

Termination does not remove the customer's obligation to pay all
amounts already due.

Now the repeated text is the overlap.

Chunk 11 can stand on its own.

It knows what type of termination it is talking about.

In my project, the target overlap is:

600 characters

The exact repeated text is not always exactly 600 characters because the chunker tries to keep safe boundaries like paragraphs and sentences.

The mental model is:

Chunk overlap is repeated context.

It slightly repeats text.

It prevents ideas from being cut in half.

Chunk size is a tradeoff

You might think larger chunks are always better.

They contain more context.

But they can also contain too many unrelated topics.

Imagine this chunk:

Introduction
Payment terms
Warranty
Troubleshooting
Technical specifications

The user asks about warranty.

The chunk contains the warranty answer, but it also contains four unrelated subjects.

That makes retrieval less precise.

Very small chunks are also bad.

Chunk 1
The warranty starts

Chunk 2
on the date of purchase

Chunk 3
and lasts for two years

These chunks are too small to be useful.

So chunking is a balance:

Very large chunks
More context
Less precise retrieval

Very small chunks
More precise retrieval
Less context

Medium chunks with overlap
Usually a useful compromise

There is no perfect chunk size that works for every document.

This is why a serious RAG app evaluates chunking changes instead of changing numbers based on vibes.

Page-aware chunks

My chunks stay inside one PDF page.

This means a chunk does not start on page 7 and end on page 8.

Why?

Because citations become simple.

Chunk 14
Contract.pdf
Page 7

Instead of:

Chunk 14
Contract.pdf
Pages 7–8
Maybe page 7
Maybe page 8

Page-aware chunking is not always the only valid approach.

But for document citations, it is a very practical one.

The chunk stores:

document_id
owner_id
page_number
chunk_index
token_count

The chunk_index tells us the reading order.

The page number tells us where the user can find the original text.

After chunking, we still only have text

At this point, we have chunks like this:

Chunk 1
"The warranty covers defects in materials and workmanship..."

Chunk 2
"The warranty does not cover accidental damage..."

Chunk 3
"To reset the device, hold the power button for five seconds..."

These are useful for humans.

But PostgreSQL cannot magically understand that this question:

How long am I protected if the product breaks?

is related to this text:

The manufacturer guarantees the product against defects for 24 months.

The words are different.

Question:
protected
breaks

Document:
guarantees
defects
24 months

But the meaning is similar.

This is what embeddings solve.

The second important RAG concept: embeddings

An embedding is a way to turn text into numbers.

For example, this text:

The warranty lasts for two years.

becomes something like this:

[0.18, -0.42, 0.91, 0.03, ...]

This list of numbers is called a vector.

In my project, every embedding contains:

1024 numbers

So every chunk becomes a point in 1024-dimensional space.

You cannot really visualize 1024 dimensions.

But you do not need to.

The important idea is:

Texts with similar meanings should end up near each other.

Texts with different meanings should end up farther apart.

For example:

How long is the warranty?

and:

The product is covered for two years from the date of purchase.

should be close together.

But this:

Hold the power button for five seconds to restart the device.

should be farther away.

Embeddings happen externally in Voyage AI

This is another important separation.

The worker creates chunks locally.

Then it sends those chunks to Voyage AI.

Local worker
↓
Chunk text

↓

Voyage AI
↓
Embedding vectors

The code conceptually looks like this:

const embeddings = await voyageEmbedTextsIntoVectors({
  texts,
  inputType: "document"
});

The worker sends text.

Voyage returns numbers.

Chunk text
↓
Voyage embedding model
↓
1024-number vector

Voyage does not receive the original PDF structure.

It does not know about pages.

It does not know about users.

It just receives text chunks and returns vectors.

That is why we keep page numbers, ownership and document metadata in our own database.

An embedding model is not the chat model

This can be confusing at first.

An embedding model does not write answers.

It does not chat.

It does not summarize.

It does one job:

Text
↓
Numbers that represent meaning

The embedding model is useful for search.

The LLM is useful for generating an answer.

Embedding model:
"Which document pieces seem related to this question?"

LLM:
"Using these document pieces, how should I explain the answer?"

Those are different jobs.

What gets stored in the database?

This was one of the questions I had while building this.

Do we save the chunks?

Do we save the vectors too?

Yes.

Both are saved in the same table.

The table is called:

document_chunks

Every row represents one chunk of one document.

Conceptually, it looks like this:

Column	What it contains
`content`	The original readable chunk text
`embedding`	The 1024-number vector from Voyage
`content_tsv`	A PostgreSQL full-text version of the chunk text
`document_id`	The parent PDF
`owner_id`	The user who owns the PDF
`page_number`	The PDF page for citations
`chunk_index`	The chunk’s reading order
`token_count`	Rough size information

One row might look like this:

content:
"The warranty covers defects in materials and workmanship
for two years from the purchase date."

embedding:
[0.18, -0.42, 0.91, 0.03, ...]

page_number:
14

chunk_index:
31

The raw text and the vector belong together.

Raw text
Used later as context for the LLM.

Vector
Used to find that raw text.

The vector is not useful to show to the user.

The raw text is not enough for semantic similarity search.

We need both.

Why not use a separate vector database?

You often see tutorials using:

PostgreSQL for normal data

Pinecone or Weaviate for vectors

You can do that.

But you do not have to.

I use PostgreSQL with pgvector.

pgvector is a PostgreSQL extension.

It adds:

A vector column type

Vector similarity operators

Vector indexes

So the same PostgreSQL database can store:

Documents

Chat messages

Chunks

Embeddings

Citations

There is no separate vector database to keep in sync.

The vectors live next to the data they describe.

PostgreSQL
├── documents
├── document_chunks
│   ├── content
│   ├── embedding
│   ├── page_number
│   └── owner_id
└── chat_messages

This makes the architecture simpler.

When deleting a document, the related chunks can be deleted too.

When searching, we can filter by owner and document in the same query.

The three representations of one chunk

One chunk is stored in three useful forms.

content
The real text.

embedding
The meaning as numbers.

content_tsv
A full-text search representation.

Each one solves a different problem.

content
Used as context for the LLM.
Shown in the debug panel.
Used for citations.

embedding
Used for semantic search.
Finds similar meaning.

content_tsv
Used for exact text search.
Useful for codes, IDs and literal words.

The content_tsv column is generated by PostgreSQL from content.

The application does not manually write it.

You store the text, and PostgreSQL builds the searchable text representation.

That is useful because exact identifiers do not always work well with embeddings.

For example:

INV-2025-0492

SKU-88771

A7F9-KL22

These are not really meanings.

They are exact codes.

For those, literal search is better.

Now the PDF is searchable

At upload time, the worker does this:

Read PDF
↓
Parse PDF text
↓
Split text into chunks
↓
Send chunks to Voyage
↓
Receive vectors
↓
Store text + vectors in PostgreSQL

This is done once per uploaded document.

The chunk vectors are saved.

They do not need to be created again every time the user asks a question.

That is why RAG can be fast enough.

The expensive work is done when the document is ingested.

What happens when the user asks a question?

Imagine the user asks:

What happens when an invoice is overdue?

We do not split the question into chunks.

It is already small.

But we do create a fresh embedding for it.

Question text
↓
Voyage AI
↓
Fresh query vector

Conceptually:

const queryVector = await voyageEmbedQueryIntoVector(question);

This vector is usually not saved in the database.

It is temporary.

It exists for this retrieval request.

Stored forever:
Document chunk vectors

Created fresh per question:
Question vector

The question needs a fresh vector because every question is different.

How long is the warranty?

What does the warranty cover?

Can I transfer the warranty?

Does the warranty cover water damage?

Every question has a different meaning.

Every question needs its own vector.

Why query embeddings and document embeddings are different

The application uses Voyage with two input types.

inputType: "document"

for document chunks.

And:

inputType: "query"

for user questions.

This is called asymmetric retrieval.

A question is usually short.

A document chunk is usually longer and more formal.

For example:

Question:
What happens when an invoice is overdue?

Document chunk:
If payment remains unpaid for more than 30 days after its due date,
the supplier may terminate this agreement.

These texts do not look the same.

But they should match.

Using a query embedding mode and a document embedding mode helps the model place these two kinds of text in compatible positions.

How pgvector finds similar chunks

Now we have:

Question vector
[0.12, -0.40, 0.88, ...]

Stored chunk vector
[0.18, -0.42, 0.91, ...]

We need to decide whether they are close.

One common way is cosine similarity.

Cosine similarity compares the direction of two vectors.

Imagine vectors as arrows.

Question:
→

Late payment chunk:
→

Device reset chunk:
↑

The question and late payment chunk point in a similar direction.

The device reset chunk points in a different direction.

So the late payment chunk is more likely to be relevant.

The formula is:

cosineSimilarity(a, b) =
  dotProduct(a, b) / (length(a) × length(b))

You do not need to calculate this by hand.

The important result is:

High cosine similarity
= vectors point in a similar direction
= text probably has a similar meaning

PostgreSQL often uses cosine distance instead.

cosineDistance = 1 - cosineSimilarity

So:

High similarity
= low distance
= good match

Low similarity
= high distance
= bad match

In the database query, this looks like:

ORDER BY embedding <=> queryVector

This means:

Order stored chunk vectors by cosine distance.

Closest vectors first.

The database returns the chunks whose meaning is closest to the question.

What vector search is actually doing

Imagine these chunks exist in the database:

Chunk A
The warranty covers defects for two years.

Chunk B
The customer must pay invoices within 30 days.

Chunk C
If payment remains unpaid for more than 30 days,
the supplier may terminate this agreement.

Chunk D
Hold the power button for five seconds to restart the device.

The user asks:

What happens when an invoice is overdue?

The query embedding should end up closer to chunks B and C.

Question vector
↓
Compare against stored vectors
↓
Closest chunks

1. Chunk C
2. Chunk B
3. Maybe another payment-related chunk
4. Not chunk D

The database returns the raw text for these chunks.

The raw text is what the LLM needs later.

The vector is only the way we found it.

What is HNSW?

At first, you might think PostgreSQL compares the question vector with every chunk vector in the entire database.

It could.

But that gets slow as the number of chunks grows.

Imagine one million chunks.

Question
↓
Compare against 1,000,000 vectors
↓
Slow

HNSW is an index that helps pgvector find nearby vectors quickly.

HNSW means:

Hierarchical Navigable Small World

The name sounds scary.

The idea is easier.

Imagine all chunk vectors are connected in a graph.

A small multi-user HNSW problem

There is one interesting detail when users have separate documents.

Imagine the database has chunks from many users.

The vector index first finds globally close chunks.

Then SQL filters by:

owner_id = current user

That can cause a problem.

1. HNSW finds close chunks from all users.

2. PostgreSQL removes chunks that do not belong to the current user.

3. The current user might have no chunks left.

This is especially possible when one user has only a small number of documents.

The solution is iterative scanning.

It tells pgvector:

Keep searching for more candidates until enough chunks pass the owner filter.

The implementation sets this inside the database transaction.

That is a small detail, but it matters for multi-user retrieval quality.

Vector retrieval is fast, but not perfect

The first vector search gives us possible chunks.

In my project, it gets a pool of 12 chunks.

Question
↓
Question embedding
↓
pgvector finds 12 close chunks
↓
Candidate pool

These are not necessarily the final best chunks.

They are good possibilities.

This first stage is called recall.

Recall asks:

Did we include the correct answer somewhere in the candidate pool?

The goal is not perfect ordering yet.

The goal is to avoid missing the answer.

Bi-encoder: the first retrieval stage

The embedding system is called a bi-encoder.

Bi means two.

One side:
Embed the question.

Other side:
Embed the document chunk.

The question and the chunk are embedded separately.

Question
↓
Question vector

Chunk
↓
Chunk vector

Question vector + chunk vector
↓
Cosine distance

The question and chunk do not actually meet inside the embedding model.

They are converted separately into numbers.

Then we compare those numbers.

This is fast because document chunk vectors were already created during upload.

The only new work during chat is:

Create one fresh query vector.

But this speed comes with a limitation.

Vector search can find chunks with similar meaning.

It can be weaker for exact weird-looking values.

Exact IDs are not concepts

Imagine the user asks:

What does invoice INV-2025-0492 say?

The useful part is:

INV-2025-0492

That is not a concept like warranty or payment.

It is an exact identifier.

You do not want a semantically similar invoice.

You want this exact invoice.

So after vector search, the application checks whether the query contains identifier-shaped text.

For example:

INV-2025-0492

SKU-88771

A7F9-KL22

987654321

If it does, the app performs an exact keyword search using content_tsv.

Question includes ID
↓
Exact token search
↓
Find chunks containing that exact ID
↓
Add them to the candidate pool

This is not broad full-text search for every query.

It is a small fix for a known vector-search blind spot.

Vector search
Good at meaning.

Exact token search
Good at exact codes.

Then both kinds of results go into the same candidate pool.

The third important RAG concept: reranking

The candidate pool has 12 possible chunks.

Now we need to decide which ones are actually the best answer.

This is where reranking happens.

Reranking is done externally with Voyage too.

Question + 12 candidates
↓
Voyage reranker
↓
Candidates reordered by relevance
↓
Keep the best 8

A reranker does something different from embeddings.

The embedding model sees the question and chunk separately.

The reranker sees them together.

Question:
What happens when an invoice is overdue?

Chunk:
If payment remains unpaid for more than 30 days after its due date, the supplier may terminate this agreement.

The reranker can directly judge:

Does this chunk answer this exact question?

This is called a cross-encoder.

Bi-encoder
Question and chunk are processed separately.
Fast.
Good for finding candidates.

Cross-encoder
Question and chunk are processed together.
Slower.
Better at judging relevance.

You would not rerank one million chunks.

That would be expensive.

But reranking 12 already-good candidates is reasonable.

That is why the pipeline has two stages.

Stage 1: cheap recall
Vector search finds possible chunks.

Stage 2: expensive precision
Reranker decides which possible chunks are best.

This is one of the main RAG ideas.

Use the cheap thing first.

Use the more accurate thing only on a small list.

What happens if reranking fails?

Voyage is an external API.

It can fail.

The network can fail.

The API can be rate-limited.

So the app does not make reranking required for chat to work.

Reranking works
↓
Use reranked top 8 chunks

Reranking fails
↓
Use the original vector order

The answer might be less precise.

But it is still grounded in retrieved chunks.

This is better than returning a complete error after retrieval already succeeded.

The LLM only receives the final chunks

After vector retrieval, identifier search and reranking, the application has the final chunks.

For example:

[1] Contract.pdf, page 3

If payment remains unpaid for more than 30 days after its due date, the supplier may terminate this agreement.

[2] Contract.pdf, page 3

Termination does not remove the customer's obligation to pay all amounts already due.

Then the application builds the LLM prompt.

Question:
What happens when an invoice is overdue?

Context:
[1] Contract.pdf, page 3
...

[2] Contract.pdf, page 3
...

The system prompt tells the LLM:

Answer only from the provided context.

Cite the document and page.

If the answer is not in the context, say so.

Do not invent information.

This is grounding.

Grounding means that the LLM should use retrieved document text as the source of truth.

The LLM should not answer based on what it generally knows about contracts.

It should answer based on what this contract says.

“I don't know” is part of a good RAG app

If no useful chunks are found, the app does not call the LLM.

Instead, it returns something like:

I don't know based on the provided documents.

This is a feature.

A RAG app should not confidently invent an answer just because the user asked a question.

The goal is not:

Always generate something.

The goal is:

Answer from the uploaded documents.

If the documents do not contain the answer, the correct answer is that the system does not know.

Why questions are rewritten before retrieval

Chat questions are often incomplete.

Imagine this conversation:

User:
Tell me about the Mercedes EQS.

Assistant:
The Mercedes EQS is an electric luxury sedan...

User:
What about its warranty?

The last question is not good for vector retrieval.

What about its warranty?

What does its mean?

So the app can rewrite the message using chat history.

Original:
What about its warranty?

Retrieval query:
What is the warranty of the Mercedes EQS?

The rewritten question is used for retrieval.

The original question is still saved in chat.

The user sees what they actually typed.

The retriever gets a complete question.

RAG is not always the correct answer

This is something I think is worth saying.

For a single small PDF, RAG can be worse than just sending the whole document to the LLM.

In my app, if the user selected exactly one small document and it fits under the token limit, the application skips vector search.

One small selected document
↓
Load every chunk in reading order
↓
Send the full document as context

No embedding call.

No vector search.

No reranking.

This is useful because retrieval can accidentally miss the most relevant chunk.

For a small document, there is no need to take that risk.

RAG makes more sense when:

Documents are large

There are many documents

Only a small part is relevant

You need efficient search

You need citations

The best RAG system is not the one that uses RAG everywhere.

It is the one that knows when not to use it.

What happens during ingestion, step by step

This is the local and external work separated clearly.

1. User uploads a PDF

2. The API validates it
   - signed-in user
   - file size
   - actual PDF bytes

3. The PDF file is saved locally

4. A document row is created in PostgreSQL
   status = uploaded

5. A BullMQ job is added to Redis

6. The worker receives:
   documentId
   ownerId

7. The worker reads the PDF bytes locally

8. unpdf extracts text locally, page by page

9. The worker rejects:
   corrupt PDFs
   too many pages
   scanned PDFs with no text

10. Local JavaScript splits page text into chunks

11. The chunks are sent to Voyage AI

12. Voyage AI returns one 1024-number vector per chunk

13. PostgreSQL stores:
   raw text
   vector
   page number
   document ID
   owner ID
   chunk order

14. The document status becomes ready

At this point, the PDF is searchable.

What happens when a user asks a question, step by step

1. User sends a chat message

2. The API checks:
   user
   message
   rate limit
   LLM configuration

3. The app loads chat history

4. The app rewrites the question when needed

5. The app checks whether one small document can use full context

6. If not:
   send the fresh question to Voyage AI

7. Voyage returns one fresh query vector

8. pgvector searches stored chunk vectors by cosine distance

9. Exact identifier search runs when the question contains a code or ID

10. The candidate pool goes to Voyage reranking

11. Voyage returns the best chunks in better order

12. The app builds context with:
    raw text
    document name
    page number

13. The LLM receives question + context

14. The answer streams back to the browser

15. The answer and citations are saved

The architecture in one final mind map

                         UPLOAD TIME

PDF
│
├── local storage
│
├── Node.js worker
│   │
│   ├── read bytes
│   ├── parse text with unpdf
│   ├── keep page numbers
│   └── split text into overlapping chunks
│
├── Voyage AI
│   │
│   └── turn every chunk into a 1024-number embedding
│
└── PostgreSQL + pgvector
    │
    └── store:
        raw chunk text
        chunk vector
        page number
        document ID
        owner ID
        chunk order


                         QUESTION TIME

User question
│
├── rewrite with chat history when needed
│
├── Voyage AI
│   │
│   └── turn the fresh question into a query vector
│
├── PostgreSQL + pgvector
│   │
│   ├── compare question vector with stored chunk vectors
│   ├── use cosine distance
│   ├── use HNSW for fast search
│   └── return likely relevant chunks
│
├── exact keyword search
│   │
│   └── find exact IDs, SKUs, invoice numbers and codes
│
├── Voyage reranker
│   │
│   └── reorder possible chunks by actual relevance
│
└── LLM
    │
    └── answer only from final chunks and cite the pages

The main idea I want to remember

RAG is a data pipeline.

First:
Turn PDFs into searchable pieces of text.

Then:
Turn those pieces into vectors.

Then:
Store text and vectors together.

Later:
Turn the user question into a fresh vector.

Then:
Find stored vectors with similar meaning.

Then:
Give the original text from those vectors to the LLM.

The embedding vectors help us find the answer.

The raw chunk text is the actual answer source.

The LLM explains that source to the user.

Ingestion prepares the knowledge.

pgvector finds the knowledge.

The LLM explains the knowledge.

Good for you if you stayed this long ;)

Command Palette

The whole thing in one picture

What problem does RAG solve?

First: a PDF is not searchable text yet

Why page numbers matter

Not every PDF has readable text

The first important RAG concept: chunks

Chunking happens locally

How the chunker decides where to split

Why chunks overlap

Chunk size is a tradeoff

Page-aware chunks

After chunking, we still only have text

The second important RAG concept: embeddings

Embeddings happen externally in Voyage AI

An embedding model is not the chat model

What gets stored in the database?

Why not use a separate vector database?

The three representations of one chunk

Now the PDF is searchable

What happens when the user asks a question?

Why query embeddings and document embeddings are different

How pgvector finds similar chunks

What vector search is actually doing

What is HNSW?

A small multi-user HNSW problem

Vector retrieval is fast, but not perfect

Bi-encoder: the first retrieval stage

Exact IDs are not concepts

The third important RAG concept: reranking

What happens if reranking fails?

The LLM only receives the final chunks

“I don't know” is part of a good RAG app

Why questions are rewritten before retrieval

RAG is not always the correct answer

What happens during ingestion, step by step

What happens when a user asks a question, step by step

The architecture in one final mind map

The main idea I want to remember

Comments (1)

More from this blog