This series bridges the gap between AI research and software engineering. It moves from the high-level intuition of Large Language Models to the low-level systems engineering required to build deterministic, production-grade RAG pipelines.
- Part 1: Revolutionizing Question-and-Answer Systems (RAG Intuition) – The mental model of the “Student” and the “Librarian.”
- Part 2: Building RAG: All Things Retrieval – The basics of Vector Stores and Search.
- Part 3: Deep Dive: Keyword Search – Understanding BM25 and the math of “exact match.”
- Part 4: RAG Systems Engineering: The Structure-Aware Data Pipeline – Fixing the “Garbage In” problem with logical chunking.
- Part 5: Beyond Vectors: Sparse Embeddings & SPLADE – Solving the “Vocabulary Mismatch” problem.
- Part 6: RAG Architecture: The “Union + Rerank” Pipeline – Orchestrating Hybrid Search for production.
In the previous article, we unpacked TF-IDF and BM25—the math of exact-match keyword search. Now we shift from retrieval to ingestion: structuring data with logical chunking, metadata, and multi-resolution indexing so the “Librarian” can actually find signal.
In the previous posts, we built the intuition for RAG and explored the mechanics of retrieval. Now, we hit the most unglamorous but critical part of the stack: The Data Pipeline.
Here is the scenario most engineers face: You build a prototype. You use the standard RecursiveCharacterTextSplitter. You throw a PDF of your company’s documentation into it. You ask a question. The answer is… mediocre.
The trap is thinking this is a model problem. You assume you need GPT-5 or a better prompt. The reality is that your bottleneck isn’t the brain (the LLM); it’s the digestive system (Ingestion). If you feed the model fragmented, context-less sentence soup, no amount of reasoning can save you.
We need to stop treating documents as “strings of text” and start treating them as hierarchical data structures.
The “Garbage In” Reality Check
“Why can’t I just dump the whole document into the prompt? Context windows are 1 million tokens now.”
This is the most dangerous fallacy in RAG today. It assumes that Context Length == Context Understanding.
Research proves otherwise.
The U-Shaped Failure
In the paper “Lost in the Middle”, researchers found that LLMs are great at retrieving information from the beginning of a prompt (instruction bias) and the end (recency bias), but performance falls off a cliff for data buried in the middle.
The Haystack Problem
Greg Kamradt’s “Needle in a Haystack” analysis visually demonstrated this failure mode across almost every major model. As the “haystack” (context) grows, the model’s ability to retrieve specific facts degrades.
The Engineering Takeaway
You cannot lazy-load your data. You must architect a pipeline that retrieves only the relevant signal and places it at the “top” of the context window.
The Prerequisite: Visual vs. Logical Formats
Before we split text, we have to read it. This is where the Impedance Mismatch begins.
The Problem with PDFs
PDFs are designed for printing, not parsing. To a computer, a PDF isn’t a document; it’s a bag of words with XY coordinates.
The Column Trap: A human sees two columns. A naive parser sees lines of text running horizontally across the page, merging unrelated sentences from Column A and Column B.
The Table Trap: A table in a PDF is often just a set of floating lines, not a grid object.
The Solution: The Logical Object Model (DOM)
You must prioritize formats that possess an inherent Document Object Model (DOM), like Markdown, HTML, or XML.
- Markdown has headers (
#), lists (-), and code blocks (```). - HTML has
<div>,<table>, and<h1>.
These aren’t just styling; they are delimiters. They tell us where one thought ends and another begins. If you are stuck with PDFs, your first step isn’t “splitting”—it’s using tools (like unstructured.io or Azure Document Intelligence) to reconstruct this DOM before you index.
Strategy A: Structure-Aware Splitting
Once we have a DOM, we need to chop it up for the vector store.
The Naive Approach: The Shredded Book
Imagine tearing the pages out of a textbook, shredding them into confetti, and asking someone to study for a test. That is what RecursiveCharacterTextSplitter does. It cuts text based on a character count (e.g., 512 characters).
Failure Mode: It blindly cuts sentences in half, or worse, separates a function signature from its return statement.
The Systems Approach: Logical Blocking
Instead of counting characters, we split by boundaries.
Split by Headers: Every # Header starts a new chunk.
Split by Tables: A table is an atomic unit. Never split row 5 from row 6.
Split by Code: Code blocks are atomic.
- Rule: Never split a function in the middle.
- Constraint: If a function is longer than your embedding limit, you must parse the Abstract Syntax Tree (AST) to split by method/class, not by line number.
Strategy B: Metadata Injection (The Context Anchor)
When you slice a document into chunks, you destroy the Global Context.
The Chunk: “It returns a 404 error if the ID is invalid.”
The Problem: Which API endpoint? Which version? This chunk is mathematically similar to every error message in your database.
The Fix: Ancestral Metadata
We must enrich every chunk with the context of its “parents.” We don’t just index the text; we index the path.
{
"text": "It returns a 404 error...",
"metadata": {
"source": "api_docs_v2.md",
"breadcrumbs": "Authentication > Error Codes > Edge Cases",
"header_path": "/auth/errors/404",
"version": "2.1"
}
}
Why this matters: This allows you to filter deterministically before you search probabilistically. You can tell the vector store: “Only calculate cosine similarity on chunks where breadcrumbs contains ‘Authentication’.”
Strategy C: Multi-Resolution Indexing (Parent-Child)
We face a fundamental trade-off in vector search:
- Small Chunks (128 tokens): High precision. Captures specific facts.
- Large Chunks (1024 tokens): Good context. Captures the “big picture,” but dilutes the vector (too many topics in one vector).
The “Systems” Solution: Parent-Child Indexing
Don’t choose. Do both.
- The Parent: Split your document into large, parent chunks (e.g., full sections).
- The Child: Split those parents into small, granular child chunks (e.g., single paragraphs).
- The Link: Index the Children (for search precision) but store the Parent ID in the metadata.
At Query Time
- Search against the small Child chunks to find the precise match.
- Retrieve the Parent chunk to feed the LLM.
This gives you the best of both worlds: The precision of a needle, but the context of the whole haystack.
Conclusion: The “ETL” Mindset
We need to stop looking for “Magic AI” solutions to data problems. Building a production RAG system is an ETL (Extract, Transform, Load) challenge.
- Extract: Parse the DOM, not just the text.
- Transform: Logical splitting and metadata injection.
- Load: Multi-resolution indexing.
$$\text{Better splitting} + \text{Richer Metadata} = \text{Higher quality vectors} = \text{Less hallucination}$$
Coming Next: Beyond vectors — we tackle sparse embeddings and SPLADE to solve vocabulary mismatch and improve recall.
Read Part 5 here: Beyond Vectors: Sparse Embeddings & SPLADE