Retrieval-Augmented Generation (RAG) is the dominant pattern for building LLM applications over private data. Instead of fine-tuning a model on your documents, you retrieve relevant chunks at query time and inject them into the prompt. Done well, it reduces hallucinations, keeps knowledge current, and costs a fraction of fine-tuning.
RAG Pipeline Architecture
User Query
|
v
+------------------+
| Query Processing | (rewrite, expand, decompose)
+------------------+
|
v
+------------------+ +--------------------+
| Embedding Model | | Document Ingestion |
| (query -> vector)| | |
+------------------+ | Source Docs |
| | | |
v | v |
+------------------+ | Chunking |
| Vector Search |<----| Embedding |
| (ANN lookup) | | Indexing |
+------------------+ +--------------------+
|
v
+------------------+
| Reranking | (cross-encoder, Cohere Rerank, etc.)
+------------------+
|
v
+------------------+
| Prompt Assembly | (system prompt + retrieved chunks + query)
+------------------+
|
v
+------------------+
| LLM Generation | (GPT-4, Claude, Llama, etc.)
+------------------+
|
v
+------------------+
| Post-Processing | (citation extraction, guardrails, caching)
+------------------+
Chunking Strategy Comparison
| Strategy | Chunk Size | Overlap | Pros | Cons | Best For |
|---|
| Fixed-size | 256-512 tokens | 10-20% | Simple, predictable | Breaks mid-sentence | Homogeneous documents |
| Sentence-based | 1-5 sentences | 1 sentence | Preserves meaning | Uneven chunk sizes | Narrative text |
| Paragraph-based | 1-3 paragraphs | 0-1 paragraph | Natural boundaries | Large variance in size | Structured articles |
| Semantic | Varies | Adaptive | Groups related content | Slower, more complex | Mixed-format documents |
| Recursive | Target size, split on hierarchy | 10-20% | Respects document structure | Requires format knowledge | Markdown, HTML, code |
| Parent-Child | Small (retrieval) + large (context) | N/A | Precise retrieval, rich context | More complex indexing | Long documents |
Embedding Model Comparison
| Model | Dimensions | Max Tokens | MTEB Score | Speed | Cost | Open/Closed |
|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | ~65 | Fast | $0.13/1M tokens | Closed |
| OpenAI text-embedding-3-small | 1536 | 8191 | ~62 | Very Fast | $0.02/1M tokens | Closed |
| Cohere embed-v3 | 1024 | 512 | ~65 | Fast | $0.10/1M tokens | Closed |
| Voyage AI voyage-3 | 1024 | 32000 | ~67 | Fast | $0.06/1M tokens | Closed |
| BGE-large-en-v1.5 | 1024 | 512 | ~64 | Medium | Free (self-host) | Open |
| E5-mistral-7b-instruct | 4096 | 32768 | ~66 | Slow | Free (self-host) | Open |
| nomic-embed-text-v1.5 | 768 | 8192 | ~62 | Fast | Free (self-host) | Open |
Retrieval Method Matrix
| Method | Precision | Recall | Latency | Complexity | When to Use |
|---|
| Dense Vector Search | High | Medium | Fast | Low | Default starting point |
| Sparse (BM25/SPLADE) | Medium | High | Fast | Low | Keyword-heavy queries, exact terms |
| Hybrid (Dense + Sparse) | High | High | Medium | Medium | Production systems (best general performance) |
| Multi-Query | High | Very High | Slow | Medium | Complex questions |
| HyDE | High | High | Slow | Medium | Abstract or vague queries |
| Parent-Child Retrieval | Very High | High | Medium | High | Long documents needing context |
| Knowledge Graph + Vector | Very High | High | Slow | High | Structured domain knowledge |
RAGAS Evaluation Framework
| Metric | What It Measures | Range | Target |
|---|
| Faithfulness | Is the answer grounded in retrieved context? | 0-1 | > 0.85 |
| Answer Relevance | Does the answer address the question? | 0-1 | > 0.80 |
| Context Precision | Are retrieved chunks relevant? (precision) | 0-1 | > 0.75 |
| Context Recall | Were all needed chunks retrieved? | 0-1 | > 0.75 |
| Answer Correctness | Is the answer factually correct? | 0-1 | > 0.80 |
Vector Database Comparison
| Database | Type | Max Scale | Filtering | Managed Option | Best For |
|---|
| Pinecone | Purpose-built | Billions | Metadata filters | Yes (only) | Simplicity, serverless |
| Weaviate | Purpose-built | Billions | GraphQL + filters | Yes + self-host | Hybrid search |
| Qdrant | Purpose-built | Billions | Payload filters | Yes + self-host | Performance, Rust-based |
| Milvus/Zilliz | Purpose-built | Trillions | Attribute filters | Yes + self-host | Very large scale |
| pgvector | Extension (Postgres) | Millions | Full SQL | Yes (managed PG) | Small-medium, existing PG |
| ChromaDB | Lightweight | Millions | Metadata filters | No (embedded) | Prototyping, local dev |
Resources