tadata
Back to home

RAG Architecture: Grounding LLMs in Your Data

#artificial-intelligence#vector-search#rag#llm#architecture

Retrieval-Augmented Generation (RAG) is the dominant pattern for building LLM applications over private data. Instead of fine-tuning a model on your documents, you retrieve relevant chunks at query time and inject them into the prompt. Done well, it reduces hallucinations, keeps knowledge current, and costs a fraction of fine-tuning.

RAG Pipeline Architecture

User Query
    |
    v
+------------------+
| Query Processing |  (rewrite, expand, decompose)
+------------------+
    |
    v
+------------------+     +--------------------+
| Embedding Model  |     | Document Ingestion |
| (query -> vector)|     |                    |
+------------------+     | Source Docs        |
    |                    |   |                 |
    v                    |   v                 |
+------------------+     | Chunking           |
| Vector Search    |<----| Embedding          |
| (ANN lookup)     |     | Indexing           |
+------------------+     +--------------------+
    |
    v
+------------------+
| Reranking        |  (cross-encoder, Cohere Rerank, etc.)
+------------------+
    |
    v
+------------------+
| Prompt Assembly  |  (system prompt + retrieved chunks + query)
+------------------+
    |
    v
+------------------+
| LLM Generation   |  (GPT-4, Claude, Llama, etc.)
+------------------+
    |
    v
+------------------+
| Post-Processing  |  (citation extraction, guardrails, caching)
+------------------+

Chunking Strategy Comparison

StrategyChunk SizeOverlapProsConsBest For
Fixed-size256-512 tokens10-20%Simple, predictableBreaks mid-sentenceHomogeneous documents
Sentence-based1-5 sentences1 sentencePreserves meaningUneven chunk sizesNarrative text
Paragraph-based1-3 paragraphs0-1 paragraphNatural boundariesLarge variance in sizeStructured articles
SemanticVariesAdaptiveGroups related contentSlower, more complexMixed-format documents
RecursiveTarget size, split on hierarchy10-20%Respects document structureRequires format knowledgeMarkdown, HTML, code
Parent-ChildSmall (retrieval) + large (context)N/APrecise retrieval, rich contextMore complex indexingLong documents

Embedding Model Comparison

ModelDimensionsMax TokensMTEB ScoreSpeedCostOpen/Closed
OpenAI text-embedding-3-large30728191~65Fast$0.13/1M tokensClosed
OpenAI text-embedding-3-small15368191~62Very Fast$0.02/1M tokensClosed
Cohere embed-v31024512~65Fast$0.10/1M tokensClosed
Voyage AI voyage-3102432000~67Fast$0.06/1M tokensClosed
BGE-large-en-v1.51024512~64MediumFree (self-host)Open
E5-mistral-7b-instruct409632768~66SlowFree (self-host)Open
nomic-embed-text-v1.57688192~62FastFree (self-host)Open

Retrieval Method Matrix

MethodPrecisionRecallLatencyComplexityWhen to Use
Dense Vector SearchHighMediumFastLowDefault starting point
Sparse (BM25/SPLADE)MediumHighFastLowKeyword-heavy queries, exact terms
Hybrid (Dense + Sparse)HighHighMediumMediumProduction systems (best general performance)
Multi-QueryHighVery HighSlowMediumComplex questions
HyDEHighHighSlowMediumAbstract or vague queries
Parent-Child RetrievalVery HighHighMediumHighLong documents needing context
Knowledge Graph + VectorVery HighHighSlowHighStructured domain knowledge

RAGAS Evaluation Framework

MetricWhat It MeasuresRangeTarget
FaithfulnessIs the answer grounded in retrieved context?0-1> 0.85
Answer RelevanceDoes the answer address the question?0-1> 0.80
Context PrecisionAre retrieved chunks relevant? (precision)0-1> 0.75
Context RecallWere all needed chunks retrieved?0-1> 0.75
Answer CorrectnessIs the answer factually correct?0-1> 0.80

Vector Database Comparison

DatabaseTypeMax ScaleFilteringManaged OptionBest For
PineconePurpose-builtBillionsMetadata filtersYes (only)Simplicity, serverless
WeaviatePurpose-builtBillionsGraphQL + filtersYes + self-hostHybrid search
QdrantPurpose-builtBillionsPayload filtersYes + self-hostPerformance, Rust-based
Milvus/ZillizPurpose-builtTrillionsAttribute filtersYes + self-hostVery large scale
pgvectorExtension (Postgres)MillionsFull SQLYes (managed PG)Small-medium, existing PG
ChromaDBLightweightMillionsMetadata filtersNo (embedded)Prototyping, local dev

Resources