RAG — Retrieval-Augmented Generation
The complete RAG pipeline: document ingestion, chunking strategies, embedding models, vector stores, retrieval, and prompt augmentation with practical implementation guidance.
Retrieval-Augmented Generation (RAG) is the most widely deployed pattern for building LLM applications that need access to private, domain-specific, or up-to-date knowledge. Instead of relying solely on what a model learned during training, RAG retrieves relevant context at query time and provides it to the model. The result: an LLM that can answer questions about your data.
Why RAG, not Fine-tuning?
RAG and fine-tuning solve different problems. RAG gives the model access to dynamic, frequently-changing, or large volumes of knowledge. Fine-tuning teaches the model new skills, styles, or behaviours. For most knowledge-base and Q&A applications, RAG is cheaper, faster to iterate, and more maintainable than fine-tuning. See Fine-Tuning for when the calculus shifts.
The RAG Pipeline
Documents → Ingestion → Chunking → Embedding → Vector Store
↓
User Query → Query Embedding → Vector Search → Retrieved Chunks
↓
LLM ← Augmented Prompt (query + chunks)
↓
Answer
Every RAG system has two phases:
- Indexing (offline): Ingest and prepare documents for retrieval
- Querying (online): Retrieve relevant chunks and augment the prompt
Phase 1: Indexing
Document Loading
The first step is getting documents into a processable format:
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { TextLoader } from 'langchain/document_loaders/fs/text';
const loader = new DirectoryLoader('./docs', {
'.pdf': (path) => new PDFLoader(path),
'.txt': (path) => new TextLoader(path),
'.mdx': (path) => new TextLoader(path),
});
const documents = await loader.load();
// Each document has: pageContent (string), metadata (source, page, etc.)
Chunking Strategies
Chunking splits documents into pieces that fit within retrieval windows. This is one of the most impactful decisions in RAG quality.
Fixed-size chunking — Simple, predictable, but may split mid-thought:
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 512, // characters per chunk
chunkOverlap: 64, // overlap between adjacent chunks
separators: ['\n\n', '\n', '. ', ' ', ''], // split priority
});
const chunks = await splitter.splitDocuments(documents);
Semantic chunking — Splits on meaning boundaries (paragraph, section, topic change) — better quality, harder to implement:
// Split on markdown headers — preserves document structure
const splitter = new MarkdownTextSplitter({ chunkSize: 1000, chunkOverlap: 100 });
Key trade-offs:
| Chunk Size | Retrieval Precision | Context Quality |
|---|---|---|
| Small (256 tokens) | High — very specific matches | Low — may lack context |
| Medium (512 tokens) | Balanced | Balanced |
| Large (1024+ tokens) | Low — broader matches | High — rich context |
Chunk Size is Task-Dependent
For Q&A on technical docs, smaller chunks (256-512 tokens) with overlap work well. For summarisation tasks where context matters more, larger chunks (512-1024) are better. There is no universal optimal — experiment with your specific documents and queries.
Embedding Documents
Embeddings convert text to numeric vectors that capture semantic meaning. Similar texts have similar vectors.
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
const embeddings = new OpenAIEmbeddings({
model: 'text-embedding-3-small', // 1536 dimensions, fast, cheap
// model: 'text-embedding-3-large' → better quality, higher cost
});
// Embed all chunks
const vectors = await embeddings.embedDocuments(
chunks.map(chunk => chunk.pageContent)
);
Embedding model selection:
| Model | Dimensions | Use case |
|---|---|---|
OpenAI text-embedding-3-small | 1536 | General purpose (best cost/quality) |
OpenAI text-embedding-3-large | 3072 | When quality is critical |
Cohere embed-multilingual-v3 | 1024 | Multilingual documents |
nomic-embed-text | 768 | Self-hosted (Ollama), privacy |
Storing in a Vector Database
import { PGVectorStore } from '@langchain/community/vectorstores/pgvector';
// Store chunks + their vectors in PostgreSQL with pgvector extension
const vectorStore = await PGVectorStore.fromDocuments(
chunks,
embeddings,
{
postgresConnectionOptions: { connectionString: process.env.DATABASE_URL },
tableName: 'documents',
columns: {
idColumnName: 'id',
vectorColumnName: 'embedding',
contentColumnName: 'content',
metadataColumnName: 'metadata',
},
}
);
Vector database options:
| Database | Type | Best for |
|---|---|---|
| pgvector (PostgreSQL) | Extension | Existing PostgreSQL infra, ACID transactions |
| Pinecone | Managed cloud | Scale, no infrastructure management |
| Qdrant | Open source / cloud | Self-hosted, filtering, high performance |
| Weaviate | Open source / cloud | Hybrid search (vector + keyword) |
| Chroma | Open source | Local development, prototyping |
Phase 2: Querying
Retrieval
// Semantic search — find the k most similar chunks to the query
async function retrieve(query: string, k: number = 5): Promise<Document[]> {
const queryEmbedding = await embeddings.embedQuery(query);
return vectorStore.similaritySearchVectorWithScore(queryEmbedding, k);
}
Retrieval Strategies
Naive retrieval — Return the top-k chunks by embedding similarity. Simple, works well for focused queries.
Hybrid search — Combine vector similarity with keyword (BM25) search. Better for queries with specific terms (product names, error codes):
// In Weaviate, Qdrant, or with pgvector + pg_trgm
const results = await vectorStore.hybridSearch(query, alpha=0.5);
// alpha: 0 = full keyword, 1 = full semantic, 0.5 = balanced
Re-ranking — After initial retrieval, use a cross-encoder model to re-rank results for accuracy:
import { CohereRerank } from '@langchain/cohere';
const reranker = new CohereRerank({ model: 'rerank-english-v3.0', topN: 3 });
const rerankedDocs = await reranker.compressDocuments(retrievedDocs, query);
Multi-query retrieval — Generate multiple query variants to catch results the original query might miss:
const queryVariants = await llm.generate(`
Generate 3 different phrasings of this query to improve retrieval:
"${originalQuery}"
Return as JSON: { "queries": ["...", "...", "..."] }
`);
const allResults = await Promise.all(queryVariants.queries.map(q => retrieve(q)));
const deduplicated = deduplicateByContent(allResults.flat());
Prompt Augmentation
async function ragQuery(userQuery: string): Promise<string> {
// 1. Retrieve relevant chunks
const relevantDocs = await retrieve(userQuery, 5);
const context = relevantDocs.map(doc => doc.pageContent).join('\n\n---\n\n');
// 2. Augment the prompt with retrieved context
const augmentedPrompt = `
You are a helpful assistant answering questions about Aircury's engineering practices.
Use ONLY the following context to answer. If the answer isn't in the context, say so.
## Context
${context}
## Question
${userQuery}
## Answer
`;
return llm.complete(augmentedPrompt);
}
The Hallucination Guard
Include an explicit instruction: “If the answer isn’t in the provided context, say you don’t know.” Without this, LLMs will use their training knowledge to fill gaps — which can produce confident-sounding but incorrect answers. RAG’s value is grounding; don’t let the model escape that grounding.
Common RAG Failure Modes
| Failure | Symptom | Fix |
|---|---|---|
| Poor retrieval | Correct info in DB, but not retrieved | Tune chunk size, try hybrid search, multi-query |
| Context too large | Model ignores parts of the context | Reduce k, use re-ranking, smaller chunks |
| Lost in the middle | Middle of context ignored | Put most important chunks first or last |
| Stale embeddings | Answers outdated after doc update | Re-index on document change, use metadata filtering |
| Hallucination bypass | Model uses training knowledge despite instruction | Stronger negative instruction, RAG-specific fine-tuning |
Evaluating RAG Quality
RAG requires evaluating three distinct qualities:
| Metric | Question | How to Measure |
|---|---|---|
| Retrieval Recall | Did we retrieve the relevant chunks? | LLM judge: “Is the answer to this question present in these chunks?” |
| Faithfulness | Does the answer stick to the context? | LLM judge: “Is every claim in the answer supported by the context?” |
| Answer Relevance | Does the answer address the query? | LLM judge or embedding similarity between query and answer |
Tools: RAGAS automates all three metrics with LLM judges.