RAG — Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is the most widely deployed pattern for building LLM applications that need access to private, domain-specific, or up-to-date knowledge. Instead of relying solely on what a model learned during training, RAG retrieves relevant context at query time and provides it to the model. The result: an LLM that can answer questions about your data.

Why RAG, not Fine-tuning?

RAG and fine-tuning solve different problems. RAG gives the model access to dynamic, frequently-changing, or large volumes of knowledge. Fine-tuning teaches the model new skills, styles, or behaviours. For most knowledge-base and Q&A applications, RAG is cheaper, faster to iterate, and more maintainable than fine-tuning. See Fine-Tuning for when the calculus shifts.

The RAG Pipeline

Documents → Ingestion → Chunking → Embedding → Vector Store
                                                      ↓
User Query → Query Embedding → Vector Search → Retrieved Chunks
                                                      ↓
                              LLM ← Augmented Prompt (query + chunks)
                                ↓
                             Answer

Every RAG system has two phases:

Indexing (offline): Ingest and prepare documents for retrieval
Querying (online): Retrieve relevant chunks and augment the prompt

Phase 1: Indexing

Document Loading

The first step is getting documents into a processable format:

import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { TextLoader } from 'langchain/document_loaders/fs/text';

const loader = new DirectoryLoader('./docs', {
  '.pdf': (path) => new PDFLoader(path),
  '.txt': (path) => new TextLoader(path),
  '.mdx': (path) => new TextLoader(path),
});

const documents = await loader.load();
// Each document has: pageContent (string), metadata (source, page, etc.)

Chunking Strategies

Chunking splits documents into pieces that fit within retrieval windows. This is one of the most impactful decisions in RAG quality.

Fixed-size chunking — Simple, predictable, but may split mid-thought:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 512,     // characters per chunk
  chunkOverlap: 64,   // overlap between adjacent chunks
  separators: ['\n\n', '\n', '. ', ' ', ''],  // split priority
});

const chunks = await splitter.splitDocuments(documents);

Semantic chunking — Splits on meaning boundaries (paragraph, section, topic change) — better quality, harder to implement:

// Split on markdown headers — preserves document structure
const splitter = new MarkdownTextSplitter({ chunkSize: 1000, chunkOverlap: 100 });

Key trade-offs:

Chunk Size	Retrieval Precision	Context Quality
Small (256 tokens)	High — very specific matches	Low — may lack context
Medium (512 tokens)	Balanced	Balanced
Large (1024+ tokens)	Low — broader matches	High — rich context

Chunk Size is Task-Dependent

For Q&A on technical docs, smaller chunks (256-512 tokens) with overlap work well. For summarisation tasks where context matters more, larger chunks (512-1024) are better. There is no universal optimal — experiment with your specific documents and queries.

Embedding Documents

Embeddings convert text to numeric vectors that capture semantic meaning. Similar texts have similar vectors.

import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

const embeddings = new OpenAIEmbeddings({
  model: 'text-embedding-3-small',  // 1536 dimensions, fast, cheap
  // model: 'text-embedding-3-large' → better quality, higher cost
});

// Embed all chunks
const vectors = await embeddings.embedDocuments(
  chunks.map(chunk => chunk.pageContent)
);

Embedding model selection:

Model	Dimensions	Use case
OpenAI `text-embedding-3-small`	1536	General purpose (best cost/quality)
OpenAI `text-embedding-3-large`	3072	When quality is critical
Cohere `embed-multilingual-v3`	1024	Multilingual documents
`nomic-embed-text`	768	Self-hosted (Ollama), privacy

Storing in a Vector Database

import { PGVectorStore } from '@langchain/community/vectorstores/pgvector';

// Store chunks + their vectors in PostgreSQL with pgvector extension
const vectorStore = await PGVectorStore.fromDocuments(
  chunks,
  embeddings,
  {
    postgresConnectionOptions: { connectionString: process.env.DATABASE_URL },
    tableName: 'documents',
    columns: {
      idColumnName: 'id',
      vectorColumnName: 'embedding',
      contentColumnName: 'content',
      metadataColumnName: 'metadata',
    },
  }
);

Vector database options:

Database	Type	Best for
pgvector (PostgreSQL)	Extension	Existing PostgreSQL infra, ACID transactions
Pinecone	Managed cloud	Scale, no infrastructure management
Qdrant	Open source / cloud	Self-hosted, filtering, high performance
Weaviate	Open source / cloud	Hybrid search (vector + keyword)
Chroma	Open source	Local development, prototyping

Phase 2: Querying

Retrieval

// Semantic search — find the k most similar chunks to the query
async function retrieve(query: string, k: number = 5): Promise<Document[]> {
  const queryEmbedding = await embeddings.embedQuery(query);
  return vectorStore.similaritySearchVectorWithScore(queryEmbedding, k);
}

Retrieval Strategies

Naive retrieval — Return the top-k chunks by embedding similarity. Simple, works well for focused queries.

Hybrid search — Combine vector similarity with keyword (BM25) search. Better for queries with specific terms (product names, error codes):

// In Weaviate, Qdrant, or with pgvector + pg_trgm
const results = await vectorStore.hybridSearch(query, alpha=0.5);
// alpha: 0 = full keyword, 1 = full semantic, 0.5 = balanced

Re-ranking — After initial retrieval, use a cross-encoder model to re-rank results for accuracy:

import { CohereRerank } from '@langchain/cohere';

const reranker = new CohereRerank({ model: 'rerank-english-v3.0', topN: 3 });
const rerankedDocs = await reranker.compressDocuments(retrievedDocs, query);

Multi-query retrieval — Generate multiple query variants to catch results the original query might miss:

const queryVariants = await llm.generate(`
Generate 3 different phrasings of this query to improve retrieval:
"${originalQuery}"
Return as JSON: { "queries": ["...", "...", "..."] }
`);
const allResults = await Promise.all(queryVariants.queries.map(q => retrieve(q)));
const deduplicated = deduplicateByContent(allResults.flat());

Prompt Augmentation

async function ragQuery(userQuery: string): Promise<string> {
  // 1. Retrieve relevant chunks
  const relevantDocs = await retrieve(userQuery, 5);
  const context = relevantDocs.map(doc => doc.pageContent).join('\n\n---\n\n');

  // 2. Augment the prompt with retrieved context
  const augmentedPrompt = `
You are a helpful assistant answering questions about Aircury's engineering practices.
Use ONLY the following context to answer. If the answer isn't in the context, say so.

## Context
${context}

## Question
${userQuery}

## Answer
`;

  return llm.complete(augmentedPrompt);
}

The Hallucination Guard

Include an explicit instruction: “If the answer isn’t in the provided context, say you don’t know.” Without this, LLMs will use their training knowledge to fill gaps — which can produce confident-sounding but incorrect answers. RAG’s value is grounding; don’t let the model escape that grounding.

Common RAG Failure Modes

Failure	Symptom	Fix
Poor retrieval	Correct info in DB, but not retrieved	Tune chunk size, try hybrid search, multi-query
Context too large	Model ignores parts of the context	Reduce k, use re-ranking, smaller chunks
Lost in the middle	Middle of context ignored	Put most important chunks first or last
Stale embeddings	Answers outdated after doc update	Re-index on document change, use metadata filtering
Hallucination bypass	Model uses training knowledge despite instruction	Stronger negative instruction, RAG-specific fine-tuning

Evaluating RAG Quality

RAG requires evaluating three distinct qualities:

Metric	Question	How to Measure
Retrieval Recall	Did we retrieve the relevant chunks?	LLM judge: “Is the answer to this question present in these chunks?”
Faithfulness	Does the answer stick to the context?	LLM judge: “Is every claim in the answer supported by the context?”
Answer Relevance	Does the answer address the query?	LLM judge or embedding similarity between query and answer

Tools: RAGAS automates all three metrics with LLM judges.