Retrieval-Augmented Generation (RAG)
Fundamentals intermediate 20 min
Sources verified Dec 22
RAG combines document retrieval with LLM generation, allowing AI to answer questions grounded in your specific data without fine-tuning.
Retrieval-Augmented Generation (RAG) is a pattern that lets LLMs answer questions using your own documents. Instead of relying solely on the model's training data, RAG retrieves relevant information at query time and includes it in the prompt.
The key insight: LLMs are great at reasoning and synthesis, but their knowledge is frozen at training time. RAG gives them access to current, domain-specific information.
How RAG Works
RAG has two phases:
1. Indexing (One-time)
- Chunk: Split documents into smaller pieces (e.g., 500-1000 tokens)
- Embed: Convert each chunk to a vector using an embedding model
- Store: Save vectors in a vector database (Pinecone, Weaviate, pgvector)
2. Query (Every request)
- Embed query: Convert the user's question to a vector
- Retrieve: Find the most similar document chunks (typically top 3-10)
- Augment: Add retrieved chunks to the prompt as context
- Generate: LLM answers using both its knowledge and the provided context
rag_pipeline.ts
import OpenAI from 'openai';
import { Index } from '@pinecone-database/pinecone';
const openai = new OpenAI();
// Indexing phase (run once per document update)
async function indexDocument(doc: string, docId: string, index: Index) {
// 1. Chunk the document
const chunks = chunkText(doc, { maxTokens: 500, overlap: 50 });
// 2. Embed each chunk
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunks
});
// 3. Store in vector database
await index.upsert(
chunks.map((chunk, i) => ({
id: `${docId}-${i}`,
values: embeddings.data[i].embedding,
metadata: { text: chunk, docId }
}))
);
}
// Query phase (every user request)
async function ragQuery(question: string, index: Index): Promise<string> {
// 1. Embed the question
const queryEmbedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question
});
// 2. Retrieve similar chunks
const results = await index.query({
vector: queryEmbedding.data[0].embedding,
topK: 5,
includeMetadata: true
});
// 3. Build augmented prompt
const context = results.matches
.map(m => m.metadata?.text)
.join('\n\n---\n\n');
// 4. Generate answer with context
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: `Answer based on the provided context. If the context doesn't contain the answer, say so.
Context:
${context}`
},
{ role: 'user', content: question }
]
});
return response.choices[0].message.content ?? '';
} L7: Chunking prevents exceeding context limits
L12: Same embedding model for indexing and queries
L32: topK controls how many chunks to retrieve
L44: Context goes in system prompt for better grounding
When to Use RAG
| Use RAG When | Don't Use RAG When |
|---|---|
| Your data changes frequently | Knowledge is static and universal |
| You need source citations | Speed is critical (<1s) |
| Documents are large/numerous | A few examples suffice (use few-shot) |
| Accuracy matters more than speed | General knowledge questions |
| You want to avoid fine-tuning costs | You need a specific output style |
Common RAG Pitfalls
- Chunk size too large: Retrieval becomes less precise
- Chunk size too small: Loses context, retrieves fragments
- No overlap: Relevant info split across chunk boundaries
- Wrong embedding model: Use the same model for indexing and queries
- Too few/many chunks: Balance between context and noise
- No reranking: First-pass retrieval isn't always best — consider a reranker
RAG vs Fine-Tuning vs Long Context
| Approach | Best For | Tradeoff |
|---|---|---|
| RAG | Dynamic knowledge, citations needed | Retrieval latency, chunking complexity |
| Fine-tuning | Consistent style, specialized behavior | Expensive, data goes stale |
| Long context | Few documents, simple use case | Cost scales with tokens, no citations |
Key Takeaways
- RAG = Retrieve relevant docs + Augment prompt + Generate response
- Two phases: indexing (one-time) and query (per-request)
- Use the same embedding model for indexing and querying
- Chunk size and overlap significantly affect quality
- RAG enables source citations and works with dynamic data
- Not a silver bullet — retrieval quality limits answer quality
In This Platform
RAG principles inform how we structure sources for retrieval. Each source has tags, key_findings, and quotes that could be embedded and retrieved to support claims automatically.
Relevant Files:
- sources/research_papers.json
- sources/official_docs.json
- sources/blog_posts.json
source_rag.ts (future)
// Future: RAG-powered source retrieval
interface Source {
id: string;
key_findings: string[];
quotes: { text: string; context: string }[];
tags: string[];
}
// When a claim needs backing, retrieve relevant sources
async function findSourcesForClaim(claim: string): Promise<Source[]> {
const claimEmbedding = await getEmbedding(claim);
return sources
.flatMap(s => s.key_findings.map(f => ({ source: s, finding: f })))
.map(({ source, finding }) => ({
source,
similarity: cosineSimilarity(claimEmbedding, await getEmbedding(finding))
}))
.sort((a, b) => b.similarity - a.similarity)
.slice(0, 3)
.map(r => r.source);
} Prerequisites
Sources
Tempered AI — Forged Through Practice, Not Hype
? Keyboard shortcuts