Skip to content

Retrieval-Augmented Generation (RAG)

Fundamentals intermediate 20 min
Sources verified Dec 22

RAG combines document retrieval with LLM generation, allowing AI to answer questions grounded in your specific data without fine-tuning.

Retrieval-Augmented Generation (RAG) is a pattern that lets LLMs answer questions using your own documents. Instead of relying solely on the model's training data, RAG retrieves relevant information at query time and includes it in the prompt.

The key insight: LLMs are great at reasoning and synthesis, but their knowledge is frozen at training time. RAG gives them access to current, domain-specific information.

How RAG Works

RAG has two phases:

1. Indexing (One-time)

  1. Chunk: Split documents into smaller pieces (e.g., 500-1000 tokens)
  2. Embed: Convert each chunk to a vector using an embedding model
  3. Store: Save vectors in a vector database (Pinecone, Weaviate, pgvector)

2. Query (Every request)

  1. Embed query: Convert the user's question to a vector
  2. Retrieve: Find the most similar document chunks (typically top 3-10)
  3. Augment: Add retrieved chunks to the prompt as context
  4. Generate: LLM answers using both its knowledge and the provided context
rag_pipeline.ts
import OpenAI from 'openai';
import { Index } from '@pinecone-database/pinecone';

const openai = new OpenAI();

// Indexing phase (run once per document update)
async function indexDocument(doc: string, docId: string, index: Index) {
  // 1. Chunk the document
  const chunks = chunkText(doc, { maxTokens: 500, overlap: 50 });
  
  // 2. Embed each chunk
  const embeddings = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunks
  });
  
  // 3. Store in vector database
  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${docId}-${i}`,
      values: embeddings.data[i].embedding,
      metadata: { text: chunk, docId }
    }))
  );
}

// Query phase (every user request)
async function ragQuery(question: string, index: Index): Promise<string> {
  // 1. Embed the question
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });
  
  // 2. Retrieve similar chunks
  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });
  
  // 3. Build augmented prompt
  const context = results.matches
    .map(m => m.metadata?.text)
    .join('\n\n---\n\n');
  
  // 4. Generate answer with context
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer based on the provided context. If the context doesn't contain the answer, say so.

Context:
${context}`
      },
      { role: 'user', content: question }
    ]
  });
  
  return response.choices[0].message.content ?? '';
}
L7: Chunking prevents exceeding context limits
L12: Same embedding model for indexing and queries
L32: topK controls how many chunks to retrieve
L44: Context goes in system prompt for better grounding

When to Use RAG

Use RAG When Don't Use RAG When
Your data changes frequently Knowledge is static and universal
You need source citations Speed is critical (<1s)
Documents are large/numerous A few examples suffice (use few-shot)
Accuracy matters more than speed General knowledge questions
You want to avoid fine-tuning costs You need a specific output style

Common RAG Pitfalls

  1. Chunk size too large: Retrieval becomes less precise
  2. Chunk size too small: Loses context, retrieves fragments
  3. No overlap: Relevant info split across chunk boundaries
  4. Wrong embedding model: Use the same model for indexing and queries
  5. Too few/many chunks: Balance between context and noise
  6. No reranking: First-pass retrieval isn't always best — consider a reranker

RAG vs Fine-Tuning vs Long Context

Approach Best For Tradeoff
RAG Dynamic knowledge, citations needed Retrieval latency, chunking complexity
Fine-tuning Consistent style, specialized behavior Expensive, data goes stale
Long context Few documents, simple use case Cost scales with tokens, no citations

Key Takeaways

  • RAG = Retrieve relevant docs + Augment prompt + Generate response
  • Two phases: indexing (one-time) and query (per-request)
  • Use the same embedding model for indexing and querying
  • Chunk size and overlap significantly affect quality
  • RAG enables source citations and works with dynamic data
  • Not a silver bullet — retrieval quality limits answer quality

In This Platform

RAG principles inform how we structure sources for retrieval. Each source has tags, key_findings, and quotes that could be embedded and retrieved to support claims automatically.

Relevant Files:
  • sources/research_papers.json
  • sources/official_docs.json
  • sources/blog_posts.json
source_rag.ts (future)
// Future: RAG-powered source retrieval
interface Source {
  id: string;
  key_findings: string[];
  quotes: { text: string; context: string }[];
  tags: string[];
}

// When a claim needs backing, retrieve relevant sources
async function findSourcesForClaim(claim: string): Promise<Source[]> {
  const claimEmbedding = await getEmbedding(claim);
  return sources
    .flatMap(s => s.key_findings.map(f => ({ source: s, finding: f })))
    .map(({ source, finding }) => ({
      source,
      similarity: cosineSimilarity(claimEmbedding, await getEmbedding(finding))
    }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, 3)
    .map(r => r.source);
}

Prerequisites

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts