Retrieval-Augmented Generation (RAG)

Fundamentals intermediate 20 min

Sources verified Dec 22

RAG combines document retrieval with LLM generation, allowing AI to answer questions grounded in your specific data without fine-tuning.

Retrieval-Augmented Generation (RAG) is a pattern that lets LLMs answer questions using your own documents. Instead of relying solely on the model's training data, RAG retrieves relevant information at query time and includes it in the prompt.

The key insight: LLMs are great at reasoning and synthesis, but their knowledge is frozen at training time. RAG gives them access to current, domain-specific information.

How RAG Works

RAG has two phases:

1. Indexing (One-time)

Chunk: Split documents into smaller pieces (e.g., 500-1000 tokens)
Embed: Convert each chunk to a vector using an embedding model
Store: Save vectors in a vector database (Pinecone, Weaviate, pgvector)

2. Query (Every request)

Embed query: Convert the user's question to a vector
Retrieve: Find the most similar document chunks (typically top 3-10)
Augment: Add retrieved chunks to the prompt as context
Generate: LLM answers using both its knowledge and the provided context

rag_pipeline.ts
 import OpenAI from 'openai';
import { Index } from '@pinecone-database/pinecone';

const openai = new OpenAI();

// Indexing phase (run once per document update)
async function indexDocument(doc: string, docId: string, index: Index) {
  // 1. Chunk the document
  const chunks = chunkText(doc, { maxTokens: 500, overlap: 50 });
  
  // 2. Embed each chunk
  const embeddings = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunks
  });
  
  // 3. Store in vector database
  await index.upsert(
    chunks.map((chunk, i) => ({
      id: `${docId}-${i}`,
      values: embeddings.data[i].embedding,
      metadata: { text: chunk, docId }
    }))
  );
}

// Query phase (every user request)
async function ragQuery(question: string, index: Index): Promise<string> {
  // 1. Embed the question
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });
  
  // 2. Retrieve similar chunks
  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });
  
  // 3. Build augmented prompt
  const context = results.matches
    .map(m => m.metadata?.text)
    .join('\n\n---\n\n');
  
  // 4. Generate answer with context
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer based on the provided context. If the context doesn't contain the answer, say so.

Context:
${context}`
      },
      { role: 'user', content: question }
    ]
  });
  
  return response.choices[0].message.content ?? '';
} 
  Chunking prevents exceeding context limits 
  Same embedding model for indexing and queries 
  topK controls how many chunks to retrieve 
  Context goes in system prompt for better grounding 

When to Use RAG

Use RAG When	Don't Use RAG When
Your data changes frequently	Knowledge is static and universal
You need source citations	Speed is critical (<1s)
Documents are large/numerous	A few examples suffice (use few-shot)
Accuracy matters more than speed	General knowledge questions
You want to avoid fine-tuning costs	You need a specific output style

Common RAG Pitfalls

Chunk size too large: Retrieval becomes less precise
Chunk size too small: Loses context, retrieves fragments
No overlap: Relevant info split across chunk boundaries
Wrong embedding model: Use the same model for indexing and queries
Too few/many chunks: Balance between context and noise
No reranking: First-pass retrieval isn't always best — consider a reranker

RAG vs Fine-Tuning vs Long Context

Approach	Best For	Tradeoff
RAG	Dynamic knowledge, citations needed	Retrieval latency, chunking complexity
Fine-tuning	Consistent style, specialized behavior	Expensive, data goes stale
Long context	Few documents, simple use case	Cost scales with tokens, no citations

Key Takeaways

RAG = Retrieve relevant docs + Augment prompt + Generate response
Two phases: indexing (one-time) and query (per-request)
Use the same embedding model for indexing and querying
Chunk size and overlap significantly affect quality
RAG enables source citations and works with dynamic data
Not a silver bullet — retrieval quality limits answer quality

In This Platform

RAG principles inform how we structure sources for retrieval. Each source has tags, key_findings, and quotes that could be embedded and retrieved to support claims automatically.

Relevant Files:

sources/research_papers.json
sources/official_docs.json
sources/blog_posts.json

source_rag.ts (future)
 // Future: RAG-powered source retrieval
interface Source {
  id: string;
  key_findings: string[];
  quotes: { text: string; context: string }[];
  tags: string[];
}

// When a claim needs backing, retrieve relevant sources
async function findSourcesForClaim(claim: string): Promise<Source[]> {
  const claimEmbedding = await getEmbedding(claim);
  return sources
    .flatMap(s => s.key_findings.map(f => ({ source: s, finding: f })))
    .map(({ source, finding }) => ({
      source,
      similarity: cosineSimilarity(claimEmbedding, await getEmbedding(finding))
    }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, 3)
    .map(r => r.source);
} 

Prerequisites

Sources

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts

Retrieval-Augmented Generation (RAG)

How RAG Works

1. Indexing (One-time)

2. Query (Every request)

When to Use RAG

Common RAG Pitfalls

RAG vs Fine-Tuning vs Long Context

Key Takeaways

In This Platform

Related Concepts

Prerequisites

Sources