Skip to content

AI System Monitoring

Patterns intermediate 12 min
Sources verified Dec 23

Observability practices for AI systems that track model performance, costs, latency, and output quality in production.

AI system monitoring extends traditional application observability to handle the unique challenges of LLM-based applications: non-deterministic outputs, variable costs, quality degradation, and model drift. Effective monitoring answers: Is the AI working? Is it working well? Is it worth the cost?

Key Metrics for AI Systems

Performance Metrics

  • Latency: Time-to-first-token, total response time, streaming throughput
  • Throughput: Requests per second, concurrent conversations
  • Error rates: API failures, timeout rates, rate limit hits
  • Availability: Model uptime, fallback activation frequency

Quality Metrics

  • Task completion rate: Did the AI accomplish the user's goal?
  • Accuracy: For tasks with ground truth (extraction, classification)
  • Hallucination rate: Frequency of fabricated information
  • User feedback: Thumbs up/down, regeneration requests, abandonment

Cost Metrics

  • Token usage: Input/output tokens per request, daily/monthly totals
  • Cost per request: Average and P95 costs
  • Cost per user: Identify heavy users or abuse patterns
  • Model cost comparison: Track spend across different models

Observability Stack for AI

1. Logging

Capture full request/response pairs (redacting PII):

  • User prompts and system prompts
  • Model responses (full text or summaries)
  • Token counts and costs
  • Latency breakdowns
  • Error details and stack traces

2. Tracing

Distributed tracing across the AI pipeline:

  • Embedding generation → retrieval → augmentation → generation
  • Tool call chains (which tools, in what order, what results)
  • Multi-turn conversation flows
  • Parent-child relationships for agentic workflows

3. Alerting

Proactive notification of issues:

  • Cost spike alerts (daily spend exceeds threshold)
  • Latency degradation (P95 above SLA)
  • Error rate increases (sudden API failures)
  • Quality drops (user feedback trends negative)

Monitoring Tools

Tool Focus Open Source?
LangSmith LangChain tracing, prompt management No
Langfuse Tracing, analytics, prompt versioning Yes
Helicone OpenAI proxy with analytics Yes
Weights & Biases Experiment tracking, model monitoring Partial
Arize ML observability, drift detection No
OpenTelemetry General tracing (AI adapters available) Yes
Datadog/New Relic APM with AI integrations No

Common Patterns

Prompt Versioning

Track which prompt version produced which outputs. When quality drops, identify which prompt change caused it.

A/B Testing

Compare model versions, prompt variants, or retrieval strategies. Measure impact on quality and cost.

Regression Detection

Automatically run evaluation suites when prompts change. Catch quality regressions before production.

Cost Attribution

Tag requests by feature, team, or user. Understand where AI spend is going.

Feedback Loops

Collect user feedback (ratings, corrections) and use it to improve prompts and detect issues.

When to Invest in Monitoring

Invest When Defer When
Production traffic Prototyping/experimentation
Cost is significant Low volume, fixed budget
Quality matters (customer-facing) Internal tools with tolerance
Agentic workflows (complex chains) Simple prompt-response
Multi-model routing Single model, simple use case

Debugging Agentic Workflows

Observability alone doesn't fix non-deterministic agent failures. An agent that fails 20% of the time at $5/run is expensive to debug. Two complementary approaches:

Statistical Debugging (TDD approach): Run the agent many times and catch failures statistically. Tests anchor behavior - the agent iterates until tests pass, not until it 'feels done'. See the TDD for Agents concept.

Trace-Based Debugging: Use tracing tools to identify where in the chain failure occurs:

  • Was retrieval poisoned (bad context from RAG)?
  • Did a tool call return unexpected data?
  • Did the model misinterpret instructions?

Combine both: tests tell you WHAT failed, traces tell you WHY.

Key Takeaways

  • AI monitoring extends traditional observability with quality, cost, and non-determinism concerns
  • Key metrics: latency, token usage, cost per request, task completion rate, hallucination rate
  • Capture full prompts/responses (redact PII) for debugging and quality analysis
  • Use distributed tracing for RAG pipelines and agentic workflows
  • Alert on cost spikes, latency degradation, and quality drops
  • Tools like Langfuse, LangSmith, and Helicone specialize in AI observability

In This Platform

This platform implements build-time monitoring: validation reports track missing sources, broken cross-references, and content quality metrics. The 'validate' command produces a summary of issues that serves as a quality dashboard for content, demonstrating monitoring principles applied to content rather than runtime AI.

Relevant Files:
  • build.js

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts