AI System Monitoring

Patterns intermediate 12 min

Sources verified Dec 23, 2025

Observability practices for AI systems that track model performance, costs, latency, and output quality in production.

AI system monitoring extends traditional application observability to handle the unique challenges of LLM-based applications: non-deterministic outputs, variable costs, quality degradation, and model drift. Effective monitoring answers: Is the AI working? Is it working well? Is it worth the cost?

Key Metrics for AI Systems

Performance Metrics

Latency: Time-to-first-token, total response time, streaming throughput
Throughput: Requests per second, concurrent conversations
Error rates: API failures, timeout rates, rate limit hits
Availability: Model uptime, fallback activation frequency

Quality Metrics

Task completion rate: Did the AI accomplish the user's goal?
Accuracy: For tasks with ground truth (extraction, classification)
Hallucination rate: Frequency of fabricated information
User feedback: Thumbs up/down, regeneration requests, abandonment

Cost Metrics

Token usage: Input/output tokens per request, daily/monthly totals
Cost per request: Average and P95 costs
Cost per user: Identify heavy users or abuse patterns
Model cost comparison: Track spend across different models

Observability Stack for AI

1. Logging

Capture full request/response pairs (redacting PII):

User prompts and system prompts
Model responses (full text or summaries)
Token counts and costs
Latency breakdowns
Error details and stack traces

2. Tracing

Distributed tracing across the AI pipeline:

Embedding generation → retrieval → augmentation → generation
Tool call chains (which tools, in what order, what results)
Multi-turn conversation flows
Parent-child relationships for agentic workflows

3. Alerting

Proactive notification of issues:

Cost spike alerts (daily spend exceeds threshold)
Latency degradation (P95 above SLA)
Error rate increases (sudden API failures)
Quality drops (user feedback trends negative)

Monitoring Tools

Tool	Focus	Open Source?
LangSmith	LangChain tracing, prompt management	No
Langfuse	Tracing, analytics, prompt versioning	Yes
Helicone	OpenAI proxy with analytics	Yes
Weights & Biases	Experiment tracking, model monitoring	Partial
Arize	ML observability, drift detection	No
OpenTelemetry	General tracing (AI adapters available)	Yes
Datadog/New Relic	APM with AI integrations	No

Common Patterns

Prompt Versioning

Track which prompt version produced which outputs. When quality drops, identify which prompt change caused it.

A/B Testing

Compare model versions, prompt variants, or retrieval strategies. Measure impact on quality and cost.

Regression Detection

Automatically run evaluation suites when prompts change. Catch quality regressions before production.

Cost Attribution

Tag requests by feature, team, or user. Understand where AI spend is going.

Feedback Loops

Collect user feedback (ratings, corrections) and use it to improve prompts and detect issues.

When to Invest in Monitoring

Invest When	Defer When
Production traffic	Prototyping/experimentation
Cost is significant	Low volume, fixed budget
Quality matters (customer-facing)	Internal tools with tolerance
Agentic workflows (complex chains)	Simple prompt-response
Multi-model routing	Single model, simple use case

Debugging Agentic Workflows

Observability alone doesn't fix non-deterministic agent failures. An agent that fails 20% of the time at $5/run is expensive to debug. Two complementary approaches:

Statistical Debugging (TDD approach): Run the agent many times and catch failures statistically. Tests anchor behavior - the agent iterates until tests pass, not until it 'feels done'. See the TDD for Agents concept.

Trace-Based Debugging: Use tracing tools to identify where in the chain failure occurs:

Was retrieval poisoned (bad context from RAG)?
Did a tool call return unexpected data?
Did the model misinterpret instructions?

Combine both: tests tell you WHAT failed, traces tell you WHY.

Key Takeaways

AI monitoring extends traditional observability with quality, cost, and non-determinism concerns
Key metrics: latency, token usage, cost per request, task completion rate, hallucination rate
Capture full prompts/responses (redact PII) for debugging and quality analysis
Use distributed tracing for RAG pipelines and agentic workflows
Alert on cost spikes, latency degradation, and quality drops
Tools like Langfuse, LangSmith, and Helicone specialize in AI observability

In This Platform

This platform implements build-time monitoring: validation reports track missing sources, broken cross-references, and content quality metrics. The 'validate' command produces a summary of issues that serves as a quality dashboard for content, demonstrating monitoring principles applied to content rather than runtime AI.

Relevant Files:

build.js

Sources

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts