AI System Monitoring
Observability practices for AI systems that track model performance, costs, latency, and output quality in production.
AI system monitoring extends traditional application observability to handle the unique challenges of LLM-based applications: non-deterministic outputs, variable costs, quality degradation, and model drift. Effective monitoring answers: Is the AI working? Is it working well? Is it worth the cost?
Key Metrics for AI Systems
Performance Metrics
- Latency: Time-to-first-token, total response time, streaming throughput
- Throughput: Requests per second, concurrent conversations
- Error rates: API failures, timeout rates, rate limit hits
- Availability: Model uptime, fallback activation frequency
Quality Metrics
- Task completion rate: Did the AI accomplish the user's goal?
- Accuracy: For tasks with ground truth (extraction, classification)
- Hallucination rate: Frequency of fabricated information
- User feedback: Thumbs up/down, regeneration requests, abandonment
Cost Metrics
- Token usage: Input/output tokens per request, daily/monthly totals
- Cost per request: Average and P95 costs
- Cost per user: Identify heavy users or abuse patterns
- Model cost comparison: Track spend across different models
Observability Stack for AI
1. Logging
Capture full request/response pairs (redacting PII):
- User prompts and system prompts
- Model responses (full text or summaries)
- Token counts and costs
- Latency breakdowns
- Error details and stack traces
2. Tracing
Distributed tracing across the AI pipeline:
- Embedding generation → retrieval → augmentation → generation
- Tool call chains (which tools, in what order, what results)
- Multi-turn conversation flows
- Parent-child relationships for agentic workflows
3. Alerting
Proactive notification of issues:
- Cost spike alerts (daily spend exceeds threshold)
- Latency degradation (P95 above SLA)
- Error rate increases (sudden API failures)
- Quality drops (user feedback trends negative)
Monitoring Tools
| Tool | Focus | Open Source? |
|---|---|---|
| LangSmith | LangChain tracing, prompt management | No |
| Langfuse | Tracing, analytics, prompt versioning | Yes |
| Helicone | OpenAI proxy with analytics | Yes |
| Weights & Biases | Experiment tracking, model monitoring | Partial |
| Arize | ML observability, drift detection | No |
| OpenTelemetry | General tracing (AI adapters available) | Yes |
| Datadog/New Relic | APM with AI integrations | No |
Common Patterns
Prompt Versioning
Track which prompt version produced which outputs. When quality drops, identify which prompt change caused it.
A/B Testing
Compare model versions, prompt variants, or retrieval strategies. Measure impact on quality and cost.
Regression Detection
Automatically run evaluation suites when prompts change. Catch quality regressions before production.
Cost Attribution
Tag requests by feature, team, or user. Understand where AI spend is going.
Feedback Loops
Collect user feedback (ratings, corrections) and use it to improve prompts and detect issues.
When to Invest in Monitoring
| Invest When | Defer When |
|---|---|
| Production traffic | Prototyping/experimentation |
| Cost is significant | Low volume, fixed budget |
| Quality matters (customer-facing) | Internal tools with tolerance |
| Agentic workflows (complex chains) | Simple prompt-response |
| Multi-model routing | Single model, simple use case |
Debugging Agentic Workflows
Observability alone doesn't fix non-deterministic agent failures. An agent that fails 20% of the time at $5/run is expensive to debug. Two complementary approaches:
Statistical Debugging (TDD approach): Run the agent many times and catch failures statistically. Tests anchor behavior - the agent iterates until tests pass, not until it 'feels done'. See the TDD for Agents concept.
Trace-Based Debugging: Use tracing tools to identify where in the chain failure occurs:
- Was retrieval poisoned (bad context from RAG)?
- Did a tool call return unexpected data?
- Did the model misinterpret instructions?
Combine both: tests tell you WHAT failed, traces tell you WHY.
Key Takeaways
- AI monitoring extends traditional observability with quality, cost, and non-determinism concerns
- Key metrics: latency, token usage, cost per request, task completion rate, hallucination rate
- Capture full prompts/responses (redact PII) for debugging and quality analysis
- Use distributed tracing for RAG pipelines and agentic workflows
- Alert on cost spikes, latency degradation, and quality drops
- Tools like Langfuse, LangSmith, and Helicone specialize in AI observability
In This Platform
This platform implements build-time monitoring: validation reports track missing sources, broken cross-references, and content quality metrics. The 'validate' command produces a summary of issues that serves as a quality dashboard for content, demonstrating monitoring principles applied to content rather than runtime AI.
- build.js