Calibrated AI Use

Patterns beginner 10 min

Sources verified Dec 25, 2025

A framework for matching AI oversight levels to code stakes—using AI everywhere while calibrating verification to risk.

The question isn't whether to use AI for critical code—it's how much oversight to apply. Research shows AI-generated code has ~1.5-2.7x more security issues than human code. But human code has security issues too—the answer is calibrated oversight (security scanning + expert review), not blanket avoidance.

The Core Principle

The bottleneck is human review capacity, not AI capability.

An authentication expert can use AI for auth code—with appropriate oversight. A security engineer can use AI for security tooling—with proper review. Blanket prohibitions waste expertise and slow teams down.

Instead, match:

Reviewer expertise → Code domain
Review depth → Code stakes
Tooling → Risk level

The Oversight Calibration Framework

Code Stakes	Examples	Oversight Level	What Changes
HIGH	Auth, crypto, compliance, financial	Maximum	Security scanning (SAST/DAST), domain expert review, mandatory test coverage, audit trail
MEDIUM	Business logic, data handling, APIs	Standard	Thorough code review, domain-aware reviewer, integration tests
LOW	Boilerplate, formatting, CRUD	Basic	Standard code review, unit tests, linting

High-Stakes Code: Maximum Oversight

What qualifies: Authentication, authorization, cryptography, financial calculations, PHI/PII handling, compliance-critical code.

Oversight requirements:

Security scanning: Run SAST/DAST on ALL code (AI or human)
Domain expert review: Reviewer must have expertise in the specific domain
Mandatory test coverage: 100% branch coverage for security-critical paths
Audit trail: Document who reviewed what and when

Why not just avoid AI? Because human code also has bugs. Veracode's own data shows human code has security issues too—the 45% AI flaw rate is compared to a human baseline that's also non-zero. The solution is verification, not avoidance.

Medium-Stakes Code: Standard Oversight

What qualifies: Business logic, data transformations, API integrations, non-trivial CRUD operations.

Oversight requirements:

Thorough code review: Read and understand, don't just approve
Domain awareness: Reviewer should understand the business context
Integration tests: Verify behavior at system boundaries
Edge case coverage: AI often misses domain-specific edge cases

AI advantage here: AI excels at generating correct-looking code for well-defined patterns. The risk is subtle business logic errors, which domain-aware review catches.

Low-Stakes Code: Basic Oversight

What qualifies: Boilerplate, formatting, simple CRUD, test scaffolding, documentation.

Oversight requirements:

Standard code review: Quick sanity check
Unit tests: Verify basic correctness
Linting: Catch obvious issues automatically

AI advantage here: This is where AI delivers the most value with least risk. Boilerplate is tedious for humans but trivial for AI.

Common Pitfalls

Pitfall	Problem	Better Approach
Blanket AI bans	Wastes expert capacity on low-stakes work	Calibrate oversight to stakes
Blanket AI trust	45% security flaw rate in AI code	Always verify, scale depth to stakes
Expertise mismatch	Reviewer lacks domain knowledge	Match reviewer expertise to code domain
No tooling for high-stakes	Manual review misses systematic issues	Add SAST/DAST for security-critical code

The Quick Decision

Before AI generates code, ask:

What's at stake if this code has bugs?
- Security breach, compliance violation, data loss → HIGH stakes
- Business logic errors, degraded UX → MEDIUM stakes
- Cosmetic issues, slower development → LOW stakes
Do I have the right oversight for this stake level?
- HIGH: Expert reviewer + security scanning available?
- MEDIUM: Domain-aware reviewer available?
- LOW: Basic review process in place?
If oversight is unavailable, reduce scope
- Can't get expert review? Use AI for less critical parts only
- No security scanning? Add it before touching auth code

Key Takeaways

45% of AI code fails security tests, but human code also has vulnerabilities—both need verification (Veracode 2025)
AI code has 1.57x more security findings, up to 2.74x for XSS (CodeRabbit 2025)
The bottleneck is review capacity, not AI capability
Match oversight level to code stakes: maximum for auth/crypto, standard for business logic, basic for boilerplate
Match reviewer expertise to code domain—an auth expert can use AI for auth code
Always run security scanning on high-stakes code, regardless of who wrote it

Visual Overview

Match Oversight to Stakes

In This Platform

This platform applies calibrated oversight: build-time validation (guardrails), schema enforcement (structured outputs), and source verification (human review of claims). The survey assessment itself helps teams identify their current oversight calibration.

Relevant Files:

build.js
Directoryschema/
- …
Directorysources/
- …

Prerequisites

Sources

Oversight Levels

Stakes	Examples	Oversight Required
HIGH	Auth, crypto, compliance	Security scanning, domain expert review, 100% test coverage
MEDIUM	Business logic, APIs	Thorough code review, integration tests, edge cases
LOW	Boilerplate, CRUD	Standard review, unit tests, linting

The Key Insight

The question isn’t whether to use AI for critical code—it’s how much oversight to apply.

Finding	Source
AI code has 1.57x more security findings (2.74x for XSS)	CodeRabbit 2025
45% of AI code fails security tests; human code also has vulnerabilities	Veracode 2025

Both AI and human code need verification. The answer is calibrated oversight, not avoidance.

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts

Calibrated AI Use

The Core Principle

The Oversight Calibration Framework

High-Stakes Code: Maximum Oversight

Medium-Stakes Code: Standard Oversight

Low-Stakes Code: Basic Oversight

Common Pitfalls

The Quick Decision

Key Takeaways

Visual Overview

In This Platform

Related Concepts

Prerequisites

Sources

Oversight Levels

The Key Insight