Skip to content

Calibrated AI Use

Patterns beginner 10 min
Sources verified Dec 25, 2025

A framework for matching AI oversight levels to code stakes—using AI everywhere while calibrating verification to risk.

The question isn't whether to use AI for critical code—it's how much oversight to apply. Research shows AI-generated code has ~1.5-2.7x more security issues than human code. But human code has security issues too—the answer is calibrated oversight (security scanning + expert review), not blanket avoidance.

The Core Principle

The bottleneck is human review capacity, not AI capability.

An authentication expert can use AI for auth code—with appropriate oversight. A security engineer can use AI for security tooling—with proper review. Blanket prohibitions waste expertise and slow teams down.

Instead, match:

  • Reviewer expertise → Code domain
  • Review depth → Code stakes
  • Tooling → Risk level

The Oversight Calibration Framework

Code Stakes Examples Oversight Level What Changes
HIGH Auth, crypto, compliance, financial Maximum Security scanning (SAST/DAST), domain expert review, mandatory test coverage, audit trail
MEDIUM Business logic, data handling, APIs Standard Thorough code review, domain-aware reviewer, integration tests
LOW Boilerplate, formatting, CRUD Basic Standard code review, unit tests, linting

High-Stakes Code: Maximum Oversight

What qualifies: Authentication, authorization, cryptography, financial calculations, PHI/PII handling, compliance-critical code.

Oversight requirements:

  1. Security scanning: Run SAST/DAST on ALL code (AI or human)
  2. Domain expert review: Reviewer must have expertise in the specific domain
  3. Mandatory test coverage: 100% branch coverage for security-critical paths
  4. Audit trail: Document who reviewed what and when

Why not just avoid AI? Because human code also has bugs. Veracode's own data shows human code has security issues too—the 45% AI flaw rate is compared to a human baseline that's also non-zero. The solution is verification, not avoidance.

Medium-Stakes Code: Standard Oversight

What qualifies: Business logic, data transformations, API integrations, non-trivial CRUD operations.

Oversight requirements:

  1. Thorough code review: Read and understand, don't just approve
  2. Domain awareness: Reviewer should understand the business context
  3. Integration tests: Verify behavior at system boundaries
  4. Edge case coverage: AI often misses domain-specific edge cases

AI advantage here: AI excels at generating correct-looking code for well-defined patterns. The risk is subtle business logic errors, which domain-aware review catches.

Low-Stakes Code: Basic Oversight

What qualifies: Boilerplate, formatting, simple CRUD, test scaffolding, documentation.

Oversight requirements:

  1. Standard code review: Quick sanity check
  2. Unit tests: Verify basic correctness
  3. Linting: Catch obvious issues automatically

AI advantage here: This is where AI delivers the most value with least risk. Boilerplate is tedious for humans but trivial for AI.

Common Pitfalls

Pitfall Problem Better Approach
Blanket AI bans Wastes expert capacity on low-stakes work Calibrate oversight to stakes
Blanket AI trust 45% security flaw rate in AI code Always verify, scale depth to stakes
Expertise mismatch Reviewer lacks domain knowledge Match reviewer expertise to code domain
No tooling for high-stakes Manual review misses systematic issues Add SAST/DAST for security-critical code

The Quick Decision

Before AI generates code, ask:

  1. What's at stake if this code has bugs?

    • Security breach, compliance violation, data loss → HIGH stakes
    • Business logic errors, degraded UX → MEDIUM stakes
    • Cosmetic issues, slower development → LOW stakes
  2. Do I have the right oversight for this stake level?

    • HIGH: Expert reviewer + security scanning available?
    • MEDIUM: Domain-aware reviewer available?
    • LOW: Basic review process in place?
  3. If oversight is unavailable, reduce scope

    • Can't get expert review? Use AI for less critical parts only
    • No security scanning? Add it before touching auth code

Key Takeaways

  • 45% of AI code fails security tests, but human code also has vulnerabilities—both need verification (Veracode 2025)
  • AI code has 1.57x more security findings, up to 2.74x for XSS (CodeRabbit 2025)
  • The bottleneck is review capacity, not AI capability
  • Match oversight level to code stakes: maximum for auth/crypto, standard for business logic, basic for boilerplate
  • Match reviewer expertise to code domain—an auth expert can use AI for auth code
  • Always run security scanning on high-stakes code, regardless of who wrote it

Visual Overview

Loading diagram...
Match Oversight to Stakes

In This Platform

This platform applies calibrated oversight: build-time validation (guardrails), schema enforcement (structured outputs), and source verification (human review of claims). The survey assessment itself helps teams identify their current oversight calibration.

Relevant Files:
  • build.js
  • Directoryschema/
  • Directorysources/

Prerequisites

Sources

StakesExamplesOversight Required
HIGHAuth, crypto, complianceSecurity scanning, domain expert review, 100% test coverage
MEDIUMBusiness logic, APIsThorough code review, integration tests, edge cases
LOWBoilerplate, CRUDStandard review, unit tests, linting

The question isn’t whether to use AI for critical code—it’s how much oversight to apply.

FindingSource
AI code has 1.57x more security findings (2.74x for XSS)CodeRabbit 2025
45% of AI code fails security tests; human code also has vulnerabilitiesVeracode 2025

Both AI and human code need verification. The answer is calibrated oversight, not avoidance.

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts