Calibrated AI Use
A framework for matching AI oversight levels to code stakes—using AI everywhere while calibrating verification to risk.
The question isn't whether to use AI for critical code—it's how much oversight to apply. Research shows AI-generated code has ~1.5-2.7x more security issues than human code. But human code has security issues too—the answer is calibrated oversight (security scanning + expert review), not blanket avoidance.
The Core Principle
The bottleneck is human review capacity, not AI capability.
An authentication expert can use AI for auth code—with appropriate oversight. A security engineer can use AI for security tooling—with proper review. Blanket prohibitions waste expertise and slow teams down.
Instead, match:
- Reviewer expertise → Code domain
- Review depth → Code stakes
- Tooling → Risk level
The Oversight Calibration Framework
| Code Stakes | Examples | Oversight Level | What Changes |
|---|---|---|---|
| HIGH | Auth, crypto, compliance, financial | Maximum | Security scanning (SAST/DAST), domain expert review, mandatory test coverage, audit trail |
| MEDIUM | Business logic, data handling, APIs | Standard | Thorough code review, domain-aware reviewer, integration tests |
| LOW | Boilerplate, formatting, CRUD | Basic | Standard code review, unit tests, linting |
High-Stakes Code: Maximum Oversight
What qualifies: Authentication, authorization, cryptography, financial calculations, PHI/PII handling, compliance-critical code.
Oversight requirements:
- Security scanning: Run SAST/DAST on ALL code (AI or human)
- Domain expert review: Reviewer must have expertise in the specific domain
- Mandatory test coverage: 100% branch coverage for security-critical paths
- Audit trail: Document who reviewed what and when
Why not just avoid AI? Because human code also has bugs. Veracode's own data shows human code has security issues too—the 45% AI flaw rate is compared to a human baseline that's also non-zero. The solution is verification, not avoidance.
Medium-Stakes Code: Standard Oversight
What qualifies: Business logic, data transformations, API integrations, non-trivial CRUD operations.
Oversight requirements:
- Thorough code review: Read and understand, don't just approve
- Domain awareness: Reviewer should understand the business context
- Integration tests: Verify behavior at system boundaries
- Edge case coverage: AI often misses domain-specific edge cases
AI advantage here: AI excels at generating correct-looking code for well-defined patterns. The risk is subtle business logic errors, which domain-aware review catches.
Low-Stakes Code: Basic Oversight
What qualifies: Boilerplate, formatting, simple CRUD, test scaffolding, documentation.
Oversight requirements:
- Standard code review: Quick sanity check
- Unit tests: Verify basic correctness
- Linting: Catch obvious issues automatically
AI advantage here: This is where AI delivers the most value with least risk. Boilerplate is tedious for humans but trivial for AI.
Common Pitfalls
| Pitfall | Problem | Better Approach |
|---|---|---|
| Blanket AI bans | Wastes expert capacity on low-stakes work | Calibrate oversight to stakes |
| Blanket AI trust | 45% security flaw rate in AI code | Always verify, scale depth to stakes |
| Expertise mismatch | Reviewer lacks domain knowledge | Match reviewer expertise to code domain |
| No tooling for high-stakes | Manual review misses systematic issues | Add SAST/DAST for security-critical code |
The Quick Decision
Before AI generates code, ask:
What's at stake if this code has bugs?
- Security breach, compliance violation, data loss → HIGH stakes
- Business logic errors, degraded UX → MEDIUM stakes
- Cosmetic issues, slower development → LOW stakes
Do I have the right oversight for this stake level?
- HIGH: Expert reviewer + security scanning available?
- MEDIUM: Domain-aware reviewer available?
- LOW: Basic review process in place?
If oversight is unavailable, reduce scope
- Can't get expert review? Use AI for less critical parts only
- No security scanning? Add it before touching auth code
Key Takeaways
- 45% of AI code fails security tests, but human code also has vulnerabilities—both need verification (Veracode 2025)
- AI code has 1.57x more security findings, up to 2.74x for XSS (CodeRabbit 2025)
- The bottleneck is review capacity, not AI capability
- Match oversight level to code stakes: maximum for auth/crypto, standard for business logic, basic for boilerplate
- Match reviewer expertise to code domain—an auth expert can use AI for auth code
- Always run security scanning on high-stakes code, regardless of who wrote it
Visual Overview
In This Platform
This platform applies calibrated oversight: build-time validation (guardrails), schema enforcement (structured outputs), and source verification (human review of claims). The survey assessment itself helps teams identify their current oversight calibration.
- build.js
Directoryschema/
- …
Directorysources/
- …
Prerequisites
Sources
Oversight Levels
Section titled “Oversight Levels”| Stakes | Examples | Oversight Required |
|---|---|---|
| HIGH | Auth, crypto, compliance | Security scanning, domain expert review, 100% test coverage |
| MEDIUM | Business logic, APIs | Thorough code review, integration tests, edge cases |
| LOW | Boilerplate, CRUD | Standard review, unit tests, linting |
The Key Insight
Section titled “The Key Insight”The question isn’t whether to use AI for critical code—it’s how much oversight to apply.
| Finding | Source |
|---|---|
| AI code has 1.57x more security findings (2.74x for XSS) | CodeRabbit 2025 |
| 45% of AI code fails security tests; human code also has vulnerabilities | Veracode 2025 |
Both AI and human code need verification. The answer is calibrated oversight, not avoidance.