Veracode 2025: AI-generated code introduced security flaws in 45% of controlled tests (unreviewed raw output; varies by language—Java 72%, Python 38%). Harness 2025: 67% of developers spend more time debugging AI-generated code than before. METR found experienced devs feel 20% faster while actually being 19% slower—a dangerous perception gap. Review fatigue is now a critical concern.
October 2025 Update: GenAI Code Security Report
Veracode
Primary source for AI code security statistics: 45% overall failure rate, 72% for Java specifically. The 'bigger models ≠ more secure code' finding is critical for model_routing - security scanning is needed regardless of model. Java's 72% rate makes it the riskiest language for AI-generated code.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
Java was the riskiest language with 72% security failure rate
XSS (CWE-80) defense failed in 86% of relevant code samples
State of Software Delivery Report 2025: The Role of AI in the SDLC
Harness
Critical counterweight to AI productivity hype. The 67% debugging overhead stat directly challenges simplistic 'AI makes you faster' narratives. The governance gaps (only 48% using approved tools, 60% lacking vulnerability assessment) highlight organizational maturity requirements. The 'blast radius' finding is particularly important for the trust_calibration dimension.
Key Findings:
67% of developers spend more time debugging AI-generated code
68% spend more time resolving AI-related security vulnerabilities
92% report AI increases 'blast radius' from bad deployments
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
METR
This is the most rigorous 2025 study on AI coding productivity. The RCT methodology (16 experienced developers, 246 tasks, $150/hr compensation) makes this highly credible. The 39-44 percentage point gap between perceived and actual productivity is the key insight for our trust_calibration dimension. This directly supports recommendations about not over-trusting AI suggestions and maintaining verification practices.
Key Findings:
Experienced developers were 19% slower with AI
Developers perceived 20% speedup (39-44 percentage point gap)
Self-reported productivity may not reflect reality
What do you typically do before accepting a Copilot suggestion?
[0]Accept immediately if it looks roughly right
[1]Quick glance - a few seconds review
[2]Careful read-through of the code
[3]Read-through plus mental/actual execution trace
[4]Full review including running tests
[5]Security-aware review (OWASP Top 10 check)
The 'Trust, But Verify' Pattern For AI-Assisted Engineering
This article provides the conceptual framework for our trust_calibration dimension. The three principles (Blind Trust is Vulnerability, Copilot Not Autopilot, Human Accountability Remains) directly inform our survey questions. The emphasis on verification over speed aligns with METR findings. Practical guidance includes starting conservatively with AI on low-stakes tasks.
Key Findings:
Blind trust in AI-generated code is a vulnerability
AI tools function as 'Copilot, Not Autopilot'
Human verification is the new development bottleneck
Primary source for AI code security statistics: 45% overall failure rate, 72% for Java specifically. The 'bigger models ≠ more secure code' finding is critical for model_routing - security scanning is needed regardless of model. Java's 72% rate makes it the riskiest language for AI-generated code.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
Java was the riskiest language with 72% security failure rate
XSS (CWE-80) defense failed in 86% of relevant code samples
Approximately what percentage of Copilot suggestions do you accept?
[3]0-20%
[4]21-35%
[3]36-50%
[2]51-70%
[0]71-100%
Note: 2025 benchmark: 30-33% is healthy. Healthcare/regulated at 50-60%. Startups at 75% (too high). >70% is a red flag.
Research: Quantifying GitHub Copilot's impact in the enterprise with Accenture
GitHub/Accenture
This is the primary source for the 30% acceptance rate benchmark and the 88% code retention statistic. The 95% enjoyment and 90% fulfillment stats are powerful for adoption justification. The 84% increase in successful builds directly supports the claim that AI doesn't sacrifice quality for speed. Published May 2024, so represents mature Copilot usage patterns.
Key Findings:
95% of developers said they enjoyed coding more with GitHub Copilot
90% of developers felt more fulfilled with their jobs when using GitHub Copilot
Developers accepted around 30% of GitHub Copilot's suggestions
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
In the past month, how often did you discover an error in AI-generated code AFTER accepting it?
[3]Never
[4]1-2 times
[2]3-5 times
[1]6-10 times
[0]More than 10 times
Note: 1-2 times scores highest—indicates you use AI enough to encounter issues but catch most before acceptance
October 2025 Update: GenAI Code Security Report
Veracode
Primary source for AI code security statistics: 45% overall failure rate, 72% for Java specifically. The 'bigger models ≠ more secure code' finding is critical for model_routing - security scanning is needed regardless of model. Java's 72% rate makes it the riskiest language for AI-generated code.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
Java was the riskiest language with 72% security failure rate
XSS (CWE-80) defense failed in 86% of relevant code samples
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
Are you aware that AI-generated code introduced security flaws in 45% of coding tests (Veracode 2025)?
[1]No, this is surprising to me
[2]I knew there were some risks but not the extent
[4]Yes, and I've adjusted my review practices
[5]Yes, we have specific security scanning for AI code
October 2025 Update: GenAI Code Security Report
Veracode
Primary source for AI code security statistics: 45% overall failure rate, 72% for Java specifically. The 'bigger models ≠ more secure code' finding is critical for model_routing - security scanning is needed regardless of model. Java's 72% rate makes it the riskiest language for AI-generated code.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
Java was the riskiest language with 72% security failure rate
XSS (CWE-80) defense failed in 86% of relevant code samples
CodeRabbit 2025 found AI code has 2.74x more XSS, 8x more I/O performance issues. Do you check for these patterns?
[1]No - I wasn't aware of these specific risk multipliers
[2]I know about them but don't specifically check
[3]I manually check for XSS and performance issues in AI code
[4]We use automated tools that catch these patterns
[5]Automated + manual review with focus on AI code hot spots
Note: CodeRabbit analyzed 470 PRs: AI code has 1.7x overall issues, with security (2.74x XSS) and performance (8x I/O) as top concerns
State of AI vs Human Code Generation Report
CodeRabbit
This is the most rigorous empirical comparison of AI vs human code quality to date. The 1.7x issue rate and specific vulnerability multipliers (2.74x XSS, 1.88x password handling) are critical for trust_calibration recommendations. Key insight: AI makes the same kinds of mistakes humans do, just more often at larger scale. The 8x I/O performance issue rate shows AI favors simple patterns over efficiency.
Key Findings:
AI-generated PRs contain 1.7x more issues overall (10.83 vs 6.45 issues per PR)
AI PRs show 1.4-1.7x more critical and major issues
Logic and correctness issues 75% more common in AI PRs
GitClear found 8x increase in code duplication from AI tools. Do you monitor for this?
[1]No - I didn't know AI increases duplication
[2]I'm aware but don't actively monitor
[3]I manually review for DRY violations in AI code
[4]We use code quality tools that flag duplication
[5]We actively refactor AI-introduced duplication
Note: GitClear 2025: AI makes it easier to add new code than reuse existing (limited context). Refactoring dropped from 25% to <10% of changes.
AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones
GitClear
GitClear's research on 211M lines of code shows AI is changing how code is written - more duplication, less refactoring. NOTE: Title says '4x Growth' (2025 projection); key finding is '8x increase' (actual 2024 data). We cite the 8x figure as it's measured data. The decline in refactoring suggests AI makes it easier to add new code than reuse existing code (limited context). Critical for understanding long-term maintainability implications.
Key Findings:
8x increase in duplicated code blocks during 2024
Refactoring dropped from 25% of changed lines (2021) to <10% (2024)
Copy/pasted (cloned) code rose from 8.3% to 12.3% (2021-2024)
With agent mode generating larger diffs, how do you manage review fatigue?
[0]I don't use agent mode / N/A
[1]I try to review everything but often skim large diffs
[2]I focus on critical paths and trust the rest
[3]I break large changes into smaller reviewable chunks
[4]I use AI code review tools (CodeRabbit, etc.) + human spot-check
[5]Layered review: AI review + focused human review + automated tests
Note: This is a new critical question for 2025. Layered AI+human review is emerging best practice.
State of AI Code Quality 2025
Qodo
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
The 'Trust, But Verify' Pattern For AI-Assisted Engineering
This article provides the conceptual framework for our trust_calibration dimension. The three principles (Blind Trust is Vulnerability, Copilot Not Autopilot, Human Accountability Remains) directly inform our survey questions. The emphasis on verification over speed aligns with METR findings. Practical guidance includes starting conservatively with AI on low-stakes tasks.
Key Findings:
Blind trust in AI-generated code is a vulnerability
AI tools function as 'Copilot, Not Autopilot'
Human verification is the new development bottleneck
How often do you 'Accept All' AI changes without reading the diff? (Karpathy's 'vibe coding' pattern)
[5]Never - I always read diffs before accepting
[4]Rarely - only for trivial/obvious changes
[2]Sometimes - when I'm confident in the AI
[0]Often - I trust the AI's judgment
[0]Always - I fully 'vibe code' (accept without reading)
Note: Karpathy coined 'vibe coding' (Feb 2025) for accepting without reading. Fast Company reported 'vibe coding hangover' (Sep 2025) with 'development hell' consequences. Score of 2+ is a critical risk flag.
Vibe Coding Definition (Original Tweet)
This tweet coined the term 'vibe coding' on February 3, 2025, defining it as a programming style where you 'forget that the code even exists' and 'Accept All' without reading diffs. Critically, Karpathy explicitly limits this to 'throwaway weekend projects' - a nuance often missed in subsequent coverage. The full quote shows he acknowledges the code grows 'beyond my usual comprehension' and he works around bugs rather than fixing them. This is essential context for our trust_calibration dimension: even the person who coined the term warns it's not for production work.
Key Findings:
Coined term 'vibe coding' for accepting AI changes without reading
This article documents the real-world consequences of 'vibe coding' practices going wrong. The Tea App case study is particularly powerful: a dating app built with minimal oversight leaked 72,000 images including driver's licenses due to a misconfigured Firebase bucket - a basic security error that proper review would have caught. PayPal engineer Jack Hays calls AI-generated code 'development hell' to maintain. Stack Overflow data shows declining trust (46% distrust vs 33% trust) and positive sentiment falling from 70% to 60%. This is essential evidence for our trust_calibration and agentic_supervision dimensions.
Key Findings:
Documented 'vibe coding hangover' phenomenon
Teams in 'development hell' from unreviewed AI code
Tea App data breach: 72,000 sensitive images leaked from unsecured Firebase
If you sometimes accept without reading: In what contexts?
[2]Prototypes and throwaway code
[1]Test code generation
[2]Documentation and comments
[0]Refactoring existing code
[0]New feature implementation
[0]Production code
Note: Vibe coding on prototypes/docs is lower risk. On production/features is critical risk. Score 0 for high-risk contexts.
Vibe Coding Definition (Original Tweet)
This tweet coined the term 'vibe coding' on February 3, 2025, defining it as a programming style where you 'forget that the code even exists' and 'Accept All' without reading diffs. Critically, Karpathy explicitly limits this to 'throwaway weekend projects' - a nuance often missed in subsequent coverage. The full quote shows he acknowledges the code grows 'beyond my usual comprehension' and he works around bugs rather than fixing them. This is essential context for our trust_calibration dimension: even the person who coined the term warns it's not for production work.
Key Findings:
Coined term 'vibe coding' for accepting AI changes without reading
This article documents the real-world consequences of 'vibe coding' practices going wrong. The Tea App case study is particularly powerful: a dating app built with minimal oversight leaked 72,000 images including driver's licenses due to a misconfigured Firebase bucket - a basic security error that proper review would have caught. PayPal engineer Jack Hays calls AI-generated code 'development hell' to maintain. Stack Overflow data shows declining trust (46% distrust vs 33% trust) and positive sentiment falling from 70% to 60%. This is essential evidence for our trust_calibration and agentic_supervision dimensions.
Key Findings:
Documented 'vibe coding hangover' phenomenon
Teams in 'development hell' from unreviewed AI code
Tea App data breach: 72,000 sensitive images leaked from unsecured Firebase