Veracode 2025: 45% of AI-generated code contains security vulnerabilities. Qodo 2025: 25% of developers estimate 1 in 5 AI suggestions contain errors. Yet METR found experienced devs feel 20% faster while actually being 19% slower—a dangerous perception gap. Review fatigue is now a critical concern.
October 2025 Update: GenAI Code Security Report
Veracode
This is the primary source for our 45% security vulnerability claim. The October 2025 update confirms that AI code security issues persist even with newer models. The finding that 'bigger models ≠ more secure code' is important for our model_routing dimension - it suggests security scanning is needed regardless of which model is used. The 72% Java-specific rate mentioned in our citations may be from the full PDF report.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
100+ LLMs tested across Java, JavaScript, Python, and C#
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
METR
This is the most rigorous 2025 study on AI coding productivity. The RCT methodology (16 experienced developers, 246 tasks, $150/hr compensation) makes this highly credible. The 39-44 percentage point gap between perceived and actual productivity is the key insight for our trust_calibration dimension. This directly supports recommendations about not over-trusting AI suggestions and maintaining verification practices.
Key Findings:
Experienced developers were 19% slower with AI
Developers perceived 20% speedup (39-44 percentage point gap)
Self-reported productivity may not reflect reality
What do you typically do before accepting a Copilot suggestion?
[0]Accept immediately if it looks roughly right
[1]Quick glance - a few seconds review
[2]Careful read-through of the code
[3]Read-through plus mental/actual execution trace
[4]Full review including running tests
[5]Security-aware review (OWASP Top 10 check)
The 'Trust, But Verify' Pattern For AI-Assisted Engineering
This article provides the conceptual framework for our trust_calibration dimension. The three principles (Blind Trust is Vulnerability, Copilot Not Autopilot, Human Accountability Remains) directly inform our survey questions. The emphasis on verification over speed aligns with METR findings. Practical guidance includes starting conservatively with AI on low-stakes tasks.
Key Findings:
Blind trust in AI-generated code is a vulnerability
AI tools function as 'Copilot, Not Autopilot'
Human verification is the new development bottleneck
This is the primary source for our 45% security vulnerability claim. The October 2025 update confirms that AI code security issues persist even with newer models. The finding that 'bigger models ≠ more secure code' is important for our model_routing dimension - it suggests security scanning is needed regardless of which model is used. The 72% Java-specific rate mentioned in our citations may be from the full PDF report.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
100+ LLMs tested across Java, JavaScript, Python, and C#
Approximately what percentage of Copilot suggestions do you accept?
[3]0-20%
[4]21-35%
[3]36-50%
[2]51-70%
[0]71-100%
Note: 2025 benchmark: 30-33% is healthy. Healthcare/regulated at 50-60%. Startups at 75% (too high). >70% is a red flag.
Research: Quantifying GitHub Copilot's impact in the enterprise with Accenture
GitHub/Accenture
This is the primary source for the 30% acceptance rate benchmark and the 88% code retention statistic. The 95% enjoyment and 90% fulfillment stats are powerful for adoption justification. The 84% increase in successful builds directly supports the claim that AI doesn't sacrifice quality for speed. Published May 2024, so represents mature Copilot usage patterns.
Key Findings:
95% of developers said they enjoyed coding more with GitHub Copilot
90% of developers felt more fulfilled with their jobs when using GitHub Copilot
Developers accepted around 30% of GitHub Copilot's suggestions
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
In the past month, how often did you discover an error in AI-generated code AFTER accepting it?
[3]Never
[4]1-2 times
[2]3-5 times
[1]6-10 times
[0]More than 10 times
Note: 1-2 times scores highest—indicates you use AI enough to encounter issues but catch most before acceptance
October 2025 Update: GenAI Code Security Report
Veracode
This is the primary source for our 45% security vulnerability claim. The October 2025 update confirms that AI code security issues persist even with newer models. The finding that 'bigger models ≠ more secure code' is important for our model_routing dimension - it suggests security scanning is needed regardless of which model is used. The 72% Java-specific rate mentioned in our citations may be from the full PDF report.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
100+ LLMs tested across Java, JavaScript, Python, and C#
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
Are you aware that 45% of AI-generated code contains security vulnerabilities (Veracode 2025)?
[1]No, this is surprising to me
[2]I knew there were some risks but not the extent
[4]Yes, and I've adjusted my review practices
[5]Yes, we have specific security scanning for AI code
October 2025 Update: GenAI Code Security Report
Veracode
This is the primary source for our 45% security vulnerability claim. The October 2025 update confirms that AI code security issues persist even with newer models. The finding that 'bigger models ≠ more secure code' is important for our model_routing dimension - it suggests security scanning is needed regardless of which model is used. The 72% Java-specific rate mentioned in our citations may be from the full PDF report.
Key Findings:
AI-generated code introduced risky security flaws in 45% of tests
100+ LLMs tested across Java, JavaScript, Python, and C#
CodeRabbit 2025 found AI code has 2.74x more XSS, 8x more I/O performance issues. Do you check for these patterns?
[1]No - I wasn't aware of these specific risk multipliers
[2]I know about them but don't specifically check
[3]I manually check for XSS and performance issues in AI code
[4]We use automated tools that catch these patterns
[5]Automated + manual review with focus on AI code hot spots
Note: CodeRabbit analyzed 470 PRs: AI code has 1.7x overall issues, with security (2.74x XSS) and performance (8x I/O) as top concerns
State of AI vs Human Code Generation Report
CodeRabbit
This is the most rigorous empirical comparison of AI vs human code quality to date. The 1.7x issue rate and specific vulnerability multipliers (2.74x XSS, 1.88x password handling) are critical for trust_calibration recommendations. Key insight: AI makes the same kinds of mistakes humans do, just more often at larger scale. The 8x I/O performance issue rate shows AI favors simple patterns over efficiency.
Key Findings:
AI-generated PRs contain 1.7x more issues overall (10.83 vs 6.45 issues per PR)
AI PRs show 1.4-1.7x more critical and major issues
Logic and correctness issues 75% more common in AI PRs
GitClear found 8x increase in code duplication from AI tools. Do you monitor for this?
[1]No - I didn't know AI increases duplication
[2]I'm aware but don't actively monitor
[3]I manually review for DRY violations in AI code
[4]We use code quality tools that flag duplication
[5]We actively refactor AI-introduced duplication
Note: GitClear 2025: AI makes it easier to add new code than reuse existing (limited context). Refactoring dropped from 25% to <10% of changes.
AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones
GitClear
GitClear's research on 211M lines of code shows AI is changing how code is written - more duplication, less refactoring. The 8x increase in code clones and decline in refactoring suggest AI makes it easier to add new code than reuse existing code (limited context). Critical for understanding long-term maintainability implications.
Key Findings:
8x increase in duplicated code blocks during 2024
Refactoring dropped from 25% of changed lines (2021) to <10% (2024)
Copy/pasted (cloned) code rose from 8.3% to 12.3% (2021-2024)
With agent mode generating larger diffs, how do you manage review fatigue?
[0]I don't use agent mode / N/A
[1]I try to review everything but often skim large diffs
[2]I focus on critical paths and trust the rest
[3]I break large changes into smaller reviewable chunks
[4]I use AI code review tools (CodeRabbit, etc.) + human spot-check
[5]Layered review: AI review + focused human review + automated tests
Note: This is a new critical question for 2025. Layered AI+human review is emerging best practice.
State of AI Code Quality 2025
Qodo
This is the most comprehensive 2025 survey on AI code quality (609 developers). The key insight is the 'Confidence Flywheel' - context-rich suggestions reduce hallucinations, which improves quality, which builds trust. The finding that 80% of PRs don't receive human review when AI tools are enabled is critical for our agentic_supervision dimension. NOTE: The previously cited 1.7x issue rate and 41% commit stats were not found in the current report.
Key Findings:
82% of developers use AI coding tools daily or weekly
65% of developers say at least a quarter of each commit is AI-generated
The 'Trust, But Verify' Pattern For AI-Assisted Engineering
This article provides the conceptual framework for our trust_calibration dimension. The three principles (Blind Trust is Vulnerability, Copilot Not Autopilot, Human Accountability Remains) directly inform our survey questions. The emphasis on verification over speed aligns with METR findings. Practical guidance includes starting conservatively with AI on low-stakes tasks.
Key Findings:
Blind trust in AI-generated code is a vulnerability
AI tools function as 'Copilot, Not Autopilot'
Human verification is the new development bottleneck
How often do you 'Accept All' AI changes without reading the diff? (Karpathy's 'vibe coding' pattern)
[5]Never - I always read diffs before accepting
[4]Rarely - only for trivial/obvious changes
[2]Sometimes - when I'm confident in the AI
[0]Often - I trust the AI's judgment
[0]Always - I fully 'vibe code' (accept without reading)
Note: Karpathy coined 'vibe coding' (Feb 2025) for accepting without reading. Fast Company reported 'vibe coding hangover' (Sep 2025) with 'development hell' consequences. Score of 2+ is a critical risk flag.
Vibe Coding Definition (Original Tweet)
This tweet coined the term 'vibe coding' on February 3, 2025, defining it as a programming style where you 'forget that the code even exists' and 'Accept All' without reading diffs. Critically, Karpathy explicitly limits this to 'throwaway weekend projects' - a nuance often missed in subsequent coverage. The full quote shows he acknowledges the code grows 'beyond my usual comprehension' and he works around bugs rather than fixing them. This is essential context for our trust_calibration dimension: even the person who coined the term warns it's not for production work.
Key Findings:
Coined term 'vibe coding' for accepting AI changes without reading
This article documents the real-world consequences of 'vibe coding' practices going wrong. The Tea App case study is particularly powerful: a dating app built with minimal oversight leaked 72,000 images including driver's licenses due to a misconfigured Firebase bucket - a basic security error that proper review would have caught. PayPal engineer Jack Hays calls AI-generated code 'development hell' to maintain. Stack Overflow data shows declining trust (46% distrust vs 33% trust) and positive sentiment falling from 70% to 60%. This is essential evidence for our trust_calibration and agentic_supervision dimensions.
Key Findings:
Documented 'vibe coding hangover' phenomenon
Teams in 'development hell' from unreviewed AI code
Tea App data breach: 72,000 sensitive images leaked from unsecured Firebase
If you sometimes accept without reading: In what contexts?
[2]Prototypes and throwaway code
[1]Test code generation
[2]Documentation and comments
[0]Refactoring existing code
[0]New feature implementation
[0]Production code
Note: Vibe coding on prototypes/docs is lower risk. On production/features is critical risk. Score 0 for high-risk contexts.
Vibe Coding Definition (Original Tweet)
This tweet coined the term 'vibe coding' on February 3, 2025, defining it as a programming style where you 'forget that the code even exists' and 'Accept All' without reading diffs. Critically, Karpathy explicitly limits this to 'throwaway weekend projects' - a nuance often missed in subsequent coverage. The full quote shows he acknowledges the code grows 'beyond my usual comprehension' and he works around bugs rather than fixing them. This is essential context for our trust_calibration dimension: even the person who coined the term warns it's not for production work.
Key Findings:
Coined term 'vibe coding' for accepting AI changes without reading
This article documents the real-world consequences of 'vibe coding' practices going wrong. The Tea App case study is particularly powerful: a dating app built with minimal oversight leaked 72,000 images including driver's licenses due to a misconfigured Firebase bucket - a basic security error that proper review would have caught. PayPal engineer Jack Hays calls AI-generated code 'development hell' to maintain. Stack Overflow data shows declining trust (46% distrust vs 33% trust) and positive sentiment falling from 70% to 60%. This is essential evidence for our trust_calibration and agentic_supervision dimensions.
Key Findings:
Documented 'vibe coding hangover' phenomenon
Teams in 'development hell' from unreviewed AI code
Tea App data breach: 72,000 sensitive images leaked from unsecured Firebase