
Your developers tell you AI makes them faster. Your PR counts are climbing. Your engineers swear they can’t live without their coding assistants.
So why haven't your delivery timelines budged?
This is the challenge engineering leaders are still wrestling with as 2025 comes to a close. According to the 2025 Stack Overflow Developer Survey, 84% of developers are using AI tools. Yet, only 60% actually view AI positively (down from over 70% last year), and nearly half (46%) don’t trust the accuracy of the output.
But here’s the stat that should keep every CTO and VP of Engineering awake at night, courtesy of the 2025 DORA State of AI Report:
AI boosts individual output — 21% more tasks completed and 98% more pull requests merged. But organizational delivery metrics stay flat.
Read that again.
Individual productivity is skyrocketing. Organizational outcomes are stalling out.
If your dashboard is green but your Chief Financial Officer is asking why features aren't shipping faster, you have a measurement problem.
The "Amplifier Effect" (or, why AI can’t fix your process)
The DORA research uncovered a brutal truth: AI acts as an amplifier, not a fixer.
If you have a strong team with solid CI/CD pipelines and healthy review cultures, AI makes them stronger. But if you have a brittle architecture, slow testing cycles, or a bottlenecked review process, AI just makes those problems louder.
Think about it. If your deployment pipeline is slow, an AI-fueled increase in code volume just makes the queue longer. If your code review process is already strained, doubling the number of PRs doesn't speed up delivery; it just burns out your senior engineers.
This is why the "vanity metrics" we used in 2023 don't work anymore:
- PR Count: Great, you have 98% more PRs. Are they stuck in review?
- Lines of Code: AI can generate boilerplate by the bucketload. Is it maintainable?
- Adoption Rate: 91% of your team uses Copilot. Are they using it to solve difficult problems or just to tab-complete variable names?
From skeptic to believer: How we fixed our measurement
Six months ago, I was not an AI evangelist. In fact, I thought the tools were absolute junk. I had just lost six hours of work to a hallucination because I wasn’t committing regularly. I was doing what I now call "waste coding" — throwing prompts at a wall and hoping for magic.
But as a manager, I couldn't just ignore it. I didn't want to mandate AI usage, but I needed to understand it.
We started small. No mandates. Just shared learning. We documented failures in standups. We celebrated when an engineer used AI to cut a reporting feature from a quarter-long project down to one month.
Crucially, we decided to measure everything. We stopped looking at activity (PRs opened) and started looking at flow and outcomes.
Here are the five metrics that actually told us if AI was working, based on six months of data from my own team.
1. Time to merge (the real velocity)
What it is: How long work sits in review before shipping.
Why it matters: AI inflates PR volume. If Time to merge goes up, you’ve just moved the bottleneck to your reviewers.
Our result: We drove this down to 43.8 hours (19.5% improvement). We are now in the top quartile of the industry. We are shipping faster, not just coding faster.
2. Rework rate (the "trust" metric)
What it is: The amount of code rewritten within 30 days of being committed.
Why it matters: Remember that 46% of developers don't trust AI output? This is where that shows up. If AI writes buggy code, Rework spikes.
Our result: Our rework time dropped to 18 hours (-26.2%). We aren't just generating code; we're generating code that sticks.
3. Review quality (% thoroughly reviewed)
What it is: The percentage of PRs that receive substantive comments vs. a rubber stamp.
Why it matters: When volume doubles, reviewer fatigue sets in. If this metric drops, you are trading quality for speed, and you will pay for it later.
Our Result: 43.7% thoroughly reviewed (67.3% improvement). Even with more code, our review rigor went up.
4. Impact score (not lines of code)
What it is: A measure of cognitive complexity and the significance of code changes (based on Flow Impact).
Why it matters: A 50-line architectural fix is worth more than 500 lines of generated getters and setters. If Impact is flat while Volume rises, AI is just generating noise.
Our result: Impact score rose to 170.7 (+30.8%). We are doing more meaningful work.
5. Active coding days (sustainability)
What it is: How many days per week engineers are actually committing code.
Why it matters: This is your burnout detector.
Our result: Up to 3.04 days/week (+26.7%). Engineers are spending less time blocked and more time in the flow.
The honest tradeoff
I want to be transparent about something most AI success stories hide: There is a cost.
While our outcomes improved, our Efficiency metric (the raw percentage of code that doesn't get rewritten) actually dropped by 14%.
Why? Because moving faster means iterating more. We are breaking things slightly more often.
However, because our Rework Time dropped by 26%, the net result is positive. We make mistakes more often, but we fix them much faster.
This is a tradeoff I’ll take to the board any day: "We are trading a small dip in first-pass efficiency for a massive gain in total throughput and impact."
A 90-day plan for your board
If you go to your board today and say "AI is great because our PRs are up," you are failing them. Instead, try this approach:
Month 1: Baseline & benchmark
Stop guessing. Establish your current Time to Merge, Rework Rate, and Impact Score. Don't set targets yet; just find out where you stand against industry benchmarks.
Month 2: Watch the bottlenecks
As adoption rises, watch where the pipe clogs. Does Time to Merge spike? That’s the Amplifier Effect kicking in. Address your review process immediately.
Month 3: Report on outcomes
Present the data honestly. Show the wins (Impact, Speed) alongside the tradeoffs (Efficiency).
The question isn't "Is AI making us productive?" We know it is.
The question is: "Is that productivity leaving the developer's laptop and entering the product?"
If you measure the right things, the answer can be yes.
Try Flow free