Bridging the metrics vacuum: how to measure the real impact of AI assistants in software engineering

Software engineering intelligence

Software development and DevOps

A conceptual image for a vacuous space - two executives looking at different projected data screens.
Agnes Józwiak

Agnes Józwiak

May 27, 2026

AI promises to be the ultimate coding sidekick. Nearly 90% of software teams now lean on it, with dashboards boasting 200 hours saved per year

But your team’s reality may not match the hype.

Developers say they feel faster, but the PR queue is still overflowing, incidents are still lingering, and roadmaps keep slipping. Even though studies show 35% of devs resolve issues faster, the so-called AI productivity boost often gets lost in translation, especially since 68% of engineering teams spend at least four hours reviewing or correcting AI outputs.

Many teams are left wondering why the numbers don’t add up. Turns out, about 18% of organizations don’t have a framework to make sense of their AI results.

With insights from Appfire Flow’s Director of Engineering, Surya Mereddy, you’ll get the cues, checklists, and cultural habits to spot what matters and actually move the needle on delivery.

Common pitfalls in AI

If you’ve plugged AI into your workflow, you’ve seen the fireworks. DevOps metrics like review times dropping by 75% and PR volume spiking 111% look impressive on any dashboard. But those numbers alone won’t tell you if you’re actually shipping value or just creating the appearance of progress.

"The biggest signal isn't in the code. It's in the gap between velocity and understanding."

Surya Mereddy

Rising code churn after delivery

If your team is rewriting big chunks of code right after a feature ships, that’s code rot setting in. Surya’s seen it firsthand through six different approaches across three services, all in one sprint.

Why does this happen? AI only knows what you feed it, and every developer brings their own instructions and ideas. Without a shared playbook, consistency breaks down quickly, and your team ends up reworking code just to keep things afloat.

“A healthy team's rework should be declining after delivery. Pair it with the Investment Profile report to see whether your team's effort is shifting from new work toward unplanned maintenance. That shift is the early warning.”

Surya Mereddy

Developers resist AI adoption

A recent Stack Overflow survey found only 29% of developers actually trust AI tools. This is partially because of the “discernment burden,” meaning engineers have to double-check every line of code to ensure it aligns with your team’s coding standards.

When working in complex, real-world codebases, AI can struggle to keep up with context. For instance, a 2025 METR (Model Evaluation and Threat Research) study found that experienced open-source developers took about 19% longer to complete tasks with AI tools than without, highlighting a gap between perceived and actual productivity.

Think of it like testing out any other tool. As developers get used to it, this longer time is likely to shorten as they learn to use AI effectively. In addition, Surya found that the extra time spent reviewing code led to more collaboration and better product delivery. 

Just watch out: If your time to merge (TTM) suddenly drops, it could mean reviewers are rubber-stamping code, quietly resisting the AI wave.

“When the ratio shifts toward unreviewed (PRs), you have a review culture problem. Pair that with time to first comment and reaction time to find out whether your review process is real or performative.”

Surya Mereddy

Deployment frequency diverging from time to recovery

AI-generated code can push features out the door fast. But if your team doesn’t really understand what’s shipping, recovery slows down, and delivery quality takes a hit.

The Ox Security team looked at over 300 AI-assisted, open-source projects and found that most of them had these problems:

  • Comments explaining code were left everywhere, increasing cognitive load during human review.
  • AI-generated code follows textbook patterns and not organizational nuances.
  • Over-specified code that’s only useful in extreme edge cases.
  • Developers find the same bugs throughout the project.

Where human developers constantly improve code structure to make it easier to understand, AI is only concerned with meeting the prompt.

Measuring_AI_impact_1.jpg

Relying on outdated metrics

Old-school metrics like ticket velocity and lines of code (LOC) just don’t cut it for AI. They sort of made sense when humans wrote every line, but now that engineers are spending more time in QA mode, those numbers miss the mark.

These old metrics ignore the real work: debugging, untangling logs, and reviewing legacy code. You might see LOC and PR counts go up, but cycle time stays flat because engineers are busy making sense of what AI delivers. Rush that process, and you’ll see more change failures and code churn as teams scramble to fix what slipped through.

Unfortunately, there’s no magic metric for code generated versus code understood. The upside? There are frameworks that help you get closer.

Measuring_AI_impact_2.jpg

The hidden tax of AI-generated pull requests

AI-generated PRs are flooding the pipeline, and not all of them are gems. Some are low-quality AI-generated code — so poor that GitHub is considering limiting who can submit them. For your team, this means time wasted reviewing low-value PRs instead of delivering real impact.

Your senior devs can handle the load, but nobody wants to spend hours on low-value work.

Editing or tossing out low-quality “junk PRs” usually falls on your most experienced (and expensive) people. If reviews keep piling up on just a couple of folks (especially with smaller teams), you’ve got a knowledge silo and high bus factor problem. Our 20 Patterns Guide breaks down these and other slowdowns to watch for.

Measuring_AI_impact_3.jpg

Frameworks that lead to better outcomes

Two frameworks and their software engineering metrics help AI-supported teams improve delivery outcomes: DORA (DevOps Research and Assessment) and SPACE (Satisfaction, Performance, Activity, Communication, and Efficiency).

Use both for a layered approach that keeps your metrics aligned with business goals — not just dashboard vanity.

DORA metrics

SPACE frameworks

Business value alignment

What AI changes

Deployment frequency

Satisfaction and well-being

Developer velocity index (DVI)

Evaluates AI's ability to increase competitive delivery without causing developer burnout.

Lead time for changes

Performance

Contribution analysis (CA)

Contrasts time-to-value acceleration with new bottlenecks introduced during code review.

Change failure rate

Activity

Talent capacity (TC)

Balances the risk of shipping AI-authored bugs against the successful growth of developer expertise.

Mean time to recovery (MTTR)

Communication and collaboration

Inner loop efficiency

Highlights AI's role in improving incident recovery while keeping developers in a productive flow state.


Efficiency and flow


Tracks the actual improvement in overall time-to-market driven by AI adoption.

DORA metrics

DORA metrics focus on four delivery signals, with questions you can ask to better understand how AI contributes:

  • Deployment frequency:  Does AI reduce cycle time?
  • Lead time for changes: How much QA time has been reduced (or increased) by AI?
  • Change failure rate: Does AI increase change failure rate?
  • Mean time to recovery (MTTR): Does recovery time change meaningfully?

If you’re already using DORA, you can see the before-and-after impact of AI. Just make sure your AI policies are clear and your data house is in order. That way, every tweak you make is trackable and meaningful.

SPACE framework

SPACE metrics provide a more team-focused approach to tracking AI, broken down like this:

  • Satisfaction and well-being: Has AI reduced or increased developer workloads?
  • Performance: Is our increased coding speed resulting in faster customer value?
  • Activity: How has AI changed our new code-to-review ratio? Has review depth collapsed because of this?
  • Communication and collaboration: Are reviewers spending too much time deciphering AI-generated logic?
  • Efficiency and flow: Has AI decreased our TTM, or is work stuck in review longer?

Pair DORA’s big-picture view with SPACE’s team-level lens to see where AI is actually making life easier for your engineers.

Business-value alignment data

Business-value metrics focus on how changes are bringing value to your business. Here’s where AI comes in:

  • Developer Velocity Index (DVI): How have AI integrations made us more competitive?
  • Contribution Analysis (CA): Does AI free our senior talent to focus on complex projects?
  • Talent Capacity (TC): Can we track how fast our novices are trained through AI-driven learning journeys?
  • Inner Loop Efficiency: Does AI help us spend more time on creating the product (coding, building, and unit testing)?

When you track business value, you move past surface-level productivity and zero in on what’s really driving better delivery.

Results with Flow:

KAR Global used visibility into how engineering teams work to modernize onboarding, understand workflows, and help earn a “Best in Tech” award for transformation through data.

Getting results with Software Engineering Intelligence (SEI) platforms 

If you’ve already made these changes, you’re one step ahead. In this case, the problem isn’t a lack of AI tools; it’s fragmented telemetry across the software development lifecycle (SDLC).

To get a full picture of AI adoption that actually works, Surya uses a three-level framework:

  • Level 1 (Adoption): Track tool activation rates, session frequency, and breadth of usage across the team. Emphasize the “how,” tracking how AI adoption correlates with actual delivery patterns.
  • Level 2 (Throughput): Track cycle time, PR volume, and completion rates. Look at trends over quarters, not weeks, to see where AI is truly changing delivery patterns (not just inflating numbers).
  • Level 3 (Reliability): Track rework, backflow rate, TTM, thoroughly reviewed PR percentage, change failure rate, and service restoration times. This helps you track actual results, not just the novelty effect that can occur when throughput spikes immediately after adopting AI.

Without that level of visibility, teams are left interpreting disconnected signals: speed without context, output without outcomes. Bringing those signals together makes it easier to spot where AI is helping, where it’s adding friction, and where to focus next.

An SEI platform like Appfire Flow helps translate engineering activity into clear, actionable insight, so teams can see how delivery patterns, code quality, and review behavior actually impact delivery outcomes.

“Start strict. Earn trust through evidence. Loosen deliberately. That's the progression.”

Surya Mereddy

How to build a DevOps culture ready for continuous AI improvements

To make AI stick, treat it as a people problem, not just a tech upgrade. Here’s how to get it right:

Follow the golden line

Don’t roll out AI just because everyone else is. Deploy it with purpose, and make sure every tool ties directly to a business KPI you can measure.

The golden line starts with showing how to use your new tools. Surya ran daily stand-up demos for seven months, and his team hit 100% adoption without a single mandate.

“AI mandates and standardization don't work because you need a mindset shift and upskilling with intent, not compliance.”

Surya Mereddy

Give permission to experiment (and fail)

AI makes it cheap to experiment, helping your team create prototypes in hours, not days. But if your culture punishes failure, you’ll end up with a risk-averse team that never pushes the envelope.

This matters most in the first three months, when failure signals are fuzzy. Don’t panic and fall back on vanity metrics like PRs per week. Instead, expect a three-month lag and use your SEI platform to track quarter-over-quarter trends. Your patience will be rewarded with measurable understanding.

Measuring_AI_impact_4.jpg

Support your developers

Give your teams real autonomy to experiment and get curious with AI. Treat them like cogs at 100% utilization, and you’ll get assembly-line results.

Keep an eye on the architecture-to-toil ratio, the time spent on high-level design versus hours lost to reviewing and fixing regressions. If your senior engineers are stuck reviewing AI code and patching junior mistakes, they’re not sharpening their edge.

Also, let your developers voice their concerns. As AI evolves, even seasoned engineers can feel imposter syndrome creeping in. Remind them  that the smartest move is to pause and recharge, and everyone feels the pressure as the bar keeps moving.

“AI is a genuinely safer space not to know things. And it has raised the stakes on what knowing things even means. Both can be true at the same time, and acknowledging that is leadership.”

Surya Mereddy

Find your invisible champions

Spot the engineers who quietly ship solid work and always have a draft ready before the crowd. Give these silent champions a platform, protected time, and the green light to share their skills and train others.

Your SEI dashboard can help you spot these folks. Appfire Flow’s Work Focus breakdown highlights who’s helping others while getting their own work done. Look for low rework, thoroughly reviewed PRs, and a strong Sharing Index for broad collaboration.

Measure the system, not the individual

Chasing individual AI output turns your incentives upside down, rewarding output over results. Focus on team outcomes, not just PR counts.

Here are some questions you can ask to focus on the team over the individual:

  • Did cycle time actually improve?
  • Did the defect escape rate go down?
  • Are we delivering features customers actually use?
  • How quickly does the team respond to production issues?
  • How often does the team need to redo work?
  • How confident is the team in understanding what they shipped?

Tracking more doesn’t help; understanding better does. 

Encourage uncompromising accountability

Real accountability means building a trust framework for AI workflows. The right guardrails keep your team from letting AI run wild and wrecking your codebase without punishing experimentation.

Surya has three tiers in his trust hierarchy: 

  • Tier 1 (Full human review): With security-critical code, financial logic, and anything touching user data, pay the extra time cost with a complete human review. To save time, treat these AI agents building this codebase like product service accounts with limited privileges and full audit trails.
  • Tier 2 (Verify by proxy): Excluding the highest-risk issues, trust your system if the tests, the type checks, and the linter pass. Just make sure your verification infrastructure works, as most organizations overestimate their CI reliability.
  • Tier 3 (Trusted provisionally): For low-stakes code and isolated functions you can rip out easily, trust the process. These can ship faster than most, but you should monitor closely in case you need to make quick fixes.

Trust frameworks should be grounded in real data: change failure rates, rework trends, review depth. Your SEI platform needs to surface these signals so you can adjust your process with evidence, not guesswork.

But above all, start with the process, as adding another tool to your software engineering stack won’t resolve those issues.

Illuminate your AI productivity metrics with Appfire Flow

To make AI metrics work for you, look beyond raw output and measure patiently over quarters. By combining DORA, SPACE, and business value alignment, you create a system that addresses the most common AI measurement apps gaps. This approach keeps you tracking sustainable, high-quality delivery and actual team health, not just inflated coding velocity.

But this multi-level tracking shouldn’t require drowning your team in manual reporting. Appfire Flow takes the guesswork out of AI adoption with a single source of truth for data-driven decisions. The platform’s unified dashboard gives engineering leaders a clear view of cycle times, throughput, and delivery metrics. That way, you know you’re shipping what customers want while giving your developers what they need to do their best work.

Ready to stop guessing and start measuring the true impact of your AI adoption? Get Appfire Flow to illuminate your engineering metrics today.

Get Appfire Flow
Agnes Józwiak

Agnes Józwiak

Agnes Jozwiak is a Senior Product Marketing Manager at Appfire. With deep roots in Agile and SaaS, she crafts messaging that connects with users, drives adoption, and turns great products into everyday solutions. She’s passionate about the human side of technology and uses storytelling to build community and inspire action.