Why 10x code didn't give us 10x throughput

Artificial intelligence

Engineering leadership

Software engineering intelligence

Why 10x code didn't give us 10x throughput

Surya Mereddy

Feb 16, 2026

I looked at our metrics. Cycle Time: flat. Coding days: flat. Commits per day: flat.

We'd gone AI-first, meaning AI tools were part of daily development work, not side experiments. Great adoption. Everyone using it daily. Seven months of standup demos where I shared what worked and what broke. And the numbers hadn't moved.

We were generating more code and shipping at the same speed. What was going on?

AI is a magnifier, not a multiplier

Here's what I learned: if your processes were designed for human-speed input and you're now feeding them machine-speed code, you don't get 10x throughput. You get 10x bottlenecks. The constraint just moves downstream.

Our engineers could code 10 tickets. But then they had to run the same CI pipeline for all 10. Same regression checklist. Same deployment process. Each PR takes 30 minutes in CI. Deployment takes an hour. This is queue time. The code was written fast but waiting everywhere downstream.

So maybe we got 2x. Not 10x.

The work piled up in review queues, regression tests, and deployment. We wanted to move fast and do it right. By the time we verified it was right, we had already lost most of the gains from writing it quickly.

Systems designed for humans, fed by machines

All of our systems were designed for human input. The speed of a human writing code. The pace of a human creating tickets. The rhythm of a human moving through a sprint.

Now we have machine-speed input. And the systems can't keep up.

My engineers can work on three different tickets simultaneously. But in the AI era, those three tickets could be one ticket with detailed intent and specs. They just go build it.

So what even is a ticket anymore? What does a story mean when you can ship an entire epic at the code level in a day? For engineering leaders, these questions show up in planning sessions, progress reviews, and conversations about where work is actually getting stuck.

I don't have the answers yet. But I know the structure is changing. And if your processes assume human-speed input, you're going to hit walls.

What we built to fix it

We didn't build agents to generate more code. We built agents to unblock the system so code could actually ship.

CS triage agent. Customer support triage was one of the first places work got stuck. Repetitive log investigation for customer support tickets kept routing decisions through me, creating a bottleneck. I built an agent to triage them. It runs at 91% accuracy, but accuracy wasn't the point. The point was removing myself as a constraint so my team could focus on features instead of waiting on me.

PR review bot. Pull request review became the next bottleneck. As throughput increased, reviews started piling up. That’s when Nate — one of Flow’s lead engineers — built a PR review bot to take pressure off that step. Same pattern: He removed himself as a constraint so the team could keep moving. That bot has now completed 628 reviews.

QA agent. Regression testing surfaced as another limit. Two engineers, Briana and Nishtha, identified the bottleneck and started building a QA agent to automate it. Briana was skeptical at first. She spent time in open office hours and watched peers experiment. Her early results weren't good. She refined her approach. Now she's building production agents.

Every agent solved a specific bottleneck. None of them were about generating more code.

I don't want to make it sound like we have it all figured out. We don't. These are active, in-progress efforts, and we're learning as we go.

A coherent way to think about AI impact

If you’re struggling to measure whether AI is actually helping, it’s usually because the system itself is changing. What worked before no longer tells the full story.

We found it useful to think about AI adoption in three stacked levels. Each level has its own question and its own metrics. As your capabilities mature, your focus should move up the stack.

Level 1: Adoption

The question here is simple: “Are engineers using AI at all?”

The metrics are familiar. Tool usage. Daily active users. Sentiment.

This is where most teams start. It’s useful early on. But it only tells you whether the tool is present, not whether it’s making anything better.

Level 2: Throughput

The question shifts to: “Is AI changing how work flows through the system?”

This is where metrics like investment profile (feature work versus maintenance and tech debt), time to merge, review depth, and the current bottleneck start to matter.

This is where real business value begins to show up. But you can’t see it through a single number. You have to look at how the system behaves as a whole.

Level 3: Reliability

The question becomes: “Do agents behave like dependable parts of the system?”

At this level, you measure what you’d expect from any production system. Deployment frequency. Lead time. Change failure rate. Mean time to recovery.

Once agents are unblocking your system, they are part of your system. That means you have to measure them. Secure them. Govern them. Just like anything else running in production.

What the data showed us

This is where it gets interesting. Once we had a framework for measurement, the data started to tell a coherent story. Some metrics looked worse, and some looked better, but the full picture only made sense when we looked at them through the lens of the three levels.

Investment Profile: where the time actually goes. This metric shows how your team splits time between features, bugs, maintenance, and tech debt. Six months ago, we were spending 40-50% of our time on feature work. The rest was bugs, maintenance, tech debt. Now we're at 60-70% feature work. The agents are unblocking us. We're actually getting to work on the things that matter for the business.

Time to Merge went up. Under 24 hours to closer to 48 hours. That sounds bad, right? But here's why: our thoroughly reviewed PRs are way up. Half of our PRs now have additional commits on them. The reviews themselves are getting more complex. We're putting more time into making sure what goes out the door is right.

A lot of those robust review comments are coming from our AI agents. Nate's PR bot alone has 628 reviews. The reviews are more thorough, which takes longer, but the quality going out is better.

Completion Trend is up. Even though traditional metrics look weird, we're completing way more work. Keeping up with tickets just looks different now.

After we fixed the downstream bottlenecks, we saw a 111% increase in PR volume and a 57% reduction in Cycle Time. But only after we addressed what was downstream. The ROI wasn't in code generation. It was in removing the constraints that code generation exposed.

The 3-month lag. There's a critical lag between adoption and visible impact. If you're measuring week over week, you'll think nothing is working. We almost killed initiatives during this dip because we didn't see immediate results. You have to track leading indicators like time to merge and investment profile to bridge the gap to lagging indicators like throughput and cycle time.

The lesson: if you only look at one metric, you'll draw the wrong conclusion. Cycle Time flat? Looks like failure. But investment profile shifting to feature work? Completion trend up? Time to merge longer but reviews more thorough? That's a different story. The system is changing, and you need multiple views to understand what's actually happening.

I'm not entirely sure these are the right metrics yet. But they're the ones that seem to tell a coherent story.

The governance problem nobody talks about

Here's something I didn't anticipate: every agent you build has an identity. And that identity has access.

Our CS triage agent reads customer tickets and queries databases. Our Snyk Agent accesses the codebase and creates PRs. These are powerful capabilities running at machine speed.

The uncomfortable truth: we initially gave agents whatever access they needed to “just work” and figured out permissions later. That's backwards. We learned to treat agent credentials exactly like production service accounts. Least privilege, scoped access, fully auditable.

And because our CS triage agent ingests customer tickets, untrusted external input flows directly into our system. Customer tickets go straight into an agent with database query access. Prompt injection isn't theoretical for us. So we treat evals as security. They run on every prompt change, like unit tests. If the agent hallucinates or reveals internal logic, the build fails.

If your agents move faster than your security reviews, you're creating risk at machine speed.

What I've learned about leading through this

Find your champions. Mandates don't work. I've never seen an executive email change how engineers actually work. What worked for us: I shared my failures and wins in daily standups for seven months. Kyra watched me build the CS triage agent, then built a Snyk Security Bot on her own. Then Briana saw Kyra. Peer observation scales. Executive mandates don't.

This is going to be messy. SDLC might fundamentally change. Your teams need to be able to come to you and say “this isn't working” without fear. I lost 6 hours to compounding AI hallucinations because I was moving fast across multiple repos without committing incrementally. Small drifts compounded. The work was unsalvageable. I shared that failure publicly. When leaders go first, it creates permission for others to be honest.

Fundamentals matter more now. Head on a swivel. Watch your pipelines. Watch your tech stack. Version control matters more now, not less. Small PRs matter more. Frequent commits matter more. AI makes your good habits faster and your bad habits faster too.

Mob programming is back. I pair with my lead engineer every day. We brainstorm, record conversations, use that to generate specs and intent, build, then review together. Two engineers with different prompting styles produce better outcomes than either alone.

Metrics matter more than ever. But here's the thing: the metrics are going to be wrong at first. You don't know what to look at yet. Start with a few. Stick with them. Revisit every 3 months. That's how I'm doing it. Every quarter, I pick five metrics to watch, then reassess whether they told me what I needed to know.

Stop measuring “how many engineers have Copilot” or lines of code generated. Start measuring time to merge, investment profile (feature work vs maintenance), completion trends, and where the bottleneck is today. Because it will move.

Get closer to the customer. Here's the mindset shift: we're not solving code problems as much anymore. We're getting closer to solving business problems. So as engineering leaders, how do we break down traditional barriers? Get closer to what customers need. Cut out as much of the middle as you can. Solve for customers faster.

The mindset shift

I've been drilling this into my team: The change is coming. You’re going to be an AI engineer, whether you like it or not. This is an expectation, and it’s on leaders to support their teams and make the transition secure and measurable.

I've gotten much better feedback framing it that way than when I was just saying “hey, try this tool.” Once people understand this is the direction, not an experiment, they engage differently.

The warning

If you don't fix your bottlenecks now, things will break later. In six or seven months, when AI is everywhere and teams are shipping code at machine speed, the weak points will show up all at once. Pipelines will fail. Teams will get overwhelmed. Fixing problems one at a time won’t be an option anymore.

The time to find your bottlenecks is now, while you can still see them clearly.

The real question isn't “Are you using AI?” It's “How do you know AI is actually helping?”

I'm still not confident we're measuring the right things. Every quarter I pick five metrics to watch. Then I look back and ask whether they told me what I needed to know. So far, the answer has been “sort of.”

But at least we're asking.

I’m still figuring things out. If you want to see how we’re measuring this in Appfire Flow, we’d be happy to walk you through it.

Book your Appfire Flow demo

Surya Mereddy

Surya Mereddy is the Director of Engineering for Appfire’s Flow product, where he leads AI innovation, developer experience, and scalable systems for enterprise teams. He operates at the intersection of product vision and execution, building intelligent tools that make software delivery smarter and more reliable. Prior to Appfire, Surya held engineering leadership roles at Pluralsight (Flow) and served as a principal engineer at Acertara.