Your AI Team Doesn't Need More People — It Needs Agents

Damian Galarza — Wed, 13 May 2026 00:00:00 -0400

Stripe merges 1,300 agent-written PRs per week. Every one is human-reviewed. None contain human-written code.

That number showed up in Stripe’s engineering blog in March 2026, and it has been circulating in boardroom conversations ever since. Not because it’s impressive as a volume metric, but because of what it implies about the relationship between team size and engineering output. If a fleet of agents can produce that volume of shippable work, and humans are reviewing rather than writing, the old model of scaling output by scaling headcount starts to look like the wrong optimization.

This is already reshaping how engineering leaders think about team structure. Leigh Newsome at HoolaHoop captures the pattern emerging across mid-to-late-stage companies: three to five senior engineers, supported by agents, matching or exceeding what eight to twelve person teams shipped a year ago. It’s showing up in the organizations they coach, not just the headline reference companies.

Coinbase is taking it further. Brian Armstrong’s May 2026 restructuring announcement — cutting fourteen percent of the workforce amid a crypto downturn — reorganizes the company around what he calls “AI-native pods.” Small teams where engineers, designers, and product managers collapse into single roles, managing fleets of agents. No pure managers. Five layers max below CEO/COO. “One person teams” as an explicit experiment. The framing is blunt: “rebuilding Coinbase as an intelligence, with humans around the edge aligning it.” You can debate whether the layoff is cost-cutting dressed as innovation or genuine structural change. But the organizational thesis — that small, high-context teams with agent leverage replace larger ones — is the same pattern showing up everywhere else.

The question most engineering leaders are asking right now isn’t whether this shift is real. It’s what “supported by a fleet of agents” actually means in practice.

Why the old math broke

For twenty years, engineering scale was synonymous with headcount. Need more output? Hire more engineers. Want to ship faster? Add a team. The org chart grew, the roadmap grew, and the relationship between the two felt roughly linear.

That linear relationship depended on an assumption: execution required human hands. Writing code, running tests, fixing lint errors, updating documentation, opening PRs. Each of these was a unit of labor that only a person could perform. So more labor required more people.

Agents broke that assumption. Not for all tasks, but for enough of them to change the math.

The shift isn’t about replacing engineers. It’s about changing what makes an engineering team productive. A senior engineer who can spec work clearly, verify agent output efficiently, and maintain the infrastructure that agents run on produces more than a senior engineer writing all the code themselves. Add two more seniors with the same skills, plus a fleet of agents executing the predictable work, and you have a small team with the output of a much larger one.

This is a unit-economics revision, not a layoff strategy. The teams getting this right are redirecting budget from planned mid-level hires toward senior talent and agent infrastructure. The ones getting it wrong are either cutting headcount and hoping agents fill the gap (they won’t) or ignoring the shift entirely and wondering why smaller teams elsewhere are shipping faster.

What “plus agents” means in practice

The phrase “supported by agents” is easy to say and hard to make concrete. Here’s how I think about it: some tasks automate well with current agent capabilities, and some don’t. Knowing the difference is the entire game.

Tasks that automate well

Boilerplate and scaffolding. Generating new files, wiring up routes, creating database migrations, adding API endpoints that follow existing patterns. Agents handle this reliably because the work is repetitive and the success criteria is structural, not judgmental.

Test generation. Writing unit tests for existing code, especially when the code follows clear patterns. Agents are surprisingly good at identifying edge cases when given a well-defined function signature and existing test examples to follow.

Refactoring. Renaming across a codebase, extracting modules, updating import paths, migrating from one API version to another. Deterministic transformations with verifiable outcomes.

Lint, formatting, and CI fixes. This is where Stripe’s Blueprint pattern shines. They run lint as a deterministic node in their agent state machine before pushing to CI. The agent doesn’t need judgment here, just compliance.

Documentation updates. Syncing docs with code changes, updating README files, regenerating API references. The source of truth (the code) already exists; the agent just needs to reflect it.

Tasks that don’t automate well

Architecture decisions. Which database to use, how to structure service boundaries, when to split a monolith. These require context that lives in organizational history, business constraints, and operational experience. No amount of context window will replace knowing what broke in production last quarter.

Product judgment. Deciding what to build, what to cut, what to defer. Agents can execute against a spec, but they can’t decide whether the spec solves the right problem.

Spec authorship. Writing the acceptance criteria that agents execute against. This is the inverse of what you might expect. Vague specs produce vague work, fast. The cost of ambiguity in product input is amplified by agents, not absorbed by mid-level engineers the way it used to be. Spec quality is now directly proportional to output quality.

Verification. Confirming that agent-generated code actually solves the problem it was supposed to solve. This is the new bottleneck, and it deserves its own section.

Where the ROI breaks down

Agent leverage is real, but it compounds unevenly. Three areas determine whether you’re getting actual value or just generating volume.

The verification bottleneck

Agent output is fast but inconsistent. A fleet of agents can generate fifty PRs in an afternoon. If each one takes a senior engineer thirty minutes to verify, you’ve consumed twenty-five hours of your most expensive engineering time reviewing code instead of designing systems.

This is where new metrics start to matter. Optimum Partners proposes three that I think are worth tracking:

MTTV (Mean Time to Verification) measures how long it takes to confirm that agent output is actually correct and ready to ship. This replaces “code complete” as the meaningful quality milestone. If MTTV is high, you’re not getting leverage even when volume looks impressive.

AI-CFR (AI Change Failure Rate) tracks the percentage of agent-generated changes that fail in production or require post-merge rework. This needs to be tracked separately from human change failure rate because the failure modes are different. Agent-generated code tends to fail in different ways than human-written code, and combining the metrics hides the problem.

Interaction Churn measures how many back-and-forths a human has with an agent before reaching a useful result. High churn signals weak specs, poor context engineering, or the wrong agent for the task. It’s the leading indicator of whether your AI investment is actually compounding.

The uncomfortable implication: if your engineering review next quarter uses none of these metrics, you’re running an AI-curious team, not an AI-native one.

Infrastructure quality compounding

Here’s something that gets missed in most conversations about agent teams. Stripe didn’t become good at agents overnight. They spent years investing in developer experience. Their development environments spin up in ten seconds. Their MCP Toolshed contains nearly 500 tools in a centralized registry. Their rule files are scoped to subdirectories rather than global, because global rules in large repos waste the agent’s context window before the agent starts working.

The insight from their engineering team is deceptively simple: “What’s good for humans is good for agents.” Every investment they made in human developer productivity — fast feedback loops, clean tooling, isolated environments — returned dividends when agents started running the same tooling.

This cuts both ways. Teams that cut corners on developer experience are discovering those shortcuts at ten times the speed when agents surface them. If your codebase is messy, agents will reproduce that mess faster than any human could. If your CI pipeline is flaky, agents will burn cycles retrying it. If your test coverage is thin, agent-generated code will ship without guardrails.

Infrastructure quality isn’t just a nice-to-have in the agent era. It’s a force multiplier. The gap between well-maintained and poorly-maintained codebases is widening faster than ever.

Spec precision

Ramp’s story adds a different dimension. Their AI usage grew 6,300% year-over-year. 99.5% of their team is active on AI tools. 84% use coding agents weekly. And here’s the number that matters most: non-engineers now account for 12% of all human-initiated PRs on Ramp’s production codebase, thousands per month.

They got the org design wrong before they got it right. The initial instinct was to centralize: one small team builds tools for the whole company. Demand outstripped capacity immediately. Then they swung decentralized. Every team builds its own solutions. Redundant re-learning everywhere.

The answer was hub-and-spoke. A small central team builds the platforms, connectors, and infrastructure. Functional teams build solutions on top and feed requirements back. A risk analyst automated sixteen hours per month of manual modeling. A sales ops lead replaced a spreadsheet-based comp model across three orgs in forty-eight hours. An L&D lead built a training simulator in fifteen minutes. None of them were engineers.

This works because the central team invested in spec precision. Their internal AI workspace, Glass, auto-configures on install, connecting thirty-plus tools through single sign-on. No setup guide. No IT ticket. When the barrier between “I have a problem” and “I can spec a solution” drops to near zero, the bottleneck shifts from execution to specification quality.

For agent fleets specifically, the same principle applies. Centralize the infrastructure, decentralize the application. And invest relentlessly in making specs clear enough that agents can execute against them without ambiguity.

Juniors aren’t going away

The surface-level reading of this story says junior engineers are obsolete. The data contradicts it.

Shopify is publicly expanding its junior pipeline, going from roughly a hundred interns to over a thousand a year, while reporting around a twenty percent productivity lift from AI tooling. That’s not a company that thinks juniors are dead weight.

What’s happening is subtler. Junior engineering work is being repointed. Instead of writing boilerplate code that agents now handle, juniors are being directed toward verification, spec authoring, and the kind of detail-oriented review work that AI-native teams need more of. The AI Reliability Engineer role that’s emerging in some organizations is essentially a formalization of this shift: someone who validates agent-generated output and designs the verification systems that make AI work safely shippable.

Entry-level engineering isn’t disappearing. The definition of entry-level work is changing. The junior who can verify agent output, write precise acceptance criteria, and identify when an agent’s solution is structurally correct but semantically wrong is exactly the junior AI-native teams need.

Where to start

If you’re leading an engineering team and this resonates, here’s what I’d do first.

Audit your developer experience. Before adding agents to your workflow, look at what they’ll inherit. Is your CI fast? Are your test environments isolated? Is your documentation current? Agents amplify whatever they find. Make sure what they find is solid.

Measure verification, not velocity. Track the three metrics from the verification section: MTTV, AI-CFR, and Interaction Churn. These tell you whether you’re getting leverage or just generating volume.

Automate the predictable, keep humans on the judgment. Map your team’s work to the two lists above. If a task is repetitive, structurally verifiable, and follows existing patterns, it’s a candidate for agent automation. If it requires organizational context, product judgment, or architectural reasoning, it stays with humans.

Invest in spec quality. The biggest single improvement most teams can make is writing clearer specs. When agents execute against precise acceptance criteria, the output improves dramatically. When they execute against vague user stories, you get technically-shipped work that doesn’t solve the problem.

Stop hiring against pre-agent plans. If the headcount plan you presented to the board was built before agents changed the unit economics, revisit it. The next ten engineers you would have hired might be better allocated as three senior engineers and an investment in agent infrastructure. Frame it as a unit-economics revision, not a hiring miss.

None of this is theoretical. The companies producing evidence (Stripe’s 1,300 PRs per week, Coinbase restructuring around AI-native pods, Ramp’s 12% non-engineer PRs, Shopify’s expanded junior pipeline) are operating at a different leverage ratio than teams still sizing by headcount alone. The gap is measurable and it’s growing.

Working with me

I help founders and engineering teams navigate the shift where the old headcount math stops making sense. We figure out where agents fit, where humans still need to own judgment, and what your team has to change to ship reliably. If that conversation would be useful, let’s talk.

Team-Structure on Damian Galarza | Software Engineering & AI Consulting