Four Dimensions of Agent-Ready Codebase Design

15 min read

When an AI agent rewrites a file and the result doesn’t match your conventions, the first move is usually to adjust the prompt. Try different instructions. Add more context to the message. Maybe switch models.

The model is rarely the bottleneck. The codebase is.

The same model, pointed at a codebase with strong tests, clear architecture, and good documentation, produces remarkably consistent output. Point it at a codebase with weak coverage, no architecture docs, and no linting, and you get drift. Not because the model is less capable, but because it has less to work with.

I built the Codebase Readiness Assessment to make this measurable. It scores your repo across eight dimensions on a 0-100 scale. But you don’t need to run the assessment to understand what separates high-scoring codebases from low-scoring ones. Four dimensions account for most of the gap.

Test Foundation

Test foundation carries the most weight in the assessment (25%) because it’s the single biggest lever for agent output quality.

What a low score looks like

An agent makes a change. There are no tests covering that area, so it moves on. The change compiles, maybe even runs, but it broke an assumption three modules away. Nobody finds out until a human reviews the PR, or worse, until production.

I’ve seen this repeatedly: teams with 30-40% test coverage ask an agent to refactor a service object. The agent produces clean code that looks right. But there’s no spec for the edge case where a nil association triggers a downstream error. The agent had no way to catch it because there’s no test to fail.

The other failure mode is slow tests. If your suite takes 20 minutes, the agent can’t iterate. It makes a change, waits, discovers the failure, tries again, waits again. In a fast suite, that feedback cycle takes seconds. In a slow one, the agent burns time and money waiting for results.

What a high score looks like

Codebases that score well here share a few characteristics:

  • Coverage above 70% on critical paths. Not 100% everywhere, but thorough coverage on the code that matters: domain logic, service objects, API endpoints. The agent can make changes and get immediate confirmation that nothing broke.
  • Suite runs in under 5 minutes. Fast enough that the agent can run tests after every meaningful change, not just at the end.
  • Deterministic results. No flaky tests. When the suite says green, it means green. Agents can’t distinguish between a flaky failure and a real one, so flaky tests teach agents to ignore failures.

Don’t stop at unit tests

Unit tests on service objects and models are the foundation, but they only verify isolated behavior. An agent that passes all unit tests can still break a user-facing workflow that spans multiple components.

End-to-end tests give agents confidence across entire flows. A system spec that signs a user in, submits a form, and checks the result tells the agent whether the feature works, not just whether a method returns the right value. This is especially valuable when agents make changes that touch controllers, views, and services in the same PR.

Here’s a simplified system spec from one of my Rails projects. It covers the core user journey: signing in and submitting a video idea for validation.

# spec/system/idea_submission_spec.rb

RSpec.describe "Idea submission" do
  it "allows a signed-in user to submit a video idea" do
    user = create(:user)

    sign_in_as(user, path: new_idea_path)

    select user.channels.first.name, from: "Channel"
    fill_in "Title", with: "Building a Rails AI Agent from Scratch"
    fill_in "Description", with: "Step-by-step tutorial on building an AI agent"
    fill_in "Category", with: "AI Coding"
    click_button "Validate Idea"

    expect(page).to have_content("Building a Rails AI Agent from Scratch")
  end
end

This test touches authentication, the form UI, the controller, the background job, and the results page. If an agent breaks any part of that chain, this spec catches it.

The tradeoff is speed. End-to-end tests are slower and more brittle than unit tests. You don’t need full E2E coverage, but having system specs on your critical user journeys (signup, checkout, the core action your product is built around) gives agents a safety net that unit tests alone can’t provide.

The smallest change that moves the needle

Add coverage to your critical paths first. Don’t chase a coverage number. Instead, identify the three or four service objects or domain models where bugs would hurt the most, and write specs for those. Then add one or two system specs covering your most important user journeys end-to-end. If your suite is slow, add parallel test execution. In a Rails app, that might be as simple as adding the parallel_tests gem. A suite that goes from 15 minutes to 4 minutes fundamentally changes how an agent can work with your code. If you’re running multiple agents in parallel, you’ll also need database isolation per worktree to prevent test data collisions.

If you want to accelerate the process, tools like autoresearch apply this pattern as an autonomous loop: give the agent a measurable goal (like a coverage target), and it iterates, verifies, keeps what works, and discards what doesn’t.

Documentation as Code

Documentation carries 15% of the assessment weight, but in practice it’s the dimension where I see the biggest gap between teams that get good agent output and teams that don’t.

What a low score looks like

Without an agent-facing entry point (a CLAUDE.md, AGENTS.md, or equivalent), an agent has to reverse-engineer your conventions from the code itself. It reads your files, infers patterns, and guesses at intent. Sometimes it guesses right. Often it doesn’t.

Here’s a concrete example. A Rails app uses service objects for all business logic. Controllers call a service, the service does the work, and the result gets rendered. There’s nothing enforcing this in the framework. It’s a team convention. An agent that doesn’t know about this convention puts the logic directly in the controller action. The code works. The tests pass. But it breaks the team’s pattern, and now there’s a 50-line controller action that should have been a service object.

The agent wasn’t wrong. It had no way to know.

What a high score looks like

The key insight is that this entry point file should be a map, not a manual. OpenAI’s Harness Engineering team learned this the hard way: they tried a single large instruction file and it failed because “context is a scarce resource” and “too much guidance becomes non-guidance.” When everything is marked important, agents pattern-match locally instead of navigating intentionally.

Their solution: keep the entry file short (roughly 100 lines) and treat it as a table of contents that points to deeper sources of truth in a structured docs/ directory. The entry file gives agents quick commands and a documentation map. The detail lives in dedicated files the agent reads when it needs them. Whether you call it CLAUDE.md, AGENTS.md, or CURSOR.md, the pattern is the same.

Here’s what this looks like in practice from one of my Rails projects:

## Quick Commands

bin/dev                                # Start dev server
bin/rails spec                         # All tests
bin/ci                                 # Full CI: lint + security + tests
bin/rubocop                            # Lint
bin/brakeman                           # Security scan

## Documentation Map

| Topic | Document |
|-------|----------|
| Stack, patterns, domain model | docs/ARCHITECTURE.md |
| Testing patterns and stack | docs/TESTING.md |
| Credentials, env vars, API keys | docs/CONFIGURATION.md |
| Engineering principles | docs/design-docs/core-beliefs.md |
| Architecture decision records | docs/design-docs/ |

The agent gets commands and a map up front. When it needs to understand the domain model or testing conventions, it follows the pointer. This is progressive disclosure: the agent starts with what it needs immediately and loads deeper context on demand.

Here’s a trimmed excerpt from the ARCHITECTURE.md behind that pointer:

## Domain Model

CreatorSignal validates YouTube video ideas. The core flow:

1. User submits a video **Idea**
2. A **Validation** job is enqueued
3. The **ResearchAgent** runs tools against YouTube, Reddit, X, and HN
4. Results are synthesized into a scored **Go / Refine / Kill** verdict

### Key Models

| Model | Responsibility |
|-------|---------------|
| `User` | Authentication, subscription plan |
| `Idea` | A video idea submitted for validation |
| `Validation` | One run of the research agent against an idea |

### Project Structure

app/
├── components/       # ViewComponent components
├── controllers/
├── jobs/             # ActiveJob jobs (async validation)
├── models/
├── services/         # Research agent, tool orchestration
└── views/            # Hotwire (Turbo frames/streams)

An agent reading this knows what an Idea is, that validation is async through a job, and that orchestration logic lives in app/services/. Those are the conventions that prevent drift.

ADRs (Architecture Decision Records) add a layer that documentation alone can’t. An agent that understands why a particular pattern was chosen can make better decisions when extending it. If your ADR says “we chose event sourcing for the billing domain because of auditability requirements,” the agent won’t try to refactor billing into simple CRUD.

The smallest change that moves the needle

Create an AGENTS.md in your project root with two things: commands (build, test, lint) and a documentation map pointing to deeper files. AGENTS.md is an emerging standard supported by Codex, Cursor, Gemini CLI, GitHub Copilot, Windsurf, Devin, and many others. If you’re using Claude Code, symlink CLAUDE.md to it so both resolve to the same file. Then create an ARCHITECTURE.md covering your stack, domain model, and key conventions. This can take an hour and the effect on agent output is immediate. If you want to automate the scaffolding, the agent-ready plugin generates a starting point based on your existing codebase.

Architecture Clarity

Architecture clarity carries 15% of the assessment weight. It measures whether an agent can understand where code belongs and how components relate to each other.

What a low score looks like

Agents replicate patterns they find in the codebase. If your codebase has clear boundaries (controllers handle HTTP, services handle business logic, models handle persistence), the agent follows those boundaries. If your codebase mixes concerns, the agent mixes concerns.

The most common failure I see: a controller that does everything. It validates input, calls the database, sends emails, enqueues jobs. An agent asked to add a new feature looks at the existing controller, sees that’s where logic goes, and adds more logic to the controller. The agent is doing exactly what the codebase taught it to do.

The subtler version is dependency direction. In a well-layered app, dependencies point inward: controllers depend on services, services depend on models. When that direction is inconsistent (models importing from controllers, services reaching into HTTP request objects), agents produce code with the same tangled dependencies.

What a high score looks like

  • Clear layering. Each layer has a single responsibility, and the codebase is consistent about which layer owns what.
  • Domain namespacing. Related functionality is grouped by business domain, not just by technical layer. Instead of a flat app/services/ with 40 files, you have app/services/billing/, app/services/onboarding/, app/services/research/. When an agent needs to add billing logic, the namespace tells it exactly where to look and what patterns to follow.
  • Predictable file organization. A new developer (or agent) can guess where a piece of code lives based on what it does.
  • Dependency direction is consistent. Inner layers don’t reach outward. You don’t see models importing controller concerns.

Domain namespacing is especially powerful for agents because it constrains the search space. An agent working on a billing feature only needs to understand the billing namespace, not the entire codebase. It finds the existing patterns in that namespace and replicates them. Without namespacing, the agent has to scan the whole codebase to figure out where billing logic lives, and it might find three different patterns in three different places.

The smallest change that moves the needle

If you have fat controllers, extract one. Pick your most complex controller action, pull the business logic into a service object, and write a spec for it. The agent will start using that service object pattern for new features. One well-structured example teaches the agent more than any documentation, because it’s a pattern it can directly replicate.

If your codebase has grown past a handful of services, start namespacing by domain. Group related services, jobs, and models under a shared namespace. This compounds quickly: once you have three or four service objects under Billing::, agents start producing new billing code in the same namespace by default. The codebase becomes self-reinforcing.

Feedback Loops

Feedback loops carry 10% of the assessment weight, but their impact is multiplicative. Good feedback loops make everything else work better. Poor ones make everything else work worse.

What a low score looks like

Agents learn from the signals they get back. When the only signal is “tests passed,” the agent has no way to know it introduced a style violation, broke a naming convention, or used a deprecated API. It moves on, confident the change is correct.

Two things make feedback loops weak: narrow signals and slow signals.

Narrow signals mean the agent only hears from one source. Tests tell the agent whether the code works. They don’t tell it whether the code follows your conventions, whether it introduced a security vulnerability, or whether the UI actually renders correctly. Each missing signal is a category of problems the agent can’t self-correct.

Slow signals are just as damaging. If the agent has to wait 20 minutes for a CI run to discover a linting error, it’s already moved on. It’s built three more features on top of code that doesn’t pass lint. Now you’re unwinding multiple changes instead of catching the first one. The closer the feedback is to the moment of the change, the cheaper it is to fix.

There’s also a hierarchy to how you enforce conventions. Anything that can be checked deterministically by a linter should be a lint rule, not a line in your CLAUDE.md. A lint rule catches every violation, every time. A documentation rule depends on the agent reading it and choosing to follow it. If your convention is “methods must be under 20 lines” or “always use frozen_string_literal,” encode it in RuboCop, ESLint, or whatever linter your stack uses. Save documentation for the things that can’t be mechanically enforced: architectural decisions, domain context, workflow conventions.

What a high score looks like

  • Pre-commit hooks for immediate feedback. The agent discovers formatting issues, type errors, or lint violations before it even commits.
  • CI that runs in under 10 minutes. Fast enough that the agent can push, get feedback, and iterate without burning excessive context.
  • Rich error messages. Linting output that says “method too long (25 lines, max 20)” is actionable. A generic “style violation” is not.

Here’s what a CI script looks like when it goes beyond just running tests. This is the bin/ci from the same Rails project:

# config/ci.rb - run with bin/ci

CI.run do
  step "Setup", "bin/setup --skip-server"
  step "Style: Ruby", "bin/rubocop"
  step "Security: Gem audit", "bin/bundler-audit"
  step "Security: Importmap vulnerability audit", "bin/importmap audit"
  step "Security: Brakeman code analysis", "bin/brakeman --quiet --no-pager --exit-on-warn --exit-on-error"
end

Five steps, each giving the agent a different kind of feedback. RuboCop catches style violations. Bundler-audit catches vulnerable gems. Brakeman catches security issues in the code itself. An agent that runs bin/ci gets five signals instead of one.

Browser access as a feedback loop

For web applications, there’s a feedback loop that most teams overlook: giving agents the ability to see what they built.

An agent that can only run tests is working blind on anything visual. It can verify that a controller returns 200, but it can’t tell whether the page actually renders correctly, whether a modal opens, or whether a form submits without errors. Cursor’s team wrote about this: once they gave agents browser access via cloud sandboxes, agents could “iterate until they’ve validated their output rather than handing off the first attempt.” More than 30% of their merged PRs are now created by agents operating autonomously in cloud sandboxes.

You don’t need a full cloud sandbox to get value from this. Claude Code has built-in Chrome support via claude --chrome, and tools like Playwright MCP give agents browser control locally. The agent can navigate to a page, take a snapshot of the DOM, fill in a form, and verify the result. That’s a feedback loop that catches an entire class of issues that unit tests and linters never will.

The smallest change that moves the needle

Add a linter to your CI pipeline. For a Ruby project, that’s RuboCop. For JavaScript/TypeScript, ESLint. For Python, Ruff. One config file, one CI step. The agent immediately starts getting feedback on style and conventions that it wouldn’t otherwise know about.

If you want faster feedback, add pre-commit hooks. The agent runs into the linter before it even pushes, which means it fixes issues in the same context window where it created them. That’s cheaper, faster, and produces cleaner commits.

For web projects, consider adding browser access through Playwright MCP or a similar tool. The agent starts verifying its own UI changes instead of relying on you to catch visual issues in review.

Where to Start

If you’re looking at your codebase and wondering where to start, here’s how I think about prioritization:

  1. Fix your test foundation first. Without reliable tests, every other improvement is hard to verify. An agent can’t confidently refactor your architecture if there’s no test suite to catch regressions.
  2. Add an AGENTS.md. This is 30 minutes of work that immediately changes agent behavior. It’s the highest-ROI improvement you can make.
  3. Add a linter to CI. This closes the feedback gap with minimal effort. The agent starts learning your conventions from automated feedback instead of guessing from code patterns.

These three changes don’t require a major initiative. They’re individual tasks that compound. A codebase with strong tests, clear documentation, and fast feedback loops creates a reinforcing cycle: agents produce better code, which maintains the patterns, which makes future agent output even better.

If you want to see where your codebase stands across all eight dimensions, run the Codebase Readiness Assessment. It takes 60 seconds and gives you a score, a per-dimension breakdown, and a prioritized roadmap.

If your team wants hands-on help closing these gaps, that’s what the AI Workflow Enablement program is built for. Or if you just want to talk through your results, book a free intro call.

Further Reading

More on building real systems

I write about AI integration, architecture decisions, and what actually works in production.

Occasional emails, no fluff.

Powered by Buttondown