Governing AI Agents Without Killing Them: What Actually Works in Production

Damian Galarza — Wed, 22 Apr 2026 00:00:00 -0400

Agentic AI governance for CTOs argues governance needs to come before deployment, not after. The strategic frame is right about what’s at stake: organizational accountability, observability, and tool access. But the solutions assume organizational machinery most early-stage teams don’t have. A two-person startup running a multi-agent system doesn’t need a RACI. It needs a guardrail processor that fails loudly. Leigh names the tension directly: overly restrictive governance drives experimentation underground. Governance that lives in code resolves it — lightweight enough for a seed-stage team, enforceable enough for a regulator.

The piece covers six governance gaps. This post is about three where code-level enforcement most obviously beats policy — tool access, observability, and human-in-the-loop. Cost visibility, shadow AI, and accountability chains are real concerns that deserve their own treatment.

I’ve spent the last several months building a multi-agent AI assistant that runs my consulting business: CRM, email, calendar, invoicing, content pipeline, Slack across two workspaces. Before that, years building software in regulated healthcare, including work on 510(k)-cleared medical device software where every system decision needed an audit trail. “We’ll add logging later” was never an acceptable answer when a regulator could ask to reconstruct any action the system took. That mindset shapes how I think about agent governance. The three patterns below are ones I’ve either hit in production or narrowly avoided. Each is a place where code-level governance beats policy-level governance for a team that can’t afford a review board.

Tool Sprawl Widens Your Blast Radius

MCP server sprawl is named in the original frame as a source of expanded blast radius. The same governance principle lives one layer down, inside the agent’s tool definition: every tool an agent can access is a tool it could misuse. An agent with access to email, calendar, invoicing, CRM, and file operations has a blast radius that spans your entire business. A single prompt injection or hallucination can reach tools the agent should never touch. The principle is least privilege at the agent level, not the system level. Each agent should have access to exactly the tools it needs for its role, and nothing else.

What makes this worse is that tool sprawl also degrades the agent’s ability to do its job. An agent with 40 tools when it regularly uses 8 faces two compounding problems. Every tool definition consumes context window tokens, space the model can’t use for reasoning about the actual task. And the model has to select the right tool from a larger set, which increases the odds of misselection. I’ve watched agents pick a vaguely-similar tool over the correct one because the tool list was too long for the model to evaluate carefully. The governance risk and the performance cost come from the same root cause: too many tools in one agent’s definition.

In my system, I run a multi-agent architecture where a supervisor delegates to domain-specific agents. I built the first version with a supervisor that had access to everything — why not let it figure out what to use? It worked in demos. In production, both problems showed up immediately: the supervisor’s blast radius spanned the entire system, and the model wasted reasoning capacity navigating tools it didn’t need.

Here’s how I structure it instead. Each agent gets a scoped tool set:

// Each agent declares only the tools it needs
const relayAgent = new Agent({
  name: "relay",
  instructions: relayInstructions,
  model: LOCAL_MODEL_LARGE_THINKING,
  tools: {
    // Email tools only - no CRM, no calendar, no invoicing
    scanInbox,
    readEmail,
    readEmailThread,
    labelEmail,
    archiveEmail,
    composeEmail,
    draftEmail,
    replyToEmail,
  },
});

const tempoAgent = new Agent({
  name: "tempo",
  instructions: tempoInstructions,
  model: FAST_MODEL,
  tools: {
    // Calendar tools only - no email, no CRM, no invoicing
    listCalendarEvents,
    getCalendarEvent,
    createCalendarEvent,
    updateCalendarEvent,
    deleteCalendarEvent,
    findCalendarFreeBusy,
  },
});

The email agent can’t touch the calendar. The calendar agent can’t read emails. The invoicing agent can’t send Slack messages to the shared workspace. These boundaries aren’t documentation. They’re structural. An agent literally cannot call a tool it doesn’t have.

But tool scoping alone isn’t enough. Some tools within an agent’s set need additional constraints. My email agent has tools for composing and sending emails. The model can hallucinate plausible-looking recipient addresses, fabricate domains, or construct emails to addresses that don’t exist. Instructions alone won’t prevent this because the model can reason past them.

Rather than trusting the model’s judgment, I enforce this at the framework level using Mastra’s output processors:

// A Mastra processor that blocks emails to fabricated addresses
import type { ProcessOutputStepArgs, Processor } from "@mastra/core/processors";

const SEND_TOOLS = new Set(["compose-email", "reply-to-email"]);

export class EmailSendGuardrailProcessor implements Processor<"email-send-guardrail"> {
  readonly id = "email-send-guardrail" as const;

  processOutputStep({ toolCalls, abort, messages }: ProcessOutputStepArgs) {
    if (!toolCalls?.length) return messages;

    for (const tc of toolCalls) {
      if (!SEND_TOOLS.has(tc.toolName)) continue;

      const to = (tc.args as { to?: string })?.to;
      if (!to) continue;

      // Block obviously fabricated or placeholder recipients
      if (/(@example\.com|@test\.com|@placeholder\.)/.test(to) || !to.includes("@")) {
        abort(
          `The recipient "${to}" looks like a guessed address. Look up the contact in the CRM first. Never fabricate email addresses.`,
          { retry: true },
        );
        return messages;
      }
    }

    return messages;
  }
}

Processors inspect the step’s generated tool calls and can abort execution with a retry hint when something violates a hard rule. If the model hallucinates a recipient address, the guardrail aborts with a message telling the agent to look up the contact in the CRM first. The address never reaches the send tool. No approval card, no prompt-based workaround.

The same principle applies to trust boundaries across workspaces. I run two Slack integrations: one for my private workspace, one for a shared community. The community-facing agent has no browser access, no credential vault, no file system. That’s not a policy document. It’s a different agent with a different tool set, pointed at a different Slack app.

The pattern: Don’t govern tool access with policies that agents might ignore. Remove the tools from the agent’s definition entirely. Governance you can’t violate is better than governance you promise to follow.

Beyond Tracing: Structured Decision Logs for Agent Governance

You cannot govern what you cannot see. Modern tracing tools like Phoenix Arize, Langfuse, and Mastra Studio show you the full request/response cycle: inputs, outputs, tool calls, latency, and the model’s reasoning process. I use Phoenix Arize extensively. It’s the first place I look when debugging why an agent picked the wrong tool, hallucinated a parameter, or took an unexpected path.

Tracing is essential, but it answers a specific class of questions: what happened inside the model’s reasoning. Governance needs a second layer: structured decision logs that answer what the system decided, what confidence it had, and whether the outcome was correct in your domain context.

This is familiar territory if you’ve worked in regulated environments. In healthcare software, particularly anything touching the 510(k) pathway for Software as a Medical Device (SaMD), you don’t just log that a record was modified. You log who modified it, when, what the previous value was, and what rule authorized the change. Every action must be reconstructable because a regulator will ask. Agent governance has the same shape, even outside healthcare. The stakeholder asking “why did the agent do that?” isn’t debugging model behavior. They’re asking whether the outcome was correct given the business rules, and they need a trail that answers that question without ambiguity.

Here’s the distinction in practice. When my email triage agent archives a message, I can see in Phoenix Arize exactly what the model received and how it reasoned about the classification. That’s useful for debugging why the model chose “archive” over “escalate.” But when I need to answer “show me every email that was auto-archived from my inbox last week, what confidence level each had, and which ruleset applied,” I need structured logs that are queryable independent of the tracing system.

That means capturing the agent’s decision context in a structured, queryable format:

// Each triage decision captures full context for audit
interface TriageDecision {
  messageId: string;
  subject: string;
  from: string;
  classification: "archive" | "act" | "digest" | "escalate";
  confidence: number;
  mode: "conservative" | "full";   // Which ruleset applied
  reason: string;                  // Why the agent chose this
  actionTaken: string;             // What actually happened
  labels: string[];                // What labels were applied
  timestamp: string;
}
// Persisted to the database, queryable from the dashboard

The triage system runs two different modes depending on whose inbox it’s scanning. My inbox gets conservative mode (only auto-archives high-confidence machine-generated noise like billing receipts and marketing emails). The AI assistant’s inbox gets full mode (four classification categories, auto-archives everything after classification). That modal distinction matters for governance because the blast radius is different. Archiving a marketing email from the assistant’s inbox is low stakes. Archiving something from my inbox that I hadn’t seen yet is a different conversation.

Beyond decision logging, I track errors with fingerprint deduplication. Every catch block writes to a structured error table with module, message, and context. A background health monitor runs every five minutes, detects stale processes, and escalates to the LLM for analysis when rule-based detection isn’t enough. The dashboard surfaces all of this: health banners, error pages with filters, and stale-session badges that go amber after 10 minutes and red after 30.

None of this came from a governance framework. It came from an earlier system I built called Tracewell AI, where agents generated design inputs from source material in a regulated context. Every derivation had to be auditable: “show me every design input, which sources the agent pulled from, and its confidence at the time.” No trace format could answer that, and I wasn’t using one. I built a structured audit log because compliance required it, not because debugging demanded it. That’s where I learned the distinction: traces show what the model reasoned; structured logs show what the system decided, under which rules, with what confidence.

The pattern: Tracing gives you deep visibility into model behavior. Use it. But for governance, pair it with structured decision logs that capture domain-specific context: what was decided, what confidence level, what ruleset applied, and what action the system took. Make both queryable, and make sure someone is actually reviewing them.

Human-in-the-Loop: The Checkpoint That Actually Works

The most important insight about human-in-the-loop is the “rubber-stamp trap.” Adding human review to every agent decision is a common starting point. In practice, reviewers get overwhelmed, start rubber-stamping, and the checkpoint becomes theater.

This isn’t just a theory. Anthropic recently published research on Claude Code’s auto-accept mode that quantifies the problem: users were approving 93% of permission prompts. That’s not review. That’s muscle memory. Their solution was to replace blanket approval with a tiered system where a model-based classifier evaluates risk and only escalates actions that warrant human attention. The classifier uses a two-stage pipeline (fast filter, then chain-of-thought reasoning) and catches overeager behavior, prompt injection, scope escalation, and honest mistakes while letting routine actions through without friction.

The same principle applies to agent systems. The solution isn’t removing human review. It’s being precise about where it adds value and what context the reviewer needs to make a real decision.

My system uses a tiered approach. Low-risk actions (reading emails, looking up calendar events, searching the CRM) happen without approval. The agent just does them. High-risk actions go through explicit approval gates using Mastra’s agent approval system. When a tool is tagged with requireApproval: true, Mastra pauses execution at the framework level before the tool runs. The stream emits an approval event with the tool name and arguments, and the tool only executes after an explicit approveToolCall(). This is framework-enforced, not prompt-based, so the model can’t reason its way past the gate.

The key design choice is what “approval” looks like. A generic “Agent wants to perform an action. Approve?” dialog is useless. The reviewer has no context, so they either rubber-stamp or block everything out of caution. Both outcomes are governance failures.

Here’s what a real approval checkpoint looks like for my coding pipeline:

planning → risk assessment → low risk? ──yes──→ auto-approved → executing
                                ↓ no
                          plan_review → approved → executing
                                ↑              ↓
                            revise ← request changes

The agent generates a plan. A risk assessor (two layers: deterministic heuristics for hard stops like DROP TABLE or .env modifications, plus an LLM classifier for everything else) evaluates the plan. Low-risk plans auto-approve and execute immediately. Medium and high-risk plans go to human review with the full plan visible, not just a yes/no prompt.

When I review a plan, I see exactly what the agent intends to do, which files it will touch, and why the risk assessor flagged it. I can approve, request changes (the agent revises and resubmits), or reject entirely. That’s a checkpoint with teeth. The reviewer has enough context to make a real judgment call, and the “request changes” path means the review isn’t binary.

For email, the approval is even more specific. When the agent wants to send an email, the approval card shows the full email: recipient, subject, body. I’m not approving “send an email.” I’m approving this specific email to this specific person. The context makes the checkpoint real instead of performative.

The less obvious lesson: the approval system itself can break in ways that look like it’s working. I discovered that tagging certain tools with requireApproval caused my supervisor agent to avoid delegating to the sub-agent entirely. The supervisor model saw that the delegation path was “approval-gated” and hallucinated reasons not to use it. The approval mechanism was technically present but functionally disabled because the model routed around it. I only caught this by checking the traces (see: observability matters).

The pattern: Approval checkpoints work when three conditions are met. The reviewer sees the full context of the action, not just a generic prompt. Low-risk actions bypass review entirely so the reviewer isn’t fatigued. And the system is monitored to ensure the approval path is actually being exercised, not silently avoided.

Governance as Code: Defense in Depth

Each of the previous patterns (tool scoping, guardrail processors, decision logs, approval gates) is a single layer. None of them is sufficient alone. The real value shows up when you stack them.

Take sending email as the running example. You’ve already seen the individual layers: the email guardrail processor that blocks fabricated recipients, and the approval gate that pauses execution for human review. Here’s how they combine with two additional layers into a defense-in-depth stack:

Tool API design forces an explicit sender parameter (“emma” | “damian”) with no default. The caller must deliberately choose which account sends.
Guardrail processor blocks fabricated or placeholder recipients before the tool executes. Hard abort, no workaround.
Framework-level approval gate (requireApproval: true) pauses execution and surfaces the full email for review: recipient, subject, body, sender.
Client-level enforcement in sendEmail() requires an explicit userEmail argument. No fallback, no default. If the parameter is missing, it throws.

Each layer is independent. If the model hallucinates a recipient, Layer 2 catches it. If it tries to send without approval, Layer 3 blocks it. If somehow the tool args are malformed, Layer 4 throws. A bypass at one layer doesn’t compromise the others.

That’s what governance as code means. The constraints are enforced by the system, verified by tests, and visible in the codebase, not buried in a Confluence page. This is one of the dimensions that separate agent-ready codebases from ones that break under real workloads. Frameworks like Mastra give you the primitives: guardrail processors for hard rules, approval gates for human review. Your job is to wire them into a layered defense that matches your risk profile.

What I’d Tell a CTO to Do Monday Morning

If you’re leading a team that’s deploying agents, here’s where to start:

Audit your tool surfaces. For every agent in production, list the tools it has access to and ask: does this agent need all of these? Every unnecessary tool is expanded blast radius and wasted context window. Scope them down. You’ll likely see better tool selection as a side effect.

Add structured decision logging to your highest-stakes agent action. You probably already have LLM tracing. Pick the one action where you’d need to explain “why did the agent do that?” to a stakeholder, and add structured logs that capture the decision context: inputs, classification, confidence, action taken. Make it queryable from your dashboard, not buried in trace spans.

Pick your highest-risk action and build a real checkpoint. Not a generic approval dialog. An approval flow that shows the reviewer the full context of the action. Frameworks like Mastra provide the primitives. One real checkpoint is worth more than twenty rubber-stamp prompts.

Move one governance rule from documentation to code. Find a constraint that’s currently a line in a README or a team agreement. Encode it as a guardrail processor, a test, a structural boundary. Something that fails loudly when violated rather than depending on an agent reading and following instructions.

These are afternoon-sized tasks, not quarterly initiatives. That’s the point. The original frame is right that governance has to come before agents ship at scale. But “before” doesn’t mean organizational review boards. It means constraints in code that ship with the agent. Each of these gives you something concrete: a tighter tool surface, a queryable decision log, an approval checkpoint that someone actually uses, or a constraint that enforces itself without depending on an agent’s good behavior.

If you want a faster read on where you stand, I built a companion Agent Governance Scorecard — 30 yes/no questions across the four dimensions above. It takes about ten minutes and tells you which layer to fix first.

These aren’t theoretical patterns. They’re the same techniques I apply when working with early-stage teams to formalize their agent architecture.

Most early-stage teams I talk to have agents in production and governance that’s still catching up. The gap between “it works” and “I can explain why it did that” is where real risk lives — and it’s where investors, partners, and your first enterprise customer will start asking questions. If that sounds familiar, book a free 30-minute strategy call. I’ll walk through your agent architecture, identify the highest-risk tool surfaces, and give you a prioritized action plan: what to lock down first, what can wait, and which of these patterns fits your system. No slide decks. Just a concrete roadmap you can start executing the same week.