Agent Loops vs. Workflows: The Boundary That Makes AI Reliable

13 min read
Listen to this post

Most agent demos hand one model the entire job. One prompt, one loop, one impressive screen recording.

That works until the output matters. The moment an agent’s decision touches your reputation, your money, or anything that leaves your system, “the model figured it out” stops being a feature. A single loop holding the whole process in its head is exactly the thing you can’t inspect when it gets something wrong.

There’s a better split, and it’s not the one most demos reach for. Use the model for judgment. Use a workflow for control. Use deterministic code for the parts where you don’t want a probability distribution deciding what happens. That boundary, where the model’s judgment ends and deterministic structure takes over, is the subject of this post.

To keep it concrete, I’ll lean on one running example: a system that triages incoming sponsorship emails, evaluates the ones worth taking seriously, and drafts a structured internal brief before any reply goes out. The inbox is just the vehicle. The pattern underneath it applies to any agent doing a known job with a judgment call or two inside it.

Where the Agent Loop Boundary Actually Is

Frameworks keep pushing more of the process inside the agent loop. Sometimes that’s the right call. When the path is genuinely open-ended, when the next step depends on what the last step turned up, a loop that can reason and re-plan is the correct tool.

But a lot of the work we hand to agents isn’t open-ended. It’s a known process with a few judgment calls embedded in it. Triaging an email is mostly known. You normalize it, you classify it, you route it, and somewhere in the middle there’s a real question that needs judgment. Is this sponsorship worth pursuing? The mistake is letting the model own the whole sequence just because one step inside it needs intelligence.

Anthropic draws this same line in their guidance on building agents. Workflows orchestrate models and tools through predefined code paths. Agents let the model dynamically direct its own process. Predictability where the path is known, flexibility where it isn’t. The skill is knowing which parts are which, and being explicit about it rather than collapsing everything into one prompt.

The version that holds in production looks less like a single clever agent and more like a typed pipeline with a few model calls in it. That’s what we’re going to build.

The Decision: Model Call or Deterministic Step

Before any code, the useful question for every step in the process is one of two:

  1. Does this step require judgment that only a model can provide?
  2. Or is this a step I could write as ordinary code if I just sat down and wrote it?

Normalizing a messy email into a predictable shape is the second kind. Deciding whether an email is a sponsorship inquiry, and how confident you are, is the first. Sending the reply is neither a judgment call nor something you want the model doing on its own. It’s a side effect, and side effects are where deterministic control earns its place.

This isn’t a framework opinion. It’s a reliability one. Models are excellent at constrained judgment and unreliable at remembering to do the boring, mandatory parts every single time. A workflow lets you put the boring, mandatory parts in code where they run the same way on every execution, and reserve the model for the few places where judgment is the actual job.

I built the example with Mastra, a TypeScript framework for building AI agents and workflows with memory, evals, and observability built in. The patterns here aren’t Mastra-specific, but having typed steps and a workflow primitive makes the boundary concrete instead of conceptual. If you’re building with Mastra, this boundary between agent loops and typed workflows is usually where production reliability starts to matter.

Step One: Normalize Before You Reason

Raw email is inconsistent. Different headers, quoted reply chains, signature noise, HTML and plain text mixed together. Before asking a model to judge anything, make the input shape predictable. This is the least glamorous step and one of the most important, because every downstream step gets easier once the state is structured.

// src/mastra/steps/normalize-email.step.ts
import { createStep } from "@mastra/core/workflows";
import { z } from "zod";
import { parseRawEmail } from "../../lib/email";

export const normalizeEmailStep = createStep({
  id: "normalize-email",
  inputSchema: z.object({ raw: z.string() }),
  outputSchema: z.object({
    from: z.string().email(),
    subject: z.string(),
    body: z.string(),
    links: z.array(z.string().url()),
  }),
  execute: async ({ inputData }) => {
    // Plain software engineering, no model involved
    return parseRawEmail(inputData.raw);
  },
});

There’s no model here, and that’s the point. Not every important step is AI. Putting a model in this step would be slower, more expensive, and less reliable than the parsing code you already know how to write. The workflow is already paying off and we haven’t made a single API call.

Step Two: One Narrow Model Call

The first model call has exactly one job. It looks at the normalized email and decides a category, a confidence score, and a reason. Nothing else. It doesn’t research, it doesn’t draft, it doesn’t decide what happens next. It produces a typed decision.

// src/mastra/steps/classify-email.step.ts
import { createStep } from "@mastra/core/workflows";
import { z } from "zod";
import { classifier } from "../agents/classifier";

const classification = z.object({
  category: z.enum(["sponsorship", "support", "personal", "spam"]),
  confidence: z.number().min(0).max(1),
  reason: z.string(),
});

export const classifyEmailStep = createStep({
  id: "classify-email",
  inputSchema: normalizeEmailStep.outputSchema,
  outputSchema: classification,
  execute: async ({ inputData }) => {
    const result = await classifier.generate(
      `Classify this email:\n${JSON.stringify(inputData)}`,
      { structuredOutput: { schema: classification } },
    );
    return result.object;
  },
});

Constraining the model to a typed output does two things. It forces the judgment into a shape the rest of the system can rely on, and it keeps the model in its lane. This is also a good place for a smaller, cheaper model. Classification into four buckets with a confidence score doesn’t need a frontier model, and different jobs in the same workflow can use different model lanes. The constrained call is the unit of judgment. The workflow is what decides what to do with it.

Step Three: The Workflow Owns the Routing

Here’s the first real boundary. The model produced a typed decision. It does not get to act on that decision. The workflow reads the decision and routes.

// src/mastra/workflows/inbox-triage.workflow.ts
import { createWorkflow } from "@mastra/core/workflows";

export const inboxTriageWorkflow = createWorkflow({
  id: "inbox-triage",
  inputSchema: z.object({ raw: z.string() }),
  outputSchema: triageResult,
})
  .then(normalizeEmailStep)
  .then(classifyEmailStep)
  .branch([
    [
      async ({ inputData }) => inputData.category === "sponsorship",
      sponsorTriageWorkflow,
    ],
    [
      async ({ inputData }) => inputData.category !== "sponsorship",
      reviewRequiredStep,
    ],
  ])
  .commit();

Judgment comes from the model. Control comes from the workflow. That separation is the whole idea. When something goes wrong later, you can point to exactly where the decision was made and exactly where it was acted on, because they’re different steps. In a single agent loop, those two things happen in the same opaque generation, and “why did it do that?” has no good answer.

Step Four: Nest the Domain Logic

Notice that the inbox triage workflow branches into another workflow, not a step. The parent workflow shouldn’t know how sponsorship evaluation works. It only knows when to hand off to the part of the system that does.

// src/mastra/workflows/sponsor-triage.workflow.ts
export const sponsorTriageWorkflow = createWorkflow({
  id: "sponsor-triage",
  inputSchema: classification,
  outputSchema: sponsorBrief,
})
  .then(extractSponsorDetailsStep)
  .then(researchSponsorStep)
  .then(scoreSponsorFitStep)
  .then(applyGuardrailsStep)
  .then(renderBriefStep)
  .commit();

Nested workflows are a clean way to separate broad routing from domain-specific evaluation. The inbox workflow handles “what kind of email is this and where does it go.” The sponsor workflow handles “given that this is a sponsorship inquiry, is it any good.” Each one stays readable because it isn’t carrying the other one’s concerns. If you add a second category worth deep evaluation later, it becomes its own nested workflow without touching the parent’s logic.

Step Five: Where Model Judgment Meets Deterministic Policy

This is the production AI part, and it’s where the boundary does the most work.

The sponsor workflow researches the sender. A sponsor-provided landing page is useful context, but it is not independent proof. So the research step pulls the sponsor’s own page and, separately, searches for outside corroboration. The model then scores fit based on everything it found. That’s a legitimate judgment call, and the model is good at it.

Then comes the part the model does not get to decide.

// src/mastra/steps/apply-guardrails.step.ts
export const applyGuardrailsStep = createStep({
  id: "apply-guardrails",
  inputSchema: scoredSponsor,
  outputSchema: scoredSponsor.extend({
    recommendation: z.enum(["pursue", "review", "decline"]),
  }),
  execute: async ({ inputData }) => {
    const { fitScore, externalCorroboration } = inputData;

    // Deterministic policy. The model scored fit; it does not
    // get to override this rule no matter how confident it is.
    if (externalCorroboration === "weak") {
      return { ...inputData, recommendation: "review" };
    }

    const recommendation = fitScore >= 0.7 ? "pursue" : "decline";
    return { ...inputData, recommendation };
  },
});

The anchor rule: if external corroboration is weak, this workflow cannot return pursue. Not “should usually not.” Cannot. The model can be as confident as it likes about a sponsor it only knows from that sponsor’s own marketing page, and the guardrail still routes the decision to human review. This is a policy you don’t want a probability distribution deciding, so it lives in code where it runs identically every time.

This is the difference between a prompt that says “be careful about unverified sponsors” and a system that structurally cannot recommend one. The first is a suggestion the model can reason past. The second is a gate. I’ve written before about why prompt-based rules fail and framework-enforced gates hold, and this is the same principle applied to a workflow step instead of a tool call.

Step Six: Draft From State, Not From the Email

The reply gets drafted from the structured state the workflow built up, not directly from the raw email. By this point the system knows the category, the extracted details, the research findings, the fit score, and the guardrail-approved recommendation. The draft step assembles a reply from that, and the workflow returns two things: machine-friendly JSON for whatever calls it, and a human-readable markdown brief for review.

That dual output is what makes this feel like an internal tool instead of a chatbot. Workflows aren’t only about controlling the process. They also improve the shape and reviewability of what comes out. A reviewer sees “DevFlow AI, fit score 0.82, two independent sources confirm the company exists, recommendation: pursue” instead of a wall of generated prose they have to re-verify from scratch.

Step Seven: A Valid Output Is Not a Good Output

Schema validation tells you the output has the right shape. It says nothing about whether the classification was correct or the extraction captured the right details. Those are quality questions, and they need a different tool.

Scorers evaluate output quality, not just structure, and they attach to the agents whose judgment you depend on. In this system, a scorer on the classifier and a scorer on the extraction agent tell you whether those judgment calls are actually any good over time.

// src/mastra/agents/classifier.ts
import { Agent } from "@mastra/core/agent";
import { classificationScorer } from "../scorers/classification";

export const classifier = new Agent({
  name: "email-classifier",
  instructions: "Classify inbound email into one category with a reason.",
  model: openai("gpt-4o-mini"),
  scorers: {
    classificationAccuracy: {
      scorer: classificationScorer,
      sampling: { type: "ratio", rate: 0.2 },
    },
  },
});

Evals belong close to the judgment you depend on. A workflow makes that natural, because the judgment lives in isolated steps backed by specific agents. You know exactly which calls make decisions, so you know exactly where to measure decision quality. The sampling ratio keeps the cost reasonable. You don’t need to score every run to know whether classification is drifting.

Step Eight: A System You Can Actually Inspect

The last piece is the one that separates a clever demo from something production-ready. Every run produces a trace. You can open the parent workflow execution and see each step, the model calls inside it, the scorer results, and the path the branch took. When a sponsor gets the wrong recommendation, you don’t guess. You read the trace.

This is the payoff for drawing the boundary in the first place. Because judgment, control, and side effects are separate steps, the trace shows you which one failed. A single agent loop gives you a transcript and a shrug. A workflow gives you a sequence of typed, inspectable decisions.

When to Reach for a Workflow Instead of a Loop

The boundary isn’t “workflows good, agents bad.” Both are right tools for different shapes of problem. Before building, the questions worth asking are roughly these:

  • Is the path mostly known? If you can describe the steps in advance, that’s a workflow with a few model calls in it, not an agent loop.
  • Are there irreversible side effects? Sending email, finalizing invoices, publishing anything. The more the output leaves your control, the more you want deterministic gates around it.
  • Does a wrong step need to be explainable? If “why did it do that?” has to have an answer, separate judgment from control so the trace can tell you.

If instead the work is genuinely open-ended, where the next move depends on what the last move uncovered and you can’t enumerate the steps ahead of time, that’s where an agent loop earns its complexity. Most production work I see is a known process with a few judgment calls in it. That shape wants a workflow.

There’s an honest edge here worth naming. “Known path” is not a permanent property. A triage workflow accumulates exceptions: a new sponsor type, an edge case the classifier wasn’t built for, a routing rule that needs context the workflow doesn’t carry. At some point a workflow with enough branches and special cases is telling you that part of the process has become genuinely open-ended and wants a loop after all. I don’t have a clean numeric threshold for that. The signal I watch for is branches that exist to handle the model’s uncertainty rather than the business’s actual rules. When the structure starts fighting the work, it’s time to revisit the boundary.

The goal was never to remove the model. We still wanted its judgment. We just didn’t want one loop holding the whole process in its head, making decisions we couldn’t inspect and performing side effects we couldn’t gate. Explicit steps, explicit routing, deterministic guardrails, and something you can actually read after the fact. That’s what makes an AI system reliable enough to put in front of something that matters.


If you’re building with Mastra and trying to decide where agent loops should stop and typed workflows should take over, I help teams design that boundary in production. See how I help Mastra teams.

Further Reading

More on building real systems

I write about AI integration, architecture decisions, and what actually works in production.

Occasional emails, no fluff.

Powered by Buttondown