The Quality Loop Your AI Agent Is Missing (Evals + Tracing)
Traces tell you what your agent did. Evals tell you whether it did it right. Most AI agent stacks ship with neither connected to the other — and that’s how a “successful” run can quietly include a fabricated action item your unit tests would never flag.
The Agent Quality Loop, end to end
This is Part 3 of the Agent Quality series, closing the loop between Part 1 on the eval framework and Part 2 on agent observability. The loop: code → traces → evals → scores → back to code. Each piece is well-known in isolation. The point of this video is to wire them together on a real agent and watch the flywheel turn.
A custom LLM-as-judge scorer in Mastra Studio
I add a custom groundedness scorer to a Mastra meeting assistant using createScorer, following the preprocess → analyze → generateScore → generateReason pattern. The scorer is attached to the agent so every run gets graded automatically. Mastra Studio shows the trace and the score in one place. First run: 0.83, one action item that didn’t appear in the transcript. I fix the prompt with explicit grounding rules and the score moves to 1.00. No custom infrastructure — observability and evals both ship in Mastra.
Why this fails without the loop
Underspecified prompts produce plausible-looking failures. The agent looks like it worked. The logs are clean. The dashboard is green. The only thing that catches the failure is a scorer that checks the output against the input — and the only way to fix it without guessing is to read the trace.
If you’re shipping AI agents to production, this is the layer that separates “demo works” from “actually working.” I help teams build agent quality into their stack from the start.
Building an AI agent?
I help teams design and ship agentic systems — from architecture to production.
See how I can help
Stop Giving Your Agent Every Tool
Large tool catalogs break agent context. Tool search fixes that by letting agents discover and load only what they need.

Stop Letting AI Agents Run the Whole Workflow
One inbox agent should not classify, research, score, route, and draft replies in one loose loop.

Harness Engineering: 4 Levers to Diagnose Any AI Agent
Most agent failures aren't model failures. They're harness failures. Here's the 4-lever framework I use to diagnose what broke.

Building Approval Gates AI Agents Can't Route Around
How to wire human-in-the-loop on tool calls — and why system prompt instructions like "always ask before sending" don't actually hold.
Get new videos and posts by email
Weekly videos on AI engineering, plus deeper dives in the newsletter.
Occasional emails, no fluff.