AI Agent Evals: The 4 Layers Most Teams Skip
Most teams evaluate AI agents by vibes. Here are the four layers of evals you actually need to ship agents with confidence.
I walk through the eval stack I use on real agent projects — from unit-level prompt checks up through end-to-end trajectory scoring — and explain where each layer catches different classes of failure. If you’re building agents for production and wondering why regressions keep slipping through, this is the framework to borrow.
Building an AI agent?
I help teams design and ship agentic systems — from architecture to production.
See how I can help
Building Approval Gates AI Agents Can't Route Around
How to wire human-in-the-loop on tool calls — and why system prompt instructions like "always ask before sending" don't actually hold.

Your AI Assistant Doesn't Need a Bigger Model. It Needs Colleagues
The multi-agent supervisor pattern in Mastra: eight specialist agents on one local LLM, one supervisor, structural trust boundaries — using TypeScript.

The Quality Loop Your AI Agent Is Missing (Evals + Tracing)
Add an LLM-as-judge scorer to a Mastra agent, catch a fabricated action item your tests would never flag, and fix the prompt — no custom infra.

The Observability Layer Your AI Agent Is Missing
Logs tell you what happened. Traces tell you why. The three layers of agent observability, and where silent failures actually live.
Get new videos and posts by email
Weekly videos on AI engineering, plus deeper dives in the newsletter.
Occasional emails, no fluff.