Ai-Agents on Damian Galarza | Software Engineering & AI Consulting

Shrinking a Production Prompt by 28% With Autonomous Optimization

Damian Galarza — Mon, 06 Apr 2026 00:00:00 -0400

Every token in a production LLM prompt costs you latency, money, and context window space. An agent I’ve been building takes ~170 input categories and produces a detailed structured matrix as output. The system prompt includes a 421-line reference matrix as a few-shot example gallery so the model knows the expected output patterns.

The question was concrete: how much of this reference data does the model actually need? I used uditgoenka/autoresearch, a Claude Code skill based on Andrej Karpathy’s autoresearch, to find out. After over 65 autonomous iterations, it cut the matrix to 303 lines (28% smaller) while maintaining 98.1% output quality.

Here’s the prompt optimization pattern, the results, and what surprised me about how robust LLMs are to reference data reduction.

The Autoresearch Pattern

Andrej Karpathy’s autoresearch introduced the core idea: give an AI agent a metric to optimize and let it loop. Modify, measure, keep or revert, repeat.

Udit Goenka built a Claude Code skill that brings this pattern to arbitrary optimization tasks, adding a dedicated guard command to prevent regressions.

You define six parameters:

Goal: what you want to improve
Scope: which files the agent can modify
Metric: a number extracted from a shell command (line count, test coverage, score)
Direction: whether higher or lower is better
Verify: the command that produces the metric
Guard: a safety net command that must always pass

Each iteration follows the same cycle: modify, commit to git, verify the metric, run the guard, keep or revert. Every experiment gets committed before verification, so rollbacks are clean. It tracks results in a TSV log and reads its own git history to avoid repeating failed approaches.

The separation between metric and guard is what makes this work. The metric tells autoresearch “did we make progress?” while the guard tells it “did we break anything?” Keeping those independent lets the loop optimize aggressively while the guard catches regressions.

Setting Up the Prompt Optimization Experiment

Scope

I scoped autoresearch to a single file — the reference matrix itself. I didn’t want it touching the agent’s prompt instructions, the recommendation library, or the eval infrastructure. Just the example data.

The alternative was to also let it modify the agent prompt, changing how the matrix is described. But I wanted to isolate the variable: same prompt instructions, same recommendation library, just less example data.

Metric

For the metric, I used line count. It’s simple, deterministic, and directly measures what we care about — how much data gets injected into the prompt. The metric doesn’t measure quality at all. That’s the guard’s job.

Guard

The quality gate was our existing golden-benchmark eval:

EVAL_QUIET=true npx vitest run --config vitest.evals.config.ts \
  src/evals/matrix-generation/golden-benchmark.test.ts

This eval feeds all 166 input categories into the agent, runs the full matrix generation end-to-end via LLM, and compares the output against golden reference data across four dimensions:

Input category coverage (did it produce rows for every input category?)
Output category accuracy (correct category assignment?)
Recommendation overlap (right recommendations from the library?)
Assignment accuracy (correct responsible party?)

The guard must exit 0 (all Vitest assertions pass) for a change to be kept. Each guard run took 5-7 minutes because it makes a real LLM API call to generate the full matrix.

The 65-Iteration Run

I ran three rounds:

Round	Iterations	Lines	Focus
1	25	421 → 337	Easy wins: exact duplicates, severity level consolidation, frequency variants
2	25	337 → 255	Deeper cuts: multi-row category groups, shared high-severity rows
3	15	255 → 197	Aggressive: most multi-row groups reduced to 1-2 representatives

Each iteration took 6-8 minutes (mostly the guard eval). Total wall time was roughly 7-8 hours across three rounds.

The agent’s approach was systematic. In round 1, it found the free wins: 5 exact duplicate rows, frequency variants (7 rows for different weekly frequencies that could collapse to 2), and severity levels where lower severity was always a subset of moderate. In later rounds, it got more aggressive, reducing most multi-row category groups to single representative entries.

Results After Fixing the Eval Baseline

During the run, I discovered that the original eval had a self-referencing bug. Both the agent prompt and the eval’s golden comparison data imported from the same REFERENCE_MATRIX_CSV constant. Every time autoresearch shrank the reference matrix, it also shrank what the eval compared against. The eval was proving “the model can reproduce a smaller matrix” rather than “the model handles all real-world input categories correctly.”

The fix was straightforward. I split the data into two files:

// src/data/reference-matrix.ts — injected into the agent prompt (optimized)
export const REFERENCE_MATRIX_CSV = `...`; // 303 lines after optimization

// src/data/golden-reference-matrix.ts — used by eval (immutable)
export const GOLDEN_REFERENCE_MATRIX_CSV = `...`; // original 421 lines, never changes

// src/evals/matrix-generation/golden-benchmark.test.ts
// Before: import { REFERENCE_MATRIX_CSV } from '../../data/reference-matrix';
import { GOLDEN_REFERENCE_MATRIX_CSV } from '../../data/golden-reference-matrix';

With the fixed eval, I binary-searched through the git history to find the optimal size. Because autoresearch commits every experiment, the full optimization history was available to test against the corrected eval.

Lines	Reduction	Overall Score	Status
421	0%	~99.9%	Baseline
337	-20%	99.1%	Above 98%
308	-27%	98.5%	Above 98%
305	-28%	98.4%	Above 98%
303	-28%	98.1%	Sweet spot
297	-29%	97.7%	Below 98%
283	-33%	96.9%	Below 98%
255	-39%	~96%	Below 98%
197	-53%	95.5%	Too aggressive

The sweet spot is 303 lines: a 28% reduction maintaining 98%+ overall quality. The quality cliff appears around iteration 35, where the agent removed shared high-severity rows that contained unique recommendation mappings.

At 303 lines, the score breakdown:

Input category coverage: 100% (all 166 golden categories present)
Output category accuracy: 100%
Recommendation overlap: 90.5% (about 32 specific recommendations lost)
Assignment accuracy: 99.7%
Overall weighted score: 98.1%

The main quality cost is recommendation overlap. The model still covers all input categories and assigns output categories correctly, but produces slightly fewer recommendation rows per category. For this use case, that’s an acceptable tradeoff: 118 fewer lines in every prompt for a 1.9% quality reduction.

What This Reveals About LLMs and Reference Data

The most useful finding isn’t the 28% number. It’s the degradation curve.

Even at 197 lines (53% cut), the model still hit 95.5%. It correctly covered all input categories and most output categories. The recommendation library (a separate 337-entry file in the prompt) carries much of the mapping knowledge. The reference matrix turned out to be more “example gallery” than “source of truth.” The model uses it to learn output patterns, not to look up specific mappings.

This has implications for any system that injects large reference data into prompts. The model likely doesn’t need all of it. But you need a correct eval to find the actual boundary, and the degradation is gradual, not a cliff. Without a quality gate, you won’t know where that boundary is until users report problems.

Lessons for Running Autonomous Optimization Loops

The guard is what makes it work

Without a quality gate, autoresearch is a deletion loop. The guard is the only thing preventing it from removing everything. This sounds obvious until you see how easy it is to write a guard that doesn’t actually guard.

Separate your optimization target from your eval baseline

If your golden data is the same data you’re optimizing, you’ll always pass. This is easy to do when the reference data serves dual purpose (prompt injection and eval comparison). Split them from the start. The optimization target is mutable. The eval baseline is immutable.

Git-as-memory enables post-hoc analysis

Autoresearch commits every experiment before verification. This is a form of agent memory that pays off after the run ends: I was able to binary-search through the history after fixing the eval, finding the exact commit where quality degraded. Without that history, I would have had to re-run the entire optimization from scratch.

Guard speed determines iteration budget

Fast guards (line count, type checks, unit tests) enable hundreds of iterations overnight. Slow guards (LLM-based evals, end-to-end tests) limit you to 10-15 iterations per hour. Plan your guard complexity based on how many iterations you can afford.

Applying This Pattern to Other Prompt Components

The recommendation library (a separate 337-entry reference file also injected into every prompt) is the next candidate for the same treatment. Same loop, same approach, but with the eval separation built in from the start.

The pattern generalizes to any prompt optimization problem: define the metric, build a correct guard, let the agent loop. The constraint is always the guard. A guard that looks correct but measures the wrong thing is worse than no guard at all.

I built a one-page scorecard based on the four layers of agent evaluation — component testing, trajectory visibility, outcome measurement, and production monitoring. It takes two minutes and shows you where your gaps are. Get the Agent Eval Scorecard →

If you’re past the scorecard stage and want hands-on help with eval design or prompt optimization, let’s talk.

Additional Reading

autoresearch Claude Code skill by Udit Goenka
autoresearch by Andrej Karpathy

Four Dimensions of Agent-Ready Codebase Design

Damian Galarza — Wed, 25 Mar 2026 00:00:00 -0400

When an AI agent rewrites a file and the result doesn’t match your conventions, the first move is usually to adjust the prompt. Try different instructions. Add more context to the message. Maybe switch models.

The model is rarely the bottleneck. The codebase is.

The same model, pointed at a codebase with strong tests, clear architecture, and good documentation, produces remarkably consistent output. Point it at a codebase with weak coverage, no architecture docs, and no linting, and you get drift. Not because the model is less capable, but because it has less to work with.

I built the Codebase Readiness Assessment to make this measurable. It scores your repo across eight dimensions on a 0-100 scale. But you don’t need to run the assessment to understand what separates high-scoring codebases from low-scoring ones. Four dimensions account for most of the gap.

Test Foundation

Test foundation carries the most weight in the assessment (25%) because it’s the single biggest lever for agent output quality.

What a low score looks like

An agent makes a change. There are no tests covering that area, so it moves on. The change compiles, maybe even runs, but it broke an assumption three modules away. Nobody finds out until a human reviews the PR, or worse, until production.

I’ve seen this repeatedly: teams with 30-40% test coverage ask an agent to refactor a service object. The agent produces clean code that looks right. But there’s no spec for the edge case where a nil association triggers a downstream error. The agent had no way to catch it because there’s no test to fail.

The other failure mode is slow tests. If your suite takes 20 minutes, the agent can’t iterate. It makes a change, waits, discovers the failure, tries again, waits again. In a fast suite, that feedback cycle takes seconds. In a slow one, the agent burns time and money waiting for results.

What a high score looks like

Codebases that score well here share a few characteristics:

Coverage above 70% on critical paths. Not 100% everywhere, but thorough coverage on the code that matters: domain logic, service objects, API endpoints. The agent can make changes and get immediate confirmation that nothing broke.
Suite runs in under 5 minutes. Fast enough that the agent can run tests after every meaningful change, not just at the end.
Deterministic results. No flaky tests. When the suite says green, it means green. Agents can’t distinguish between a flaky failure and a real one, so flaky tests teach agents to ignore failures.

Don’t stop at unit tests

Unit tests on service objects and models are the foundation, but they only verify isolated behavior. An agent that passes all unit tests can still break a user-facing workflow that spans multiple components.

End-to-end tests give agents confidence across entire flows. A system spec that signs a user in, submits a form, and checks the result tells the agent whether the feature works, not just whether a method returns the right value. This is especially valuable when agents make changes that touch controllers, views, and services in the same PR.

Here’s a simplified system spec from one of my Rails projects. It covers the core user journey: signing in and submitting a video idea for validation.

# spec/system/idea_submission_spec.rb

RSpec.describe "Idea submission" do
  it "allows a signed-in user to submit a video idea" do
    user = create(:user)

    sign_in_as(user, path: new_idea_path)

    select user.channels.first.name, from: "Channel"
    fill_in "Title", with: "Building a Rails AI Agent from Scratch"
    fill_in "Description", with: "Step-by-step tutorial on building an AI agent"
    fill_in "Category", with: "AI Coding"
    click_button "Validate Idea"

    expect(page).to have_content("Building a Rails AI Agent from Scratch")
  end
end

This test touches authentication, the form UI, the controller, the background job, and the results page. If an agent breaks any part of that chain, this spec catches it.

The tradeoff is speed. End-to-end tests are slower and more brittle than unit tests. You don’t need full E2E coverage, but having system specs on your critical user journeys (signup, checkout, the core action your product is built around) gives agents a safety net that unit tests alone can’t provide.

The smallest change that moves the needle

Add coverage to your critical paths first. Don’t chase a coverage number. Instead, identify the three or four service objects or domain models where bugs would hurt the most, and write specs for those. Then add one or two system specs covering your most important user journeys end-to-end. If your suite is slow, add parallel test execution. In a Rails app, that might be as simple as adding the parallel_tests gem. A suite that goes from 15 minutes to 4 minutes fundamentally changes how an agent can work with your code. If you’re running multiple agents in parallel, you’ll also need database isolation per worktree to prevent test data collisions.

If you want to accelerate the process, tools like autoresearch apply this pattern as an autonomous loop: give the agent a measurable goal (like a coverage target), and it iterates, verifies, keeps what works, and discards what doesn’t.

Documentation as Code

Documentation carries 15% of the assessment weight, but in practice it’s the dimension where I see the biggest gap between teams that get good agent output and teams that don’t.

What a low score looks like

Without an agent-facing entry point (a CLAUDE.md, AGENTS.md, or equivalent), an agent has to reverse-engineer your conventions from the code itself. It reads your files, infers patterns, and guesses at intent. Sometimes it guesses right. Often it doesn’t.

Here’s a concrete example. A Rails app uses service objects for all business logic. Controllers call a service, the service does the work, and the result gets rendered. There’s nothing enforcing this in the framework. It’s a team convention. An agent that doesn’t know about this convention puts the logic directly in the controller action. The code works. The tests pass. But it breaks the team’s pattern, and now there’s a 50-line controller action that should have been a service object.

The agent wasn’t wrong. It had no way to know.

What a high score looks like

The key insight is that this entry point file should be a map, not a manual. OpenAI’s Harness Engineering team learned this the hard way: they tried a single large instruction file and it failed because “context is a scarce resource” and “too much guidance becomes non-guidance.” When everything is marked important, agents pattern-match locally instead of navigating intentionally.

Their solution: keep the entry file short (roughly 100 lines) and treat it as a table of contents that points to deeper sources of truth in a structured docs/ directory. The entry file gives agents quick commands and a documentation map. The detail lives in dedicated files the agent reads when it needs them. Whether you call it CLAUDE.md, AGENTS.md, or CURSOR.md, the pattern is the same.

Here’s what this looks like in practice from one of my Rails projects:

## Quick Commands

bin/dev                                # Start dev server
bin/rails spec                         # All tests
bin/ci                                 # Full CI: lint + security + tests
bin/rubocop                            # Lint
bin/brakeman                           # Security scan

## Documentation Map

| Topic | Document |
|-------|----------|
| Stack, patterns, domain model | docs/ARCHITECTURE.md |
| Testing patterns and stack | docs/TESTING.md |
| Credentials, env vars, API keys | docs/CONFIGURATION.md |
| Engineering principles | docs/design-docs/core-beliefs.md |
| Architecture decision records | docs/design-docs/ |

The agent gets commands and a map up front. When it needs to understand the domain model or testing conventions, it follows the pointer. This is progressive disclosure: the agent starts with what it needs immediately and loads deeper context on demand.

Here’s a trimmed excerpt from the ARCHITECTURE.md behind that pointer:

## Domain Model

CreatorSignal validates YouTube video ideas. The core flow:

1. User submits a video **Idea**
2. A **Validation** job is enqueued
3. The **ResearchAgent** runs tools against YouTube, Reddit, X, and HN
4. Results are synthesized into a scored **Go / Refine / Kill** verdict

### Key Models

| Model | Responsibility |
|-------|---------------|
| `User` | Authentication, subscription plan |
| `Idea` | A video idea submitted for validation |
| `Validation` | One run of the research agent against an idea |

### Project Structure

app/
├── components/       # ViewComponent components
├── controllers/
├── jobs/             # ActiveJob jobs (async validation)
├── models/
├── services/         # Research agent, tool orchestration
└── views/            # Hotwire (Turbo frames/streams)

An agent reading this knows what an Idea is, that validation is async through a job, and that orchestration logic lives in app/services/. Those are the conventions that prevent drift.

ADRs (Architecture Decision Records) add a layer that documentation alone can’t. An agent that understands why a particular pattern was chosen can make better decisions when extending it. If your ADR says “we chose event sourcing for the billing domain because of auditability requirements,” the agent won’t try to refactor billing into simple CRUD.

The smallest change that moves the needle

Create an AGENTS.md in your project root with two things: commands (build, test, lint) and a documentation map pointing to deeper files. AGENTS.md is an emerging standard supported by Codex, Cursor, Gemini CLI, GitHub Copilot, Windsurf, Devin, and many others. If you’re using Claude Code, symlink CLAUDE.md to it so both resolve to the same file. Then create an ARCHITECTURE.md covering your stack, domain model, and key conventions. This can take an hour and the effect on agent output is immediate. If you want to automate the scaffolding, the agent-ready plugin generates a starting point based on your existing codebase.

Architecture Clarity

Architecture clarity carries 15% of the assessment weight. It measures whether an agent can understand where code belongs and how components relate to each other.

What a low score looks like

Agents replicate patterns they find in the codebase. If your codebase has clear boundaries (controllers handle HTTP, services handle business logic, models handle persistence), the agent follows those boundaries. If your codebase mixes concerns, the agent mixes concerns.

The most common failure I see: a controller that does everything. It validates input, calls the database, sends emails, enqueues jobs. An agent asked to add a new feature looks at the existing controller, sees that’s where logic goes, and adds more logic to the controller. The agent is doing exactly what the codebase taught it to do.

The subtler version is dependency direction. In a well-layered app, dependencies point inward: controllers depend on services, services depend on models. When that direction is inconsistent (models importing from controllers, services reaching into HTTP request objects), agents produce code with the same tangled dependencies.

What a high score looks like

Clear layering. Each layer has a single responsibility, and the codebase is consistent about which layer owns what.
Domain namespacing. Related functionality is grouped by business domain, not just by technical layer. Instead of a flat app/services/ with 40 files, you have app/services/billing/, app/services/onboarding/, app/services/research/. When an agent needs to add billing logic, the namespace tells it exactly where to look and what patterns to follow.
Predictable file organization. A new developer (or agent) can guess where a piece of code lives based on what it does.
Dependency direction is consistent. Inner layers don’t reach outward. You don’t see models importing controller concerns.

Domain namespacing is especially powerful for agents because it constrains the search space. An agent working on a billing feature only needs to understand the billing namespace, not the entire codebase. It finds the existing patterns in that namespace and replicates them. Without namespacing, the agent has to scan the whole codebase to figure out where billing logic lives, and it might find three different patterns in three different places.

The smallest change that moves the needle

If you have fat controllers, extract one. Pick your most complex controller action, pull the business logic into a service object, and write a spec for it. The agent will start using that service object pattern for new features. One well-structured example teaches the agent more than any documentation, because it’s a pattern it can directly replicate.

If your codebase has grown past a handful of services, start namespacing by domain. Group related services, jobs, and models under a shared namespace. This compounds quickly: once you have three or four service objects under Billing::, agents start producing new billing code in the same namespace by default. The codebase becomes self-reinforcing.

Feedback Loops

Feedback loops carry 10% of the assessment weight, but their impact is multiplicative. Good feedback loops make everything else work better. Poor ones make everything else work worse.

What a low score looks like

Agents learn from the signals they get back. When the only signal is “tests passed,” the agent has no way to know it introduced a style violation, broke a naming convention, or used a deprecated API. It moves on, confident the change is correct.

Two things make feedback loops weak: narrow signals and slow signals.

Narrow signals mean the agent only hears from one source. Tests tell the agent whether the code works. They don’t tell it whether the code follows your conventions, whether it introduced a security vulnerability, or whether the UI actually renders correctly. Each missing signal is a category of problems the agent can’t self-correct.

Slow signals are just as damaging. If the agent has to wait 20 minutes for a CI run to discover a linting error, it’s already moved on. It’s built three more features on top of code that doesn’t pass lint. Now you’re unwinding multiple changes instead of catching the first one. The closer the feedback is to the moment of the change, the cheaper it is to fix.

There’s also a hierarchy to how you enforce conventions. Anything that can be checked deterministically by a linter should be a lint rule, not a line in your CLAUDE.md. A lint rule catches every violation, every time. A documentation rule depends on the agent reading it and choosing to follow it. If your convention is “methods must be under 20 lines” or “always use frozen_string_literal,” encode it in RuboCop, ESLint, or whatever linter your stack uses. Save documentation for the things that can’t be mechanically enforced: architectural decisions, domain context, workflow conventions.

What a high score looks like

Pre-commit hooks for immediate feedback. The agent discovers formatting issues, type errors, or lint violations before it even commits.
CI that runs in under 10 minutes. Fast enough that the agent can push, get feedback, and iterate without burning excessive context.
Rich error messages. Linting output that says “method too long (25 lines, max 20)” is actionable. A generic “style violation” is not.

Here’s what a CI script looks like when it goes beyond just running tests. This is the bin/ci from the same Rails project:

# config/ci.rb - run with bin/ci

CI.run do
  step "Setup", "bin/setup --skip-server"
  step "Style: Ruby", "bin/rubocop"
  step "Security: Gem audit", "bin/bundler-audit"
  step "Security: Importmap vulnerability audit", "bin/importmap audit"
  step "Security: Brakeman code analysis", "bin/brakeman --quiet --no-pager --exit-on-warn --exit-on-error"
end

Five steps, each giving the agent a different kind of feedback. RuboCop catches style violations. Bundler-audit catches vulnerable gems. Brakeman catches security issues in the code itself. An agent that runs bin/ci gets five signals instead of one.

Browser access as a feedback loop

For web applications, there’s a feedback loop that most teams overlook: giving agents the ability to see what they built.

An agent that can only run tests is working blind on anything visual. It can verify that a controller returns 200, but it can’t tell whether the page actually renders correctly, whether a modal opens, or whether a form submits without errors. Cursor’s team wrote about this: once they gave agents browser access via cloud sandboxes, agents could “iterate until they’ve validated their output rather than handing off the first attempt.” More than 30% of their merged PRs are now created by agents operating autonomously in cloud sandboxes.

You don’t need a full cloud sandbox to get value from this. Claude Code has built-in Chrome support via claude --chrome, and tools like Playwright MCP give agents browser control locally. The agent can navigate to a page, take a snapshot of the DOM, fill in a form, and verify the result. That’s a feedback loop that catches an entire class of issues that unit tests and linters never will.

The smallest change that moves the needle

Add a linter to your CI pipeline. For a Ruby project, that’s RuboCop. For JavaScript/TypeScript, ESLint. For Python, Ruff. One config file, one CI step. The agent immediately starts getting feedback on style and conventions that it wouldn’t otherwise know about.

If you want faster feedback, add pre-commit hooks. The agent runs into the linter before it even pushes, which means it fixes issues in the same context window where it created them. That’s cheaper, faster, and produces cleaner commits.

For web projects, consider adding browser access through Playwright MCP or a similar tool. The agent starts verifying its own UI changes instead of relying on you to catch visual issues in review.

Where to Start

If you’re looking at your codebase and wondering where to start, here’s how I think about prioritization:

Fix your test foundation first. Without reliable tests, every other improvement is hard to verify. An agent can’t confidently refactor your architecture if there’s no test suite to catch regressions.
Add an AGENTS.md. This is 30 minutes of work that immediately changes agent behavior. It’s the highest-ROI improvement you can make.
Add a linter to CI. This closes the feedback gap with minimal effort. The agent starts learning your conventions from automated feedback instead of guessing from code patterns.

These three changes don’t require a major initiative. They’re individual tasks that compound. A codebase with strong tests, clear documentation, and fast feedback loops creates a reinforcing cycle: agents produce better code, which maintains the patterns, which makes future agent output even better.

If you want to see where your codebase stands across all eight dimensions, run the Codebase Readiness Assessment. It takes 60 seconds and gives you a score, a per-dimension breakdown, and a prioritized roadmap.

If your team wants hands-on help closing these gaps, that’s what the AI Workflow Enablement program is built for. Or if you just want to talk through your results, book a free intro call.

How AI Agents Remember Things

Damian Galarza — Tue, 17 Feb 2026 00:00:00 -0500

Out of the box, AI agents have no memory. Every conversation starts with a blank slate.

Most people assume you need vector databases, complex retrieval pipelines, or specialized memory infrastructure to fix this. But it turns out the storage is the easy part. The hard part is knowing when to write and when to load. Get that right, and the rest is just files.

Prefer video? Watch How AI Agents Remember Things on YouTube →

I’ll use OpenClaw as a case study here. Its memory model is one of the clearest real-world implementations I’ve seen. But the patterns apply to any agent you build.

Why Agents Have No Memory By Default

AI models are inherently stateless. There’s no memory between calls. What looks like a conversation is just an increasingly long context window being passed on each turn. Every message, every response, every tool call gets appended to the transcript and sent with the next request.

This works fine for a one-off question. It breaks down the moment you want an agent that knows you.

Memory systems handle this by splitting the problem in two: the session, and longer-term memory.

Sessions

A session is the history of a single conversation with an LLM. While the conversation is active, that history gets passed along with each call, and the model can see everything said so far. But LLMs have finite context windows, and as you approach that limit, something has to give.

That something is compaction. Compaction takes the session’s conversation history and condenses it down to the most important information so the conversation can continue. There are three different strategies for triggering it:

Count-based: compact once the conversation exceeds a certain token size or turn count
Time-based: triggered when the user stops interacting for a period of time, handled in the background
Event-based: an agent detects that a task or topic has concluded and triggers compaction. The most intelligent approach, but also the hardest to implement accurately

The shared problem with all three: you can’t simply carry entire old conversations forward into a new session. Context windows don’t allow it. That’s where long-term memory comes in.

Think of it as a desk and a filing cabinet. The session is the messy desk, with notes scattered around and documents open. Memory is the filing cabinet where things are categorized and stored for later. When the session ends, whatever isn’t filed is gone.

The Memory Taxonomy

Google published a whitepaper in November 2025 titled “Context Engineering: Sessions & Memory” that provides a useful framework for thinking about this. It breaks agent memory into three types.

Episodic memory covers events and interactions. “What happened in our last conversation?” If you spent a session debugging a webhook integration, episodic memory is what lets the agent recall that context in your next conversation.

Semantic memory is facts and preferences. “What do I know about this user?” Tech stack, coding style, project conventions. These are stable facts that don’t change much from session to session.

Procedural memory is workflows and learned routines. “How do I accomplish this task?” The agent’s understanding of your deployment process, your testing patterns, your PR review checklist.

All three work together to form what we’d call an agent’s memory. The challenge isn’t categorizing them. It’s extracting them from conversation and keeping them accurate over time.

Extraction and Consolidation

In order for a memory system to be effective, it needs to extract the right things from a conversation. Not every detail is worth keeping. Targeted filtering is necessary, the same way human memory doesn’t retain every word of a conversation. It retains key facts and decisions.

Beyond that, the system needs to consolidate. Consider a user who tells an agent “I prefer dark mode” in one session, then later says “I like dark mode,” and in another session mentions “I switched to dark mode.” Without consolidation, all three entries sit in memory saying essentially the same thing. A good memory system collapses those into a single entry: “User prefers dark mode.”

It also needs to handle updates. Something true today might not be true tomorrow. If you switch from dark mode to light mode, the memory system needs to overwrite the old entry, not append a contradictory one. Without this, memory becomes noisy and unreliable over time.

Both extraction and consolidation are typically handled by a separate LLM instance that takes a conversation and processes it, deciding what to keep, what to merge, and what to update.

Memory Storage

Storage itself is relatively straightforward. For local agents, markdown files work well. They’re readable, debuggable, and require no infrastructure. For agents that need semantic search across a large history, a vector database is the right tool. The choice depends on the use case.

What matters more than the storage format is the shape of what you store: semantic memory for stable facts, episodic memory for events and recent context, and procedural memory for workflows.

OpenClaw’s Memory Model

Let me walk through how one system actually implements this.

OpenClaw’s memory system has three core components, and all of them are just markdown files.

MEMORY.md is the semantic memory store. Stable facts, user preferences, identity information. It has a recommended 200-line cap and is organized into structured sections. The key design decision: this file is loaded into every single prompt, not retrieved on demand. The agent starts every conversation already knowing who you are.

Daily logs are OpenClaw’s first implementation of episodic memory. They live at ~/.openclaw/workspace/memory/YYYY-MM-DD.md and contain recent context organized by day. They’re append-only; new entries get added, nothing is removed. Today’s and yesterday’s logs are loaded at the start of each session.

Session snapshots are the second implementation of episodic memory. When you start a new session with /new or /reset, a hook captures the last 15 meaningful messages from your conversation, filtering out tool calls, system messages, and slash commands. It’s not a summary; it’s the raw conversation text, saved as a markdown file with a descriptive name like ~/.openclaw/workspace/memory/2026-02-08-api-design.md.

So at its core, OpenClaw’s memory is markdown files. But the files are only half the story. Without something that reads and writes them at the right time, they’re just sitting there doing nothing.

The files are the filing cabinet. What comes next are the four mechanisms that move things from the desk to the cabinet at the right moments.

How It All Comes Together

Mechanism 1: Bootstrap loading at session start.

For every new conversation, MEMORY.md is automatically injected into the prompt. The agent always has it. On top of that, the agent’s instructions tell it to read today’s and yesterday’s daily logs for recent context. MEMORY.md is injected by the system; the daily logs are loaded by the agent itself, following its own instructions.

This is the simplest pattern and the most important one. The agent doesn’t have to search for context. It’s just there.

Mechanism 2: Pre-compaction flush.

OpenClaw takes a count-based approach to compaction. When a session nears the context window limit, OpenClaw injects a silent agentic turn (invisible to the user) with the following instructions:

“Pre-compaction memory flush. Store durable memories now (use memory/YYYY-MM-DD.md; create memory/ if needed). If nothing to store, reply with NO_REPLY.”

When the agent sees this, it writes anything worth keeping to the daily log, then replies with NO_REPLY so it never surfaces in the conversation.

This turns a destructive operation into a checkpoint. Losing context becomes a save point rather than a loss. It’s the write-ahead log pattern: save before you lose, load when you start. The same pattern databases have used for decades, applied to agent memory.

Mechanism 3: Session snapshot on /new.

When you explicitly start a new session, a hook grabs the last chunk of your conversation, filters to meaningful messages only, and saves it with a descriptive filename. It only fires on explicit /new or /reset; closing the browser doesn’t trigger it. It’s an intentional save point, not an automatic backup.

Mechanism 4: User says “remember this.”

The simplest mechanism. If you ask the agent to remember something, it determines whether it belongs in MEMORY.md as semantic memory or the daily log as episodic memory, and writes accordingly. No special hook needed, just file-writing capabilities and instructions for how to categorize.

Why This Matters Beyond OpenClaw

Claude Code recently shipped a native memory feature. It also uses markdown files. The pattern is becoming standard.

The agents that feel most useful, the ones that stick as part of your workflow, are the ones that remember you. An agent that asks your tech stack every session doesn’t feel like a colleague. An agent that already knows your conventions and what you worked on yesterday does.

The building blocks are the same regardless of what you’re building on: file-first storage, lifecycle triggers tied to meaningful session events, and extraction and consolidation to keep memory clean over time.

Wrapping Up

OpenClaw’s entire memory system comes down to markdown files and knowing when to write to them. Semantic memory in MEMORY.md. Episodic memory in daily logs and session snapshots. And four mechanisms that fire at the right moments in a conversation’s lifecycle.

You don’t need a complex setup to give an agent memory. You need a clear answer to three questions: what’s worth remembering, where does it go, and when does it get written.

Ai-Agents on Damian Galarza | Software Engineering & AI Consulting

Shrinking a Production Prompt by 28% With Autonomous Optimization

The Autoresearch Pattern

Setting Up the Prompt Optimization Experiment

Scope

Metric

Guard

The 65-Iteration Run

Results After Fixing the Eval Baseline

What This Reveals About LLMs and Reference Data

Lessons for Running Autonomous Optimization Loops

The guard is what makes it work

Separate your optimization target from your eval baseline

Git-as-memory enables post-hoc analysis

Guard speed determines iteration budget

Applying This Pattern to Other Prompt Components

Additional Reading

Four Dimensions of Agent-Ready Codebase Design

Test Foundation

What a low score looks like

What a high score looks like

Don’t stop at unit tests

The smallest change that moves the needle

Documentation as Code

What a low score looks like

What a high score looks like

The smallest change that moves the needle

Architecture Clarity

What a low score looks like

What a high score looks like

The smallest change that moves the needle

Feedback Loops

What a low score looks like

What a high score looks like

Browser access as a feedback loop

The smallest change that moves the needle

Where to Start

Further Reading

How AI Agents Remember Things

Why Agents Have No Memory By Default

Sessions

The Memory Taxonomy

Extraction and Consolidation

Memory Storage

OpenClaw’s Memory Model

How It All Comes Together

Why This Matters Beyond OpenClaw

Wrapping Up

Further Reading