<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Prompt-Engineering on Damian Galarza | Software Engineering &amp; AI Consulting</title><link>https://www.damiangalarza.com/tags/prompt-engineering/</link><description>Recent posts from Damian Galarza | Software Engineering &amp; AI Consulting</description><generator>Hugo</generator><language>en-us</language><managingEditor>Damian Galarza</managingEditor><atom:link href="https://www.damiangalarza.com/tags/prompt-engineering/feed.xml" rel="self" type="application/rss+xml"/><item><title>Shrinking a Production Prompt by 28% With Autonomous Optimization</title><link>https://www.damiangalarza.com/posts/2026-04-06-autonomous-optimization-loops-with-autoresearch/</link><pubDate>Mon, 06 Apr 2026 00:00:00 -0400</pubDate><author>Damian Galarza</author><guid>https://www.damiangalarza.com/posts/2026-04-06-autonomous-optimization-loops-with-autoresearch/</guid><description>How I used autoresearch to run 65 autonomous prompt optimization iterations on a production LLM agent, cutting it 28% while retaining 98% output quality.</description><content:encoded><![CDATA[<p>Every token in a production LLM prompt costs you latency, money, and <a href="/posts/understanding-claude-code-context-window/">context window</a> space. An agent I&rsquo;ve been building takes around 170 input categories and produces a detailed structured matrix as output. The system prompt includes a 421-line reference matrix as a few-shot example gallery so the model knows the expected output patterns.</p>
<p>The question was concrete. How much of this reference data does the model actually need? I used <a href="https://github.com/uditgoenka/autoresearch">uditgoenka/autoresearch</a>, a Claude Code skill based on <a href="https://github.com/karpathy/autoresearch">Andrej Karpathy&rsquo;s autoresearch</a>, to find out. After over 65 autonomous iterations, it cut the matrix to 303 lines (28% smaller) while maintaining 98.1% output quality.</p>
<p>Here&rsquo;s the prompt optimization pattern, the results, and what surprised me about how robust LLMs are to reference data reduction.</p>
<h2 id="the-autoresearch-pattern">The Autoresearch Pattern</h2>
<p>Andrej Karpathy&rsquo;s <a href="https://github.com/karpathy/autoresearch">autoresearch</a> introduced the core idea: give an AI agent a metric to optimize and let it loop. Modify, measure, keep or revert, repeat.</p>
<figure class="tweet-screenshot"><a href="https://x.com/karpathy/status/2030371219518931079"><img src="/images/posts/autoresearch/karpathy-tweet.png"
    alt="Andrej Karpathy announcing autoresearch on X"></a>
</figure>

<p>Udit Goenka built a <a href="https://github.com/uditgoenka/autoresearch">Claude Code skill</a> that brings this pattern to arbitrary optimization tasks, adding a dedicated guard command to prevent regressions.</p>
<p>You define six parameters:</p>
<ul>
<li><strong>Goal</strong>: what you want to improve</li>
<li><strong>Scope</strong>: which files the agent can modify</li>
<li><strong>Metric</strong>: a number extracted from a shell command (line count, test coverage, score)</li>
<li><strong>Direction</strong>: whether higher or lower is better</li>
<li><strong>Verify</strong>: the command that produces the metric</li>
<li><strong>Guard</strong>: a safety net command that must always pass</li>
</ul>
<p>Each iteration follows the same cycle: modify, commit to git, verify the metric, run the guard, keep or revert. Every experiment gets committed before verification, so rollbacks are clean. It tracks results in a TSV log and reads its own git history to avoid repeating failed approaches.</p>
<p>The separation between metric and guard is what makes this work. The metric tells autoresearch &ldquo;did we make progress?&rdquo; while the guard tells it &ldquo;did we break anything?&rdquo; Keeping those independent lets the loop optimize aggressively while the guard catches regressions.</p>
<h2 id="setting-up-the-prompt-optimization-experiment">Setting Up the Prompt Optimization Experiment</h2>
<h3 id="scope">Scope</h3>
<p>I scoped autoresearch to a single file — the reference matrix itself. I didn&rsquo;t want it touching the agent&rsquo;s prompt instructions, the recommendation library, or the eval infrastructure. Just the example data.</p>
<p>The alternative was to also let it modify the agent prompt, changing how the matrix is described. But I wanted to isolate the variable: same prompt instructions, same recommendation library, just less example data.</p>
<h3 id="metric">Metric</h3>
<p>For the metric, I used line count. It&rsquo;s simple, deterministic, and directly measures what we care about — how much data gets injected into the prompt. The metric doesn&rsquo;t measure quality at all. That&rsquo;s the guard&rsquo;s job.</p>
<h3 id="guard">Guard</h3>
<p>The quality gate was our existing golden-benchmark eval:</p>
<div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#f5e0dc">EVAL_QUIET</span><span style="color:#89dceb;font-weight:bold">=</span><span style="color:#89dceb">true</span> npx vitest run --config vitest.evals.config.ts <span style="color:#89b4fa">\
</span></span></span><span style="display:flex;"><span><span style="color:#89b4fa"></span>  src/evals/matrix-generation/golden-benchmark.test.ts
</span></span></code></pre></div><p>This eval feeds all 166 input categories into the agent, runs the full matrix generation end-to-end via LLM, and compares the output against golden reference data across four dimensions:</p>
<ol>
<li><strong>Input category coverage</strong> (did it produce rows for every input category?)</li>
<li><strong>Output category accuracy</strong> (correct category assignment?)</li>
<li><strong>Recommendation overlap</strong> (right recommendations from the library?)</li>
<li><strong>Assignment accuracy</strong> (correct responsible party?)</li>
</ol>
<p>The guard must exit 0 (all Vitest assertions pass) for a change to be kept. Each guard run took 5 to 7 minutes because it makes a real LLM API call to generate the full matrix.</p>
<h2 id="the-65-iteration-run">The 65-Iteration Run</h2>
<p>I ran three rounds:</p>
<table>
  <thead>
      <tr>
          <th>Round</th>
          <th>Iterations</th>
          <th>Lines</th>
          <th>Focus</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>25</td>
          <td>421 → 337</td>
          <td>Easy wins: exact duplicates, severity level consolidation, frequency variants</td>
      </tr>
      <tr>
          <td>2</td>
          <td>25</td>
          <td>337 → 255</td>
          <td>Deeper cuts: multi-row category groups, shared high-severity rows</td>
      </tr>
      <tr>
          <td>3</td>
          <td>15</td>
          <td>255 → 197</td>
          <td>Aggressive: most multi-row groups reduced to 1-2 representatives</td>
      </tr>
  </tbody>
</table>
<p>Each iteration took 6 to 8 minutes (mostly the guard eval). Total wall time was roughly 7 to 8 hours across three rounds.</p>
<p>The agent&rsquo;s approach was systematic. In round 1, it found the free wins: 5 exact duplicate rows, frequency variants (7 rows for different weekly frequencies that could collapse to 2), and severity levels where lower severity was always a subset of moderate. In later rounds, it got more aggressive, reducing most multi-row category groups to single representative entries.</p>
<h2 id="results-after-fixing-the-eval-baseline">Results After Fixing the Eval Baseline</h2>
<p>During the run, I discovered that the original eval had a self-referencing bug. Both the agent prompt and the eval&rsquo;s golden comparison data imported from the same <code>REFERENCE_MATRIX_CSV</code> constant. Every time autoresearch shrank the reference matrix, it also shrank what the eval compared against. The eval was proving &ldquo;the model can reproduce a smaller matrix&rdquo; rather than &ldquo;the model handles all real-world input categories correctly.&rdquo;</p>
<p>The fix was straightforward. I split the data into two files:</p>
<div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic">// src/data/reference-matrix.ts — injected into the agent prompt (optimized)
</span></span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic"></span><span style="color:#cba6f7">export</span> <span style="color:#cba6f7">const</span> REFERENCE_MATRIX_CSV <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#a6e3a1">`...`</span>; <span style="color:#6c7086;font-style:italic">// 303 lines after optimization
</span></span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic"></span>
</span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic">// src/data/golden-reference-matrix.ts — used by eval (immutable)
</span></span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic"></span><span style="color:#cba6f7">export</span> <span style="color:#cba6f7">const</span> GOLDEN_REFERENCE_MATRIX_CSV <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#a6e3a1">`...`</span>; <span style="color:#6c7086;font-style:italic">// original 421 lines, never changes
</span></span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic">// src/evals/matrix-generation/golden-benchmark.test.ts
</span></span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic">// Before: import { REFERENCE_MATRIX_CSV } from &#39;../../data/reference-matrix&#39;;
</span></span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic"></span><span style="color:#cba6f7">import</span> { GOLDEN_REFERENCE_MATRIX_CSV } <span style="color:#cba6f7">from</span> <span style="color:#a6e3a1">&#39;../../data/golden-reference-matrix&#39;</span>;
</span></span></code></pre></div><p>With the fixed eval, I binary-searched through the git history to find the optimal size. Because autoresearch commits every experiment, the full optimization history was available to test against the corrected eval.</p>
<!-- audio-skip -->
<table>
  <thead>
      <tr>
          <th>Lines</th>
          <th>Reduction</th>
          <th>Overall Score</th>
          <th>Status</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>421</td>
          <td>0%</td>
          <td>~99.9%</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>337</td>
          <td>-20%</td>
          <td>99.1%</td>
          <td>Above 98%</td>
      </tr>
      <tr>
          <td>308</td>
          <td>-27%</td>
          <td>98.5%</td>
          <td>Above 98%</td>
      </tr>
      <tr>
          <td>305</td>
          <td>-28%</td>
          <td>98.4%</td>
          <td>Above 98%</td>
      </tr>
      <tr>
          <td><strong>303</strong></td>
          <td><strong>-28%</strong></td>
          <td><strong>98.1%</strong></td>
          <td><strong>Sweet spot</strong></td>
      </tr>
      <tr>
          <td>297</td>
          <td>-29%</td>
          <td>97.7%</td>
          <td>Below 98%</td>
      </tr>
      <tr>
          <td>283</td>
          <td>-33%</td>
          <td>96.9%</td>
          <td>Below 98%</td>
      </tr>
      <tr>
          <td>255</td>
          <td>-39%</td>
          <td>~96%</td>
          <td>Below 98%</td>
      </tr>
      <tr>
          <td>197</td>
          <td>-53%</td>
          <td>95.5%</td>
          <td>Too aggressive</td>
      </tr>
  </tbody>
</table>
<p>The sweet spot is 303 lines: a 28% reduction maintaining 98%+ overall quality. The quality cliff appears around iteration 35, where the agent removed shared high-severity rows that contained unique recommendation mappings.</p>
<p>At 303 lines, the score breakdown:</p>
<ul>
<li><strong>Input category coverage:</strong> 100% (all 166 golden categories present)</li>
<li><strong>Output category accuracy:</strong> 100%</li>
<li><strong>Recommendation overlap:</strong> 90.5% (about 32 specific recommendations lost)</li>
<li><strong>Assignment accuracy:</strong> 99.7%</li>
<li><strong>Overall weighted score:</strong> 98.1%</li>
</ul>
<p>The main quality cost is recommendation overlap. The model still covers all input categories and assigns output categories correctly, but produces slightly fewer recommendation rows per category. For this use case, that&rsquo;s an acceptable tradeoff: 118 fewer lines in every prompt for a 1.9% quality reduction.</p>
<h2 id="what-this-reveals-about-llms-and-reference-data">What This Reveals About LLMs and Reference Data</h2>
<p>The most useful finding isn&rsquo;t the 28% number. It&rsquo;s the degradation curve.</p>
<p>Even at 197 lines (53% cut), the model still hit 95.5%. It correctly covered all input categories and most output categories. The recommendation library (a separate 337-entry file in the prompt) carries much of the mapping knowledge. The reference matrix turned out to be more &ldquo;example gallery&rdquo; than &ldquo;source of truth.&rdquo; The model uses it to learn output patterns, not to look up specific mappings.</p>
<p>This has implications for any system that injects large reference data into prompts. The model likely doesn&rsquo;t need all of it. But you need a correct eval to find the actual boundary, and the degradation is gradual, not a cliff. Without a quality gate, you won&rsquo;t know where that boundary is until users report problems.</p>
<h2 id="lessons-for-running-autonomous-optimization-loops">Lessons for Running Autonomous Optimization Loops</h2>
<h3 id="the-guard-is-what-makes-it-work">The guard is what makes it work</h3>
<p>Without a quality gate, autoresearch is a deletion loop. The guard is the only thing preventing it from removing everything. This sounds obvious until you see how easy it is to write a guard that doesn&rsquo;t actually guard.</p>
<h3 id="separate-your-optimization-target-from-your-eval-baseline">Separate your optimization target from your eval baseline</h3>
<p>If your golden data is the same data you&rsquo;re optimizing, you&rsquo;ll always pass. This is easy to do when the reference data serves dual purpose (prompt injection and eval comparison). Split them from the start. The optimization target is mutable. The eval baseline is immutable.</p>
<h3 id="git-as-memory-enables-post-hoc-analysis">Git-as-memory enables post-hoc analysis</h3>
<p>Autoresearch commits every experiment before verification. This is a form of <a href="/posts/how-ai-agents-remember-things/">agent memory</a> that pays off after the run ends: I was able to binary-search through the history after fixing the eval, finding the exact commit where quality degraded. Without that history, I would have had to re-run the entire optimization from scratch.</p>
<h3 id="guard-speed-determines-iteration-budget">Guard speed determines iteration budget</h3>
<p>Fast guards (line count, type checks, unit tests) enable hundreds of iterations overnight. Slow guards (LLM-based evals, end-to-end tests) limit you to 10 to 15 iterations per hour. Plan your guard complexity based on how many iterations you can afford.</p>
<h2 id="applying-this-pattern-to-other-prompt-components">Applying This Pattern to Other Prompt Components</h2>
<p>The recommendation library (a separate 337-entry reference file also injected into every prompt) is the next candidate for the same treatment. Same loop, same approach, but with the eval separation built in from the start.</p>
<p>The pattern generalizes to any prompt optimization problem: define the metric, build a correct guard, let the agent loop. The constraint is always the guard. A guard that looks correct but measures the wrong thing is worse than no guard at all.</p>
<p>I built a one-page scorecard based on the four layers of agent evaluation — component testing, trajectory visibility, outcome measurement, and production monitoring. It takes two minutes and shows you where your gaps are. <a href="/agent-eval-scorecard/">Get the Agent Eval Scorecard →</a></p>
<p>If you&rsquo;re past the scorecard stage and want hands-on help with eval design or prompt optimization, <a href="/ai-agents/">let&rsquo;s talk</a>.</p>
<h2 id="additional-reading">Additional Reading</h2>
<ul>
<li><a href="https://github.com/uditgoenka/autoresearch">autoresearch Claude Code skill</a> by Udit Goenka</li>
<li><a href="https://github.com/karpathy/autoresearch">autoresearch</a> by Andrej Karpathy</li>
</ul>
]]></content:encoded></item><item><title>How to Fix LLM Date and Time Issues in Production</title><link>https://www.damiangalarza.com/posts/2026-01-07-llm-date-time-context-production/</link><pubDate>Wed, 07 Jan 2026 00:00:00 -0500</pubDate><author>Damian Galarza</author><guid>https://www.damiangalarza.com/posts/2026-01-07-llm-date-time-context-production/</guid><description>LLMs don't have access to the current date, causing issues in time-based analysis. Here's how to fix date and time handling in production LLM systems with explicit context.</description><content:encoded><![CDATA[<p>I was recently working on a project to generate summarized reporting using the Anthropic Claude API. What looked good at first eventually revealed some odd behavior in production. This post explains the problem we ran into and how we resolved it.</p>
<h2 id="the-problem">The Problem</h2>
<p>The following is adapted from a real production system but generalized for this post.</p>
<p>Take a theoretical SaaS application. The goal: generate a report of users who have low activity and identify those who are likely to churn. Let&rsquo;s take a look at an example prompt:</p>
<div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-ruby" data-lang="ruby"><span style="display:flex;"><span><span style="color:#89dceb">require</span> <span style="color:#a6e3a1">&#39;anthropic&#39;</span>
</span></span><span style="display:flex;"><span><span style="color:#89dceb">require</span> <span style="color:#a6e3a1">&#39;json&#39;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#cba6f7">class</span> <span style="color:#f9e2af">ChurnRiskAnalyzer</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f9e2af">SYSTEM_PROMPT</span> <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#89dceb;font-weight:bold">&lt;&lt;~</span><span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">You</span> are a customer success analyst<span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Your</span> job is to analyze user engagement
</span></span><span style="display:flex;"><span>    data <span style="color:#89dceb;font-weight:bold">and</span> identify customers at risk of churning<span style="color:#89dceb;font-weight:bold">.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">When</span> analyzing users, identify which users are recently converted <span style="color:#f9e2af">AND</span> at high
</span></span><span style="display:flex;"><span>    risk of churning due to low engagement<span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Pay</span> special attention <span style="color:#a6e3a1">to</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Login</span> frequency relative to their plan type
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Feature</span> adoption breadth
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Time</span> since trial conversion
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">For</span> each user, <span style="color:#a6e3a1">state</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">1</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Days</span> since conversion
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">2</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Whether</span> they qualify as a <span style="color:#a6e3a1">&#34;recent conversion&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">3</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Your</span> churn risk assessment <span style="color:#89dceb;font-weight:bold">and</span> reasoning
</span></span><span style="display:flex;"><span>  <span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">def</span> <span style="color:#89b4fa">initialize</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f5e0dc">@client</span> <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#f9e2af">Anthropic</span><span style="color:#89dceb;font-weight:bold">::</span><span style="color:#f9e2af">Client</span><span style="color:#89dceb;font-weight:bold">.</span>new
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">end</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">def</span> <span style="color:#89b4fa">analyze</span>(low_engagement_users)
</span></span><span style="display:flex;"><span>    user_prompt <span style="color:#89dceb;font-weight:bold">=</span> build_user_prompt(low_engagement_users)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    response <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#f5e0dc">@client</span><span style="color:#89dceb;font-weight:bold">.</span>messages<span style="color:#89dceb;font-weight:bold">.</span>create(
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">model</span>: <span style="color:#a6e3a1">&#39;claude-sonnet-4-20250514&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">max_tokens</span>: <span style="color:#fab387">1024</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#89dceb">system</span>: <span style="color:#f9e2af">SYSTEM_PROMPT</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">messages</span>: <span style="color:#89dceb;font-weight:bold">[</span>
</span></span><span style="display:flex;"><span>        { <span style="color:#a6e3a1">role</span>: <span style="color:#a6e3a1">&#39;user&#39;</span>, <span style="color:#a6e3a1">content</span>: user_prompt }
</span></span><span style="display:flex;"><span>      <span style="color:#89dceb;font-weight:bold">]</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    response<span style="color:#89dceb;font-weight:bold">.</span>content<span style="color:#89dceb;font-weight:bold">.</span>first<span style="color:#89dceb;font-weight:bold">.</span>text
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">end</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">private</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">def</span> <span style="color:#89b4fa">build_user_prompt</span>(users)
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">&lt;&lt;~</span><span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f9e2af">Analyze</span> the following low<span style="color:#89dceb;font-weight:bold">-</span>engagement users from the past <span style="color:#fab387">30</span> <span style="color:#a6e3a1">days</span>:
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      <span style="color:#6c7086;font-style:italic">#{JSON.pretty_generate(users)}</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">end</span>
</span></span><span style="display:flex;"><span><span style="color:#cba6f7">end</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6c7086;font-style:italic"># Example usage</span>
</span></span><span style="display:flex;"><span>analyzer <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#f9e2af">ChurnRiskAnalyzer</span><span style="color:#89dceb;font-weight:bold">.</span>new
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>low_engagement_users <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#89dceb;font-weight:bold">[</span>
</span></span><span style="display:flex;"><span>  {
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">engagement_id</span>: <span style="color:#a6e3a1">&#39;eng_001&#39;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">last_login</span>: <span style="color:#a6e3a1">&#39;2025-12-28&#39;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">logins_past_30_days</span>: <span style="color:#fab387">2</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">features_used</span>: <span style="color:#89dceb;font-weight:bold">[</span><span style="color:#a6e3a1">&#39;dashboard&#39;</span><span style="color:#89dceb;font-weight:bold">]</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">user</span>: {
</span></span><span style="display:flex;"><span>      <span style="color:#89dceb">id</span>: <span style="color:#a6e3a1">&#39;usr_4821&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">email</span>: <span style="color:#a6e3a1">&#39;sarah@acme.co&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">plan</span>: <span style="color:#a6e3a1">&#39;pro&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">trial_converted_at</span>: <span style="color:#a6e3a1">&#39;2025-02-15&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">company</span>: <span style="color:#a6e3a1">&#39;Acme Corp&#39;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  },
</span></span><span style="display:flex;"><span>  {
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">engagement_id</span>: <span style="color:#a6e3a1">&#39;eng_002&#39;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">last_login</span>: <span style="color:#a6e3a1">&#39;2025-12-20&#39;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">logins_past_30_days</span>: <span style="color:#fab387">1</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">features_used</span>: <span style="color:#89dceb;font-weight:bold">[]</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e3a1">user</span>: {
</span></span><span style="display:flex;"><span>      <span style="color:#89dceb">id</span>: <span style="color:#a6e3a1">&#39;usr_9174&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">email</span>: <span style="color:#a6e3a1">&#39;mike@newstartup.io&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">plan</span>: <span style="color:#a6e3a1">&#39;pro&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">trial_converted_at</span>: <span style="color:#a6e3a1">&#39;2025-12-01&#39;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e3a1">company</span>: <span style="color:#a6e3a1">&#39;NewStartup&#39;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span><span style="color:#89dceb;font-weight:bold">]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#89dceb">puts</span> analyzer<span style="color:#89dceb;font-weight:bold">.</span>analyze(low_engagement_users)
</span></span></code></pre></div><p>Let&rsquo;s assume the date is December 29th, 2025.</p>
<p>In this example we have two users with low engagement. <code>sarah@acme.co</code> converted back in February 2025 and has 2 logins in the past 30 days. <code>mike@newstartup.io</code> converted December 1st, 2025 and has just 1 login.</p>
<p>The expected behavior: flag Mike as at risk since he converted recently and has minimal engagement. What actually happened: both Mike and Sarah were flagged. Let&rsquo;s look at why.</p>
<h2 id="lack-of-guidance">Lack of guidance</h2>
<p>In the first version of our system prompt, we mention that we want to include recent conversions—but we never define what &ldquo;recent&rdquo; means. This leaves it up to the model to decide, which leads to non-deterministic and confusing results. The fix is to provide explicit guidance:</p>
<div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-ruby" data-lang="ruby"><span style="display:flex;"><span><span style="color:#cba6f7">class</span> <span style="color:#f9e2af">ChurnRiskAnalyzer</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f9e2af">SYSTEM_PROMPT</span> <span style="color:#89dceb;font-weight:bold">=</span> <span style="color:#89dceb;font-weight:bold">&lt;&lt;~</span><span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">You</span> are a customer success analyst<span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Your</span> job is to analyze user engagement
</span></span><span style="display:flex;"><span>    data <span style="color:#89dceb;font-weight:bold">and</span> identify customers at risk of churning<span style="color:#89dceb;font-weight:bold">.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    A <span style="color:#a6e3a1">&#34;recent conversion&#34;</span> is defined as a user who converted from
</span></span><span style="display:flex;"><span>    trial to paid within the past <span style="color:#fab387">30</span> days<span style="color:#89dceb;font-weight:bold">.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">When</span> analyzing users, identify which users are recently converted <span style="color:#f9e2af">AND</span> at high
</span></span><span style="display:flex;"><span>    risk of churning due to low engagement<span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Pay</span> special attention <span style="color:#a6e3a1">to</span>:
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Login</span> frequency relative to their plan type
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Feature</span> adoption breadth
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">-</span> <span style="color:#f9e2af">Time</span> since trial conversion
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">For</span> each user, <span style="color:#a6e3a1">state</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">1</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Days</span> since conversion
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">2</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Whether</span> they qualify as a <span style="color:#a6e3a1">&#34;recent conversion&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#fab387">3</span><span style="color:#89dceb;font-weight:bold">.</span> <span style="color:#f9e2af">Your</span> churn risk assessment <span style="color:#89dceb;font-weight:bold">and</span> reasoning
</span></span><span style="display:flex;"><span><span style="color:#f9e2af">PROMPT</span>
</span></span></code></pre></div><p>Now the model has explicit guidance on what makes a &ldquo;recent conversion.&rdquo; But we can&rsquo;t stop here.</p>
<h2 id="providing-a-reference-date">Providing a reference date</h2>
<p>We&rsquo;ve updated the prompt to provide explicit guidance on what makes a recent conversion, but there&rsquo;s still one problem—the model doesn&rsquo;t know what the current date is. LLMs have no system clock access; they only know what you tell them. One way to resolve this is to provide the date as part of the prompt:</p>
<div class="highlight"><pre tabindex="0" style="color:#cdd6f4;background-color:#1e1e2e;-moz-tab-size:2;-o-tab-size:2;tab-size:2;"><code class="language-ruby" data-lang="ruby"><span style="display:flex;"><span>  <span style="color:#cba6f7">def</span> <span style="color:#89b4fa">build_user_prompt</span>(users)
</span></span><span style="display:flex;"><span>    <span style="color:#89dceb;font-weight:bold">&lt;&lt;~</span><span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f9e2af">Today</span><span style="color:#f38ba8">&#39;</span>s date <span style="color:#a6e3a1">is</span>: <span style="color:#6c7086;font-style:italic">#{Date.today.iso8601}.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      <span style="color:#f9e2af">Analyze</span> the following low<span style="color:#89dceb;font-weight:bold">-</span>engagement users from the past <span style="color:#fab387">30</span> <span style="color:#a6e3a1">days</span>:
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>      <span style="color:#6c7086;font-style:italic">#{JSON.pretty_generate(users)}</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f9e2af">PROMPT</span>
</span></span><span style="display:flex;"><span>  <span style="color:#cba6f7">end</span>
</span></span></code></pre></div><p>Now the model has everything it needs for accurate reporting. Running this updated version correctly excludes Sarah, who signed up months ago, and flags only Mike.</p>
<h2 id="conclusion">Conclusion</h2>
<p>When working with LLMs and time-sensitive data:</p>
<ol>
<li><strong>Be explicit about definitions</strong> - Don&rsquo;t assume the model interprets terms like &ldquo;recent&rdquo; the same way you do.</li>
<li><strong>Always provide the current date</strong> - LLMs have no awareness of real-time; include today&rsquo;s date in your prompt.</li>
<li><strong>Test with edge cases</strong> - Run your prompts with data that spans different time periods to catch these issues early.</li>
</ol>
<p>These might seem like small details, but in production systems where accuracy matters, they make the difference between useful analysis and misleading results. Subtle errors like these erode trust quickly.</p>
<hr>
<p>Building LLM features into a Rails app and running into issues like this? My book <a href="/building-llm-applications/">Building LLM Applications in Rails</a> covers the patterns that hold up in production.</p>
]]></content:encoded></item></channel></rss>