Adding Audio Narration Revealed the Bugs in My Writing

Damian Galarza — Mon, 20 Apr 2026 00:00:00 -0400

Last week I was driving on a stretch of highway replaying a draft post in my head. I’d written it the night before, wanted to catch any obvious flaws, and couldn’t touch the keyboard. What I wanted was to hear it.

I’d seen narrated posts show up on X and on OpenAI’s blog recently. A small play button at the top of the article. A consistent voice reading the piece aloud. It’s a small affordance and a nice one. I realized I wanted that for my site.

So I messaged Emma.

Who Emma Is

Emma is my assistant. An AI agent I keep in my pocket for exactly this kind of request. While I drove, I described what I wanted and asked her to draft a PRD.

By the time I reached my destination, I had one waiting. Service choice (ElevenLabs). Storage (Cloudflare R2). Generation trigger (GitHub Actions on push to main). Opt-in via a frontmatter flag so I could control cost post by post.

I reviewed the PRD, approved it, and Emma kicked off a Claude Code session to scaffold the feature. By the time I pulled into my driveway, the branch was open with a generation script, a Hugo player partial, and the R2 upload wiring.

That’s the ambient workflow I’ve settled into. One human conversation, two agents, and a draft implementation before I’ve gotten out of the car. I still drive the engineering. But I no longer start from zero.

Voice Auditions

I pulled the branch. Then I went into ElevenLabs, grabbed an API key, picked a voice from their library, and updated my .env with the key and voice ID. Next I ran the script on one of my longer recent posts.

The first version sounded too slow. Every sentence stretched past its welcome.

I went back to the library and picked another. This one paced better but felt wrong for the content. Too smooth, too polished, the kind of voice you hear in onboarding videos for insurance products. Wrong register for technical writing.

After a few more auditions I found one I liked. Then I paused. The whole point was to put my voice on my posts. Why was I renting someone else’s?

ElevenLabs offers voice cloning on their Creator plan. Instant clones take a minute of clean audio. Professional clones train on 30 minutes to several hours of source material and produce noticeably better fidelity. Fortunately, I had a few months worth of YouTube recordings that I could leverage. I pointed Claude Code at my archives and asked it to extract 2 hours worth of audio tracks for me to use. While you can generate using 30 minutes, 2 hours is the preferred amount for the highest quality clone.

About an hour later ElevenLabs sent me a notification. The clone was ready. I plugged the new voice ID into the script, regenerated the test post, and hit play.

It was my voice, reading my post, in a car I was now sitting in.

The first pass was quite good but there was room for improvement.

The Writing Was the Bug

The clone itself sounded fine. The writing was the problem.

My first real-run listen surfaced half a dozen things I’d never noticed about my own prose.

~170 read as “tilde one seventy.” I’d written “around 170” for readers but let the tilde do the work for listeners.

Colons introducing questions arrived flat. “The question was concrete: how much data does the model actually need?” The voice couldn’t flag the interrogative early enough, so the setup clause sounded deadpan and the question sounded abrupt.

Arrows in tables and inline CTAs like 421 → 337 or Get the Scorecard → went silent, or worse, were read as the word “arrow.”

Dense tables came out as comma-joined rows with no column labels. “One, twenty-five, four hundred twenty-one, three hundred thirty-seven, easy wins, exact duplicates, severity level consolidation, frequency variants.” Meaningless as speech.

Headings landed without enough pause in front of them, so the listener had no section break to hold onto.

Every one of these reads fine on the page. Every one of them is a bug in narration.

Fixing It Through Conversation

I didn’t have a mental model for “how to write for a TTS.” So I did what I’ve been doing for most engineering work lately. I played the audio, wrote down the moments that stumbled, and asked Claude Code to help.

The back-and-forth surfaced a set of patterns I hadn’t seen on my own:

Split colon-introduced questions into two sentences so the interrogative intonation has somewhere to land.
Replace ~ with “around” before narration.
Convert arrows between values to “to.” Strip decorative trailing arrows from CTAs.
Render tables as labeled rows so the column header gives each cell context.
Inject explicit pause tags at section breaks.

Most of these patterns are not original to Claude, of course. They’re baked into audiobook-production wisdom going back decades. Punctuation as pacing. Questions as standalone sentences. Narration scripts with more punctuation than print prose, not less. Urban Writers on audiobook pacing is a fine single-page version of that playbook. I just didn’t know the prior art existed until after the fixes were working.

Which is a second thing this post is about. A lot of what “feels like figuring something out” is actually relaying, through an agent, knowledge that’s been in the world for decades. I got to the audiobook playbook via an LLM conversation, not through research. The cross-pollination is the interesting part. Not that technical prose needs narration discipline, but that I arrived at it by talking to Claude about what my ears didn’t like.

What Claude couldn’t pull from audiobook canon were the things unique to technical writing. Numbers with shorthand like ~ and k and M. Arrows, ASCII and unicode. Inline code fragments that look fine in monospace and fall apart in speech. Markdown tables as information-dense structures. Those got purpose-built transforms.

The Text Cleaner

All of these fixes live in one place: a text cleaner at scripts/generate-audio.mjs that runs on the raw markdown before anything reaches ElevenLabs. It does the following:

Strips code blocks, frontmatter, HTML, Hugo shortcodes.
Expands ~170 to “around 170.”
Converts arrows between tokens to “to.” 421 → 337 becomes 421 to 337.
Strips decorative trailing arrows from CTAs, so Get the Scorecard → narrates as just Get the Scorecard.
Renders markdown tables as labeled rows. The first row becomes column labels, and each data row becomes a short sentence. Round: 1; Iterations: 25; Lines: 421 to 337; Focus: Easy wins.
Injects pause tags around headings (a one-second pause before, a shorter pause after) and between paragraphs.
Honors a HTML comment before a table to replace that table with “See the written post for the full table.” Readers still see it. Listeners get a clean hand-off.

I also updated the voice-profile document my editor agent consults when I write new posts. It now has a “Writing for Audio Narration” section. Three months from now, when I lean into a colon-introduced question out of habit, the agent will catch it at draft time.

Architecture

The generation pipeline runs entirely outside the Hugo build. That was the design constraint from the start. Audio generation is expensive and side-effectful. Hugo rebuilds should stay cheap and deterministic.

 ┌─────────────┐
 │ push to main│
 └──────┬──────┘
        │
        ▼
 ┌──────────────────────┐      ┌──────────────┐
 │ generate-audio.yml   │──▶──▶│ ElevenLabs   │
 │ (GitHub Actions)     │      │ TTS API      │
 └──────┬───────────────┘      └──────┬───────┘
        │                             │
        │                             ▼ MP3
        │                      ┌──────────────┐
        │                      │ Cloudflare R2│
        │                      └──────────────┘
        ▼
 ┌──────────────────────┐
 │ data/audio_hashes.json│  ◀── commit hash manifest back
 └──────┬───────────────┘
        │
        ▼
 ┌──────────────────────┐
 │ Cloudflare Pages     │
 │ rebuilds; player +   │
 │ JSON-LD render when  │
 │ hash entry exists    │
 └──────────────────────┘

The moving pieces:

scripts/generate-audio.mjs is a Node script. Walks content/posts/, processes each post that has audio: true in its frontmatter, checks a content hash to avoid re-generating unchanged posts, calls ElevenLabs, uploads the MP3 to Cloudflare R2, updates the hash manifest.

data/audio_hashes.json is a content-addressable cache. Maps post slug to a SHA of its cleaned text. Lives under Hugo’s data/ directory so templates can read it as a ready-signal.

The GitHub Actions workflow (generate-audio.yml) triggers on pushes to main that touch content/posts/**.md or the script itself. It runs the generation script and commits the updated hash manifest back to main.

Hugo templates render the audio player only if two conditions hold. audio: true in the post’s frontmatter. AND the hash manifest has an entry for the post’s slug. That second gate is what prevents a broken player during the window between push and CI completion. If the audio isn’t ready yet, the player doesn’t appear at all.

Cloudflare R2 stores the MP3s behind assets.damiangalarza.com. Free egress, negligible storage cost, immutable caching. Objects live under audio/posts/.mp3.

Key property: the Hugo build never touches ElevenLabs. The audio artifact is durable and independent of the generation service. If ElevenLabs changes their API or disappears tomorrow, every existing MP3 keeps playing.

What the Page Exposes for Agents

LLM crawlers are starting to treat pages as content collections, not just HTML. So once the MP3 is on R2, the post page emits a richer set of semantic signals.

sits in the same slot as the RSS alternate link. Feed readers and AI crawlers can auto-discover the audio version.
Open Graph audio tags. og:audio, og:audio:type, og:audio:secure_url. Social platforms that generate link previews can reference the audio directly.
An enriched AudioObject JSON-LD block. contentUrl, encodingFormat, inLanguage, uploadDate, a transcript property pointing back to the post’s canonical URL, and a potentialAction of type ListenAction.

The transcript property is the one I care about most. It tells an LLM that this audio and this page contain the same content. For retrieval and for training, that’s the signal that matters.

Why It Matters

Accessibility. A real audio version in a consistent voice, not the synthetic monotone of a browser’s built-in reader. Readers with dyslexia, low vision, reading fatigue, or simply a preference for listening get the same content without fighting the format. And because the voice is mine, cloned from hours of my own recordings, the audio carries the same presence as the written post. A screen reader substitutes a generic voice for whatever makes a writer recognizable. A pre-generated narration doesn’t. Audio becomes a first-class version of the post, not a degraded fallback.

Consumption mode flexibility. I’ll be able to review my own drafts on the road now, which is what started this whole thing. Readers who commute or cook or walk get the same affordance.

And the quieter payoff. Narrator-failure surfaces reader-friction. Every pattern the TTS stumbled over was a pattern a careful human reader would have tripped on too. The voice just made it unignorable.

What’s Still Open

Cache-busting. When I edit a published post, the MP3 at the same URL changes, but Cloudflare’s CDN doesn’t invalidate for up to 24 hours. A ?v= query param fixes this. On the list.

SSML support. ElevenLabs’ SSML coverage is narrow. is reliable. may or may not be honored on the model I’m using. I tested it against ~170 and ended up preferring the plain-text regex substitution. Worth revisiting as ElevenLabs expands the supported tag set.

The number-pause quirk. The voice still inserts a micro-pause after multi-digit numbers. That’s a voice-model trait, not something I can fix in text. Possibly improves with a different model; eleven_flash_v2_5 and eleven_multilingual_v2 are both on the list to try.

If you’re thinking about adding audio to your own site, build it. But read your last three posts aloud first. You’ll learn more from that than from any of the transforms above.

If you’re working on AI features in a real product and want a second pair of eyes on the architecture, let’s talk. Calm, direct conversations about tradeoffs.

Additional Reading

Audiobook Pacing and Writing for Audiobook Performance — the audiobook playbook this post leans on
ElevenLabs Text to Speech — API docs
How Punctuation Influences Your Writing Voice — a useful companion on how punctuation carries tone for readers as well as narrators

Hugo on Damian Galarza | Software Engineering & AI Consulting