Watching the Loop Forge Itself

You build a tool. It works. Sooner or later, every toolmaker has the same thought: what happens if I point it at itself?

The tool, in my case, is a small bash script I’ve been calling a Ralph Loop. It drives Claude through a plan-the-work, write-the-code, review-the-code, fix-the-code cycle — one user story at a time, unattended. Each step is a fresh Claude session, so the model never gets confused by a 200-message history.

I published a cost analysis of that loop yesterday — the draft had been with me for a while, but it only went live then. The idea for the runnable demo arrived the night before publishing: a small React app — an Exchange Rates monitoring dashboard — that the loop would assemble story by story while the reader watched. Clone the repo, run one command, go to bed, wake up to a working app and a git log of how the agents got there.

By yesterday afternoon the demo was standing. And once it was standing, I made a choice I had been quietly building toward for months. I’d use the loop to improve the loop, in the same public repo, with the git log open. The loop would refactor its own bash script while another part of the same script was building the demo app. Cloners would run the demo and read the recursion in the commit history at the same time.

Hours in, the loop committed two stories that looked perfect in the git log and were actually empty. The third story tried to build on top, found nothing there, and burned $4 trying to figure out why. The git log said the loop had improved itself. The filesystem said otherwise.

That was the moment the repo became worth writing about.

The interesting part of pointing a coding loop at itself isn’t the recursion. Self-hosting is an old idea — compilers have done it for decades. The interesting part is what you choose when reality complicates the plan: when to clean up the git log and when to leave the scar, when to give the agent more budget and when to admit the spec is wrong, when to add a config knob and when to live with a hardcoded default.

I made a demo repo, pointed the loop at itself, and let the mess remain in the record. What came out of that first chapter was more useful than a clean demo: a two-track repo structure I almost got wrong, a chapter model for public iteration, a prompt-adapter pattern for live-loaded BMAD personas, and three infrastructure bugs that taught me something only because I refused to rebase them away.

The Branch I Almost Took

A clean demo is good for confidence and bad for learning.

The repo needed to serve two audiences. The first is the cloner: they run git clone, follow the README, and watch the dashboard get built one story at a time by a small team of role-based agents — Scrum Master, Developer, Code Reviewer — orchestrated by the loop and using personas from the BMAD Method, an open framework I borrow from. They want the showcase to be stable, reproducible, and easy to inspect. The second audience is the maintainer, which in this case is also the loop. I wanted the same repo to show the loop improving its own orchestration: prompt refactors, review fixes, staging behaviour, chapter workflows, and whatever else reality exposed.

Those two audiences want different things. The cloner wants a clean artifact. The maintainer wants an honest trail.

My first instinct was branches. Demo on main, improvement work somewhere else. I put that on the table and then immediately argued myself out of it: branches hide the thing I most wanted to show. A visitor on main would see a polished demo and have no visibility into the loop improving itself. The recursion — the only reason this is more interesting than yet another Ralph demo — would become a private implementation detail.

So I reframed the proposal to myself: two tracks, but on the same branch, in different folders. I gave Claude Code four constraints to work from:

Keep the demo track at the root, structurally unchanged, so anyone cloning sees the simplest possible mental model.
Tune the canonical ralph-loop.sh’s defaults to the demo track so cloners get a working demo without configuring anything. A thin ralph-loop-system.sh wrapper targets the system track. Cloners never have to think about the dual-purpose design.
Nobody develops in the demo track. It’s a frozen showcase.
/docs/ is the demo’s working folder. System work lives somewhere else entirely.

.
+-- docs/                 # demo track artifacts (frozen)
+-- src/                  # demo app (frozen)
+-- scripts/ralph-loop.sh # shared loop engine
+-- system/               # system track
    +-- chapters/         # loop-improves-loop work

The pressure this creates is the right pressure: if the system track breaks scripts/ralph-loop.sh, the demo track is affected too. A self-improving tool should not get to improve itself in a place where breakage is invisible.

Three lightweight safeguards make the risk tolerable: a --dry-run-prompts flag that assembles each role prompt without calling the model, a per-chapter checkpoint that runs bash -n plus the prompt dry run after every story, and the existing budget caps so a broken story cannot eat the whole sprint.

The pattern is broader than this repo. When a tool produces its own work product and you want to teach from the process, the audience boundary is usually a folder choice, not a branch policy. Folders keep the recursion visible. Branches make it too easy to hide the interesting part. The version of this repo I almost shipped would have been technically cleaner and substantively less useful.

Chapters, Not Loose Decisions

Once the repo had two tracks, it needed a unit of work for the system track.

The original draft of the plan lived at docs/plans/chapter-name.md. After the two-track restructure I told Claude Code: each plan should be a folder, not a file, with its own PRD, epics, stories, and supporting artifacts living together. And rename plans/ to something that names the arc, not the snapshot. Claude Code came back with chapters/. I accepted.

system/chapters/
+-- 2026-05-24-modularize-loop-prompts/
    +-- README.md
    +-- prd.md
    +-- epics/
    +-- stories/
    +-- artifacts/

The shape is deliberate. The chapter folder mirrors the demo track’s root layout: PRD, epics, stories, artifacts. Once a reader understands the demo, they can descend into a chapter and keep their bearings. Symmetry buys recognition — there is only one vocabulary to learn.

The chapter README is the plan. It has to pass a cold-start test: a fresh reader, or a fresh LLM with no project memory, can understand what the chapter is trying to do from the files alone. That means defining project-specific terms, linking related artifacts, naming dependencies, and using absolute dates instead of temporal hand-waving.

Status lives in the header: proposed, accepted, in-progress, complete, or superseded. When the chapter closes, the folder stays. Historical record beats tidiness.

This worked better than scattered Architecture Decision Records (ADRs) for this kind of effort. ADRs are useful when the main thing worth preserving is the decision. A chapter preserves the decision and the work products that flowed from it: the PRD, the epics, the generated stories, the verification scripts, the failed assumptions, the spec rewrites, and the fixes that landed midstream. ADRs make decisions look inevitable in retrospect. Chapters don’t.

That matters more than it sounds. The next three sections only land because the chapter folder forced me to leave the trail intact.

The First Chapter: Stop Forking Prompts

Chapter 1 was about modularising the loop’s prompts — work I’d deliberately held off until the loop’s behaviour underneath stopped moving.

The bash script had around 110 lines of hardcoded agent personas and execution-context heredocs baked into the orchestrator. Three things bothered me about this, in increasing order of importance.

First, maintenance: any change to persona behaviour meant editing a 1200-line shell script.

Second, staleness: BMAD already ships persona files in .claude/skills/ and updates them between versions. The loop carried its own copies and would silently fall behind every upstream improvement.

Third — and the one I cared about most, because of the demo — anyone cloning this repo to adapt it for a different stack would have to edit the bash script directly. Hand-editing a 1200-line shell file to change a few persona rules is exactly the kind of work that introduces bugs in the loop itself. The whole point of publishing the demo was to give people something they could lift and shape. The persona text needed to live somewhere a cloner could safely edit without touching the loop.

The naive fix is simple: read the BMAD persona file and inject it through --append-system-prompt (the Claude Code CLI flag that lets you append text to the system prompt of an invocation).

That fails because BMAD personas are written for interactive use. They contain instructions like:

“Greet the user before proceeding.”
“On Activation: load config, present a menu, await selection.”
“HALT and wait for user input.”

A Ralph Loop is non-interactive by design. Every invocation runs unattended. If you inject an interactive persona verbatim, the model may do exactly what the persona says: greet a user who is not there, present a menu nobody can answer, or halt and wait. That is not a prompt bug. It is a context mismatch.

The right pattern was composition with override:

[Layer 1: Execution Context Override]
[Layer 2: BMAD Persona]
[Layer 3: Project-Specific Rules]

Layer 1 tells the agent how it is being run: non-interactively, no greeting, no menus, no waiting for clarification, make reasonable assumptions when needed and document them in the output.

Layer 2 is the live BMAD persona, read from .claude/skills/role-name/SKILL.md when the loop starts. No fork. No manual copy. The upstream persona can change and the loop can pick it up.

Layer 3 is the repo-specific contract: stack rules, checkpoint command, review standards, project conventions.

The ordering matters. The override comes first so the model understands the execution context before reading the interactive persona. The project rules come last so the general persona is narrowed to this repo’s actual constraints.

This also preserved the cost lesson from the first post. All three layers are stable across invocations inside a run, so the composed prompt stays cache-friendly.

The general lesson: when you integrate a third-party agent persona, manifest, or schema designed for another context, do not fork it unless you have to. Wrap it. A stable override in front and a stable customisation layer behind gives you upstream movement without inheriting mismatched activation behaviour.

Three Decisions Worth Keeping

The first chapter surfaced three bugs. Each one landed as a fix(system) commit between the chapter’s story commits. That made the history messier. But the messier history is what made the chapter useful — and the choice to let the history stay messy was itself one of the decisions worth writing about.

Decision 1: Add a config knob, don’t hardcode the second case

The loop’s auto-commit step was built for the demo track. It staged docs/stories/ and src/. That was fine while the loop built the dashboard.

It was wrong when the loop started improving the loop.

The system track’s Dev agent modified scripts/ralph-loop.sh, but the auto-commit step ignored that path. Stories 1.2 and 1.3 produced commits that looked legitimate while leaving the load-bearing code in the working tree. Story 1.4 then had to rediscover and redo previous work, hit the budget cap, and failed.

I had a fork in front of me when I noticed.

The fast option was to widen the hardcoded staging list — add scripts/ and system/ to the default and ship. The system track would work. The demo track would still work. Total change: one line.

I rejected that. The whole reason the bug existed was a hardcoded list that assumed one work surface. Widening the list just hides the same assumption behind a longer list. The right move was to make the staging surface configurable per work surface. So the fix added an EXTRA_STAGE_PATHS environment variable; the system track wrapper sets it to include scripts/, system/, README.md, CLAUDE.md, and TIMELINE.md. The default stays demo-track-correct. The system track and any future third track configure their own scope.

But there was a second fork, and this is the one that mattered more. Stories 1.2 and 1.3 had already shipped to the git log with their actual code sitting orphaned in the working tree. I had two ways to repair the history: an interactive rebase that would quietly fold the rescued code into the existing 1.2 and 1.3 commits — clean log, no evidence of the bug — or a separate fix(1.2,1.3) rescue commit that leaves the trail visible.

I went with the rescue commit. The clean history would have been a marketing artifact: it would have read as if the loop had worked correctly from the start, which would be a lie. The honest history tells a future reader what actually happens when they try this on their own codebase — sometimes the loop ships docs without the code, and you have to notice, and the fix is a config knob plus a separate rescue commit. The Linux kernel mailing list has been doing this for thirty years. AI tooling doesn’t get a pass.

The lesson is twofold. When a tool has more than one work surface, staging has to be configurable per surface — hardcoding the primary case is only safe until the secondary case becomes real. And: when the bug already shipped, don’t rebase it away. A scar that teaches beats a clean history that doesn’t.

Decision 2: Strict on format, tolerant on placement

The first post described why review verdicts are files, not chat. This chapter taught me that a file contract can still be too brittle.

The review prompt told the agent to start the file with REVIEW_PASSED or REVIEW_FAILED. The parser read head -1 and checked for REVIEW_PASSED.

The Review agent wrote a markdown title first:

# Story 1.1 Code Review

REVIEW_PASSED

The intent was clear. The file contained a valid marker. But head -1 saw the title, and the loop treated a passing story as failed. It retried a review that had already passed until the retries were exhausted.

The loop reported “Story 1.1 failed code review 3 times. Marking as Manual Review Required.” My instinct was to override and move on. My second instinct, the one I followed, was to ask: why did this fail?

That question turned out to be the difference between a one-time workaround and a fix. The cause wasn’t agent misbehaviour — the agent had followed the spirit of the instruction perfectly. The cause was a parser that punished a totally natural LLM habit (writing a markdown title before a marker) with three wasted retry cycles.

The fix was to search for the first verdict marker line, not blindly trust the first line:

is_review_passed() {
  local review_file="$1"
  [[ -f "$review_file" ]] || return 1

  local verdict
  verdict=$(grep -m1 -E '^(REVIEW_PASSED|REVIEW_FAILED)' "$review_file" 2>/dev/null || true)
  [[ "$verdict" == "REVIEW_PASSED" ]]
}

I also tightened the prompt to say the marker must be the literal first line. Belt and suspenders. The agent should be unambiguous about the contract; the parser should be lenient where leniency is safe.

The lesson: with LLM-generated structured output, be strict about format and tolerant about placement where tolerance is safe. A markdown title before a marker is normal model behaviour. Punishing it with three retry cycles buys nothing.

Decision 3: Rewrite the spec, not the budget

Story 1.4 had the best failure, because the agent was not wrong. The spec was.

The story asked for a byte-identical diff between the old hardcoded heredoc output and the new layered prompt output. At the same time, the chapter design deliberately changed the prompt order: Layer 1 (Execution Context) first, vs. the old structure with persona first.

So the spec demanded two incompatible things: preserve byte-identical output, and change the structure and ordering of the output.

The Dev agent tried anyway. It spent budget attempting to make an impossible diff empty, retried on a larger model, hit the same wall, exhausted the cap.

I had another fork.

I could double the budget cap, escalate to Opus, and let the agent confirm the impossibility for me. Maybe the bigger model would see a path I couldn’t. Realistically, the wall would hold, but the cost of confirming would only be a few more dollars.

I could acknowledge the spec was wrong. Rewrite the acceptance criterion. Re-run.

I went with the rewrite, without hesitation. The agent wasn’t failing because it lacked capability. It was failing because I’d given it an impossible success condition. More budget would only buy more proof of impossibility. The cheaper, smarter move was to fix the artifact that was wrong — the spec.

The new acceptance criterion: every meaningful claim in the old output is present somewhere in the new output. Ordering, formatting, separator differences — all ignored. Missing rules — caught.

The Dev agent built the verification script in one pass. First try. Story 1.4 closed.

Verification Has to Match the Shape of the Change

The byte-diff bug is the most portable lesson in the chapter.

Some refactors are lifts: same behaviour, different mechanism. Byte comparison can be a good gate there.

Extract a helper but call it from the same places.
Move code into a separate file and re-export it.
Reformat without changing behaviour.

Other refactors are shifts: the structure changes on purpose. Byte comparison is the wrong gate there.

Reorder the layers of a composed prompt.
Replace hardcoded content with loaded content — or insert a stable override layer ahead of an upstream persona.
Swap a synchronous path for an async one.

Chapter 1 was a shift. The whole point was to change prompt assembly while preserving the important claims inside the prompts.

The replacement gate was a content-preservation script. It captures the old and new prompt outputs, extracts significant lines from the old output, and checks that each meaningful rule still exists somewhere in the new output. It ignores ordering, separator text, trailing newlines, and harmless formatting differences. It fails on missing rules.

That is the contract I actually cared about.

The same idea shows up in property-based testing: when implementation can change but the contract must hold, test the contract. The vocabulary is older than LLMs. The application is new because agent prompts are now executable infrastructure.

If your refactor intentionally changes structure, your verification has to compare meaning, not bytes.

The Spec Is the Lever

The deepest lesson was not about bash, prompts, or model choice.

When the PRD and epic were crisp, the SM → Dev → Review cycle worked. When the spec contained a contradiction, the loop faithfully turned that contradiction into wasted spend.

That is the uncomfortable part of agentic coding loops: they can make bad instructions expensive faster than a human team can.

In Ralph Loop terms, PRD and epic quality are the strongest predictors of output quality. Not model routing. Not retry semantics. Not the cleverness of the orchestrator. Those matter, but they are second-order. The agents execute the setup. The setup is the work.

The implication is practical: iterate on the PRD before you iterate on the prompts. Ask whether each acceptance criterion matches the architectural intent. Look specifically for contradictions between the desired structure and the proposed verification gate.

A clear brief is worth more than a larger model.

Human teams already know this. The Ralph Loop just compresses the lesson into dollars per story.

What Travels

The chapter produced a few patterns that are not bound to Ralph, BMAD, or Claude Code.

Public self-improvement needs audience boundaries in the repo structure. Folders make the recursion visible; branches hide it.
Chapters beat scattered notes for visible iteration. ADRs document verdicts; chapters document arcs.
Live-load upstream artifacts, but adapt their context. Wrap third-party personas with an execution-context override instead of forking them into stale copies.
Structured output contracts should be resilient. Strict markers are good. Parsers that collapse on harmless markdown are not.
Verification primitives must match refactor shape. Byte-diff for lifts. Content-preservation gates for shifts.
When the bug ships, don’t rebase it away. Honest history teaches; clean history sells.
More budget can’t fix a contradictory spec. Rewrite the spec before escalating the model.

None of these are really about coding agents. They are about systems that produce systems, where the meta-layer is part of the product.

Months, Not Hours

The chapter in this post ran in roughly forty-eight hours. The readiness to attempt it took months.

I first encountered the Ralph Loop several weeks before any of this. I read Geoffrey Huntley’s original write-up, saw the shape, and recognised something useful. I also recognised the shape was not mine. My day-to-day was BMAD-driven, a workflow I had spent a long time tuning and trusted in a way I did not yet trust this new thing. I refused to disrupt one for the other. The loop went into the back of my head and I kept shipping the usual way while the idea cooked.

Sometime later — long enough that I no longer remember the exact day — a gut feeling arrived. The loop and the BMAD workflow were not in tension. They composed: the loop could drive a small team of BMAD-personated agents through one story at a time. The version worth building was the one shaped by my muscle memory, not someone else’s.

The first stretch of work, spanning a month or two, was about cost. The same work that became Hardening the Ralph Loop: the runtime baseline, the cache discipline, the budget caps, the model-tier routing. Useful work, but recognisably the easy half. The loop got cheaper; it did not get smarter.

Then the harder one: a path back. Teaching the loop to notice when the bug in front of it lives in a story it has already shipped — reach back, repair that story, and resume the current one. This is the capability that made unattended overnight runs stop feeling reckless. The full mechanism, and the rest of what has to be in place for that trust to be earned, is the next post in this series.

Only after the loop’s core behaviour had stopped moving did modularising the prompts become safe to attempt. Pulling hardcoded heredocs into composable layers is a restructuring operation; restructuring something whose semantics are still drifting is how you end up with a prettier version of a broken tool. Chapter 1 in this post is what that work looked like once it was finally on the table. It is the last surface to refactor, not the first.

The reason this matters for anyone cloning the demo: the repo is the artifact of months of iteration that the clone cannot transfer. A cloner gets the file structure, the scripts, the prompt layers, the chapter convention. They do not get the months of small adjustments that taught me which corners would hold weight and which would not. That knowledge is downstream of time spent. There is no version of this tool that hands it over on a Tuesday.

Software engineering compounds with hours invested in the same way every craft does. The repo is a starting line. The work is what comes after.

Why Publish the Mess

A cleaner version of this repo could exist.

It would have neat feat(1.1) through feat(1.6) commits, no fix(system) interruptions, no failed byte-diff story, and a chapter README that looked like I had known from the start that semantic equivalence was the right gate.

That repo would be less useful.

The bugs are not embarrassing. They are the demonstration. A self-improving loop that never exposes an infrastructure bug is either too trivial to learn from or too polished to trust.

Publishing the trail lets a future cloner see what actually happens when they try this on their own codebase: the staging surface is wrong, the parser is too brittle, the acceptance criterion contradicts the design, the budget cap catches the mistake, the chapter gets repaired, and the work continues. They get to see the decisions, not just the destination. Those are the moves that travel. The patterns alone don’t.

That is more useful than a marketing artifact.

The first post ended with the architectural escape hatch: bypass the interactive Claude Code runtime and build Gantry — my planned successor to the bash loop — directly on the Claude Agent SDK. That destination has not changed. But the bash loop taught me what Gantry needs to preserve: two-track public iteration, chapter-based improvement work, live-loaded personas with explicit adapters, semantic gates that match the shape of the change, and a maintainer who refuses to clean up the trail.

The bash loop was supposed to be temporary.

It turned out to be the apprenticeship.

The views expressed here are my own. The Ralph Loop demo repo and its system-track chapters are personal open-source projects, public on GitHub. The earlier post Hardening the Ralph Loop is the cost-economics half of this story. The Claude Agent SDK and BMAD Method are linked above where first used.