Hardening the Ralph Loop

The first version worked exactly as designed and still failed.

It spun up fresh Claude Code agents. It moved planned stories from spec to implementation to review. It produced real commits. It ran while I slept.

It also burned through my Claude Max weekly quota faster than if I had stayed awake and written the code myself.

So I hardened the obvious things first: route cheap models to cheap jobs, cache the static prompt, cap turns, cap spend, make review write a strict verdict file, and put a real build-and-test gate after the AI says a story is done. All of that helped. None of it touched the deepest leak.

The clue came from the smallest possible verification call:

claude -p --output-format json --max-turns 1 "say hi"

Two words. One turn. Twelve cents. Then the cold-cache number that explained the whole mess:

"cache_creation_input_tokens": 68453

Sixty-eight thousand tokens to say hi on the cheapest model Anthropic offers, before my own prompt had done any meaningful work.

That is the real subject of this post. Geoffrey Huntley’s Ralph Loop is the right idea: short-lived agents, fresh context, one bounded step at a time. But running that idea overnight on real code, on a subscription quota, is not only a prompt problem. It is an orchestration problem, a cost problem, a state problem, a verification problem, and eventually a runtime-baseline problem.

This is what it took to turn the loop from a clever bash trick into something I could trust against my own code.

Why I Wanted This

I’m a software architect with a demanding day role and not enough hours in the week. For a long stretch I lived in a tension I suspect a lot of senior engineers will recognise.

My day work earns my full attention. The responsibilities are real, people depend on the work, and I was never going to let any of that slide so I could play with personal projects at the edges.

But if all I did was the day role, I would atrophy. The most interesting era of my craft is happening right now, and AI is what made it so. For the first time in my career, a single engineer has enough leverage to design and ship serious systems alone, in the evenings, with no team around them. I want to be in this era properly, building in it rather than reading about it.

The problem is that none of that fits neatly into a workday that is already full.

Then I found the Ralph Loop. Two paragraphs and a bash one-liner, and it changed the shape of the problem.

The insight is small and exact. Long agentic sessions rot: context fills up, the agent loses the thread, and the whole thing drifts. A fresh agent per step pays a cold-start cost in exchange for a bounded task and a clean head. Compose those steps into a loop that drives stories end-to-end and, in principle, you have a system that can make progress while you sleep.

That was the missing piece. Not a more productive working day; I already have one of those. What I needed was a way to make progress on my own projects without trading hours I did not have.

So I spent a weekend doing what a weekend can hold: brainstorming a project, researching it, writing a PRD, decomposing it into epics and stories, and putting the plan into the filesystem using the BMAD Method. Then I started the loop at night and went to sleep.

The first nights did not go well.

The loop did what I asked: fresh contexts, bounded roles, real edits, real commits. It also consumed quota like an unattended space heater. The point was time and quota back. The loop was returning neither.

The Loop I Was Trying To Trust

The loop is easier to understand if you ignore the hardening first.

A single story moves through three fresh claude invocations:

SM Agent     -> writes the implementation spec from the epic and PRD
Dev Agent    -> reads the spec and ships code
Review Agent -> audits the code against the spec

If review fails, a Fix Agent runs and the Review Agent checks again. If review passes, the loop runs the project’s checkpoint command - build, tests, lint, or whatever the repo treats as truth. Only after that gate passes does the script commit.

BMAD (Behavioral Markdown Agent Design) matters here because it gives the loop contracts. The Scrum Master, Developer, Reviewer, Fix, Upstream Fix, and Test Architect roles are not just different names for the same prompt. They are different responsibilities with different artifacts at each handoff. The Ralph Loop becomes much more useful when those handoffs are explicit enough to mechanise.

The basic pipeline worked. The hard part was making it bounded enough to leave running overnight.

The version I now run on my own work, including a personal project called Affiant, changed less inside the core loop than around it:

a model ledger, so each role spends the minimum useful amount
prompt caching, so static context is not paid for repeatedly
turn caps, so confused agents stop before they spiral
retry semantics, so terminal reasons become signals instead of noise
spend caps, so one hard story cannot eat the whole sprint
file verdicts, so natural language does not decide control flow
an independent checkpoint, so the AI cannot mark its own work done

Those are the parts worth stealing even if you never use Ralph, BMAD, or Claude Code.

Spend Where the Judgment Is

A naive Ralph Loop calls the same model for every role. Mine did too at first. That was wasteful.

The roles have different jobs.

The Scrum Master role is mostly transformation: read an epic section, expand it into a structured implementation spec, preserve the acceptance criteria, and make the work crisp enough for the next agent. There is not much frontier reasoning there. Haiku handles it.

The Dev role is implementation: read files, edit code, run tests, fix compile errors. Sonnet handles that well enough without paying Opus rates for every tool call.

The Review role is where the expensive mistake lives. A weak reviewer that signs off on broken code costs you the next agent run, and the one after that, until a human has to clean it up. Review gets Opus.

BMAD Role	Model	Cost Category	Primary Responsibility
Scrum Master (SM)	Claude 3.5 Haiku	Low ($)	Structuring plans and specs
Developer (Dev)	Claude 3.5 Sonnet	Medium ($$)	Writing code, fixing builds
Reviewer / Auditor	Claude 3 Opus	High ($$$)	Strict contract verification

In bash, the important bit is boring:

MODEL_SM="haiku"
MODEL_DEV="sonnet"
MODEL_REVIEW="opus"

Routing Dev to Sonnet was the decision I resisted longest. I had built up months of intuition filing Opus as the serious model and Sonnet as the quick one, and Dev is where the code actually gets written.

What changed my mind was the architecture I had already built. The whole SM -> Dev -> Review pipeline assumes every individual agent is imperfect. If Sonnet misses something subtle, Opus Review should catch it. If Review misses it, the checkpoint command should catch the build or test failure. The system was already designed around an imperfect Dev agent. I just had not trusted the design enough.

Per Anthropic’s published pricing, Haiku is roughly an order of magnitude cheaper than Sonnet, and Sonnet is several times cheaper than Opus. Routing each role to the cheapest model that can do the job moved a sprint from all-Opus burn to something much closer to survivable.

It was not enough.

The Cache Is the Other Half

Every agent invocation in the loop sends a large static preamble: role instructions, project conventions, layering rules, review standards, artifact contracts. None of that changes between stories. All of it is expensive if the runtime tokenises it from scratch on every call.

Anthropic’s prompt cache rewards byte-identical content within a short TTL (which expires after five minutes of inactivity). If your build and test runs take too long between stories, you pay the full cold-start cost anyway—making the cache TTL a critical metric to watch.

The way I get cache hits is to push the static preamble into --append-system-prompt, build it once at the top of the run, and pass the same string to every invocation:

claude -p \
  --model "$current_model" \
  --max-turns "$current_turns" \
  --append-system-prompt "$SYSTEM_PROMPT_DEV" \
  --output-format json \
  "$(cat "$user_prompt_file")"

The user prompt file stays small: story ID, story path, current attempt, and the specific task. Anything shared across stories belongs in the cached system prompt.

Two rules make this work:

No story-specific or invocation-varying content in the system prompt. One changed byte can miss the cache.
Build prompts once, log their byte sizes, and pass the strings around. Reconstructing prompts per invocation invites whitespace drift.

Within a single sprint run, cache-read tokens routinely outnumber fresh input tokens by a wide margin. This was the biggest cost lever inside the bash script.

The bigger lever was outside it.

Turns, Retries, and Budget Are One System

--max-turns is the cheapest insurance policy in the loop. It caps how many tool-use rounds an agent can take before the SDK terminates it. Without it, a confused agent can grind on a stuck test and burn a large chunk of quota for nothing.

I cap each role separately because the work shape differs:

MAX_TURNS_SM=15
MAX_TURNS_DEV=40
MAX_TURNS_REVIEW=25
MAX_TURNS_FIX=30

The Dev cap is the one that needs tuning. A complex story can legitimately need more turns than a simple one. A Sonnet run that hits the cap halfway through implementation is pure waste unless the loop knows how to salvage it.

The first fix is crude but useful: scale the Dev turn budget by spec size.

scale_dev_turns() {
  local lines
  lines=$(wc -l < "$story_file")
  if   [[ $lines -gt 500 ]]; then echo $(( base_turns * 7 / 4 ))
  elif [[ $lines -gt 300 ]]; then echo $(( base_turns * 5 / 4 ))
  else echo "$base_turns"
  fi
}

Spec line count is a rough proxy for implementation scope. It is also free, deterministic, and right often enough to avoid some pointless retries.

The second fix is to stop treating every non-success as the same failure.

Exit code 2 is a usage error: bad flag, malformed prompt, missing argument. Retrying with the same inputs just pays for the same error again.

if [[ $rc -eq 2 ]]; then
  log_error "$label: usage error (exit 2) - not retrying"
  break
fi

terminal_reason: max_turns is different. The agent ran out of budget, but the work-in-progress may be real. The Dev agent may already have written working code to disk before the cap hit.

if [[ "$terminal_reason" == "max_turns" ]]; then
  log_warn "$label: max_turns hit ($num_turns turns, session=$session_id)"
  return 3
fi

That special return lets the caller branch:

Salvage the working tree if Dev hit the cap, files changed, and the checkpoint command passes.
Resume the session with --resume <session_id> if Review hit the cap while preparing a verdict.
Escalate only when needed: route a second attempt to Opus and double the turn budget.

if [[ $attempt -gt 0 && "$model" != "$ESCALATION_MODEL" ]]; then
  current_model="$ESCALATION_MODEL"
  current_turns=$(( max_turns * 2 ))
fi

Then add hard spend caps, because turns are not the only way a story can spiral:

--budget-per-invocation-usd 2.00
--budget-per-story-usd 8.00

The per-invocation cap passes through to Claude Code’s own --max-budget-usd. The per-story cap is mine: parse total_cost_usd from each JSON result, accumulate it under the story ID, and stop the retry loop when the story exceeds its budget.

On subscription auth, the USD value is still a useful proxy for quota burn. It gives the loop a hard surface to push against.

Verdicts Are Files

This is the contract I am strictest about.

An earlier version had the Review agent return its verdict in chat. The orchestrator scanned the prose for approval phrases like “looks good” or “no issues found”. One night, a story shipped because the reviewer wrote something close enough to approval for the parser to count it as a pass.

The build was broken. The reviewer had actually described the problem. The parser had listened to the wrong part of the sentence.

A better parser was not the answer. Natural language was the wrong interface for a yes/no control-flow decision.

Now the Review Agent writes a file. The first line must be exactly one of two strings:

REVIEW_PASSED

or:

REVIEW_FAILED

The orchestrator reads only that first line. It does not care what the agent says in chat. It does not fuzzy-match prose. It does not ask another LLM to interpret what “this looks reasonable” might mean. The file is the contract, and the first line is the branch.

The same pattern handles cross-story root cause analysis. If the reviewer believes the current bug came from an earlier story, it must emit a structured marker block:

UPSTREAM_FIX_REQUIRED: 2.3
ROOT_CAUSE: <one-line description>
AFFECTED_FILES: <comma-separated paths>
CURRENT_IMPACT: <how it manifests now>

The orchestrator parses the story ID and routes a separate Upstream Fix Agent to the offending story. After that fix lands, a cascade verification runs the checkpoint before the current story gets reviewed again.

If an LLM decision controls your next step, turn the decision into a structured artifact. Prose is for explanation. Files are for contracts.

The Gate the AI Cannot See

Most automated coding loop demos underplay the gate.

After review passes, my script still does not trust the reviewer. It runs the project’s actual checkpoint command in a separate process: dotnet build and dotnet test, or cargo build and cargo nextest run — or whatever the project treats as truth.

The output is a real exit code from a real toolchain.

chk_output=$(run_checkpoint) || chk_rc=$?
if [[ $chk_rc -ne 0 ]]; then
  # The AI said the work was good. The build disagreed.
  # The build wins.
  ...
fi

If the checkpoint disagrees with review, the loop gives the AI one chance to use that evidence. It invokes the Review Agent again, passes the captured failure output, and tells it to overwrite its verdict file with REVIEW_FAILED plus actionable corrections for the Dev Agent. Then the fix loop re-enters.

If that still fails, the story fails hard. No commit. No false green.

The principle is simple: if an AI can mark something done, it can be wrong about it. The independent checkpoint is the one truth source the loop cannot talk its way past. Git commit happens after the gate, not before.

That is also why the loop treats commit messages of the form feat(2.3): <story title> as the source of truth for whether a story is done. If the commit exists, the story is done. If not, nothing else counts.

Manual Review Is Not Failure

The loop is designed to fail visibly and keep moving.

Three exits route a story to Manual Review Required:

Review failed MAX_REVIEW_RETRIES times.
Per-story spend exceeded --budget-per-story-usd.
Upstream-fix chain depth exceeded the configured limit.

In each case the loop moves to the next story instead of aborting the whole run. A sprint that commits twelve stories and marks three for human review is much more useful than one that dies at story four because story four was hard.

The progress file at the end of the run shows which stories committed, which need eyes, and what each one cost. That is the file I read first thing in the morning.

The Part My Script Could Not Fix

After all that hardening, I expected the quota math to work. The obvious levers were in place: model routing, cache, turns, retries, spend caps, file verdicts, build gate.

Before kicking off the next sprint, I ran a tiny verification call to make sure my jq paths still matched Claude Code’s JSON output:

claude -p --output-format json --max-turns 1 "say hi"

The response was well-formed. The cost line was not:

"total_cost_usd": 0.11574475,
"cache_creation_input_tokens": 17191,
"cache_read_input_tokens": 16142

Twelve cents to say hi, with seventeen thousand tokens hitting the cache.

Then I tried the same call against a cold cache on the cheapest model:

"cache_creation_input_tokens": 68453

That was the moment the problem changed shape. My script could cache its own prompts perfectly and still lose, because the runtime was loading a huge baseline before my prompt arrived.

The diagnosis was not glamorous:

find . -maxdepth 3 -name ".mcp.json"
cat ~/.claude/settings.json | jq '.enabledPlugins'
ls -la .claude/skills/ | wc -l

Project-scoped MCP servers in adjacent repos. Two user-level plugins enabled. A root CLAUDE.md. Claude Code’s own system prompt and tool schemas. And the real surprise: a project-level .claude/skills/ directory at my monorepo root with eighty-one entries accumulated across months of building tooling for the rest of my ecosystem.

Eighty-one skills, at least partially eligible for auto-discovery, on every fresh claude -p invocation.

The runtime was doing exactly what it was designed to do. In an interactive session, skills, plugins, hooks, MCP servers, and auto-loaded project instructions are helpful. They put the developer’s world inside reach without making the developer wire everything by hand.

The Ralph Loop inverts the economics. Its whole virtue is that every step starts fresh. In a headless overnight sprint, that means dozens of fresh invocations, and every one starts by paying for the interactive runtime baseline.

All the hardening inside the bash script was real. None of it could touch sixty-eight thousand tokens being loaded before the story prompt even arrived.

Claude Code Was Doing Its Job

That discovery made the conclusion less satisfying but more useful: Claude Code was not broken. I was using an interactive assistant runtime as a headless automation substrate.

Those are different jobs.

Wrapping the CLI in bash was the fastest way to get Claude Code’s excellent tool-execution engine (editing files, running commands, and searching) working in a loop without writing hundreds of lines of API boilerplate. But as the abstraction leaked, the CLI wrapper became a bottleneck.

There are partial workarounds. Move the skills directory aside before a run. Disable plugins in user settings. Push project-scoped .mcp.json files out of the working tree. Keep the loop in a narrow repo instead of a large monorepo. All of that helps, and all of it feels crude because bash is the wrong layer to own environment setup and teardown.

The architectural fix is to bypass the interactive runtime for this automation path. Talk to the Claude Agent SDK directly. Put the SM -> Dev -> Review pipeline, file verdicts, checkpoint gate, retry semantics, spend ledger, and upstream-fix cascade into a binary I control.

In that world, system prompts and tool schemas are explicit decisions. BMAD roles are packaged because the binary needs them, not because they happen to sit in a .claude/skills/ directory somewhere above the current working directory. The runtime baseline lives in the low thousands of tokens, not the tens of thousands.

That binary is the project I am building next, in the open, called Gantry. The success bar I have committed to is concrete: a seven-story sprint at thirty percent or less of the quota a naive Ralph bash script consumes.

What the Loop Gave Back

Even before Gantry exists, the loop has changed how I work.

I start sprints before bed. In the morning there is a progress file: which stories committed, which got bounced to manual review, and what each one cost. Some mornings it is a clean read. Other mornings it is a list of problems I need to inspect. Either way, I did not spend an evening I could not spare.

The less obvious value is what the loop did to my planning.

A story the SM agent cannot write a spec from is a story I have not thought through. The loop is a blunt reviewer of my own ambiguity. BMAD and the Ralph Loop compose well because BMAD insists every artifact has a shape and every handoff has a contract. The loop turns that insistence into a runtime requirement. Vague planning gets caught at the SM step before it burns a Dev run.

The patterns also travel:

File-as-protocol when an LLM decision controls automation.
Independent gates when the AI says work is done.
Terminal reasons as branch signals, not generic failures.
Manual review as a visible state, not a collapsed exception.
Runtime baseline as a cost surface, not background noise.

None of those are bound to Ralph, BMAD, Claude Code, or any one model. They are habits for building systems that share work with an AI without letting the AI define success by itself.

My day role gets them too.

Why I Am Still Building It

I do not know if Gantry will work. I do not know if it will matter. The space moves quickly enough that six months from now the tooling may have shifted and the orchestrator I am building may be redundant before it is finished.

That is not a reason to sit out.

Building directly on top of these models teaches lessons that a standard product cycle rarely does. Real code, real constraints, real failures, real quota, real build gates. Nothing checks my homework except the toolchain.

That is the era I want to be in, with both hands.

And I get my evenings back.

The views expressed here are my own. Affiant and the Gantry orchestrator are personal open-source projects. Claude Code’s CLI flags are documented in Anthropic’s reference; the Claude Agent SDK and BMAD Method are linked above where first used.