Introducing Ledgr — Privacy-first budget tracking, free & open source. Learn more →
Blog Ledgr About Newsletter
May 19, 2026 6 min read

Closed-Loop Iteration Beats Blind Review (23,760 Benchmarks)

JC
Johnathen Chilcher Senior SRE, TechLoom

Developers iterate. They generate code, run tests, fix failures, run again. The question isn’t whether to iterate — it’s whether the loop has signal. Blind “review and improve” prompts destroy quality. But a closed loop with real test feedback lifts the hardest model-language combinations by up to +19 points.

I ran 23,760 benchmarks testing eleven iteration patterns across four languages. The results: blind iteration makes things worse, but closed-loop iteration — tests → code → run tests → fix → pass — works. The catch: front-loading comprehensive instructions in a single turn still wins on efficiency. Closed-loop shines when you’re working with a model-language pair that struggles (Haiku+C#, Haiku+Go).

This is the ninth experiment in our benchmark series, following the context pollution, GSD methodology, and capstone best-practices studies. All data is open source at claude-benchmark.

Two Types of Iteration

Not all multi-turn patterns are equal. The critical distinction:

  • Open-loop (blind): “Review and improve your code.” No test results, no compiler errors, no signal. The model guesses what to change.
  • Closed-loop (evidence-based): Run tests between turns. Feed failures back. The model fixes what’s actually broken.

Open-loop iteration is how most people prompt AI today. Closed-loop iteration is how Claude Code’s tool-use actually works. The gap between them is enormous.

Experiment Design

I tested six variants in v1 (blind iteration) and five in v2 (test-feedback iteration):

V1 — Blind iteration (12,960 runs):

Variant Description
single-turn-bare One prompt, one response. No system instructions.
single-turn-kitchen-sink One prompt with comprehensive instructions (the “kitchen-sink” from prior experiments).
two-turn-bare First turn generates code, second turn says “review and improve your code.”
two-turn-kitchen-sink Same as two-turn-bare but with kitchen-sink instructions in the system prompt.
two-turn-specific-feedback First turn generates, second turn gives specific feedback (e.g., “check edge cases, verify error handling”).
three-turn-bare Three rounds of “review and improve.”

V2 — Closed-loop with test feedback (10,800 runs):

Variant Description
two-turn-test-feedback Generate code, run tests, feed pass/fail results back, fix.
two-turn-test-feedback-kitchen-sink Same as above with comprehensive system instructions.
two-turn-blind-review Control: generate then “review and improve” (no test results).

Each variant was run across:

  • 3 models: Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
  • 48 cross-language tasks: Python, Go, JavaScript, C#
  • 15 repetitions per model-task-variant combination

The Scoreboard: Blind Iteration Fails

V1 results — open-loop variants:

Variant Python Go JavaScript C# Overall Avg Tokens
single-turn-bare 88.21 68.87 67.17 63.17 71.86 1,393
single-turn-kitchen-sink 88.19 81.55 72.40 77.00 79.79 2,391
two-turn-bare 86.66 69.75 67.86 62.69 71.75 3,187
two-turn-kitchen-sink 87.49 81.39 72.33 76.47 79.44 4,969
two-turn-specific-feedback 85.57 67.09 68.53 64.75 71.49 3,240
three-turn-bare 81.20 65.30 65.40 61.13 68.22 5,433
Three-turn blind review scores 68.22 — the worst of all variants. 3.64 points below single-turn-bare (71.86). The model feels obligated to change SOMETHING each turn, introducing regressions in already-correct code. Without test signal, iteration is just noise.

Why Blind Review Fails

The three-turn pattern is intuitive: generate code, review it, improve it, review again, polish. It’s how humans work. But LLMs don’t have the metacognitive ability to recognize when code is already correct. They can’t say “this is fine, don’t change it.” Every turn becomes an opportunity to break something that worked.

The C# data is especially stark: three-turn-bare drops to 61.13, down from 63.17 in single-turn-bare. That’s a 2.04-point regression on the language that’s already hardest for Claude. More turns amplify model weaknesses.

Even “specific feedback” doesn’t save blind review. Two-turn-specific-feedback (71.49) still regresses below single-turn-bare (71.86). Telling the model “check edge cases” doesn’t help when it can’t actually run the code to see if the edge cases pass.

Closed-Loop Changes Everything

V2 results — test feedback between turns (10,800 runs):

Variant Score Delta vs Bare
single-turn-kitchen-sink 76.77 +4.47
two-turn-test-feedback-kitchen-sink 75.24 +2.94
two-turn-test-feedback 73.92 +1.62
two-turn-blind-review 72.90 +0.60
single-turn-bare 72.30
Closed-loop test feedback (+1.62) beats blind review (+0.60). Real signal — pass/fail results from actual test execution — gives the model something to act on. The loop has information. Blind review has none.

But the bigger story is where closed-loop shines:

Low-Baseline Combos: Where Closed-Loop Dominates

The overall averages hide the most important finding. When we break results by model-language pair, closed-loop iteration shows massive gains for combinations where Claude struggles:

Model + Language Test-Feedback Lift
Haiku + C# +19.04
Haiku + Go +14.71
Haiku + JavaScript +8.33
Opus + C# +4.12
Sonnet + Python +0.21
Haiku + C# gains +19.04 points from test feedback. When the model’s baseline is low, giving it real error signal between turns is transformative. The loop provides scaffolding the model needs but doesn’t have internalized.

This maps directly to the low-baseline rule we’ve seen across experiments: techniques that help when Claude is struggling (weak model + hard language) often don’t help — or actively hurt — when it’s already performing well (Sonnet + Python).

The Efficiency Tradeoff

Front-loading still wins on pure efficiency:

Variant Score Avg Tokens Quality/Token
single-turn-kitchen-sink 79.79 2,391 0.0334
two-turn-test-feedback 73.92 3,800 0.0195
two-turn-bare 71.75 3,187 0.0225
three-turn-bare 68.22 5,433 0.0126

Single-turn-kitchen-sink: 79.79 at 2,391 tokens. Three-turn-bare: 68.22 at 5,433 tokens. That’s 2.3x the cost for 11.57 fewer quality points. If you’re paying per token, front-loading is the obvious choice for high-baseline combinations.

But for low-baseline combos (Haiku+C#, Haiku+Go), the closed-loop approach is worth the extra tokens. A +19 point lift justifies the cost.

The Decision Framework

Use closed-loop iteration when:

  • Model baseline is low (Haiku on Go/C#/JS)
  • You have real test feedback to inject
  • The task has clear pass/fail criteria
  • Quality matters more than token cost

Front-load instead when:

  • Model baseline is high (Sonnet/Opus on Python)
  • You’d be iterating blindly without test signal
  • Token efficiency matters
  • You can invest in a comprehensive first prompt

Practical Advice

1. Never iterate without signal

Blind “review and improve” prompts degrade quality every time (-3.64 for three rounds). If you don’t have test results, compiler errors, or linter output to feed back, don’t ask the model to try again.

Evidence: 12,960 runs, consistent across 4 languages

2. Build a closed loop: tests → code → run tests → fix → pass

When you do iterate, inject real signal between turns. Test-feedback iteration (+1.62 overall, up to +19.04 for Haiku+C#) beats blind review (+0.60) because the model has actual errors to fix rather than imagining problems.

Evidence: 10,800 v2 runs with test-feedback injection

3. Front-load when baselines are high

Single-turn-kitchen-sink (79.79, 2,391 tokens) beats all multi-turn variants on efficiency. For strong model-language pairs (Sonnet+Python, Opus+Python), pack comprehensive instructions up front and accept the first output.

Evidence: +8.04 quality points vs two-turn-bare, 796 fewer tokens

4. Reserve closed-loop for low-baseline combos

Haiku+C# (+19.04), Haiku+Go (+14.71): these are the combinations where closed-loop iteration is worth the extra tokens. The model needs the test scaffold because it hasn’t internalized the language patterns.

Evidence: Per-model-language breakdowns from 10,800 v2 runs

Limitations

  • Simplified test feedback. Our v2 experiment injected pass/fail results. Real Claude Code tool-use workflows have richer feedback (compiler errors, linter output, stack traces). The real-world gains from closed-loop may be larger than +19pts for hard cases.
  • Two turns only. We didn’t test 3+ turn closed loops. Real agentic workflows may iterate more times with diminishing returns per turn.
  • Claude models only. Other model families may handle iteration differently, though the pattern of “signal helps, blind iteration hurts” likely generalizes.
  • Single-file tasks. Multi-turn may add more value for architectural work, planning, or multi-file coordination where test feedback spans integration boundaries.

Try It Yourself

The multi-turn conversation experiment configuration and all data are open source:

pip install claude-benchmark

# Run the multi-turn experiment
claude-benchmark experiment run experiments/multi-turn-conversation.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own turn patterns in the experiment TOML and test them against your specific task types. If you find a closed-loop strategy that beats front-loading for high-baseline combinations, I’d like to see the data.

Full experiment configuration: experiments/multi-turn-conversation.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.