May 19, 2026 6 min read

Closed-Loop Iteration Beats Blind Review (23,760 Benchmarks)

Johnathen Chilcher Senior SRE, TechLoom

Developers iterate. They generate code, run tests, fix failures, run again. The question isn’t whether to iterate — it’s whether the loop has signal. Blind “review and improve” prompts destroy quality. But a closed loop with real test feedback lifts the hardest model-language combinations by up to +19 points.

I ran 23,760 benchmarks testing eleven iteration patterns across four languages. The results: blind iteration makes things worse, but closed-loop iteration — tests → code → run tests → fix → pass — works. The catch: front-loading comprehensive instructions in a single turn still wins on efficiency. Closed-loop shines when you’re working with a model-language pair that struggles (Haiku+C#, Haiku+Go).

This is the eighteenth experiment in our benchmark series, following the context pollution, GSD methodology, and verification instructions studies. All data is open source at claude-benchmark.

Two Types of Iteration

Not all multi-turn patterns are equal. The critical distinction:

Open-loop (blind): “Review and improve your code.” No test results, no compiler errors, no signal. The model guesses what to change.
Closed-loop (evidence-based): Run tests between turns. Feed failures back. The model fixes what’s actually broken.

Open-loop iteration is how most people prompt AI today. Closed-loop iteration is how Claude Code’s tool-use actually works. The gap between them is enormous.

Experiment Design

I tested six variants in v1 (blind iteration) and five in v2 (test-feedback iteration):

V1 — Blind iteration (12,960 runs):

Variant	Description
single-turn-bare	One prompt, one response. No system instructions.
single-turn-kitchen-sink	One prompt with comprehensive instructions (the “kitchen-sink” from prior experiments).
two-turn-bare	First turn generates code, second turn says “review and improve your code.”
two-turn-kitchen-sink	Same as two-turn-bare but with kitchen-sink instructions in the system prompt.
two-turn-specific-feedback	First turn generates, second turn gives specific feedback (e.g., “check edge cases, verify error handling”).
three-turn-bare	Three rounds of “review and improve.”

V2 — Closed-loop with test feedback (10,800 runs):

Variant	Description
two-turn-test-feedback	Generate code, run tests, feed pass/fail results back, fix.
two-turn-test-feedback-kitchen-sink	Same as above with comprehensive system instructions.
two-turn-blind-review	Control: generate then “review and improve” (no test results).

Each variant was run across:

3 models: Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
48 cross-language tasks: Python, Go, JavaScript, C#
15 repetitions per model-task-variant combination

The Scoreboard: Blind Iteration Fails

V1 results — open-loop variants:

Variant	Python	Go	JavaScript	C#	Overall	Avg Tokens
single-turn-bare	88.21	68.87	67.17	63.17	71.86	1,393
single-turn-kitchen-sink	88.19	81.55	72.40	77.00	79.79	2,391
two-turn-bare	86.66	69.75	67.86	62.69	71.75	3,187
two-turn-kitchen-sink	87.49	81.39	72.33	76.47	79.44	4,969
two-turn-specific-feedback	85.57	67.09	68.53	64.75	71.49	3,240
three-turn-bare	81.20	65.30	65.40	61.13	68.22	5,433

Three-turn blind review scores 68.22 — the worst of all variants. 3.64 points below single-turn-bare (71.86). The model feels obligated to change SOMETHING each turn, introducing regressions in already-correct code. Without test signal, iteration is just noise.

Why Blind Review Fails

The three-turn pattern is intuitive: generate code, review it, improve it, review again, polish. It’s how humans work. But LLMs don’t have the metacognitive ability to recognize when code is already correct. They can’t say “this is fine, don’t change it.” Every turn becomes an opportunity to break something that worked.

The C# data is especially stark: three-turn-bare drops to 61.13, down from 63.17 in single-turn-bare. That’s a 2.04-point regression on the language that’s already hardest for Claude. More turns amplify model weaknesses.

Even “specific feedback” doesn’t save blind review. Two-turn-specific-feedback (71.49) still regresses below single-turn-bare (71.86). Telling the model “check edge cases” doesn’t help when it can’t actually run the code to see if the edge cases pass.

Closed-Loop Changes Everything

V2 results — test feedback between turns (10,800 runs):

Variant	Score	Delta vs Bare
single-turn-kitchen-sink	76.77	+4.47
two-turn-test-feedback-kitchen-sink	75.24	+2.94
two-turn-test-feedback	73.92	+1.62
two-turn-blind-review	72.90	+0.60
single-turn-bare	72.30	—

Closed-loop test feedback (+1.62) beats blind review (+0.60). Real signal — pass/fail results from actual test execution — gives the model something to act on. The loop has information. Blind review has none.

But the bigger story is where closed-loop shines:

Low-Baseline Combos: Where Closed-Loop Dominates

The overall averages hide the most important finding. When we break results by model-language pair, closed-loop iteration shows massive gains for combinations where Claude struggles:

Model + Language	Test-Feedback Lift
Haiku + C#	+19.04
Haiku + Go	+14.71
Haiku + JavaScript	+8.33
Opus + C#	+4.12
Sonnet + Python	+0.21

Haiku + C# gains +19.04 points from test feedback. When the model’s baseline is low, giving it real error signal between turns is transformative. The loop provides scaffolding the model needs but doesn’t have internalized.

This maps directly to the low-baseline rule we’ve seen across experiments: techniques that help when Claude is struggling (weak model + hard language) often don’t help — or actively hurt — when it’s already performing well (Sonnet + Python).

The Efficiency Tradeoff

Front-loading still wins on pure efficiency:

Variant	Score	Avg Tokens	Quality/Token
single-turn-kitchen-sink	79.79	2,391	0.0334
two-turn-test-feedback	73.92	3,800	0.0195
two-turn-bare	71.75	3,187	0.0225
three-turn-bare	68.22	5,433	0.0126

Single-turn-kitchen-sink: 79.79 at 2,391 tokens. Three-turn-bare: 68.22 at 5,433 tokens. That’s 2.3x the cost for 11.57 fewer quality points. If you’re paying per token, front-loading is the obvious choice for high-baseline combinations.

But for low-baseline combos (Haiku+C#, Haiku+Go), the closed-loop approach is worth the extra tokens. A +19 point lift justifies the cost.

The Decision Framework

Use closed-loop iteration when:

Model baseline is low (Haiku on Go/C#/JS)
You have real test feedback to inject
The task has clear pass/fail criteria
Quality matters more than token cost

Front-load instead when:

Model baseline is high (Sonnet/Opus on Python)
You’d be iterating blindly without test signal
Token efficiency matters
You can invest in a comprehensive first prompt

Practical Advice

1. Never iterate without signal

Blind “review and improve” prompts degrade quality every time (-3.64 for three rounds). If you don’t have test results, compiler errors, or linter output to feed back, don’t ask the model to try again.

Evidence: 12,960 runs, consistent across 4 languages

2. Build a closed loop: tests → code → run tests → fix → pass

When you do iterate, inject real signal between turns. Test-feedback iteration (+1.62 overall, up to +19.04 for Haiku+C#) beats blind review (+0.60) because the model has actual errors to fix rather than imagining problems.

Evidence: 10,800 v2 runs with test-feedback injection

3. Front-load when baselines are high

Single-turn-kitchen-sink (79.79, 2,391 tokens) beats all multi-turn variants on efficiency. For strong model-language pairs (Sonnet+Python, Opus+Python), pack comprehensive instructions up front and accept the first output.

Evidence: +8.04 quality points vs two-turn-bare, 796 fewer tokens

4. Reserve closed-loop for low-baseline combos

Haiku+C# (+19.04), Haiku+Go (+14.71): these are the combinations where closed-loop iteration is worth the extra tokens. The model needs the test scaffold because it hasn’t internalized the language patterns.

Evidence: Per-model-language breakdowns from 10,800 v2 runs

Limitations

Simplified test feedback. Our v2 experiment injected pass/fail results. Real Claude Code tool-use workflows have richer feedback (compiler errors, linter output, stack traces). The real-world gains from closed-loop may be larger than +19pts for hard cases.
Two turns only. We didn’t test 3+ turn closed loops. Real agentic workflows may iterate more times with diminishing returns per turn.
Claude models only. Other model families may handle iteration differently, though the pattern of “signal helps, blind iteration hurts” likely generalizes.
Single-file tasks. Multi-turn may add more value for architectural work, planning, or multi-file coordination where test feedback spans integration boundaries.

Try It Yourself

The multi-turn conversation experiment configuration and all data are open source:

pip install claude-benchmark

# Run the multi-turn experiment
claude-benchmark experiment run experiments/multi-turn-conversation.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own turn patterns in the experiment TOML and test them against your specific task types. If you find a closed-loop strategy that beats front-loading for high-baseline combinations, I’d like to see the data.

Full experiment configuration: experiments/multi-turn-conversation.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.