Closed-Loop Iteration Beats Blind Review (23,760 Benchmarks)
Developers iterate. They generate code, run tests, fix failures, run again. The question isn’t whether to iterate — it’s whether the loop has signal. Blind “review and improve” prompts destroy quality. But a closed loop with real test feedback lifts the hardest model-language combinations by up to +19 points.
I ran 23,760 benchmarks testing eleven iteration patterns across four languages. The results: blind iteration makes things worse, but closed-loop iteration — tests → code → run tests → fix → pass — works. The catch: front-loading comprehensive instructions in a single turn still wins on efficiency. Closed-loop shines when you’re working with a model-language pair that struggles (Haiku+C#, Haiku+Go).
This is the ninth experiment in our benchmark series, following the context pollution, GSD methodology, and capstone best-practices studies. All data is open source at claude-benchmark.
Two Types of Iteration
Not all multi-turn patterns are equal. The critical distinction:
- Open-loop (blind): “Review and improve your code.” No test results, no compiler errors, no signal. The model guesses what to change.
- Closed-loop (evidence-based): Run tests between turns. Feed failures back. The model fixes what’s actually broken.
Open-loop iteration is how most people prompt AI today. Closed-loop iteration is how Claude Code’s tool-use actually works. The gap between them is enormous.
Experiment Design
I tested six variants in v1 (blind iteration) and five in v2 (test-feedback iteration):
V1 — Blind iteration (12,960 runs):
| Variant | Description |
|---|---|
| single-turn-bare | One prompt, one response. No system instructions. |
| single-turn-kitchen-sink | One prompt with comprehensive instructions (the “kitchen-sink” from prior experiments). |
| two-turn-bare | First turn generates code, second turn says “review and improve your code.” |
| two-turn-kitchen-sink | Same as two-turn-bare but with kitchen-sink instructions in the system prompt. |
| two-turn-specific-feedback | First turn generates, second turn gives specific feedback (e.g., “check edge cases, verify error handling”). |
| three-turn-bare | Three rounds of “review and improve.” |
V2 — Closed-loop with test feedback (10,800 runs):
| Variant | Description |
|---|---|
| two-turn-test-feedback | Generate code, run tests, feed pass/fail results back, fix. |
| two-turn-test-feedback-kitchen-sink | Same as above with comprehensive system instructions. |
| two-turn-blind-review | Control: generate then “review and improve” (no test results). |
Each variant was run across:
- 3 models: Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus
- 48 cross-language tasks: Python, Go, JavaScript, C#
- 15 repetitions per model-task-variant combination
The Scoreboard: Blind Iteration Fails
V1 results — open-loop variants:
| Variant | Python | Go | JavaScript | C# | Overall | Avg Tokens |
|---|---|---|---|---|---|---|
| single-turn-bare | 88.21 | 68.87 | 67.17 | 63.17 | 71.86 | 1,393 |
| single-turn-kitchen-sink | 88.19 | 81.55 | 72.40 | 77.00 | 79.79 | 2,391 |
| two-turn-bare | 86.66 | 69.75 | 67.86 | 62.69 | 71.75 | 3,187 |
| two-turn-kitchen-sink | 87.49 | 81.39 | 72.33 | 76.47 | 79.44 | 4,969 |
| two-turn-specific-feedback | 85.57 | 67.09 | 68.53 | 64.75 | 71.49 | 3,240 |
| three-turn-bare | 81.20 | 65.30 | 65.40 | 61.13 | 68.22 | 5,433 |
Why Blind Review Fails
The three-turn pattern is intuitive: generate code, review it, improve it, review again, polish. It’s how humans work. But LLMs don’t have the metacognitive ability to recognize when code is already correct. They can’t say “this is fine, don’t change it.” Every turn becomes an opportunity to break something that worked.
The C# data is especially stark: three-turn-bare drops to 61.13, down from 63.17 in single-turn-bare. That’s a 2.04-point regression on the language that’s already hardest for Claude. More turns amplify model weaknesses.
Even “specific feedback” doesn’t save blind review. Two-turn-specific-feedback (71.49) still regresses below single-turn-bare (71.86). Telling the model “check edge cases” doesn’t help when it can’t actually run the code to see if the edge cases pass.
Closed-Loop Changes Everything
V2 results — test feedback between turns (10,800 runs):
| Variant | Score | Delta vs Bare |
|---|---|---|
| single-turn-kitchen-sink | 76.77 | +4.47 |
| two-turn-test-feedback-kitchen-sink | 75.24 | +2.94 |
| two-turn-test-feedback | 73.92 | +1.62 |
| two-turn-blind-review | 72.90 | +0.60 |
| single-turn-bare | 72.30 | — |
But the bigger story is where closed-loop shines:
Low-Baseline Combos: Where Closed-Loop Dominates
The overall averages hide the most important finding. When we break results by model-language pair, closed-loop iteration shows massive gains for combinations where Claude struggles:
| Model + Language | Test-Feedback Lift |
|---|---|
| Haiku + C# | +19.04 |
| Haiku + Go | +14.71 |
| Haiku + JavaScript | +8.33 |
| Opus + C# | +4.12 |
| Sonnet + Python | +0.21 |
This maps directly to the low-baseline rule we’ve seen across experiments: techniques that help when Claude is struggling (weak model + hard language) often don’t help — or actively hurt — when it’s already performing well (Sonnet + Python).
The Efficiency Tradeoff
Front-loading still wins on pure efficiency:
| Variant | Score | Avg Tokens | Quality/Token |
|---|---|---|---|
| single-turn-kitchen-sink | 79.79 | 2,391 | 0.0334 |
| two-turn-test-feedback | 73.92 | 3,800 | 0.0195 |
| two-turn-bare | 71.75 | 3,187 | 0.0225 |
| three-turn-bare | 68.22 | 5,433 | 0.0126 |
Single-turn-kitchen-sink: 79.79 at 2,391 tokens. Three-turn-bare: 68.22 at 5,433 tokens. That’s 2.3x the cost for 11.57 fewer quality points. If you’re paying per token, front-loading is the obvious choice for high-baseline combinations.
But for low-baseline combos (Haiku+C#, Haiku+Go), the closed-loop approach is worth the extra tokens. A +19 point lift justifies the cost.
The Decision Framework
Use closed-loop iteration when:
- Model baseline is low (Haiku on Go/C#/JS)
- You have real test feedback to inject
- The task has clear pass/fail criteria
- Quality matters more than token cost
Front-load instead when:
- Model baseline is high (Sonnet/Opus on Python)
- You’d be iterating blindly without test signal
- Token efficiency matters
- You can invest in a comprehensive first prompt
Practical Advice
1. Never iterate without signal
Blind “review and improve” prompts degrade quality every time (-3.64 for three rounds). If you don’t have test results, compiler errors, or linter output to feed back, don’t ask the model to try again.
2. Build a closed loop: tests → code → run tests → fix → pass
When you do iterate, inject real signal between turns. Test-feedback iteration (+1.62 overall, up to +19.04 for Haiku+C#) beats blind review (+0.60) because the model has actual errors to fix rather than imagining problems.
3. Front-load when baselines are high
Single-turn-kitchen-sink (79.79, 2,391 tokens) beats all multi-turn variants on efficiency. For strong model-language pairs (Sonnet+Python, Opus+Python), pack comprehensive instructions up front and accept the first output.
4. Reserve closed-loop for low-baseline combos
Haiku+C# (+19.04), Haiku+Go (+14.71): these are the combinations where closed-loop iteration is worth the extra tokens. The model needs the test scaffold because it hasn’t internalized the language patterns.
Limitations
- Simplified test feedback. Our v2 experiment injected pass/fail results. Real Claude Code tool-use workflows have richer feedback (compiler errors, linter output, stack traces). The real-world gains from closed-loop may be larger than +19pts for hard cases.
- Two turns only. We didn’t test 3+ turn closed loops. Real agentic workflows may iterate more times with diminishing returns per turn.
- Claude models only. Other model families may handle iteration differently, though the pattern of “signal helps, blind iteration hurts” likely generalizes.
- Single-file tasks. Multi-turn may add more value for architectural work, planning, or multi-file coordination where test feedback spans integration boundaries.
Try It Yourself
The multi-turn conversation experiment configuration and all data are open source:
pip install claude-benchmark
# Run the multi-turn experiment
claude-benchmark experiment run experiments/multi-turn-conversation.toml
# Generate the report
claude-benchmark report
# See the results
open results/*/report.html
You can also define your own turn patterns in the experiment TOML and test them against your specific task types. If you find a closed-loop strategy that beats front-loading for high-baseline combinations, I’d like to see the data.
Full experiment configuration: experiments/multi-turn-conversation.toml
For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.