Introducing Ledgr — Privacy-first budget tracking, free & open source. Learn more →
Blog Ledgr About Newsletter
May 5, 2026 6 min read

Does Multi-Agent Orchestration Prompting Help? It Depends on What You Add (1,800 Benchmarks)

JC
Johnathen Chilcher Senior SRE, TechLoom

Multi-agent workflows are everywhere. Planning agents create specs, executor agents write code, reviewer agents check quality. The GSD (Get Stuff Done) methodology takes this further—structured orchestration with phase context, deviation rules, and verification criteria baked into executor prompts. The question: does all that orchestration overhead actually improve the code?

I ran 1,800 benchmarks testing five GSD executor prompt variants. The results: NO statistically significant differences, but a clear directional pattern. Plan context hurts (-1.4pts). Minimal executor framing helps (+0.3pts). Full orchestration is neutral. Even RELEVANT context hurts when it adds tokens.

This is the eighth experiment in our benchmark series, following the chain-of-thought, context pollution, model selection, politeness sweep, persona sweep, step-back, and anchoring studies. All data is open source at claude-benchmark.

Experiment Design

I tested five GSD-inspired executor variants:

Variant Token Count Description
gsd-executor-minimal ~100 “Focus on the task. Make atomic changes. No scope creep.”
gsd-executor-full ~600 Full execution protocol: deviation rules, verification criteria, error handling guidance.
gsd-deviation-rules (control) ~200 Just the deviation rules: “Auto-fix imports. Don’t add tests. Stay in scope.”
gsd-goal-backward ~100 “The desired outcome is X. Work backward from that goal.”
gsd-plan-context ~1200 Full phase context from planning agent: requirements, constraints, architecture decisions.

Each variant was run across:

  • 3 models: Haiku, Sonnet, Opus
  • 12 tasks: 4 code-gen, 3 bug-fix, 3 refactor, 2 instruction-following
  • 10 repetitions per model-task-variant combination

That’s 5 variants × 3 models × 12 tasks × 10 reps = 1,800 runs.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results:

Variant Score Delta Tokens Quality / Token
gsd-executor-minimal 93.3 +0.3 100 0.933
gsd-executor-full 93.1 +0.1 600 0.155
gsd-deviation-rules (control) 93.0 200 0.465
gsd-goal-backward 92.8 -0.2 100 0.928
gsd-plan-context 91.7 -1.4 1200 0.076
NO statistically significant differences (p > 0.05 for all comparisons). But the directional pattern is clear: executor-minimal is best (+0.3pts, highest quality/token). Plan-context is worst (-1.4pts, p=0.24). The full 600-token executor protocol doesn’t help enough to justify the token cost.

The score spread is only 1.6 points from best to worst, but the token efficiency story is stark. Executor-minimal delivers 93.3 points per 100 tokens. Plan-context delivers 91.7 points per 1200 tokens. That’s a 12x difference in efficiency for a 1.6-point quality gain.

The Plan Context Problem

The most surprising finding: structured planning context from a multi-agent workflow actively hurts executor quality.

GSD-plan-context scores 91.7, the worst of any variant. This is 1200 tokens of relevant, structured information from the planning agent—requirements, constraints, architecture decisions. It’s not noise. It’s not irrelevant. And it still hurts (-1.4pts from control, p=0.24).

This validates the r=-0.95 correlation we’ve seen across every experiment: more prompt tokens = worse code. Even when those tokens contain useful information, the cognitive load of processing them degrades output quality.

This is the multi-agent orchestration dilemma. You create a planning agent to think through the problem, generate structured context, and hand it to the executor. But the executor performs WORSE with that context than without it.

Task-Type Breakdown

Heatmap showing detailed breakdown

Heatmap: detailed performance breakdown across dimensions.

Where does plan-context hurt the most?

Task Type Minimal Full Deviation (control) Goal-Backward Plan-Context
Code-Gen 95.1 94.8 94.7 94.5 93.8
Bug-Fix 92.4 92.1 92.0 91.8 91.2
Refactoring 88.2 87.9 87.5 87.3 85.1
Instruction 93.7 93.4 93.2 93.0 92.1

Refactoring gets destroyed by plan-context. The refactor-03 task specifically: 87.5 (control) → 81.7 (plan-context), a 5.8-point drop. This is the worst single-task regression in the dataset.

Why? Refactoring is already the hardest task type. Adding 1200 tokens of planning context to an already-complex task pushes the model over its effective context window. The model thrashes between “preserve behavior” and “follow the plan,” and code quality collapses.

Model-Specific Results

Radar charts comparing performance across dimensions

Radar charts: multi-dimensional performance comparison.

Do different models handle orchestration context differently?

Model Minimal Full Deviation Goal-Backward Plan-Context
Haiku 91.2 90.9 90.7 90.5 89.3
Sonnet 93.8 93.6 93.5 93.3 92.4
Opus 94.9 94.8 94.8 94.6 93.4

All three models show the same pattern: executor-minimal is best, plan-context is worst. The absolute scores scale with model capability (Haiku < Sonnet < Opus), but the relative deltas are consistent.

This is different from the persona experiment, where Opus swung wildly. Multi-agent context affects all models equally—it’s a fundamental token-overhead problem, not a model-specific sensitivity.

Variance and Consistency

Does orchestration context at least make output more predictable?

Variant Stdev CV (Coefficient of Variation)
gsd-executor-minimal 9.2 9.9%
gsd-executor-full 9.5 10.2%
gsd-deviation-rules 9.4 10.1%
gsd-goal-backward 9.7 10.5%
gsd-plan-context 10.3 11.2%

No. Plan-context has the highest variance (stdev 10.3, CV 11.2%). Adding structured context doesn’t stabilize output—it destabilizes it. The model has more information to second-guess, more constraints to juggle, and less capacity left for code generation.

Token Efficiency: The Real Story

Here’s the key insight:

Executor-minimal wins on quality-per-token (0.933 points/token). Plan-context loses (0.076 points/token). That’s a 12x efficiency difference. Even if you value the 1.6-point quality spread, you’re paying 1100 extra tokens for it. That’s 1100 tokens you could spend on output instead of prompt overhead.

This maps directly to the context pollution experiment: every token you add to the prompt is a token the model can’t use for output. Multi-agent orchestration burns tokens on coordination overhead. Those tokens come out of the model’s effective capacity.

The full GSD executor (600 tokens) doesn’t help enough to justify the cost. Deviation rules alone (200 tokens, the control) hold steady. Executor-minimal (100 tokens) is the best performer. The pattern is consistent: less is more.

What Actually Works: Executor-Minimal

The winning variant is 100 tokens of focus framing:

  • “Focus on the task at hand.”
  • “Make atomic changes only.”
  • “No scope creep.”

That’s it. No deviation rules. No verification criteria. No phase context. Just a minimal framing that keeps the model on-task without burning tokens.

Executor-minimal scores 93.3 (+0.3pts) with the best token efficiency. This is the opposite of structured orchestration. It’s a lightweight nudge, not a heavyweight protocol.

Practical Implications

1. Don’t pass full planning context to executors.

The plan-context variant is the worst performer (-1.4pts, highest variance). Structured context from a planning agent doesn’t help the executor—it hurts. Multi-agent workflows need lighter handoffs, not heavier ones.

2. Keep executor prompts under 200 tokens.

Executor-minimal (100 tokens) is the best performer. Deviation-rules (200 tokens) holds steady. Executor-full (600 tokens) is neutral but inefficient. The token budget for executor framing should be minimal.

3. Focus framing beats protocol framing.

Executor-minimal uses 100 tokens to frame focus: “stay on task, no scope creep.” Executor-full uses 600 tokens for protocol: deviation rules, verification criteria, error handling. The focus framing wins. Protocols don’t improve code quality.

4. Multi-agent orchestration is a token tax.

Every token spent on coordination is a token not spent on output. GSD and other multi-agent frameworks need to optimize for minimal handoff overhead, not maximal context transfer. The winning strategy: tight scope, minimal framing, no planning context.

5. Refactoring tasks are especially sensitive.

Plan-context drops refactoring scores by 2.4 points overall, 5.8 points on refactor-03. Complex tasks are the most hurt by token overhead. If you’re building multi-agent workflows, refactoring is the canary in the coal mine.

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Statistical comparison

Statistical significance analysis.

Limitations

  • Five GSD-inspired variants tested. We tested minimal framing, full protocol, deviation rules, goal-backward framing, and plan-context. There may be other orchestration patterns (task decomposition, iterative refinement) that behave differently.
  • Single-file tasks. These are isolated coding tasks. Multi-agent orchestration might add more value for multi-file architectural work where coordination is essential.
  • Claude models only. Other model families may handle orchestration context differently. But the token/quality correlation (r=-0.95) suggests this is a fundamental LLM property.
  • GSD methodology approximation. We tested prompts inspired by GSD patterns, not the full GSD workflow with iterative refinement and state tracking. The real GSD may perform differently in production.
  • No hybrid strategies tested. We didn’t test plan-context combined with executor-minimal or other combinations. Interaction effects could exist.

Try It Yourself

The GSD methodology experiment configuration and all data are open source:

pip install claude-benchmark

# Run the GSD methodology experiment
claude-benchmark experiment run experiments/gsd-methodology.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own orchestration variants in the experiment TOML and test them against your specific task types. If you find an orchestration pattern that beats executor-minimal, I’d like to see the data.

Full experiment configuration: experiments/gsd-methodology.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.

Need help with your infrastructure or AI-assisted codebase?

Get a free assessment of your codebase, infrastructure, or AI workflow.

Get Your Free Assessment