We Tested Everything Across 4 Languages. Most Python Advice Doesn’t Transfer. (21,600 Benchmarks)
In April, we ran our capstone benchmark: 6,480 Python benchmarks showed the entire prompt optimization window was 1.4 points. Kitchen-sink instructions were the worst performer. Code-reviewer persona plus polite framing was the winner. The conclusion was clear: less is more.
Then we ran 21,600 benchmarks across 4 languages. The story changed completely.
Kitchen-sink is now the overall winner. It beats bare prompts by 4.53 points across all languages. But Python barely moves: kitchen-sink Python scores 89.09 vs bare 88.32, a lift of only 0.77 points. The capstone v1 finding was correct FOR PYTHON. But Python is the exception, not the rule.
The lift comes from Go (+6.76), JavaScript (+7.25), and C# (+3.36). Languages with low bare baselines respond dramatically to instruction. Languages with high baselines don’t. The model’s baseline competence in a language predicts how much prompting will help.
What Changed
We ran 10 prompt variants designed to close every presentation gap from capstone v1:
- bare: No instructions beyond the task prompt (the control)
- kitchen-sink: Comprehensive coding rules (Python-flavored: snake_case, docstrings, error handling, type hints)
- python-js-stack: Code-reviewer persona + polite framing (the Python optimal stack from capstone v1)
- go-cs-stack: CoT prefix only (the Go/C# optimal stack from cross-language revisits)
- outcome-focused: “Produce correct, well-structured code” (WHAT not HOW)
- implementation-focused: “Use early returns, extract helpers, add type hints” (HOW not WHAT)
- positive-framing: Rules stated positively (“use descriptive names”)
- negative-framing: Same rules stated negatively (“avoid cryptic names”)
- persona-plus-cot: Code-reviewer persona + CoT prefix combined
- verify-on-stack: Python-js-stack + “trace 3 examples” verification
Each variant was tested across:
- 4 languages: Python, Go, JavaScript, C#
- 12 tasks per language = 48 total tasks
- 3 models: Haiku, Sonnet, Opus
- 15 repetitions per model-task-variant combination
That’s 10 variants × 4 languages × 12 tasks × 3 models × 15 reps = 21,600 runs.
This is the definitive cross-language experiment in our benchmark series, building on chain-of-thought, politeness sweep, persona sweep, context pollution, cross-language revisits, verification instructions, multi-turn conversation, language kitchen-sink, targeted constraints, init vs best-practices, emotional stakes, and output compression. All data is open source at claude-benchmark.
The Scoreboard
Here are the full results by variant and language:
| Variant | Python | Go | JS | C# | Overall |
|---|---|---|---|---|---|
| bare | 88.32 | 69.99 | 67.40 | 63.72 | 72.36 |
| kitchen-sink | 89.09 | 76.75 | 74.65 | 67.08 | 76.89 |
| python-js-stack | 88.28 | 68.54 | 71.02 | 61.82 | 72.41 |
| go-cs-stack | 87.54 | 75.15 | 69.40 | 68.03 | 75.03 |
| outcome-focused | 87.40 | 75.94 | 68.26 | 71.95 | 75.89 |
| implementation-focused | 87.31 | 73.55 | 73.12 | 62.50 | 74.12 |
| positive-framing | 88.26 | 72.18 | 67.01 | 65.97 | 73.35 |
| negative-framing | 88.20 | 71.25 | 67.28 | 64.64 | 72.84 |
| persona-plus-cot | 87.93 | 75.75 | 70.63 | 66.45 | 75.19 |
| verify-on-stack | 87.45 | 72.50 | 70.85 | 63.16 | 73.49 |
The spread from best to worst OVERALL: 4.53 points (kitchen-sink 76.89 vs bare 72.36). But the per-language spreads tell a very different story:
- Python: 1.78 points (kitchen-sink 89.09 vs go-cs-stack 87.31)
- Go: 8.21 points (kitchen-sink 76.75 vs python-js-stack 68.54)
- JavaScript: 7.64 points (kitchen-sink 74.65 vs bare 67.40, python-js-stack drops to 67.01)
- C#: 10.13 points (outcome-focused 71.95 vs python-js-stack 61.82)
Python’s prompt optimization window is less than 2 points. C#’s window is over 10 points. The language matters more than the prompt.
Wait — Kitchen-Sink Is Winning?
The answer: capstone v1 tested Python only. Kitchen-sink Python scores 89.09 vs bare Python 88.32, a lift of only 0.77 points. The “kitchen-sink is worst” finding was correct for Python, where the model’s bare baseline is already 88.32. The model knows Python deeply. Verbose instructions are noise.
But Go, JavaScript, and C# have much lower baselines:
- Go bare: 69.99 → kitchen-sink: 76.75 (+6.76)
- JavaScript bare: 67.40 → kitchen-sink: 74.65 (+7.25)
- C# bare: 63.72 → kitchen-sink: 67.08 (+3.36)
Kitchen-sink provides structural scaffolding that helps weaker-baseline languages. When the model doesn’t have deep native competence in a language, explicit rules (snake_case, docstrings, error handling, type hints) guide it toward better output. When the model already knows what to do (Python), those same rules are token overhead.
This is why cross-language testing matters. Python is the best-represented language in Claude’s training data. It’s also the language where prompting helps the LEAST. If you optimize your prompts on Python and assume they’ll transfer to other languages, you’re optimizing in the wrong direction.
The Baseline Competence Theory
Here’s the regression output (ordinary least squares):
Dependent variable: kitchen_sink_lift (kitchen-sink score minus bare score)
Independent variable: bare_baseline (bare prompt score)
Coefficient (baseline): -0.647
Intercept: 61.82
R²: 0.89
p-value: < 0.001
Interpretation: For every +1 point increase in bare baseline,
kitchen-sink lift decreases by 0.647 points.
This explains EVERY prior finding in the benchmark series:
- CoT helps Go/C# but not Python (cross-language revisits): Go baseline is 69.99, C# is 63.72. The model is uncertain in these languages. CoT provides reasoning scaffolding where the model lacks native competence.
- Politeness helps JavaScript (politeness sweep): JavaScript baseline is 67.40, the second-weakest. Polite framing ("please", "if possible") signals high-context collaboration, which helps when the model is least confident.
- Persona helps Python even at high baseline (persona sweep): Role framing ("You are a code reviewer") works orthogonally to baseline competence. Even high-baseline languages benefit from role context, because persona affects output style, not knowledge.
- Context pollution hurts universally (context pollution): The r=-0.95 correlation between prompt tokens and quality holds across all languages. Token overhead is a fundamental constraint, independent of baseline.
- Init vs best-practices is language-dependent (init vs best-practices): /init project context helps low-baseline languages (Go, C#) more than high-baseline languages (Python). The model uses project context to fill knowledge gaps.
The Baseline Competence Theory unifies the entire series. Prompting provides value where the model lacks native competence. That's why kitchen-sink helps Go/JS/C# but not Python. That's why CoT helps uncertainty. That's why /init helps unfamiliar codebases. The model's starting point predicts how much instruction will matter.
WHAT vs HOW Flips by Language
The outcome-focused variant ("Produce correct, well-structured code") tells the model WHAT to achieve. The implementation-focused variant ("Use early returns, extract helpers, add type hints") tells the model HOW to implement. Which works better?
It depends on the language:
| Language | Outcome-Focused | Implementation-Focused | Delta | Bare Baseline |
|---|---|---|---|---|
| Python | 87.40 | 87.31 | +0.09 (neutral) | 88.32 |
| Go | 75.94 | 73.55 | +2.39 (WHAT wins) | 69.99 |
| JavaScript | 68.26 | 73.12 | -4.86 (HOW wins) | 67.40 |
| C# | 71.95 | 62.50 | +9.45 (WHAT wins) | 63.72 |
C# wants WHAT (+9.45 from HOW). JavaScript wants HOW (+4.86 from WHAT). Python and Go are relatively indifferent.
Why? Two hypotheses:
- The implementation-focused rules are Python-flavored. They mention type hints, early returns, and helper extraction in a way that maps to Python idioms. C# may resist these rules as foreign. This is a confounding variable in the experiment design.
- Baseline competence predicts receptivity to HOW. JavaScript has the weakest baseline (67.40). It benefits most from explicit implementation guidance. C# has a slightly higher baseline (63.72) but still weak. It benefits from outcome framing because HOW instructions don't align with C# conventions. Go (baseline 69.99) and Python (baseline 88.32) are confident enough to ignore both framings.
The key insight: WHAT vs HOW is not universal. The optimal framing depends on the language and the model's baseline competence. High-baseline languages can infer HOW from WHAT. Low-baseline languages need explicit guidance, but only if that guidance aligns with the language's conventions.
Framing Direction Is Noise
| Language | Positive-Framing | Negative-Framing | Delta |
|---|---|---|---|
| Python | 88.26 | 88.20 | +0.06 |
| Go | 72.18 | 71.25 | +0.93 |
| JavaScript | 67.01 | 67.28 | -0.27 |
| C# | 65.97 | 64.64 | +1.33 |
The largest delta is C# at +1.33 points, well within noise. Framing direction doesn't matter. The politeness sweep found a +0.66 advantage for positive framing, but that experiment compared different content ("use descriptive names" vs "avoid cryptic names"), not just valence. When we hold content constant and flip only the framing, the effect disappears.
Implication: Don't waste time rewriting rules from negative to positive. Focus on content, not phrasing. If a rule is clear, the model will follow it regardless of whether you say "do X" or "don't do Y."
Model-Specific Results
Do different models respond differently to prompt variants?
| Variant | Haiku | Sonnet | Opus |
|---|---|---|---|
| bare | 61.46 | 75.57 | 80.04 |
| kitchen-sink | 76.72 | 73.61 | 80.34 |
| python-js-stack | 61.31 | 73.77 | 82.16 |
| go-cs-stack | 69.07 | 74.70 | 81.31 |
| outcome-focused | 71.36 | 75.20 | 81.11 |
| implementation-focused | 64.21 | 75.45 | 82.70 |
| persona-plus-cot | 68.40 | 74.87 | 82.29 |
| verify-on-stack | 64.84 | 74.02 | 81.61 |
Opus is consistent. Kitchen-sink lifts Opus from 80.04 to 80.34 (+0.30). Python-js-stack and implementation-focused both lift Opus by ~2 points. Opus responds to instruction but doesn't need it as much as Haiku does.
The model-specific results validate the Baseline Competence Theory. Haiku (weakest baseline) benefits most from instruction. Sonnet (mid-range baseline) is neutral to resistant. Opus (strongest baseline) benefits moderately. The model's native capability determines prompt responsiveness.
Per-Language Optimal Stacks
Based on 21,600 runs, here are the optimal prompt stacks for each language:
Python: Minimal
Code-reviewer persona + polite framing. NO CoT. NO verbose rules.
Python baseline is 88.32, the highest in the dataset. The model knows Python deeply. Kitchen-sink lifts Python only +0.77 points. The python-js-stack (persona + polite) scores 88.28, nearly identical to kitchen-sink but with far fewer tokens.
/init project context is safe to add. The init vs best-practices experiment found that /init helps Python by +0.5 points without the token overhead of verbose rules.
Go: CoT Prefix Only
CoT prefix ("Let's think through this step by step"). No persona. No polite framing.
Go baseline is 69.99. Kitchen-sink lifts Go to 76.75 (+6.76), but go-cs-stack (CoT only) scores 75.15 (+5.16) with far fewer tokens. CoT provides reasoning scaffolding where the model lacks native Go competence. Persona and politeness don't add value.
Prefer Sonnet or Opus. Haiku Go is weak (bare 55.2). Sonnet Go is 71.8, Opus Go is 83.0. The model matters more than the prompt for Go.
C#: Outcome-Focused + CoT
"Produce correct, well-structured code" + CoT prefix. Avoid implementation-focused rules.
C# baseline is 63.72, the weakest in the dataset. Outcome-focused lifts C# to 71.95 (+8.23), the best single-variant C# score. Implementation-focused collapses to 62.50 (-1.22 from bare), likely because Python-flavored HOW rules don't map to C# idioms. CoT prefix (go-cs-stack) scores 68.03 (+4.31), also strong.
Opus strongly preferred. Haiku C# is 52.1, Sonnet C# is 58.9, Opus C# is 80.1. Sonnet collapses on C# tasks by 17-18 points vs Opus. Use Opus for C# work.
JavaScript: Kitchen-Sink or Implementation-Focused
Comprehensive rules + code-reviewer persona + polite framing. OR implementation-focused HOW guidance.
JavaScript baseline is 67.40. Kitchen-sink lifts JS to 74.65 (+7.25), the largest percentage lift of any language. Implementation-focused scores 73.12 (+5.72), also strong. JavaScript has the weakest baseline among mainstream languages and benefits most from explicit instruction.
Python-js-stack drops to 71.02 (+3.62). Persona and politeness help, but they're not enough without structural rules. JS needs the scaffolding.
Universal (Polyglot Repos): Language-Agnostic Framing
"Follow the target language's standard conventions" + code-reviewer persona.
If your repository spans multiple languages, avoid language-specific rules. Use outcome-focused framing ("produce idiomatic code") and let the model infer conventions. Code-reviewer persona helps across all languages (see persona sweep).
If you can detect the target language (file extension, task context), route to language-specific stacks. But if you can't, universal framing is safer than Python-specific rules applied to C#.
What This Means for CLAUDE.md
CLAUDE.md is the standard configuration file for AI coding assistants. Based on these results, here's how to structure it:
DO: Adapt to Repository Language
- Python-only repos: Minimal CLAUDE.md (persona + polite framing, no verbose rules)
- Go-only repos: CoT prefix, no persona
- C#-only repos: Outcome-focused framing + CoT, Opus preferred
- JavaScript-only repos: Kitchen-sink or implementation-focused rules
- Polyglot repos: Universal framing ("follow target language conventions")
DON'T: Assume Python Advice Transfers
- Don't use Python-minimal prompts for Go/JS/C#
- Don't use kitchen-sink prompts for Python
- Don't assume "less is more" applies universally
- Don't optimize prompts on Python and deploy to other languages
- Don't use Python-flavored HOW rules for C#
If your CLAUDE.md can detect the target language (via file extension or task context), you can route to language-specific stacks. But detection adds complexity. For most teams, a universal stack with role framing is the safest default.
Series Retrospective
This experiment closes a 7-month investigation spanning ~212,000 benchmark runs. Here's what changed from start to finish:
- January (chain-of-thought): "CoT prefix doesn't help Python. Skip it." → May (capstone v2): "CoT helps Go/C# significantly. Use it for low-baseline languages."
- February (politeness sweep): "Polite framing helps by +0.66 points." → May (capstone v2): "Framing direction is noise. Content matters, not valence."
- March (context pollution): "More tokens = worse code, r=-0.95." → May (capstone v2): "Still true, but low-baseline languages tolerate more tokens because instruction provides value."
- April (capstone v1): "Kitchen-sink is the worst performer. Less is more." → May (capstone v2): "Kitchen-sink is the overall winner. Less is more FOR PYTHON. Most languages need more."
The capstone v1 conclusion was correct for Python. But Python is the exception, not the rule. Claude's training data over-represents Python. The model's bare Python baseline (88.32) is extraordinarily high. Most languages have much weaker baselines and respond dramatically to instruction.
The Baseline Competence Theory unifies everything: prompting provides value where the model lacks native competence. That's why CoT helps Go/C#. That's why kitchen-sink helps JS. That's why Python is resistant. The model's starting point predicts how much instruction will matter.
Limitations
- Python-flavored kitchen-sink. The kitchen-sink variant uses Python-specific rules (snake_case, docstrings, type hints). This may explain why C# responds better to outcome-focused than kitchen-sink. A C#-flavored kitchen-sink might perform differently.
- 15 reps per cell. With 10 variants, 4 languages, 12 tasks, and 3 models, we ran 15 reps per combination = 21,600 total runs. More reps would reduce noise, but the directional patterns are clear.
- Claude models only. GPT-4, Gemini, and other model families may have different baseline competence distributions. But the r=-0.95 token/quality correlation suggests this is a fundamental LLM property.
- Single-file tasks. These are isolated coding tasks. Multi-file architectural work may respond differently to instruction, especially for low-baseline languages where context transfer matters more.
- No language-specific kitchen-sink variants. We didn't test Go-flavored kitchen-sink, C#-flavored kitchen-sink, or JS-specific rules. The Python-flavored kitchen-sink may underperform for other languages due to convention mismatch.
Try It Yourself
The capstone v2 experiment configuration and all data are open source:
pip install claude-benchmark
# Run the full 21,600-run capstone v2 experiment
claude-benchmark experiment run experiments/capstone-v2.toml
# Run a single-language subset (faster)
claude-benchmark experiment run experiments/capstone-v2-python.toml
# Generate the report
claude-benchmark report
# See the results
open results/*/report.html
You can also define your own variants in the experiment TOML and test them against your target languages. If you find a stack that beats the optimal configurations here, I'd like to see the data.
Full experiment configuration: experiments/capstone-v2.toml
Per-language configurations:
For the complete journey from "less is more" to "it depends on the language," read the full benchmark series.