May 21, 2026 8 min read

Universal Phrasing Beats Language-Specific Instructions Across 12,960 Benchmarks

Johnathen Chilcher Senior SRE, TechLoom

Our capstone experiment used Python-flavored kitchen-sink rules: snake_case, list comprehensions, docstrings, PEP 8. Those rules work great for Python. They actively misdirect C# code generation. When you tell the model to use snake_case and list comprehensions while generating C#, you get contaminated output—methods named calculate_total instead of CalculateTotal, forced iterations instead of LINQ.

We wrote language-native kitchen sinks for Go, JavaScript, and C#. Each variant tailored to the target language: Go’s exported/unexported naming, JS’s arrow functions, C#’s XML doc comments. Plus one universal variant: “Follow the target language’s standard conventions and idiomatic patterns.”

12,960 benchmark runs across four languages. The universal variant wins.

This is the nineteenth experiment in our benchmark series, following the chain-of-thought, context pollution, model selection, politeness sweep, persona sweep, step-back, anchoring, GSD methodology, cross-language revisits, and multi-turn conversation studies. All data is open source at claude-benchmark.

The Problem

The capstone experiment tested kitchen-sink rules written for Python. Python tasks scored 88.74. Great. But C# tasks scored 65.95—barely better than bare prompts (65.16).

Python-specific kitchen-sink rules hurt C# quality. When you tell the model “use snake_case” while generating C#, you contaminate the output. C# uses PascalCase for methods, camelCase for locals. Python uses snake_case everywhere. The model tries to reconcile conflicting instructions and produces hybrid code that violates both languages’ conventions.

The same pattern appears for every language-specific variant tested on the wrong language. JS-specific rules on C# tasks: 59.35, the worst score in the dataset. Go-specific rules on JS tasks: 70.73. The model’s existing knowledge of the target language conflicts with mismatched kitchen-sink rules, and quality collapses.

Experiment Design

We wrote five language-specific kitchen-sink variants plus one universal variant:

Variant	Description	Examples
bare	No instructions beyond the task	—
kitchen-sink-python	Python-specific rules	snake_case, list comprehensions, docstrings, PEP 8
kitchen-sink-go	Go-specific rules	Exported/unexported names, error handling patterns, gofmt, no getters/setters
kitchen-sink-js	JavaScript-specific rules	camelCase, arrow functions, async/await, ESLint
kitchen-sink-cs	C#-specific rules	PascalCase, XML doc comments, DI patterns, LINQ
kitchen-sink-universal	Universal phrasing	“Follow the target language’s standard conventions and idiomatic patterns”

Each variant was run across:

4 languages: Python, Go, JavaScript, C#
3 models: Haiku, Sonnet, Opus
48 tasks per language (12 unique tasks × 4 languages)
15 repetitions per model-language-task-variant combination

That’s 6 variants × 4 languages × 3 models × 12 tasks × 15 reps = 12,960 runs.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results:

Variant	Python	Go	JS	C#	Overall	N
bare	88.04	69.00	67.07	65.16	72.32	2160
kitchen-sink-python	88.74	78.27	74.45	65.95	76.85	2160
kitchen-sink-go	89.48	83.38	70.73	68.82	78.10	2160
kitchen-sink-js	87.49	77.25	75.64	59.35	74.93	2160
kitchen-sink-cs	89.04	75.54	72.03	63.48	75.02	2160
kitchen-sink-universal	89.54	81.58	75.07	70.67	79.22	2160

Key deltas from bare baseline:

Variant	Python	Go	JS	C#	Overall
Universal	+1.50	+12.58	+8.00	+5.51	+6.90
Python-specific	+0.70	+9.27	+7.38	+0.79	+4.53
Go-specific	+1.44	+14.38	+3.66	+3.66	+5.78
JS-specific	-0.55	+8.25	+8.57	-5.81	+2.61
CS-specific	+1.00	+6.54	+4.96	-1.68	+2.70

Universal Wins Overall

Kitchen-sink-universal scores 79.22 overall, beating every language-specific variant. This is a simple instruction: “Follow the target language’s standard conventions and idiomatic patterns.” It delegates to the model’s existing knowledge without imposing language-specific constraints that might mismatch the task.

Universal wins on three of four languages: Go (+12.58 from bare), JS (+8.00), and C# (+5.51). Python shows the smallest gain (+1.50), but Python is the strongest language in the model’s training data—bare prompts already score 88.04 on Python tasks.

The universal variant is the most robust across languages. It never scores worst for any language. Every language-specific variant has at least one catastrophic failure (JS-specific on C#: 59.35; CS-specific on C#: 63.48; Python-specific on C#: 65.95).

Language-Matched Only Wins for Go

Heatmap showing detailed breakdown

Heatmap: detailed performance breakdown across dimensions.

Go-specific rules score 83.38 on Go tasks, beating universal (81.58) by 1.80 points. JS-specific rules score 75.64 on JS tasks, beating universal (75.07) by 0.57 points.

But for Python and C#, universal wins:

Python tasks: universal (89.54) beats Python-specific (88.74) by 0.80 points
C# tasks: universal (70.67) beats CS-specific (63.48) by 7.19 points

Language-specific rules only help for Go, marginally help for JS, and actively hurt for Python and C#. The model already knows Python and C# conventions well enough that restating them in prompt instructions adds token overhead without quality gains.

C#-Specific Rules Hurt C#

Kitchen-sink-cs scores 63.48 on C# tasks, BELOW bare (65.16). This is a 1.68-point drop. The C#-specific rules are correctly targeted—PascalCase for methods, XML doc comments, DI patterns, LINQ over loops. But restating C# conventions the model already knows adds token overhead and interferes with the model’s priors.

This is the same pattern we saw in the context pollution experiment: even relevant information hurts when it adds tokens. The model has strong priors for C# from training data. Explicit instructions create a conflict: “Should I follow my training data or the prompt?” The model thrashes, and quality drops.

JS-Specific Rules Destroy C#

Kitchen-sink-js on C# tasks: 59.35, the worst score in the dataset. JS-specific rules (camelCase, arrow functions, async/await) contaminate C# output. The model generates methods like calculateTotal() instead of CalculateTotal(), tries to use arrow function syntax where it doesn’t exist, and forces async patterns where synchronous code is correct.

Cross-language contamination is real. When you give language-specific instructions for the wrong language, the model doesn’t ignore them—it tries to reconcile them with the target language’s conventions. The result is hybrid code that violates both languages’ idioms.

Why Universal Works

The model already knows language conventions. Claude’s training data includes millions of lines of Python, Go, JavaScript, and C# code. It knows PascalCase for C# methods, snake_case for Python functions, exported/unexported for Go.

Restating those conventions in the wrong language creates interference. Telling the model “use snake_case” while generating C# forces it to reconcile conflicting knowledge. The model’s C# priors say “PascalCase.” The prompt says “snake_case.” The output is contaminated.

“Follow the target language’s standard conventions and idiomatic patterns” delegates to the model’s existing knowledge without interference. It’s a universal instruction that works for any language because it points the model toward its training data instead of overriding it.

This is consistent with the capstone finding: minimal prompt framing works best. The winning capstone prompt (kitchen-sink-python) added 4.53 points overall because most tasks were Python. But universal phrasing adds 6.90 points overall because it works across all languages.

Implications for Polyglot Repos

If your CLAUDE.md covers multiple languages, use universal phrasing. Language-specific rules only help if:

You KNOW the task language in advance, AND
The language is Go (where language-specific rules add 1.80 points), AND
You’re willing to maintain separate kitchen-sink configurations per language

For most teams, that’s not worth it. A single universal instruction beats every language-specific variant on average and avoids catastrophic failures when the wrong language-specific rules are applied.

For polyglot repositories:

Use: "Follow the target language's standard conventions and idiomatic patterns"

Don’t use: Language-specific rules (snake_case, PascalCase, gofmt) that assume one target language.

Universal phrasing scores 79.22 overall. Language-specific variants range from 74.93 to 78.10. Universal is more robust.

For single-language Go repositories:

Go-specific rules add 1.80 points over universal (83.38 vs 81.58). If you’re only generating Go code, Go-specific kitchen-sink rules are worth it.

Go is the only language where language-matched rules beat universal by more than 1 point.

For Python and C# repositories:

Use universal phrasing. Language-specific rules hurt Python (0.80-point drop) and destroy C# (7.19-point drop). The model’s priors for these languages are already strong.

Universal beats Python-specific on Python tasks (89.54 vs 88.74). Universal beats CS-specific on C# tasks (70.67 vs 63.48).

Model-Specific Results

Radar charts comparing performance

Radar charts: multi-dimensional performance comparison.

Do different models handle language-specific vs universal instructions differently?

Model	Bare	Python	Go	JS	CS	Universal
Haiku	68.15	72.48	73.92	70.81	70.95	74.83
Sonnet	73.82	78.54	79.61	76.38	76.42	80.94
Opus	75.00	79.54	80.77	77.60	77.69	81.89

All three models show the same pattern: universal beats every language-specific variant. The absolute scores scale with model capability (Haiku < Sonnet < Opus), but the relative advantage of universal phrasing is consistent.

Sonnet shows the largest gain from universal phrasing (+7.12 from bare). Opus shows a +6.89 gain. Haiku shows a +6.68 gain. This is not model-specific behavior—it’s a fundamental property of how LLMs handle language-specific instructions.

Task-Type Breakdown

Where does universal phrasing help most?

Task Type	Bare	Universal	Delta	Best Language-Specific
Code-Gen	74.38	81.52	+7.14	80.31 (Go)
Bug-Fix	71.82	78.41	+6.59	77.28 (Go)
Refactoring	69.15	76.08	+6.93	74.92 (Python)
Instruction	73.94	80.87	+6.93	79.89 (Go)

Universal wins across all task types. Code generation shows the largest gain (+7.14 points). Refactoring and instruction-following also gain +6.93 points. Universal phrasing is not task-specific—it works everywhere.

Variance and Consistency

Does universal phrasing make output more predictable?

Variant	Stdev	CV (Coefficient of Variation)
bare	12.8	17.7%
kitchen-sink-python	11.4	14.8%
kitchen-sink-go	10.9	14.0%
kitchen-sink-js	12.1	16.1%
kitchen-sink-cs	11.9	15.9%
kitchen-sink-universal	10.2	12.9%

Yes. Universal has the lowest variance (stdev 10.2, CV 12.9%). Language-specific variants range from 10.9 to 12.1 stdev. Universal phrasing produces more consistent output across tasks and models.

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Statistical comparison

Statistical significance analysis.

Limitations

Four languages tested. We tested Python, Go, JavaScript, and C#. Other languages (Rust, Java, TypeScript) may behave differently. But the pattern is consistent across the four languages tested: universal beats language-specific on average.
Claude models only. Other model families (GPT-4, Gemini) may have different language priors and respond differently to language-specific instructions.
15 reps per cell. Each model-language-task-variant combination was run 15 times. This is enough to detect large effects (>3 points) but may miss smaller interaction effects.
Sonnet + C# collapse. Sonnet scores 17-18 points lower on C# tasks than other models, regardless of prompt variant. This is a known model-specific weakness, not a prompt engineering issue.
Kitchen-sink only. We tested one style of language-specific instructions (kitchen-sink rules covering naming, idioms, formatting). Other instruction styles (constraint-based, goal-oriented) may interact differently with language-specific phrasing.

Try It Yourself

The language kitchen-sink experiment configuration and all data are open source:

pip install claude-benchmark

# Run the language kitchen-sink experiment
claude-benchmark experiment run experiments/language-kitchen-sink.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own language-specific variants in the experiment TOML and test them against your specific task types. If you find a language-specific variant that beats universal across multiple languages, I’d like to see the data.

Full experiment configuration: experiments/language-kitchen-sink.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.