June 9, 2026 16 min read

What an Effective AI Coding Prompt Actually Looks Like (Updated: 212,000+ Benchmarks Across 4 Languages)

Johnathen Chilcher Senior SRE, TechLoom

Updated June 2026. This guide was originally based on 1,458 Python benchmarks. Since then, we’ve run 212,000+ benchmarks across Python, Go, JavaScript, and C#—testing chain-of-thought, politeness, personas, context pollution, model selection, verification instructions, multi-turn iteration, language-specific rules, /init optimization, emotional stakes, output compression, and more. The conclusions have evolved significantly—especially across languages. See the capstone v2 synthesis for the full cross-language picture.

This post distills all of that data into a practical guide. Not opinions. Not “best practices” passed around on social media. Numbers from controlled experiments with statistical analysis, run across Claude Haiku, Sonnet, and Opus on 4 programming languages.

If you want the data, everything is open source at claude-benchmark. If you want the conclusions, keep reading.

Cross-Language Update: The original seven rules below were derived from Python-only testing. Cross-language experiments revealed that several findings flip by language. Where this matters, each rule now includes language-specific caveats. The biggest shift: the Baseline Competence Theory—Python’s high baseline means most advice is noise, but Go, JS, and C# benefit from instructions the model doesn’t already internalize.

The Uncomfortable Baseline

Let’s start with the finding that frames everything else: across 1,458 runs, 13 prompt configurations, and 3 models, no configuration consistently beat an empty prompt.

Configuration	Tokens	Composite Score	Delta from Empty
empty (no instructions)	0	92.15	—
micro-quality	74	91.61	-0.54
refactor-aware	105	91.34	-0.81
workflow	152	91.10	-1.04
anti-pattern	125	90.95	-1.19
judge-aligned	148	90.70	-1.44
typical-readable	753	91.08	-1.07
typical-compressed	~650	90.80	-1.35
large-readable	~6,800	91.29	-0.86
large-compressed	~5,900	90.57	-1.58
bare + cot-prefix	~8	94.7*	-0.56*
bare + cot-detailed	~25	94.6*	-0.73*

* CoT experiment used a different task subset (3 tasks vs. 12), so absolute scores aren’t directly comparable. The relative delta from their own bare baseline is the meaningful number.

The correlation between instruction token count and composite quality score across the Phase 2 profiles was r = -0.95. Nearly perfect negative correlation. Every additional token of instruction marginally decreased output quality.

The most common prompt engineering advice is actively counterproductive for coding tasks. Adding style rules, chain-of-thought instructions, persona definitions, or formatting guidelines to your system prompt does not help Claude write better code on generic tasks. It adds noise to a signal the model already has.

That’s the uncomfortable starting point. Now let’s talk about when and how instructions do help, because the story is more nuanced than “never use a system prompt.”

The Seven Rules

From 1,458 runs and 13 configurations, seven principles emerged. Each is backed by specific data points. I’ve organized them from highest impact to lowest.

Rule 1: Don’t Teach Claude How to Code

Remove any instruction that restates general programming knowledge.

Claude’s training already encodes Python best practices, clean code principles, naming conventions, design patterns, and edge-case awareness. Instructions like “use snake_case for variables,” “write clear docstrings,” or “handle edge cases” are redundant. They don’t make the model try harder—they add tokens that dilute the actual task.

The micro-quality profile tested this directly. It contained just four bullet points of universal coding advice:

Write clear, self-documenting code with descriptive names
Handle edge cases explicitly with informative error messages
Follow task instructions precisely
Keep functions focused and short

These are principles so universal they seem impossible to disagree with. They still reduced quality by 0.54 points. The instructions create a scorecard that can only subtract points, never add them.

Evidence: micro-quality (74 tokens, -0.54), typical-readable (753 tokens with style/naming/formatting rules, -1.07), large-readable (6,800 tokens of comprehensive guidelines, -0.86). All negative. Phase 2, 648 runs.

Don’t include

“Use snake_case for variables”
“Write docstrings for public functions”
“Handle edge cases”
“Follow SOLID principles”
“Use descriptive variable names”
“Keep functions short”

Do include

“Use Pydantic v2 model_validator, not v1 validator”
“Our API returns snake_case JSON, not camelCase”
“Error responses must include a trace_id field”
Things Claude cannot know from training

Rule 2: Don’t Tell Claude How to Think

Skip chain-of-thought instructions for code generation.

“Think step by step” is the most widely repeated prompt engineering advice on the internet. It works on math and logic puzzles. It does not work for coding. In our 270-run experiment, every CoT variant scored lower than the bare baseline on every model.

More explicit reasoning instructions produced worse results than lighter ones. cot-detailed (“analyze requirements step by step, consider edge cases, then implement”) scored lower than cot-prefix (“think step by step”), which scored lower than no instruction at all. The more structure you impose on the model’s reasoning, the more you constrain its natural problem-solving.

Evidence: cot-prefix averaged -0.56 from bare; cot-detailed averaged -0.73. Opus suffered the most (-1.14 for cot-detailed). On refactor-01, Opus dropped 3.5 points (86.8 → 83.3). CoT experiment, 270 runs.

The chess grandmaster analogy holds: giving a grandmaster a “first consider all captures, then consider all checks” checklist doesn’t help them—it fragments the holistic pattern recognition they’re better at. Claude already reasons through implementation. Explicit CoT instructions fragment that reasoning.

Cross-language update: 2,880 cross-language runs revealed CoT helps Go (+5.3) and C# (+7.7) while hurting Python (-0.5). The model’s baseline competence in Python makes CoT redundant, but for languages where the model is less confident, structured reasoning provides scaffolding. If you write Go or C#, add a CoT prefix. If you write Python, skip it.

Rule 3: Tell Claude What It Doesn’t Know

Instructions should encode project-specific knowledge, not programming knowledge.

The one category of information that consistently helped was knowledge Claude cannot have from training: your project’s architecture, your team’s conventions, your deployment constraints, your domain’s terminology.

The typical-readable profile contained build commands (make build, make test), project structure (src/ for code, tests/ mirrors source), and workflow conventions (commit message format, branch naming). These are exactly the kind of project-specific facts that help Claude navigate a real codebase. On their own they still showed an overall negative delta (-1.07), but this is because they were bundled with generic style rules that diluted the signal.

Evidence: Build commands and project structure are the type of knowledge that helps in multi-file, real-world sessions (not captured in this benchmark’s single-file tasks). The negative result on generic coding tasks confirms that generic rules hurt; it doesn’t invalidate project-specific context. See Caveats.

The ideal CLAUDE.md is a project knowledge base, not a coding style guide. Think of it as onboarding documentation for a new team member who already knows how to code but doesn’t know your codebase.

Rule 4: Use Positive Framing, Never Prohibitions

Say “do X” not “don’t do Y.”

The anti-pattern profile framed guidance as prohibitions: “do not leave dead code,” “do not use overly clever one-liners,” “do not create deeply nested code.” The micro-quality profile framed equivalent guidance as positive directives: “write clear, self-documenting code,” “keep functions focused and short.”

Positive framing outperformed negative framing by 0.66 points (91.61 vs 90.95). Negative instructions appear to prime the model toward the failure mode being described rather than away from it. When you say “do not write deeply nested code,” the model’s attention is drawn to deeply nested code as a concept—and that attention leak shows up in the output.

Evidence: micro-quality (positive, 91.61) vs. anti-pattern (negative, 90.95). Both ~100-125 tokens, same intent, different framing. Positive won by 0.66. Phase 2, 648 runs.

Cross-language update: The capstone v2 experiment (20,499 scored runs across 4 languages) tested positive vs. negative framing as an isolated variable. The difference was noise—no statistically significant advantage for either framing direction. The original 0.66-point gap was confounded by content differences between the micro-quality and anti-pattern profiles, not by framing valence itself. Positive framing is still fine default practice, but don’t stress about rewording existing negative rules.

Emotional stakes update: The emotional stakes experiment (15,120 runs) tested a related axis: does adding emotional weight to your framing help? Growth-mindset framing wins (+1.88 vs neutral)—”This is a great opportunity to demonstrate clean code.” But high-pressure framing backfires: life-or-death urgency hurt by -3.33 points. The takeaway: positive framing matters less in valence (do/don’t) and more in emotional register (encouraging vs. threatening).

Negative (worse)

“Do not leave dead code”
“Do not use overly clever one-liners”
“Do not create deeply nested code”
“Do not ignore edge cases”
“Avoid abbreviations”

Positive (better)

“Remove unused code before finishing”
“Prefer clear multi-line logic over one-liners”
“Use guard clauses and early returns”
“Handle every edge case mentioned in the requirements”
“Use descriptive names”

Rule 5: Target Specific Weaknesses, Not General Quality

Instructions hurt on easy tasks and help on hard ones. Write them for the hard parts.

The overall averages hide the most actionable finding in the data. Instructions consistently hurt on tasks Claude already aces (bug fixes, code generation) and helped on tasks where it struggles (refactoring, instruction-following).

The largest improvement across all 1,188 Phase 1+2 runs: the workflow profile lifted Opus’s instruction-following score by +5.80 points. A structured “read the task, plan your approach, verify your solution” checklist helped Opus dramatically on tasks requiring precise specification compliance. And it didn’t just raise the average—it raised the floor from 61.4 to 83.5, eliminating catastrophic low-scoring runs.

Evidence: workflow +5.80 on Opus/instruction tasks. refactor-aware +2.54 on Haiku/refactor. micro-quality +4.41 on Opus/instruction. Meanwhile, these same profiles all showed negative deltas on bug-fix and code-gen tasks. Phase 2, 648 runs.

This means the value of instructions is highly context-dependent. A CLAUDE.md that helps on refactoring tasks may hurt on bug-fix tasks. The right strategy is to use targeted instructions only for the task types where you observe weakness, and omit them otherwise.

Rule 6: Keep the Formatting

Markdown headers, bullet points, and whitespace help smaller models parse instructions.

I originally advocated stripping markdown formatting from CLAUDE.md files to save tokens. The benchmark showed this was wrong for two of three models.

The readable version of the typical profile outperformed the compressed version on Haiku (-1.56 delta) and Sonnet (-0.86 delta). Only Opus was indifferent (+0.19). For the large profile, the gap widened: Sonnet scored 2.81 points higher with readable formatting. The large-compressed profile triggered the only statistically significant regression in the entire Phase 1 study: an 18.8% drop on bug-fix-02 (p=0.018).

The “decoration” of markdown headers, bold text, and whitespace serves as parsing landmarks. Smaller models use these structural cues to segment and prioritize information. Stripping them doesn’t save meaningful tokens (5-13% of total API tokens, not the 60-70% of characters) and introduces ambiguity risk.

Evidence: typical-readable beat typical-compressed on Haiku (-1.56) and Sonnet (-0.86). large-readable beat large-compressed on Sonnet (-2.81). large-compressed triggered an 18.8% regression on bug-fix-02 (p=0.018). Phase 1, 540 runs.

Compression update: The output compression experiment (15,120 runs) tested how aggressively you can compress CLAUDE.md instructions. Compressed-rules (~80 tokens) is the sweet spot, scoring 73.72 overall and beating the verbose baseline. But ultra-compression (“caveman speak”: “code good. test all. no break.”) backfires, especially for Go (+8-12 pt gain from proper compression, but caveman loses it all). Keep markdown headers and bullet structure; strip only filler words and redundant phrasing.

Rule 7: Match Instructions to the Model

Different models respond differently to the same instructions. One-size-fits-all is suboptimal.

Across both studies, models showed consistent but different preferences:

Haiku benefits from structural guidance. It preferred refactor-aware (+0.48 overall) and responded well to readable formatting. It also showed the largest quality gap between readable and compressed formats. Haiku needs the landmarks.
Sonnet is best left alone. Every instruction profile scored lower than empty. Even CoT only cost it 0.23-0.28 points—it shrugged off bad instructions better than the others, but still performed best with none.
Opus responds to lightweight quality nudges. micro-quality gave it +0.82 overall, and workflow gave it +5.80 on instruction tasks specifically. But it was also the most harmed by CoT (-1.14) and showed the largest refactoring penalty. Opus has strong opinions—instructions either align with them or fight them.

Evidence: Haiku preferred typical-readable (92.3). Sonnet preferred empty (92.5). Opus preferred typical-compressed (90.1) in Phase 1, micro-quality (+0.82) in Phase 2. CoT penalty: Opus -1.14, Sonnet -0.28. All phases, 1,458 runs.

If you use model routing (fast model for simple tasks, capable model for complex ones), your system prompt should vary by model. At minimum, don’t add instructions to Sonnet sessions, and consider a lightweight workflow prompt for Opus on complex tasks.

What an Effective Prompt Actually Looks Like

Here’s the synthesis. Based on 1,458 runs, if I were building a production CLAUDE.md from scratch, it would look like this:

# Project Context

- Build: `make build`; test: `make test`; lint: `make lint`
- Source in src/, tests mirror in tests/, config in config/
- Database: PostgreSQL 16 with pgvector extension
- API framework: FastAPI with Pydantic v2 models
- Auth: JWT tokens issued by Auth0, validated in middleware

# Our Conventions

- API responses: {"data": ..., "meta": {...}, "errors": [...]}
- All timestamps UTC ISO 8601; stored as timestamptz
- IDs are ULIDs, not UUIDs (use python-ulid library)
- Error responses include trace_id from request context
- Background jobs use Celery with Redis broker

# Current Sprint Context

- Migrating from sync SQLAlchemy to async (sqlalchemy[asyncio])
- Payment service uses Stripe API v2024-12 — not the latest
- Feature flag: ENABLE_NEW_CHECKOUT gates the new flow

That’s it. No style rules. No “use descriptive variable names.” No “think step by step.” No “you are a senior engineer.” Just the things Claude cannot know from training: your project’s stack, your team’s conventions, and your current context.

Notice what’s missing from that example compared to the typical community CLAUDE.md:

Commonly Included	Why It’s Missing	Evidence
Naming conventions (snake_case, PascalCase)	Claude already follows Python naming conventions. Restating them is redundant.	All profiles with naming rules scored ≤ empty
Formatting rules (indent, line length)	Claude follows standard formatting. Your linter catches the rest.	typical-readable (-1.07), large-readable (-0.86)
Design principles (SOLID, DRY, composition over inheritance)	Encoded in training. Restating them doesn’t improve output.	large-readable contains all of these; still -0.86
“Think step by step” / CoT	Hurts code quality across all models.	CoT experiment: -0.56 to -1.14
Persona (“you are a senior engineer”)	large-readable includes a persona; scored below empty.	large-readable (-0.86), large-compressed (-1.58)
Comment/docstring rules	Claude writes appropriate documentation by default.	micro-quality includes “write clear code”; still -0.54
Error handling guidelines	Claude handles errors well on its own; generic guidelines add noise.	All profiles with error handling rules ≤ empty
Git conventions	These affect workflow, not code quality. Keep them if useful for your workflow, but know they don’t improve the code itself.	Not directly measured in code quality scores

The Exception: When Instructions Are Worth It

The data doesn’t say “never use instructions.” It says “generic instructions hurt; specific instructions help on specific tasks.” Here’s when to break the minimalist rule:

1. Opus on Complex Tasks

When using Opus for instruction-following or compliance-heavy work (API contract implementation, specification-driven development), a lightweight workflow prompt is the single most impactful intervention in the dataset:

# Before You Write Code
- Read the entire task description twice
- Identify all explicit requirements and edge cases
- Plan your approach before typing

# After Writing Code
- Verify your solution addresses every requirement
- Remove unused imports or dead code

This raised Opus’s instruction-following score by +5.80 points and its worst-case score from 61.4 to 83.5. That’s the difference between a solution that occasionally misses requirements entirely and one that consistently covers them all.

2. Haiku on Refactoring

When using Haiku for code restructuring, targeted refactoring guidance helps:

# When Refactoring
- Use guard clauses and early returns to flatten nested logic
- Extract repeated patterns into well-named helper functions
- Preserve all existing public interfaces unless told otherwise

This lifted Haiku’s refactoring score by +2.54. Haiku benefits from structural guidance because its internal model of code architecture is less developed than Sonnet’s or Opus’s.

3. Consistency-Critical Workloads

If your use case values consistency over peak performance (CI pipelines, automated code generation, production workflows), instructions compress the variance. The workflow profile reduced Opus’s standard deviation on refactoring from 3.2 to 2.2 in the CoT study. More importantly, targeted instructions eliminated catastrophic low-scoring runs. Even when the average barely moves, the floor rises significantly.

Instructions act as guardrails, not boosters. They don’t make the best runs better. They prevent the worst runs from happening. For automated workflows where a bad outlier is more costly than a slightly lower average, that tradeoff is worth it.

A Decision Framework

Rather than prescribing a single prompt, here’s how to decide what to include:

Ask yourself	If yes	If no
Does Claude already know this from training?	Remove it.	Keep it.
Is this a general coding principle?	Remove it.	Keep it.
Is this specific to my project/team/domain?	Keep it.	Remove it.
Am I framing this as a prohibition?	Rewrite as positive directive.	Good.
Am I using Sonnet?	Minimize all instructions.	See model-specific guidance.
Does this address a specific task-type weakness?	Keep it; consider making it conditional.	Remove it.
Would this survive a “does Claude need to be told this?” test?	Keep it.	Remove it.

What I Got Wrong

I’ve published three posts in this series. The first two contained advice I now know is wrong, or at best overstated. In the interest of intellectual honesty:

What I Said	What the Data Shows
“Compress your CLAUDE.md” — strip markdown formatting to save 60-70% token bloat	Compression saves 5-13% of total tokens (not 60-70%), and it degrades quality for Haiku and Sonnet. Keep the formatting.
“Check for error handling, documentation, and security” — a checklist for evaluating AI code	Claude already handles most items on that checklist by default. The value is in project-specific checks (does it use our auth middleware? does it follow our API contract?), not generic code quality criteria.
Implied: more detailed instructions = better output	r = -0.95 between instruction tokens and quality. The large-readable profile (649 lines, 6,800 tokens of painstakingly crafted guidelines) scored below an empty file.

I’m not embarrassed by the earlier posts—they led to building the benchmark that proved them wrong. That’s how this is supposed to work. You form a hypothesis, test it, and update when the data disagrees. The problem is publishing the hypothesis as advice before running the test.

Caveats

This data is real but limited. Before you delete your CLAUDE.md entirely:

Single-file tasks only. All benchmark tasks involve modifying one file. Real-world coding involves navigating codebases, understanding dependencies, and following project-specific conventions—exactly the scenarios where CLAUDE.md project context matters most. The benchmark can’t measure this.
Generic coding tasks. These are domain-agnostic problems (fix a bug, implement a function, refactor code). Tasks with domain-specific requirements, complex business logic, or unfamiliar frameworks may respond differently to instructions.
No multi-turn conversations. Each benchmark run is a single prompt → response. In extended coding sessions, a lean system prompt preserves more context window for conversation history.
Workflow instructions aren’t measured. Rules like “always run tests before committing” or “use conventional commits” affect development workflow, not code quality scores. They may be valuable even if they don’t show up in this benchmark.
Claude models only. These results are for Claude Haiku 4.5, Sonnet 4.6, and Opus 4.6. Other model families (GPT, Gemini, open-source) may respond differently to instruction patterns.

Cross-Language Optimal Stacks

After 212,000+ benchmarks across 4 languages, the optimal prompt strategy is language-dependent. The capstone v2, language kitchen-sink, emotional stakes, and output compression experiments established these stacks:

Language	Optimal Stack	What to Avoid
Python	Minimal: code-reviewer persona + polite framing + /init project context. No CoT. No verbose rules.	Kitchen-sink rules, CoT prefix, generic style instructions
Go	CoT prefix + no persona + no polite framing. Sonnet or Opus.	Polite framing (-2.4), persona stacking
C#	Outcome-focused (“produce correct, well-structured code”) + CoT prefix. Opus strongly preferred.	Python-flavored rules (snake_case, list comprehensions), Sonnet (collapses 17-18 pts)
JavaScript	Kitchen-sink or implementation-focused + code-reviewer persona + polite framing + “trace 3 examples”.	Persona + CoT combined (dilutive, -2.4 interaction)
Polyglot	“Follow the target language’s standard conventions” + code-reviewer persona.	Language-specific rules that contaminate other languages

The Bottom Line

After 212,000+ benchmark runs across 4 languages, the most effective AI coding prompt depends on the language. The data points to a refined framework:

Start empty for Python. Claude’s Python baseline is 88/100. Instructions are noise. For Go, JS, or C#, start with language-appropriate guidance instead.
Add project context. Build commands, project structure, stack details, current sprint context. /init + persona is the strongest variant we’ve measured.
Match the language. CoT helps Go/C#. Politeness helps JS. Both hurt Python. Use language-conditional stacks.
Target specific weaknesses. “Brief docstrings” saves 13% of tokens with zero quality cost. Narrow corrections beat broad rules.
Front-load, don’t iterate. Multi-turn blind review hurts quality. Invest in your prompt, not your turn count.
Use encouragement, not pressure. Growth-mindset framing helps (+1.88). High-pressure urgency backfires (-3.33). Tone matters.
Compress smartly. ~80 tokens of compressed rules beats verbose instructions. But don’t strip structure—caveman compression destroys quality.
Match the model. Haiku is the canary (biggest swings). Sonnet resists prompting. Opus responds to lightweight nudges.
Measure the output. Intuition about what “should” help is frequently wrong. Run the benchmark.

Every token in your system prompt should earn its place. If it doesn’t encode knowledge Claude lacks, it’s not paying rent.

Try It Yourself

The benchmark tool, all 13 profiles, and the experiment configurations are open source:

pip install claude-benchmark

# Test your current CLAUDE.md
claude-benchmark run --profile path/to/your/CLAUDE.md

# Compare against empty
claude-benchmark report

# Run the CoT experiment
claude-benchmark experiment run experiments/cot.toml

# See what the data says
open results/*/report.html

If your CLAUDE.md beats empty on your tasks, you’ve found something genuinely useful. If it doesn’t, you’ve just freed up context window for the conversation that actually matters.

All data and methodology: github.com/jchilcher/claude-benchmark