What an Effective AI Coding Prompt Actually Looks Like (Updated: 212,000+ Benchmarks Across 4 Languages)
Updated June 2026. This guide was originally based on 1,458 Python benchmarks. Since then, we’ve run 212,000+ benchmarks across Python, Go, JavaScript, and C#—testing chain-of-thought, politeness, personas, context pollution, model selection, verification instructions, multi-turn iteration, language-specific rules, /init optimization, emotional stakes, output compression, and more. The conclusions have evolved significantly—especially across languages. See the capstone v2 synthesis for the full cross-language picture.
This post distills all of that data into a practical guide. Not opinions. Not “best practices” passed around on social media. Numbers from controlled experiments with statistical analysis, run across Claude Haiku, Sonnet, and Opus on 4 programming languages.
If you want the data, everything is open source at claude-benchmark. If you want the conclusions, keep reading.
The Uncomfortable Baseline
Let’s start with the finding that frames everything else: across 1,458 runs, 13 prompt configurations, and 3 models, no configuration consistently beat an empty prompt.
| Configuration | Tokens | Composite Score | Delta from Empty |
|---|---|---|---|
| empty (no instructions) | 0 | 92.15 | — |
| micro-quality | 74 | 91.61 | -0.54 |
| refactor-aware | 105 | 91.34 | -0.81 |
| workflow | 152 | 91.10 | -1.04 |
| anti-pattern | 125 | 90.95 | -1.19 |
| judge-aligned | 148 | 90.70 | -1.44 |
| typical-readable | 753 | 91.08 | -1.07 |
| typical-compressed | ~650 | 90.80 | -1.35 |
| large-readable | ~6,800 | 91.29 | -0.86 |
| large-compressed | ~5,900 | 90.57 | -1.58 |
| bare + cot-prefix | ~8 | 94.7* | -0.56* |
| bare + cot-detailed | ~25 | 94.6* | -0.73* |
* CoT experiment used a different task subset (3 tasks vs. 12), so absolute scores aren’t directly comparable. The relative delta from their own bare baseline is the meaningful number.
The correlation between instruction token count and composite quality score across the Phase 2 profiles was r = -0.95. Nearly perfect negative correlation. Every additional token of instruction marginally decreased output quality.
That’s the uncomfortable starting point. Now let’s talk about when and how instructions do help, because the story is more nuanced than “never use a system prompt.”
The Seven Rules
From 1,458 runs and 13 configurations, seven principles emerged. Each is backed by specific data points. I’ve organized them from highest impact to lowest.
Rule 1: Don’t Teach Claude How to Code
Remove any instruction that restates general programming knowledge.
Claude’s training already encodes Python best practices, clean code principles, naming conventions, design patterns, and edge-case awareness. Instructions like “use snake_case for variables,” “write clear docstrings,” or “handle edge cases” are redundant. They don’t make the model try harder—they add tokens that dilute the actual task.
The micro-quality profile tested this directly. It contained just four bullet points of universal coding advice:
- Write clear, self-documenting code with descriptive names
- Handle edge cases explicitly with informative error messages
- Follow task instructions precisely
- Keep functions focused and short
These are principles so universal they seem impossible to disagree with. They still reduced quality by 0.54 points. The instructions create a scorecard that can only subtract points, never add them.
Don’t include
- “Use snake_case for variables”
- “Write docstrings for public functions”
- “Handle edge cases”
- “Follow SOLID principles”
- “Use descriptive variable names”
- “Keep functions short”
Do include
- “Use Pydantic v2 model_validator, not v1 validator”
- “Our API returns snake_case JSON, not camelCase”
- “Error responses must include a trace_id field”
- Things Claude cannot know from training
Rule 2: Don’t Tell Claude How to Think
Skip chain-of-thought instructions for code generation.
“Think step by step” is the most widely repeated prompt engineering advice on the internet. It works on math and logic puzzles. It does not work for coding. In our 270-run experiment, every CoT variant scored lower than the bare baseline on every model.
More explicit reasoning instructions produced worse results than lighter ones. cot-detailed (“analyze requirements step by step, consider edge cases, then implement”) scored lower than cot-prefix (“think step by step”), which scored lower than no instruction at all. The more structure you impose on the model’s reasoning, the more you constrain its natural problem-solving.
The chess grandmaster analogy holds: giving a grandmaster a “first consider all captures, then consider all checks” checklist doesn’t help them—it fragments the holistic pattern recognition they’re better at. Claude already reasons through implementation. Explicit CoT instructions fragment that reasoning.
Rule 3: Tell Claude What It Doesn’t Know
Instructions should encode project-specific knowledge, not programming knowledge.
The one category of information that consistently helped was knowledge Claude cannot have from training: your project’s architecture, your team’s conventions, your deployment constraints, your domain’s terminology.
The typical-readable profile contained build commands (make build, make test), project structure (src/ for code, tests/ mirrors source), and workflow conventions (commit message format, branch naming). These are exactly the kind of project-specific facts that help Claude navigate a real codebase. On their own they still showed an overall negative delta (-1.07), but this is because they were bundled with generic style rules that diluted the signal.
The ideal CLAUDE.md is a project knowledge base, not a coding style guide. Think of it as onboarding documentation for a new team member who already knows how to code but doesn’t know your codebase.
Rule 4: Use Positive Framing, Never Prohibitions
Say “do X” not “don’t do Y.”
The anti-pattern profile framed guidance as prohibitions: “do not leave dead code,” “do not use overly clever one-liners,” “do not create deeply nested code.” The micro-quality profile framed equivalent guidance as positive directives: “write clear, self-documenting code,” “keep functions focused and short.”
Positive framing outperformed negative framing by 0.66 points (91.61 vs 90.95). Negative instructions appear to prime the model toward the failure mode being described rather than away from it. When you say “do not write deeply nested code,” the model’s attention is drawn to deeply nested code as a concept—and that attention leak shows up in the output.
Negative (worse)
- “Do not leave dead code”
- “Do not use overly clever one-liners”
- “Do not create deeply nested code”
- “Do not ignore edge cases”
- “Avoid abbreviations”
Positive (better)
- “Remove unused code before finishing”
- “Prefer clear multi-line logic over one-liners”
- “Use guard clauses and early returns”
- “Handle every edge case mentioned in the requirements”
- “Use descriptive names”
Rule 5: Target Specific Weaknesses, Not General Quality
Instructions hurt on easy tasks and help on hard ones. Write them for the hard parts.
The overall averages hide the most actionable finding in the data. Instructions consistently hurt on tasks Claude already aces (bug fixes, code generation) and helped on tasks where it struggles (refactoring, instruction-following).
The largest improvement across all 1,188 Phase 1+2 runs: the workflow profile lifted Opus’s instruction-following score by +5.80 points. A structured “read the task, plan your approach, verify your solution” checklist helped Opus dramatically on tasks requiring precise specification compliance. And it didn’t just raise the average—it raised the floor from 61.4 to 83.5, eliminating catastrophic low-scoring runs.
This means the value of instructions is highly context-dependent. A CLAUDE.md that helps on refactoring tasks may hurt on bug-fix tasks. The right strategy is to use targeted instructions only for the task types where you observe weakness, and omit them otherwise.
Rule 6: Keep the Formatting
Markdown headers, bullet points, and whitespace help smaller models parse instructions.
I originally advocated stripping markdown formatting from CLAUDE.md files to save tokens. The benchmark showed this was wrong for two of three models.
The readable version of the typical profile outperformed the compressed version on Haiku (-1.56 delta) and Sonnet (-0.86 delta). Only Opus was indifferent (+0.19). For the large profile, the gap widened: Sonnet scored 2.81 points higher with readable formatting. The large-compressed profile triggered the only statistically significant regression in the entire Phase 1 study: an 18.8% drop on bug-fix-02 (p=0.018).
The “decoration” of markdown headers, bold text, and whitespace serves as parsing landmarks. Smaller models use these structural cues to segment and prioritize information. Stripping them doesn’t save meaningful tokens (5-13% of total API tokens, not the 60-70% of characters) and introduces ambiguity risk.
Rule 7: Match Instructions to the Model
Different models respond differently to the same instructions. One-size-fits-all is suboptimal.
Across both studies, models showed consistent but different preferences:
- Haiku benefits from structural guidance. It preferred refactor-aware (+0.48 overall) and responded well to readable formatting. It also showed the largest quality gap between readable and compressed formats. Haiku needs the landmarks.
- Sonnet is best left alone. Every instruction profile scored lower than empty. Even CoT only cost it 0.23-0.28 points—it shrugged off bad instructions better than the others, but still performed best with none.
- Opus responds to lightweight quality nudges. micro-quality gave it +0.82 overall, and workflow gave it +5.80 on instruction tasks specifically. But it was also the most harmed by CoT (-1.14) and showed the largest refactoring penalty. Opus has strong opinions—instructions either align with them or fight them.
If you use model routing (fast model for simple tasks, capable model for complex ones), your system prompt should vary by model. At minimum, don’t add instructions to Sonnet sessions, and consider a lightweight workflow prompt for Opus on complex tasks.
What an Effective Prompt Actually Looks Like
Here’s the synthesis. Based on 1,458 runs, if I were building a production CLAUDE.md from scratch, it would look like this:
# Project Context
- Build: `make build`; test: `make test`; lint: `make lint`
- Source in src/, tests mirror in tests/, config in config/
- Database: PostgreSQL 16 with pgvector extension
- API framework: FastAPI with Pydantic v2 models
- Auth: JWT tokens issued by Auth0, validated in middleware
# Our Conventions
- API responses: {"data": ..., "meta": {...}, "errors": [...]}
- All timestamps UTC ISO 8601; stored as timestamptz
- IDs are ULIDs, not UUIDs (use python-ulid library)
- Error responses include trace_id from request context
- Background jobs use Celery with Redis broker
# Current Sprint Context
- Migrating from sync SQLAlchemy to async (sqlalchemy[asyncio])
- Payment service uses Stripe API v2024-12 — not the latest
- Feature flag: ENABLE_NEW_CHECKOUT gates the new flow
That’s it. No style rules. No “use descriptive variable names.” No “think step by step.” No “you are a senior engineer.” Just the things Claude cannot know from training: your project’s stack, your team’s conventions, and your current context.
Notice what’s missing from that example compared to the typical community CLAUDE.md:
| Commonly Included | Why It’s Missing | Evidence |
|---|---|---|
| Naming conventions (snake_case, PascalCase) | Claude already follows Python naming conventions. Restating them is redundant. | All profiles with naming rules scored ≤ empty |
| Formatting rules (indent, line length) | Claude follows standard formatting. Your linter catches the rest. | typical-readable (-1.07), large-readable (-0.86) |
| Design principles (SOLID, DRY, composition over inheritance) | Encoded in training. Restating them doesn’t improve output. | large-readable contains all of these; still -0.86 |
| “Think step by step” / CoT | Hurts code quality across all models. | CoT experiment: -0.56 to -1.14 |
| Persona (“you are a senior engineer”) | large-readable includes a persona; scored below empty. | large-readable (-0.86), large-compressed (-1.58) |
| Comment/docstring rules | Claude writes appropriate documentation by default. | micro-quality includes “write clear code”; still -0.54 |
| Error handling guidelines | Claude handles errors well on its own; generic guidelines add noise. | All profiles with error handling rules ≤ empty |
| Git conventions | These affect workflow, not code quality. Keep them if useful for your workflow, but know they don’t improve the code itself. | Not directly measured in code quality scores |
The Exception: When Instructions Are Worth It
The data doesn’t say “never use instructions.” It says “generic instructions hurt; specific instructions help on specific tasks.” Here’s when to break the minimalist rule:
1. Opus on Complex Tasks
When using Opus for instruction-following or compliance-heavy work (API contract implementation, specification-driven development), a lightweight workflow prompt is the single most impactful intervention in the dataset:
# Before You Write Code
- Read the entire task description twice
- Identify all explicit requirements and edge cases
- Plan your approach before typing
# After Writing Code
- Verify your solution addresses every requirement
- Remove unused imports or dead code
This raised Opus’s instruction-following score by +5.80 points and its worst-case score from 61.4 to 83.5. That’s the difference between a solution that occasionally misses requirements entirely and one that consistently covers them all.
2. Haiku on Refactoring
When using Haiku for code restructuring, targeted refactoring guidance helps:
# When Refactoring
- Use guard clauses and early returns to flatten nested logic
- Extract repeated patterns into well-named helper functions
- Preserve all existing public interfaces unless told otherwise
This lifted Haiku’s refactoring score by +2.54. Haiku benefits from structural guidance because its internal model of code architecture is less developed than Sonnet’s or Opus’s.
3. Consistency-Critical Workloads
If your use case values consistency over peak performance (CI pipelines, automated code generation, production workflows), instructions compress the variance. The workflow profile reduced Opus’s standard deviation on refactoring from 3.2 to 2.2 in the CoT study. More importantly, targeted instructions eliminated catastrophic low-scoring runs. Even when the average barely moves, the floor rises significantly.
A Decision Framework
Rather than prescribing a single prompt, here’s how to decide what to include:
| Ask yourself | If yes | If no |
|---|---|---|
| Does Claude already know this from training? | Remove it. | Keep it. |
| Is this a general coding principle? | Remove it. | Keep it. |
| Is this specific to my project/team/domain? | Keep it. | Remove it. |
| Am I framing this as a prohibition? | Rewrite as positive directive. | Good. |
| Am I using Sonnet? | Minimize all instructions. | See model-specific guidance. |
| Does this address a specific task-type weakness? | Keep it; consider making it conditional. | Remove it. |
| Would this survive a “does Claude need to be told this?” test? | Keep it. | Remove it. |
What I Got Wrong
I’ve published three posts in this series. The first two contained advice I now know is wrong, or at best overstated. In the interest of intellectual honesty:
| What I Said | What the Data Shows |
|---|---|
| “Compress your CLAUDE.md” — strip markdown formatting to save 60-70% token bloat | Compression saves 5-13% of total tokens (not 60-70%), and it degrades quality for Haiku and Sonnet. Keep the formatting. |
| “Check for error handling, documentation, and security” — a checklist for evaluating AI code | Claude already handles most items on that checklist by default. The value is in project-specific checks (does it use our auth middleware? does it follow our API contract?), not generic code quality criteria. |
| Implied: more detailed instructions = better output | r = -0.95 between instruction tokens and quality. The large-readable profile (649 lines, 6,800 tokens of painstakingly crafted guidelines) scored below an empty file. |
I’m not embarrassed by the earlier posts—they led to building the benchmark that proved them wrong. That’s how this is supposed to work. You form a hypothesis, test it, and update when the data disagrees. The problem is publishing the hypothesis as advice before running the test.
Caveats
This data is real but limited. Before you delete your CLAUDE.md entirely:
- Single-file tasks only. All benchmark tasks involve modifying one file. Real-world coding involves navigating codebases, understanding dependencies, and following project-specific conventions—exactly the scenarios where CLAUDE.md project context matters most. The benchmark can’t measure this.
- Generic coding tasks. These are domain-agnostic problems (fix a bug, implement a function, refactor code). Tasks with domain-specific requirements, complex business logic, or unfamiliar frameworks may respond differently to instructions.
- No multi-turn conversations. Each benchmark run is a single prompt → response. In extended coding sessions, a lean system prompt preserves more context window for conversation history.
- Workflow instructions aren’t measured. Rules like “always run tests before committing” or “use conventional commits” affect development workflow, not code quality scores. They may be valuable even if they don’t show up in this benchmark.
- Claude models only. These results are for Claude Haiku 4.5, Sonnet 4.6, and Opus 4.6. Other model families (GPT, Gemini, open-source) may respond differently to instruction patterns.
Cross-Language Optimal Stacks
After 212,000+ benchmarks across 4 languages, the optimal prompt strategy is language-dependent. The capstone v2, language kitchen-sink, emotional stakes, and output compression experiments established these stacks:
| Language | Optimal Stack | What to Avoid |
|---|---|---|
| Python | Minimal: code-reviewer persona + polite framing + /init project context. No CoT. No verbose rules. | Kitchen-sink rules, CoT prefix, generic style instructions |
| Go | CoT prefix + no persona + no polite framing. Sonnet or Opus. | Polite framing (-2.4), persona stacking |
| C# | Outcome-focused (“produce correct, well-structured code”) + CoT prefix. Opus strongly preferred. | Python-flavored rules (snake_case, list comprehensions), Sonnet (collapses 17-18 pts) |
| JavaScript | Kitchen-sink or implementation-focused + code-reviewer persona + polite framing + “trace 3 examples”. | Persona + CoT combined (dilutive, -2.4 interaction) |
| Polyglot | “Follow the target language’s standard conventions” + code-reviewer persona. | Language-specific rules that contaminate other languages |
The Bottom Line
After 212,000+ benchmark runs across 4 languages, the most effective AI coding prompt depends on the language. The data points to a refined framework:
- Start empty for Python. Claude’s Python baseline is 88/100. Instructions are noise. For Go, JS, or C#, start with language-appropriate guidance instead.
- Add project context. Build commands, project structure, stack details, current sprint context. /init + persona is the strongest variant we’ve measured.
- Match the language. CoT helps Go/C#. Politeness helps JS. Both hurt Python. Use language-conditional stacks.
- Target specific weaknesses. “Brief docstrings” saves 13% of tokens with zero quality cost. Narrow corrections beat broad rules.
- Front-load, don’t iterate. Multi-turn blind review hurts quality. Invest in your prompt, not your turn count.
- Use encouragement, not pressure. Growth-mindset framing helps (+1.88). High-pressure urgency backfires (-3.33). Tone matters.
- Compress smartly. ~80 tokens of compressed rules beats verbose instructions. But don’t strip structure—caveman compression destroys quality.
- Match the model. Haiku is the canary (biggest swings). Sonnet resists prompting. Opus responds to lightweight nudges.
- Measure the output. Intuition about what “should” help is frequently wrong. Run the benchmark.
Every token in your system prompt should earn its place. If it doesn’t encode knowledge Claude lacks, it’s not paying rent.
Try It Yourself
The benchmark tool, all 13 profiles, and the experiment configurations are open source:
pip install claude-benchmark
# Test your current CLAUDE.md
claude-benchmark run --profile path/to/your/CLAUDE.md
# Compare against empty
claude-benchmark report
# Run the CoT experiment
claude-benchmark experiment run experiments/cot.toml
# See what the data says
open results/*/report.html
If your CLAUDE.md beats empty on your tasks, you’ve found something genuinely useful. If it doesn’t, you’ve just freed up context window for the conversation that actually matters.
All data and methodology: github.com/jchilcher/claude-benchmark