May 12, 2026 5 min read

/init + Persona Is the Strongest CLAUDE.md Strategy We’ve Measured (4,320 Benchmarks)

Johnathen Chilcher Senior SRE, TechLoom

Claude Code’s /init command generates a CLAUDE.md file automatically. It scans your codebase, extracts build commands, documents architecture patterns, and lists data models. It’s the fastest way to give Claude project-specific context. The question: does layering our empirically-proven best practices on top of /init output improve code quality?

I ran 4,320 benchmarks testing four variants. The results: YES. /init output combined with the code-reviewer persona and polite framing beats everything we’ve tested. 89.66 points—the highest score in the entire benchmark series.

This is the ninth experiment in our benchmark series, following the chain-of-thought, context pollution, model selection, politeness sweep, persona sweep, step-back, anchoring, and GSD methodology studies. All data is open source at claude-benchmark.

The Four Variants

I tested four CLAUDE.md configurations:

Variant	Description
init-plus-practices	Full `/init` output (~600 tokens: build commands, architecture, data models) + code-reviewer persona + polite framing + temperature 1.0
trimmed-plus-practices	Trimmed `/init` (~200 tokens: just build commands + file structure) + code-reviewer persona + polite framing + temperature 1.0
init-raw	Full `/init` output only, no additional techniques
init-trimmed	Trimmed `/init` output only, no additional techniques

Each variant was run across:

3 models: Haiku, Sonnet, Opus
12 Python tasks: 4 code-gen, 3 bug-fix, 3 refactor, 2 instruction-following
10 repetitions per model-task-variant combination

That’s 4 variants × 3 models × 12 tasks × 10 reps = 4,320 runs.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results:

Variant	Score	N
init-plus-practices	89.66	1080
trimmed-plus-practices	88.95	1080
init-raw	88.03	1080
init-trimmed	87.21	1080

For reference, the “bare” baseline from previous experiments is ~88.0. The best-practices stack (code-reviewer persona + polite framing + temperature 1.0) from the capstone averaged 89.16 across all experiments.

Init-plus-practices beats both. 89.66 is the highest score we’ve measured in any experiment.

The /init + Persona Synergy

Init-plus-practices scores 89.66, beating init-raw by +1.64 points (p=0.0001). This is the strongest single variant we’ve ever measured. The combination of project-specific context from /init and behavioral framing from the code-reviewer persona creates measurable synergy.

This is not just additive—it’s multiplicative. Init-raw (88.03) is already better than the bare baseline (~88.0), which means /init provides useful project context. Layering the best-practices stack on top adds another +1.63 points.

The persona sweep showed that code-reviewer persona alone adds ~1.2 points. The politeness sweep showed that polite framing adds ~0.5 points. Here, we see those gains stack cleanly on top of /init output without interference.

Why Trimming Hurts

The worst variant is init-trimmed (87.21), which uses only the trimmed /init output (build commands + file structure) with no additional techniques.

Removing architecture context from /init output actively hurts quality. Init-trimmed (87.21) is 0.82 points worse than init-raw (88.03). Trimming removes useful signal. This challenges the “less is more” finding from previous experiments.

The key insight: /init output is project-specific context, NOT generic style rules.

The r=-0.95 correlation we’ve seen across every experiment (more prompt tokens = worse code) applies to generic instructions—style guidelines, process rules, tone framing. Those tokens create cognitive load without adding task-specific information.

But /init output is different. It contains:

Build commands specific to your project
Architecture patterns actually used in your codebase
Data models the code needs to interact with
File paths and directory structure

This is factual, task-relevant context. It doesn’t create cognitive load—it reduces it by giving the model the information it needs to write correct code on the first try.

Project Context vs Generic Rules

The Token Budget Distinction

/init adds ~600 tokens of project context. Those tokens don’t interfere with code quality because they contain factual information the model needs.

Generic style rules (verbose instructions, process guidelines, tone framing) add similar token counts but actively degrade quality because they create cognitive overhead without adding task-specific signal.

600 tokens of project facts help. 600 tokens of generic rules hurt. The distinction matters.

Evidence: Init-raw (88.03) beats bare baseline (~88.0). Init-trimmed (87.21) is worse than bare baseline. Adding /init context helps. Removing it hurts.

This reconciles the apparent contradiction between “less is more” and “add /init context.” The principle is: minimize generic instructions, maximize project-specific facts.

The compression experiment showed that aggressive token reduction (2000 tokens → 300 tokens) improves quality when you’re cutting generic fluff. But cutting project context does the opposite.

Practical Template

Here’s the recommended CLAUDE.md structure combining /init output with one-line persona framing:

# CLAUDE.md

## Project Context

[Full /init output here: build commands, architecture, data models, file structure]

## Agent Behavior

You are a meticulous code reviewer. Focus on correctness, edge cases, and maintainability in every line you write.

That’s it. Full /init output + one-line persona. No verbose style guidelines. No process rules. No tone framing beyond the persona.

This structure gives Claude:

The project-specific facts it needs (build commands, architecture)
The behavioral framing that improves quality (code-reviewer mindset)
Minimal cognitive overhead (no unnecessary instructions)

For the complete best-practices stack (polite framing + temperature 1.0), see the capstone experiment.

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Heatmap

Heatmap: detailed breakdown.

Radar charts

Radar charts: multi-dimensional view.

Statistical comparison

Statistical significance analysis.

Limitations

Python-only. These benchmarks tested Python tasks exclusively. Cross-language validation (JavaScript, Go, Rust) is pending. The /init output varies significantly by language ecosystem, so these results may not generalize.
12 tasks tested. The task suite covers code-gen, bug-fix, refactor, and instruction-following, but it’s not exhaustive. Edge cases (complex refactoring, multi-file changes, architectural decisions) may behave differently.
Single /init version. We tested the current /init implementation. Future versions may generate different output with different token budgets.
No hybrid trimming tested. We tested full /init vs heavily-trimmed /init. There may be an optimal middle ground (e.g., keep architecture + data models, trim file listings) that we didn’t explore.

Try It Yourself

The /init experiment configuration and all data are open source:

pip install claude-benchmark

# Run the /init experiment
claude-benchmark experiment run experiments/init-vs-best-practices.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also test this on your own codebase. Run /init in Claude Code, copy the output to your CLAUDE.md, add the one-line code-reviewer persona, and measure the difference in output quality for your specific tasks.

Full experiment configuration: experiments/init-vs-best-practices.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.

The Four Variants

The Scoreboard

The /init + Persona Synergy

Why Trimming Hurts

Project Context vs Generic Rules

The Token Budget Distinction

Practical Template

Limitations

Try It Yourself

Related reading

“Trace 3 Examples” Is the Best Verification Instruction — But Only for JS and C# (10,800 Benchmarks)

CoT Helps Go and C# But Hurts Python: When Prompt Advice Flips by Language (5,760 Benchmarks)

Does Multi-Agent Orchestration Prompting Help? It Depends on What You Add (1,800 Benchmarks)

Need help with your infrastructure or AI-assisted codebase?