June 4, 2026 7 min read

Compressed CLAUDE.md Rules Beat Verbose Instructions and Caveman Encoding (15,120 Benchmarks)

Johnathen Chilcher Senior SRE, TechLoom

Should you write detailed CLAUDE.md instructions or compress them down to key phrases? The conventional wisdom says “be clear and explicit.” The Brussee CLAUDE.md toolkit says “compress everything, remove vowels, abbreviate ruthlessly.” What does the data say?

I ran 15,120 benchmarks to test seven compression strategies ranging from verbose instructions to ultra-compressed caveman encoding. The winner is structured compression at around 80 tokens per section. The loser is caveman-style compression, which barely beats having no instructions at all. The compression curve is non-linear, and token savings don’t translate to quality improvements once you sacrifice readability.

This is the twenty-third experiment in our benchmark series, and it’s particularly relevant to the targeted constraints experiment, which showed that focused, specific instructions beat generic advice. Both experiments point to the same conclusion: structure and precision matter more than volume.

Experiment Design

I tested seven CLAUDE.md variants with different compression strategies:

Variant	Tokens	Example
bare	0	No CLAUDE.md instructions. Pure baseline.
terse-generic	~50	“Write clean code. Add tests. Handle errors. Document functions.”
compressed-rules	~80	“Functions: single responsibility. Tests: edge cases + happy path. Errors: explicit handling, no silent failures.”
code-adapted	~90	Compressed prose + preserved code examples. “Error handling: `if err != nil { return err }`“
verbose-rules	~200	“Each function should have a single, well-defined responsibility. Write comprehensive tests covering both edge cases and the happy path. Handle all errors explicitly with proper context, never silently failing.”
caveman-full	~40	“wrt cln cd. add tsts. hndl errs. doc fncs.”
caveman-ultra	~25	“cln cd+tst+err hndl→prod rdy”

Each variant was run across:

3 models: Haiku, Sonnet, Opus
48 tasks: 12 per language (Python, Go, TypeScript, Rust)
15 repetitions per model-task-variant combination

That’s 7 variants × 3 models × 48 tasks × 15 reps = 15,120 runs.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results:

Variant	Score	Delta vs Bare
compressed-rules	73.72	+3.36
code-adapted	73.08	+2.72
terse-generic	72.65	+2.29
verbose-rules	72.52	+2.16
caveman-ultra	70.67	+0.31
caveman-full	70.59	+0.23
bare	70.36	—

Compressed-rules wins by 3.36 points over bare and beats verbose-rules by 1.2 points. The caveman variants barely outperform having no instructions at all.

The Compression Sweet Spot

The data reveals a non-linear compression curve:

The quality peak is around 80 tokens of structured, readable instructions. Go shorter and you lose clarity. Go longer and you introduce noise. Doubling from 80 to 200 tokens makes quality slightly worse, not better. The model extracts signal more effectively from concise, well-structured text than from verbose explanations.

The curve looks like this:

0 tokens (bare): 70.36
~25-40 tokens (caveman): 70.59-70.67 (+0.23 to +0.31)
~50 tokens (terse-generic): 72.65 (+2.29)
~80 tokens (compressed-rules): 73.72 (+3.36) ← PEAK
~200 tokens (verbose-rules): 72.52 (+2.16)

There’s a jump from caveman to terse-generic (+2 points), another jump to compressed-rules (+1 point), and then a slight drop to verbose-rules. The 80-token sweet spot is real.

Caveman Encoding Backfires

Brussee-style compression barely beats having no instructions. caveman-full (70.59) and caveman-ultra (70.67) are +0.23 and +0.31 above bare (70.36). That’s within measurement noise. The model can’t reliably parse heavily abbreviated text. Removing vowels, articles, and prepositions destroys semantic clarity faster than it saves tokens.

The Brussee CLAUDE.md toolkit recommends aggressive compression: “wrt cln cd”, “hndl errs”, “doc fncs”. The philosophy is that tokens are scarce and the model can infer meaning from context. The data says otherwise. Caveman encoding gains you nothing in quality and costs you maintainability.

The worst performer is caveman-full at 70.59, which is barely distinguishable from bare at 70.36. If you’re going to write instructions, make them readable.

Model-Specific Results

Radar charts comparing performance

Radar charts: multi-dimensional performance comparison.

Heatmap showing detailed breakdown

Heatmap: detailed performance breakdown across dimensions.

Different models have different compression preferences:

Model	Bare	Caveman-Full	Caveman-Ultra	Terse-Generic	Compressed-Rules	Code-Adapted	Verbose-Rules
Haiku	60.58	59.90	61.67	63.51	63.68	62.14	64.40
Sonnet	72.55	72.65	71.31	74.88	76.07	75.38	73.07
Opus	77.95	79.23	79.04	79.56	81.42	81.72	80.09

The pattern is clear:

Haiku (low-tier model): verbose-rules wins (64.40). The smallest model benefits from explicit, detailed instructions. Compression hurts it.
Sonnet (mid-tier model): compressed-rules wins clearly (76.07). This is where the compression sweet spot shines. Structured, concise instructions outperform both verbose and caveman.
Opus (high-tier model): code-adapted wins (81.72). The most capable model benefits from preserving code examples while compressing prose. It can infer context from technical artifacts better than from natural language.

If you’re optimizing for Haiku, keep instructions verbose. If you’re optimizing for Sonnet, compress to structured rules. If you’re optimizing for Opus, preserve code examples and compress the prose around them.

The Code-Adapted Strategy

Code-adapted is the consistent runner-up at 73.08 overall, and it wins on Opus at 81.72. What is it?

Code-adapted means: compress prose instructions, but preserve code examples and technical standards verbatim. Don’t abbreviate function signatures, command-line flags, or error-handling patterns. Those are information-dense and hard to abbreviate without losing precision.

Code-Adapted (Good)

Error handling: always return errors with context.

if err != nil {
    return fmt.Errorf("operation failed: %w", err)
}

Caveman Compression (Bad)

err hndl: rtrn w/ cntxt.

if err!=nil{rtrn fmt.Errorf("op fail:%w",err)}

Code-Adapted (Good)

Testing: cover edge cases.

func TestParseInput(t *testing.T) {
    t.Run("empty string", func(t *testing.T) { ... })
    t.Run("invalid format", func(t *testing.T) { ... })
}

Verbose Compression (Bad)

Testing: Write comprehensive test coverage for all functions. Each test should cover edge cases like empty strings, invalid formats, nil pointers, and boundary conditions. Use table-driven tests where appropriate to reduce duplication and improve maintainability.

The code-adapted strategy works because code examples are unambiguous. “if err != nil” is clearer than any prose explanation of error handling. The model can pattern-match against concrete examples more reliably than it can parse natural language descriptions.

Language-Specific Breakdown

Do different languages benefit from different compression strategies?

Language	Bare	Compressed-Rules	Verbose-Rules	Code-Adapted
Python	72.14	74.89	74.22	75.01
Go	68.23	76.41	74.55	75.82
TypeScript	71.02	73.15	72.08	72.89
Rust	70.05	70.43	69.23	68.60

Go sees dramatic gains from instruction compression. The bare Go score is 68.23. Compressed-rules jumps to 76.41 (+8.18), and code-adapted hits 75.82 (+7.59). No other language shows gains this large. Go’s explicitness and idiom-heavy culture make structured instructions particularly valuable.

Rust is the opposite: bare (70.05) vs compressed-rules (70.43) is only +0.38. Rust tasks are hard, and the model struggles regardless of instruction quality. Code-adapted actually makes Rust worse (68.60), which suggests that Rust examples in CLAUDE.md files confuse the model more than they help.

Practical Implications

1. Aim for ~80 tokens of structured rules per section.

This is the compression sweet spot. Write clear, concise instructions using key phrases instead of full sentences. Structure them with labels (“Error handling:”, “Testing:”, “Functions:”). Avoid fluff words and generic advice.

Evidence: compressed-rules wins overall at 73.72, beating verbose-rules (72.52) and terse-generic (72.65).

2. Don’t use caveman or Brussee-style compression.

Removing vowels, articles, and prepositions destroys semantic clarity without improving quality. Caveman-full (70.59) and caveman-ultra (70.67) barely beat bare (70.36). The token savings are not worth the readability loss.

Evidence: caveman variants are +0.23 to +0.31 above bare, within measurement noise.

3. For Haiku: keep instructions verbose and explicit.

Low-tier models benefit from detailed instructions. Haiku’s best score is 64.40 with verbose-rules, compared to 63.68 with compressed-rules. The difference is small but consistent. If you’re optimizing for Haiku, don’t compress aggressively.

Evidence: verbose-rules wins on Haiku at 64.40 vs compressed-rules at 63.68.

4. Preserve code examples and technical standards verbatim.

Code examples are information-dense and unambiguous. Don’t abbreviate function signatures, command-line flags, or error-handling patterns. Compress the prose around them, but leave the code intact. This is the code-adapted strategy, and it wins on Opus (81.72) and is the runner-up overall (73.08).

Evidence: code-adapted wins on Opus at 81.72, beating compressed-rules (81.42) and verbose-rules (80.09).

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Statistical comparison

Statistical significance analysis.

Limitations

Token counts are approximate. We estimated token counts using whitespace heuristics rather than exact tokenization. The ~80 token sweet spot could be off by 10-20 tokens in either direction.
Claude models only. Other model families (GPT-4, Gemini, etc.) may have different compression preferences. But the pattern across Haiku, Sonnet, and Opus suggests this is a general property of transformer-based LLMs.
Single-file tasks. These are isolated coding tasks. CLAUDE.md instructions for multi-file projects, architectural standards, or deployment procedures may have different optimal compression levels.
No interaction with other strategies tested. We didn’t test compression combined with chain-of-thought, personas, or targeted constraints. Interaction effects could exist.
Compression strategies could be varied further. We tested seven variants, but there’s a large space of possible compression approaches. Structured bullets, numbered lists, markdown formatting, and other organizational strategies could perform differently.

Try It Yourself

The output compression experiment configuration and all data are open source:

pip install claude-benchmark

# Run the output compression experiment
claude-benchmark experiment run experiments/output-compression.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own compression variants in the experiment TOML and test them against your specific codebase conventions. If you find a compression strategy that beats compressed-rules, I’d like to see the data.

Full experiment configuration: experiments/output-compression.toml

For the complete picture of what works in CLAUDE.md optimization, see the init vs best-practices experiment and the targeted constraints experiment. For the synthesis of all benchmark results, see the capstone-v2 experiment.

Experiment Design

The Scoreboard

The Compression Sweet Spot

Caveman Encoding Backfires

Model-Specific Results

The Code-Adapted Strategy

Code-Adapted (Good)

Caveman Compression (Bad)

Code-Adapted (Good)

Verbose Compression (Bad)

Language-Specific Breakdown

Practical Implications

1. Aim for ~80 tokens of structured rules per section.

2. Don’t use caveman or Brussee-style compression.

3. For Haiku: keep instructions verbose and explicit.

4. Preserve code examples and technical standards verbatim.

Limitations

Try It Yourself

Related reading

What an Effective AI Coding Prompt Actually Looks Like (Updated: 212,000+ Benchmarks Across 4 Languages)

Growth-Mindset Framing Beats Threats in AI Code Generation (15,120 Benchmarks)

We Tested Everything Across 4 Languages. Most Python Advice Doesn’t Transfer. (21,600 Benchmarks)