May 26, 2026 5 min read

“Brief Docstrings” Saves 13% of Tokens Without Hurting Quality (4,320 Benchmarks)

Johnathen Chilcher Senior SRE, TechLoom

Claude’s default docstrings are verbose. Functions get multi-line descriptions with Args, Returns, Raises sections even when the code is self-explanatory. The question: is this verbosity helping or just burning tokens?

I ran 4,320 benchmarks testing four docstring variants. The results: “keep docstrings brief” is quality-neutral (-0.32pts, not significant) while saving ~13% of output tokens. But “no docstrings” goes too far—it drops quality by 2.27 points. Docstrings serve as a reasoning scaffold. Remove them entirely and the model struggles.

This continues our benchmark series exploring what actually works in AI coding prompts. All data is open source at claude-benchmark.

Experiment Design

I tested four docstring variants:

Variant	Instruction
bare	No docstring instructions (model’s default behavior)
brief-docstrings	“Keep docstrings brief—one line for simple functions, at most a short description + args for complex ones”
no-docstrings	“Do not write docstrings”
verbose-docstrings	“Write comprehensive docstrings with descriptions, arguments, returns, raises, and examples”

Each variant was run across:

3 models: Haiku, Sonnet, Opus
12 Python tasks: code generation, bug fixes, refactoring
30 repetitions per model-task-variant combination

That’s 4 variants × 3 models × 12 tasks × 30 reps = 4,320 runs.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results:

Variant	Score	N	Delta from Bare
verbose-docstrings	85.20	1079	+0.36
bare	84.84	1080	—
brief-docstrings	84.52	1080	-0.32
no-docstrings	82.57	1080	-2.27

Brief Docstrings: The Free Win

Brief docstrings score 84.52, just -0.32 from bare (not statistically significant). This is well within noise. Quality is preserved. But the token savings are real—brief docstrings produce ~19% less output than bare, and ~29% less than verbose docstrings. That’s a 13% net token savings with zero quality cost.

This is the rare free optimization. You’re explicitly instructing the model to constrain a specific verbose behavior (multi-line docstrings), and the model complies without degrading output quality. The docstrings are still there—they still serve as reasoning scaffolds—but they’re shorter.

Compare this to generic “write clean code” rules, which hurt quality (r=-0.95 correlation between prompt tokens and code quality). Targeted behavioral constraints work. Generic coding rules don’t.

No Docstrings Goes Too Far

No docstrings score 82.57, a 2.27-point drop from bare. This is statistically significant (p < 0.05). Opus specifically drops 4.3 points without docstrings. Removing docstrings entirely degrades output quality.

Why? Docstrings aren’t just documentation for humans—they’re reasoning scaffolds for the model. Writing a docstring forces the model to articulate what a function does before implementing it. That articulation improves the implementation.

Remove docstrings, and the model jumps straight to code without that intermediate reasoning step. The result: more bugs, less coherent logic, lower quality.

Verbose Docstrings: Highest Quality, Highest Cost

Verbose docstrings score 85.20, the highest of any variant (+0.36 from bare). But this comes at 29% more output tokens than bare, and 60% more than brief docstrings. If token cost doesn’t matter, verbose docstrings slightly help. But the quality gain is marginal.

The verbose-docstrings variant explicitly requests “comprehensive docstrings with descriptions, arguments, returns, raises, and examples.” The model complies, generating Python docstrings with full Sphinx-style sections. This adds reasoning overhead that slightly improves quality.

But the cost is real. Verbose docstrings burn tokens on output you may not need. If you’re paying per output token, or if long outputs slow down your workflow, the 0.36-point quality gain isn’t worth it.

The Principle: Targeted Behavioral Constraints

This experiment demonstrates a key principle: targeted behavioral constraints work when they correct a specific observed behavior.

“Keep docstrings brief” is not a generic coding rule. It’s a targeted correction of Claude’s tendency to write verbose docstrings. You’re not saying “write clean code” (which the model already knows). You’re saying “this specific thing you do by default—stop doing it.”

This is different from the best-practices experiment, where generic coding rules (DRY, YAGNI, descriptive names) hurt quality. Those rules restate what the model already knows. Targeted constraints correct specific behaviors the model doesn’t know you dislike.

The compress-claude-md guide explores this principle in depth: replace verbose rules with one-line behavioral nudges that target specific model behaviors.

When to Use Targeted Constraints

1. Identify a specific Claude behavior you want to change

Watch for patterns: verbose docstrings, excessive error handling, over-engineered solutions, redundant comments. If Claude consistently does something you don’t want, that’s a candidate for a targeted constraint.

2. Write a one-line constraint addressing THAT behavior

Be specific. “Keep docstrings brief” is better than “write concise code.” The first targets a specific behavior. The second is a generic rule.

3. Don’t generalize into a list of coding rules

The temptation is to build a CLAUDE.md full of coding standards. Resist. Each rule you add is another token in the prompt. Generic rules hurt more than they help (r=-0.95).

4. Test the constraint—not all behavioral corrections are free

“Brief docstrings” is quality-neutral. “No docstrings” is not. Some constraints have trade-offs. Test before deploying.

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Heatmap

Heatmap: detailed breakdown.

Radar charts

Radar charts: multi-dimensional view.

Statistical comparison

Statistical significance analysis.

Limitations

Python-only. This experiment tested Python tasks exclusively. Docstrings are language-specific. Other languages (JavaScript with JSDoc, TypeScript with TSDoc, Go with godoc) may behave differently.
Output token savings vary by task type. The 13% token savings is an average. Code-heavy tasks with fewer docstrings see smaller savings. Documentation-heavy tasks see larger savings.
No measurement of human comprehension. We measured code quality (correctness, maintainability) but not how well humans understand code with different docstring styles. Brief docstrings may be harder for junior developers to parse.
Single constraint tested. We didn’t test combinations (e.g., “brief docstrings + no type hints”) or other behavioral constraints. Interaction effects could exist.

Try It Yourself

The targeted constraints experiment configuration and all data are open source:

pip install claude-benchmark

# Run the targeted constraints experiment
claude-benchmark experiment run experiments/targeted-constraints.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also test your own targeted constraints. Add a variant to the experiment TOML with your specific behavioral instruction, run the benchmark, and see if it’s quality-neutral.

Full experiment configuration: experiments/targeted-constraints.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.