March 17, 2026 AI Tooling

Saying “Please” to Claude Produces Better Code. 810 Benchmarks Exposed the Gap.

Johnathen Chilcher Senior SRE, TechLoom

There’s a running joke in the AI community about people who say “please” and “thank you” to their LLMs. The implication is that it’s sentimental, anthropomorphizing—treating a statistical model like a colleague. The pragmatic developer just fires off terse commands. “Refactor this.” “Fix the bug.” No pleasantries, no wasted tokens.

I was in the pragmatic camp. Then I ran 810 benchmarks.

It turns out the polite people were onto something. Not for the reasons they think, and not in the way you’d expect, but the data is clear: how you frame a prompt affects code quality more than what you put in your CLAUDE.md. The surprise? It’s not about saying “please” specifically—it’s about warmth. Effusive and polite prompts scored identically (p=0.67). The real divide is between warm framing and terse commands.

Experiment Design

This is the third experiment in the claude-benchmark series. The first study tested CLAUDE.md instruction content across 1,188 runs. The second tested chain-of-thought prompting across 270 runs. This one holds instruction content constant and isolates a variable nobody seems to measure: tone.

The Three Variants

Every variant uses the same task prompt. Only the framing changes:

Variant	What It Sounds Like
bare-imperative	Terse, command-style. “Refactor this function.” No context, no courtesy. The way most developers actually prompt.
polite	Standard professional tone. “Could you please refactor this function?” Adds courtesy and framing without changing the ask.
effusive	Enthusiastic, encouraging. “You’re excellent at writing clean, well-structured code. I’d love your help refactoring this function.” Positive reinforcement and confidence framing.

The Test Matrix

Dimension	Values
Models	Claude Haiku 4.5, Claude Sonnet 4.6, Claude Opus 4.6
Profile	empty (no CLAUDE.md)
Tasks	bug-fix-01, code-gen-01, refactor-01
Variants	bare-imperative, polite, effusive
Repetitions	30 per combination
Total runs	810

Same structure as the CoT experiment but with 3× the repetitions: 30 runs per combination, empty profile to isolate the variable, same three tasks. All that changed is the wrapper around the prompt. The higher rep count gives us enough statistical power to distinguish real effects from noise.

Results: Tone Moves the Needle

Overall composite scores, averaged across all models and tasks:

Executive summary showing effusive as best variant at 95.2 composite score, bare-imperative control at 93.7, a +1.5 point improvement

Executive summary from the benchmark report: effusive leads at 95.2, bare-imperative trails at 93.7.

Variant	Composite Score	Std Dev	Min	Max
effusive	95.21	6.45	54.3	98.5
polite	94.99	6.10	62.4	98.5
bare-imperative	93.68	9.29	41.1	98.5

Warm framing beat terse commands by ~1.5 points. But here’s the nuance 810 runs revealed: effusive and polite are statistically indistinguishable (p=0.67, Welch’s t-test). The real divide isn’t between “please” and enthusiastic praise—it’s between any warm framing and bare commands. Effusive vs. bare-imperative: p=0.027. Polite vs. bare-imperative: p=0.051.

The average difference is smaller than the original 270-run subset suggested, but the pattern is more reliable at 810 runs. Look at the standard deviations. Bare-imperative runs had a spread of 9.29—roughly 50% wider than polite’s 6.10. And the minimum scores tell the real story: bare-imperative’s worst run scored 41.1. Polite’s worst was 62.4. Effusive’s worst was 54.3.

All three variants hit the same ceiling (98.5). The difference is entirely in the floor. And notably, polite framing has the best floor of all—62.4 vs. effusive’s 54.3. If you care about worst-case reliability, simple courtesy beats enthusiastic praise.

Model-Level Breakdown

Radar charts comparing variant scores across metrics for Haiku, Opus, and Sonnet models

Per-model radar charts: Haiku is nearly identical across variants, Sonnet shows the clearest separation, Opus has the most variance.

Model	bare-imperative	polite	effusive
Haiku	96.40	96.40	96.40
Sonnet	94.19	95.29	96.11
Opus	91.00	93.28	93.13

Three very different stories here.

Haiku doesn’t care. It scored ~96.4 regardless of whether you barked orders at it or showered it with praise. Essentially zero spread across all three variants. Haiku is the honey badger of Claude models—it just does its thing.

Sonnet shows the cleanest signal. A 1.92-point spread from bare to effusive, with low variance across runs. Sonnet reliably produces better code when prompted warmly. This is the most actionable finding in the experiment.

Opus confirms the warm-vs-terse divide. A 2.28-point gap between bare-imperative (91.00) and polite (93.28), but notice that effusive (93.13) and polite are virtually identical. For Opus, just being warm is enough—cranking up the enthusiasm doesn’t buy extra quality. Opus also had the highest variance across runs, so the signal is noisier at the individual-run level.

Task-Level Breakdown

Heatmap showing variant scores broken down by task, with refactor-01 showing the largest differences

Variant x Task heatmap: refactor-01 is the only task where tone meaningfully moves the score.

Variant	bug-fix-01	code-gen-01	refactor-01
effusive	98.50	97.80	89.34
polite	98.40	97.10	89.49
bare-imperative	97.90	97.05	86.09

Same pattern as every experiment in this series: bug-fix and code-gen are nearly saturated. The models nail these tasks regardless of how you ask. The differentiation happens on refactor-01—the only task complex enough to be sensitive to prompting strategy.

On refactoring, warm framing beat bare-imperative by ~3.3 points (effusive 89.34, polite 89.49 vs. bare 86.09). Again, effusive and polite are virtually tied. That’s the largest single-variable effect I’ve measured across three experiments and over 1,300 total benchmark runs. Tone matters most when the task requires judgment, not just execution.

Why Tone Affects Code Quality

This isn’t about the model having feelings. It’s about what different framings activate in the training distribution.

Think about where these models learned to code. The training data includes millions of interactions: Stack Overflow answers, code reviews, pair programming transcripts, tutorial exchanges, mentorship conversations. The quality of code in those interactions correlates with the tone of the surrounding context.

When someone writes “just fix it,” the associated code in the training data tends to be quick patches, workarounds, drive-by fixes. When someone writes “I’d really appreciate a clean, well-structured approach to this,” the associated code tends to be thoughtful, reviewed, production-grade. The model isn’t responding to politeness. It’s pattern-matching on the contextual signal that predicts code quality.

The Floor Effect

The most important finding isn’t the 2-point average difference. It’s the floor. Look at the minimums again:

Variant	Worst Run	Std Dev
bare-imperative	41.1	9.29
effusive	54.3	6.45
polite	62.4	6.10

Bare-imperative’s worst run was a 41.1. That’s a catastrophic failure—code that barely functions. With warm framing, the worst runs were 13–21 points higher. And here’s an interesting twist: polite had the best floor (62.4) and the tightest spread (6.10). Effusive’s floor was actually lower (54.3). If you’re optimizing for worst-case reliability rather than average score, simple professional courtesy beats enthusiastic praise. The ceiling was the same across all three. Tone doesn’t make good code better. It makes bad code less bad. It’s a guardrail.

This is the same pattern the CLAUDE.md study found with instructions: they raise the floor more than the average. And the same pattern the CoT study found with reasoning scaffolds (though CoT reduced variance by pulling down the ceiling, which is worse). Tone is the first variable I’ve tested that raises the floor without lowering the ceiling.

Tone is the only prompting variable in this series that improves consistency without sacrificing peak performance. Instructions and CoT both introduced tradeoffs. Warm framing is a free lunch—or as close to one as I’ve found.

Statistical Significance

With 810 runs (270 per variant), we have enough power to draw real conclusions. Welch’s t-test results:

Statistical comparison table showing p-values and significance for all variant pairs

Full statistical comparison from the benchmark report, including effect sizes and confidence intervals.

Comparison	Mean Difference	p-value	Significant?
effusive vs. bare-imperative	+1.53	0.027	Yes (p < 0.05)
polite vs. bare-imperative	+1.31	0.051	Marginal
effusive vs. polite	+0.22	0.67	No

The effusive-vs-polite comparison is the key finding that only emerged with 810 runs. At 270 runs, the gap looked meaningful. At 810, it’s noise. The real story is warm framing vs. terse commands—not the specific flavor of warmth.

How This Fits the Bigger Picture

Three experiments, 1,350 runs, one consistent hierarchy:

Variable	Effect Size	Direction	Source
Model selection (best vs. worst)	~2.5 points	Positive	CLAUDE.md study
Prompt tone (warm vs. bare)	~1.3–1.5 points	Positive	This study (810 runs)
CLAUDE.md instructions (best vs. worst)	~1.4 points	Mixed	CLAUDE.md study
Chain-of-thought (bare vs. CoT)	~0.5–1.1 points	Negative	CoT study

Scatter plot showing token efficiency across variants, with effusive winning on both quality and token usage

Token efficiency scatter: warm framing doesn’t cost extra tokens relative to its quality gains.

Prompt tone is the second most impactful variable I’ve measured, behind only model selection. It has a comparable effect to the content of your CLAUDE.md instructions, and unlike chain-of-thought prompting, it actually helps instead of hurting.

If you’re spending time curating your CLAUDE.md but firing off terse commands in your prompts, you’re optimizing the wrong thing. And the good news from 810 runs: you don’t need to write effusive praise. A simple “Could you please…” gets you the same result.

What This Means for Your Prompts

Add warmth to your prompts, especially for complex tasks. “Could you please restructure this for readability?” will outperform “Refactor this” more often than not—and it performs just as well as effusive praise. The effect is largest on tasks that require judgment (refactoring, architecture decisions) and negligible on mechanical tasks (bug fixes, simple code generation).
Don’t bother with elaborate praise. 810 runs proved that effusive and polite prompts are statistically identical (p=0.67). You don’t need to write “You’re excellent at writing clean, well-structured code.” Just say “please.” The warmth signal is binary: present or absent.
This matters most for Sonnet. If you’re routing between models, Sonnet showed the cleanest, most reliable improvement from warm framing. Haiku doesn’t care. Opus benefits but with more variance.
Think of tone as a consistency tool, not a quality booster. You’re not making the model smarter. You’re preventing the worst-case runs where it phones it in. Polite framing has the best floor (62.4 min) and tightest spread (stdev 6.10)—if reliability matters, simple courtesy is the best bet.
Combine with the other findings. The optimal prompt from this entire series of experiments is: no CLAUDE.md (or minimal, targeted instructions), no chain-of-thought scaffolding, warm and specific framing. Let the model’s training do the work. Just set the right tone—and “please” is enough.

Limitations

Three tasks, same saturation problem. Two of the three tasks are nearly saturated—the models ace them regardless. The real signal comes from refactor-01. A broader task set with more challenging problems would give a fuller picture.
Tone is hard to parameterize precisely. “Effusive” and “polite” are fuzzy categories. The specific wording of each variant matters, and different phrasings within the same tone category might produce different results. This tests three specific framings, not the entire space of possible tones.
Single-turn only. Real development involves multi-turn conversations where tone accumulates over an exchange. The effect of sustained warmth (or sustained terseness) over a long session could be larger or smaller than what single-turn benchmarks capture.
No interaction with instructions. All runs used the empty profile. Tone might interact with CLAUDE.md content—warm framing could amplify the effect of good instructions or compensate for bad ones. That’s a separate experiment.
Opus variance is high. The Opus results are directionally consistent but noisy. Even with 30 reps per combination, Opus had the widest standard deviations. The aggregate trend is clear, but any single Opus run is unpredictable.

Try It Yourself

The experiment configuration and all 810 runs are available in the claude-benchmark repository. Run the politeness sweep against your own tasks:

git clone https://github.com/jchilcher/claude-benchmark.git
cd claude-benchmark
pip install -e .

# Run the politeness sweep experiment
claude-benchmark experiment run experiments/politeness-sweep.yaml

# Generate the report
claude-benchmark experiment report results/experiment-politeness-sweep-*/

Or define your own variants. Test different tones, different phrasings, different levels of enthusiasm. If you find a framing that beats effusive, or a task where terse commands win, I want to see the data. That boundary is where the interesting questions live.

Want to benchmark your own AI coding workflows?

I help teams measure and optimize their AI-assisted development processes—from prompt engineering to infrastructure automation. Get a free assessment to find where data-driven decisions can improve your output quality.

Get Your Free Assessment