Does Temperature Actually Matter? 720 Runs Say Not Really
The Next Knob to Test
After I published “I Ran 540 Benchmarks to Prove CLAUDE.md Compression Works”, I got a recurring question: what about temperature?
Fair question. If compressing your CLAUDE.md doesn’t measurably hurt code quality, and verbose instruction profiles barely move the needle for generic coding tasks, then temperature is the next obvious variable to isolate. It’s the setting people actually fiddle with. Forum posts recommend setting it to 0 for “deterministic” coding. API wrappers default to all sorts of values. Everyone has an opinion, almost nobody has data.
So I ran the experiment. Same benchmark tool, same scoring pipeline, new variable.
Experiment Design
I used claude-benchmark to run a full temperature sweep with the following setup:
- Temperatures: 0.0, 0.3, 0.7, 1.0
- Model: Claude Sonnet 4.6
- Profile: Empty (no CLAUDE.md instructions)
- Tasks: 6 total — 2 bug-fix, 2 code-gen, 2 refactor
- Repetitions: 30 per task per temperature
- Total runs: 720
I used the empty profile deliberately. The previous benchmark showed that instruction content has minimal impact on generic task quality, and an empty profile actually scored highest across the board. Using it here isolates temperature as the only variable — no confounding effects from prompt content.
Scoring works the same way as before: automated test execution (pass/fail), lint checks, and an LLM quality judge, blended into a composite score on a 0–100 scale. Each of the 720 runs gets scored independently.
The Results: 0.9 Points of Nothing
Here are the overall composite scores across all 6 tasks and 30 reps per temperature:
| Temperature | Mean | Median | Std Dev |
|---|---|---|---|
| 0.0 | 92.71 | 97.50 | 11.17 |
| 0.3 | 92.44 | 97.17 | 9.70 |
| 0.7 | 92.71 | 97.17 | 9.77 |
| 1.0 | 93.34 | 97.17 | 8.91 |
Bottom line: The difference between the best and worst mean score is 0.9 points across 180 runs per temperature. The medians are nearly identical. This is well within noise.
Temperature 1.0 — Claude’s default — comes out marginally on top with the highest mean (93.34) and, counterintuitively, the lowest standard deviation (8.91). The “deterministic” setting at 0.0 doesn’t produce more consistent results. If anything, it’s slightly less consistent.
Task-Type Breakdown
The overall numbers are boring by design — that’s the finding. But the per-task-type breakdown has some texture worth examining.
Bug-Fix Tasks
All four temperatures score between 98 and 99 on bug-fix tasks. The model either finds the bug and fixes it, or it doesn’t, and temperature doesn’t change that outcome. Temp 1.0 has slightly more variance (stdev 4.65 vs ~1.5 for the others), meaning it occasionally whiffs on a fix that the deterministic setting nails, but the mean is effectively identical.
Bug-fix verdict: Temperature is irrelevant. All settings produce near-perfect scores.
Code Generation Tasks
This is where temp 0.0 has its best showing. Mean score at temp 0.0 is 98.02 compared to 95.53 at temp 1.0. For tasks with clear specifications and a single correct implementation path, deterministic sampling has a slight edge — it doesn’t explore alternatives that might introduce unnecessary complexity.
Code-gen nuance: Temp 0.0 edges ahead by ~2.5 points for straightforward generation tasks. This is the strongest per-category difference in the entire sweep, and it’s still small.
Refactoring Tasks
Here’s the interesting reversal. Temp 1.0 scores 86.52 on refactoring tasks versus 81.27 at temp 0.0 — a 5-point gap, the largest single difference in the experiment. Higher temperature produces better refactoring output.
This makes sense when you think about what refactoring involves. There’s no single “correct” refactoring — it’s about improving structure, readability, and design. These are inherently subjective qualities. The LLM quality judge rewards cleaner abstractions and more thoughtful restructuring, and higher temperature lets the model explore a wider solution space before committing to an approach.
Refactor verdict: Temp 1.0 wins by the widest margin of any task type. Creativity helps where there’s no single right answer.
Sub-Component Analysis
The composite score blends three signals. Breaking them apart tells you exactly where temperature does and doesn’t matter.
Test pass rate: 100% across the board. Every temperature, every task type, every run. The model produces code that passes its test suite regardless of sampling temperature. This is the most important finding for anyone worried about correctness — temperature doesn’t introduce bugs.
Lint scores barely move. The range is 96.92 (temp 1.0) to 97.86 (temp 0.0). Yes, slightly higher temperature produces marginally more lint warnings, but we’re talking about less than a single percentage point. Not actionable.
LLM quality judge prefers higher temperature. This is the driver behind the overall scores. The quality judge — which evaluates code structure, readability, and design quality — gives temp 1.0 the highest mean (91.60) with the lowest variance (stdev 16.90 vs 22.13 at temp 0.0). Higher temperature produces output that a quality-focused evaluator consistently rates better, and rates more consistently.
The variance surprise: Conventional wisdom says lower temperature = more consistent output. The data says the opposite for code quality scores. Temp 0.0 has the highest standard deviation overall (11.17) while temp 1.0 has the lowest (8.91).
Why This Makes Sense
Modern LLMs are robust to temperature for structured output tasks, and code is about as structured as output gets. The model has learned strong priors about what correct code looks like — syntax, patterns, idioms — and those priors dominate regardless of how much randomness you inject into the sampling.
Think of it this way: if you ask someone who deeply understands Python to write a function that reverses a linked list, they’ll write essentially the same code whether they’re feeling “creative” or “methodical.” The solution space for correct code is narrow enough that temperature doesn’t push the model outside it.
Where temperature does matter is in the subjective quality channel. Refactoring has many valid approaches. Higher temperature lets the model consider more of them before settling on one, and the result is often a more thoughtful restructuring. The “creativity” knob isn’t making the code more creative — it’s making the search over possible solutions broader, which helps when the quality landscape has multiple peaks.
Updated Recommendations
- Stick with the default (1.0). The data shows no reason to override Claude’s default temperature for coding tasks. It ties or wins in every category except narrow code generation, where the advantage of 0.0 is offset by worse refactoring scores.
- Don’t waste time tuning temperature for coding. The 0.9-point overall spread across 720 runs means temperature tuning has near-zero ROI. Spend that time on better task decomposition or clearer specifications instead.
- If you must pick a non-default value, avoid 0.0 for refactoring. Deterministic sampling costs you ~5 points on refactoring quality. If your workflow is refactor-heavy, the default is strictly better.
- Don’t trust the “set temp to 0 for coding” advice. It’s intuitive but unsupported. Deterministic sampling doesn’t produce more consistent results — it actually produces slightly less consistent ones across the full task spectrum.
- Correctness isn’t at stake. Tests pass at 100% regardless of temperature. This isn’t a safety-critical parameter.
The Bigger Picture
This is the third experiment in a series, and the findings keep converging on the same theme.
In the first post, I showed that compressing CLAUDE.md saves tokens without hurting quality. In the second, I found that instruction profile content barely moves the needle for generic coding tasks — the empty profile actually scored highest. Now, temperature joins the list of things that don’t meaningfully affect code output quality.
The pattern is clear: for standard coding tasks, the model’s training dominates the parameter space. CLAUDE.md content, instruction verbosity, sampling temperature — none of these knobs produce more than a point or two of movement on a 100-point scale. The model knows how to write code, and it writes similar-quality code regardless of how you configure these settings.
What actually matters: Task design, problem decomposition, and clear specifications move the needle far more than parameter tuning. Focus your optimization effort there.
That said, “generic coding tasks” is the key qualifier. These benchmarks test bug fixes, code generation, and refactoring on well-specified problems. If your use case involves ambiguous requirements, domain-specific conventions, or tasks where the “right” answer depends heavily on context, instruction content and temperature might matter more. I haven’t tested that yet.
If you want to run your own experiments — on your tasks, your prompts, your workflow — the claude-benchmark tool is open source. Define your tasks, pick your variables, and let the data tell you what matters for your specific use case.