“Trace 3 Examples” Is the Best Verification Instruction — But Only for JS and C# (10,800 Benchmarks)
Anthropic’s prompt engineering documentation calls verification “the single highest-leverage thing you can add to improve accuracy.” But WHAT kind of verification? We tested four strategies across 10,800 benchmark runs and three languages. Only one consistently helps—and it’s language-specific.
“Trace through 3 concrete examples with your code before responding” delivers +3.2 points for JavaScript and +5.5 points for C#. But for Python? Neutral to slightly negative (-0.7pts). Meanwhile, vague instructions like “double-check your work” are the weakest non-bare variant across all languages. Verification works, but only when you tell the model exactly WHAT to do.
This is the ninth experiment in our benchmark series, following the chain-of-thought, context pollution, model selection, politeness sweep, persona sweep, step-back, anchoring, and cross-language revisits studies. All data is open source at claude-benchmark.
Experiment Design
I tested five verification variants:
| Variant | Description |
|---|---|
| bare | No verification instructions (control baseline) |
| verify-vague | “Double-check your work before responding” |
| verify-run-tests | “Run the tests mentally and verify your code passes” |
| verify-with-examples | “Trace through 3 concrete examples with your code before responding” |
| verify-fix-loop | “Check your output, identify any issues, and fix them before responding” |
Each variant was run across:
- 3 models: Haiku, Sonnet, Opus
- 48 tasks across Python, JavaScript, and C# (Go scoring was sparse and is excluded from analysis)
- 10 repetitions per model-task-variant combination
That’s 5 variants × 3 models × 48 tasks × 10 reps = 7,200 runs across the three main languages, with an additional 3,600 runs in Go that showed insufficient scoring coverage.*
* Go tasks had sparse test coverage, making score deltas unreliable. We focus on Python, JS, and C# where scoring is comprehensive.
The Scoreboard

Executive summary: overall performance across all tested variants.
Here are the overall results across Python, JavaScript, and C#:
| Variant | Python | JavaScript | C# | Overall | N |
|---|---|---|---|---|---|
| bare (control) | 87.34 | 66.94 | 63.81 | 72.70 | 1669 |
| verify-vague | 86.83 | 68.91 | 65.83 | 73.86 | 1659 |
| verify-run-tests | 87.88 | 68.63 | 67.65 | 74.72 | 1652 |
| verify-with-examples | 86.62 | 70.10 | 69.28 | 75.33 | 1650 |
| verify-fix-loop | 87.41 | 69.07 | 65.37 | 73.95 | 1651 |
And here are the deltas from the bare baseline:
| Variant | Python | JavaScript | C# |
|---|---|---|---|
| verify-vague | -0.51 | +1.97 | +2.02 |
| verify-run-tests | +0.54 | +1.69 | +3.84 |
| verify-with-examples | -0.72 | +3.15 | +5.46 |
| verify-fix-loop | +0.07 | +2.13 | +1.56 |
verify-with-examples Wins — For JS and C#
But look at Python:
This is a language-specific effect. Python is Claude’s strongest language across all our experiments. The baseline score is 20+ points higher than JS or C#. When the model is already highly accurate, verification adds overhead without corresponding benefit. For JS and C#, where baseline accuracy is lower, verification helps the model catch mistakes it would otherwise miss.
Why Vague Verification Doesn’t Work
“Double-check your work before responding” is the weakest non-bare variant. It delivers +1.97pts for JS and +2.02pts for C#—better than nothing, but worse than every specific verification instruction.
This maps to the broader pattern we’ve seen across experiments: specificity matters. The model performs best when you tell it exactly what to do. “Trace through 3 concrete examples” is specific. “Double-check your work” is not.
The Specificity Gradient
There’s a clear gradient from vague to specific verification:
| Variant | Specificity | JS Delta | C# Delta |
|---|---|---|---|
| verify-vague | Low (“double-check”) | +1.97 | +2.02 |
| verify-run-tests | Medium (“run tests mentally”) | +1.69 | +3.84 |
| verify-with-examples | High (“trace 3 examples”) | +3.15 | +5.46 |
| verify-fix-loop | Very High (“check, identify, fix”) | +2.13 | +1.56 |
More specific is better—up to a point. verify-fix-loop is the most complex instruction (“Check your output, identify any issues, and fix them before responding”), but it underperforms verify-with-examples. Why?
verify-fix-loop asks the model to do three things: check, identify, fix. That’s three cognitive steps, each with potential failure modes. verify-with-examples asks for one thing: trace examples. One specific instruction beats three stacked instructions.
Connection to Multi-Turn and Persona
Verification is single-turn iteration. Instead of generating output, checking it, and then regenerating in a second turn, verification asks the model to do that loop internally before responding. This is cheaper (one API call instead of two) but less reliable (no external feedback loop).
Our persona sweep experiment found that the “meticulous code reviewer” persona delivered the largest quality gains. That persona primes the model to check its work before responding—essentially baked-in verification. And in our capstone v2 experiment, we found that verification is redundant with the code-reviewer persona. They capture the same benefit.
Practical Advice
For JavaScript and C#: Add “trace through 3 concrete examples” to your prompts
verify-with-examples delivers +3.2pts for JS and +5.5pts for C#. This is the single highest-leverage verification instruction. Add it to your CLAUDE.md or system prompt for JS/C# projects.
For Python: Skip verification (or use the code-reviewer persona)
Python’s baseline accuracy is already high (87.34). verify-with-examples adds overhead without benefit (-0.7pts). If you want verification behavior, use the code-reviewer persona instead—it’s language-agnostic and doesn’t require language-specific tuning.
Don’t use vague verification instructions
“Double-check your work” is the weakest verification variant. It underperforms specific instructions by 1-3 points. If you’re going to add verification, be specific: “trace 3 examples” or “run tests mentally,” not “check your work.”
If you use the code-reviewer persona, skip explicit verification
The code-reviewer persona and verification capture the same benefit. From the capstone v2 experiment: persona alone delivers the same quality as persona + verification. Don’t stack them.

Token efficiency: quality vs cost tradeoff.

Heatmap: detailed breakdown.

Radar charts: multi-dimensional view.

Statistical significance analysis.
Limitations
- Go scoring sparse. Go tasks had insufficient test coverage for reliable scoring. We excluded Go from analysis and focused on Python, JS, and C# where scoring is comprehensive.
- Python-centric variant design. The original benchmark suite was designed for Python. Task complexity and scoring may not be perfectly balanced across languages, though our cross-language revisits experiment showed consistent patterns.
- Claude models only. Other model families may show different verification sensitivity. But the language-specific effect (Python high baseline, JS/C# lower baseline) suggests this is about task difficulty, not model architecture.
- Five verification strategies tested. We tested vague, run-tests, trace-examples, fix-loop, and bare. There may be other verification patterns (review common pitfalls, check edge cases) that behave differently.
- No interaction with other prompt patterns. We didn’t test verification combined with chain-of-thought, personas, or other patterns. Interaction effects could exist.
Try It Yourself
The verification instructions experiment configuration and all data are open source:
pip install claude-benchmark
# Run the verification instructions experiment
claude-benchmark experiment run experiments/verification-instructions.toml
# Generate the report
claude-benchmark report
# See the results
open results/*/report.html
You can also define your own verification variants in the experiment TOML and test them against your specific task types and languages. If you find a verification pattern that beats trace-examples for Python, I’d like to see the data.
Full experiment configuration: experiments/verification-instructions.toml
For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.