May 14, 2026 6 min read

“Trace 3 Examples” Is the Best Verification Instruction — But Only for JS and C# (10,800 Benchmarks)

Johnathen Chilcher Senior SRE, TechLoom

Anthropic’s prompt engineering documentation calls verification “the single highest-leverage thing you can add to improve accuracy.” But WHAT kind of verification? We tested four strategies across 10,800 benchmark runs and three languages. Only one consistently helps—and it’s language-specific.

“Trace through 3 concrete examples with your code before responding” delivers +3.2 points for JavaScript and +5.5 points for C#. But for Python? Neutral to slightly negative (-0.7pts). Meanwhile, vague instructions like “double-check your work” are the weakest non-bare variant across all languages. Verification works, but only when you tell the model exactly WHAT to do.

This is the seventeenth experiment in our benchmark series, following the chain-of-thought, context pollution, model selection, politeness sweep, persona sweep, step-back, anchoring, and cross-language revisits studies. All data is open source at claude-benchmark.

Experiment Design

I tested five verification variants:

Variant	Description
bare	No verification instructions (control baseline)
verify-vague	“Double-check your work before responding”
verify-run-tests	“Run the tests mentally and verify your code passes”
verify-with-examples	“Trace through 3 concrete examples with your code before responding”
verify-fix-loop	“Check your output, identify any issues, and fix them before responding”

Each variant was run across:

3 models: Haiku, Sonnet, Opus
48 tasks across Python, JavaScript, and C# (Go scoring was sparse and is excluded from analysis)
10 repetitions per model-task-variant combination

That’s 5 variants × 3 models × 48 tasks × 10 reps = 7,200 runs across the three main languages, with an additional 3,600 runs in Go that showed insufficient scoring coverage.*

* Go tasks had sparse test coverage, making score deltas unreliable. We focus on Python, JS, and C# where scoring is comprehensive.

The Scoreboard

Executive summary showing overall scores

Executive summary: overall performance across all tested variants.

Here are the overall results across Python, JavaScript, and C#:

Variant	Python	JavaScript	C#	Overall	N
bare (control)	87.34	66.94	63.81	72.70	1669
verify-vague	86.83	68.91	65.83	73.86	1659
verify-run-tests	87.88	68.63	67.65	74.72	1652
verify-with-examples	86.62	70.10	69.28	75.33	1650
verify-fix-loop	87.41	69.07	65.37	73.95	1651

And here are the deltas from the bare baseline:

Variant	Python	JavaScript	C#
verify-vague	-0.51	+1.97	+2.02
verify-run-tests	+0.54	+1.69	+3.84
verify-with-examples	-0.72	+3.15	+5.46
verify-fix-loop	+0.07	+2.13	+1.56

verify-with-examples Wins — For JS and C#

verify-with-examples is the clear winner for JavaScript (+3.2pts) and C# (+5.5pts). “Trace through 3 concrete examples with your code before responding” gives the model a concrete verification task. It works.

But look at Python:

verify-with-examples is slightly negative for Python (-0.7pts). This is the same instruction that delivers +5.5pts for C#. For Python, the model doesn’t benefit from tracing examples—it already gets Python right more often (87.34 baseline vs 66.94 for JS, 63.81 for C#).

This is a language-specific effect. Python is Claude’s strongest language across all our experiments. The baseline score is 20+ points higher than JS or C#. When the model is already highly accurate, verification adds overhead without corresponding benefit. For JS and C#, where baseline accuracy is lower, verification helps the model catch mistakes it would otherwise miss.

Why Vague Verification Doesn’t Work

“Double-check your work before responding” is the weakest non-bare variant. It delivers +1.97pts for JS and +2.02pts for C#—better than nothing, but worse than every specific verification instruction.

Vague verification underperforms because it gives the model no concrete action. “Double-check your work” doesn’t tell the model WHAT to check or HOW to verify. The model doesn’t know if it should trace examples, run tests mentally, or review for common pitfalls. So it does… something. And that something is less effective than specific instructions.

This maps to the broader pattern we’ve seen across experiments: specificity matters. The model performs best when you tell it exactly what to do. “Trace through 3 concrete examples” is specific. “Double-check your work” is not.

The Specificity Gradient

There’s a clear gradient from vague to specific verification:

Variant	Specificity	JS Delta	C# Delta
verify-vague	Low (“double-check”)	+1.97	+2.02
verify-run-tests	Medium (“run tests mentally”)	+1.69	+3.84
verify-with-examples	High (“trace 3 examples”)	+3.15	+5.46
verify-fix-loop	Very High (“check, identify, fix”)	+2.13	+1.56

More specific is better—up to a point. verify-fix-loop is the most complex instruction (“Check your output, identify any issues, and fix them before responding”), but it underperforms verify-with-examples. Why?

verify-fix-loop asks the model to do three things: check, identify, fix. That’s three cognitive steps, each with potential failure modes. verify-with-examples asks for one thing: trace examples. One specific instruction beats three stacked instructions.

Connection to Multi-Turn and Persona

Verification is single-turn iteration. Instead of generating output, checking it, and then regenerating in a second turn, verification asks the model to do that loop internally before responding. This is cheaper (one API call instead of two) but less reliable (no external feedback loop).

Our persona sweep experiment found that the “meticulous code reviewer” persona delivered the largest quality gains. That persona primes the model to check its work before responding—essentially baked-in verification. And in our capstone v2 experiment, we found that verification is redundant with the code-reviewer persona. They capture the same benefit.

If you already use the code-reviewer persona, adding explicit verification adds no marginal value. The persona already primes the model to verify. Stacking both doesn’t help further—it just adds tokens.

Practical Advice

For JavaScript and C#: Add “trace through 3 concrete examples” to your prompts

verify-with-examples delivers +3.2pts for JS and +5.5pts for C#. This is the single highest-leverage verification instruction. Add it to your CLAUDE.md or system prompt for JS/C# projects.

Evidence: 1650 runs, +3.15pts JS, +5.46pts C#

For Python: Skip verification (or use the code-reviewer persona)

Python’s baseline accuracy is already high (87.34). verify-with-examples adds overhead without benefit (-0.7pts). If you want verification behavior, use the code-reviewer persona instead—it’s language-agnostic and doesn’t require language-specific tuning.

Evidence: 1650 runs, -0.72pts Python

Don’t use vague verification instructions

“Double-check your work” is the weakest verification variant. It underperforms specific instructions by 1-3 points. If you’re going to add verification, be specific: “trace 3 examples” or “run tests mentally,” not “check your work.”

Evidence: verify-vague +1.97pts JS, +2.02pts C# vs verify-with-examples +3.15pts JS, +5.46pts C#

If you use the code-reviewer persona, skip explicit verification

The code-reviewer persona and verification capture the same benefit. From the capstone v2 experiment: persona alone delivers the same quality as persona + verification. Don’t stack them.

Evidence: Capstone v2 experiment, persona-only vs persona+verification

Scatter plot showing quality vs efficiency

Token efficiency: quality vs cost tradeoff.

Heatmap

Heatmap: detailed breakdown.

Radar charts

Radar charts: multi-dimensional view.

Statistical comparison

Statistical significance analysis.

Limitations

Go scoring sparse. Go tasks had insufficient test coverage for reliable scoring. We excluded Go from analysis and focused on Python, JS, and C# where scoring is comprehensive.
Python-centric variant design. The original benchmark suite was designed for Python. Task complexity and scoring may not be perfectly balanced across languages, though our cross-language revisits experiment showed consistent patterns.
Claude models only. Other model families may show different verification sensitivity. But the language-specific effect (Python high baseline, JS/C# lower baseline) suggests this is about task difficulty, not model architecture.
Five verification strategies tested. We tested vague, run-tests, trace-examples, fix-loop, and bare. There may be other verification patterns (review common pitfalls, check edge cases) that behave differently.
No interaction with other prompt patterns. We didn’t test verification combined with chain-of-thought, personas, or other patterns. Interaction effects could exist.

Try It Yourself

The verification instructions experiment configuration and all data are open source:

pip install claude-benchmark

# Run the verification instructions experiment
claude-benchmark experiment run experiments/verification-instructions.toml

# Generate the report
claude-benchmark report

# See the results
open results/*/report.html

You can also define your own verification variants in the experiment TOML and test them against your specific task types and languages. If you find a verification pattern that beats trace-examples for Python, I’d like to see the data.

Full experiment configuration: experiments/verification-instructions.toml

For the complete picture of what works and what doesn’t in AI prompt engineering, see the capstone best-practices experiment where we combine the winning strategies from all experiments into a single optimized configuration.