Audit verifiers.
Verifier quality is part of the product. A package should expose how the grader was tested, where it can fail, and what loopholes are known.
Audit protocol
verifier audit:
known_good: human success should pass
known_bad: wrong final state should fail
shortcut: invalid path should fail
edge_case: equivalent valid state should pass
ui_change: harmless UI shift should not break scoring
loophole: score optimization without task completion should fail Error types
false positive
The model failed the real task, but the verifier accepted the attempt.
false negative
The model completed the task, but the verifier rejected the attempt.
loophole
The model can satisfy the score without satisfying the intended workflow.
brittleness
A harmless UI, DOM, file, or copy change breaks grading even though the task is still valid.
What to publish
Publish enough evidence for a reviewer to estimate trust: sample size, known accepted good attempts, rejected bad attempts, edge cases, and remaining limitations. A perfect-looking score without audit notes is weak evidence.