Write a grader.
The grader is the trust boundary. If it accepts shortcuts or rejects real successes, the package teaches the wrong behavior.
Grader shape
grader:
final_state_checks: required state or artifact
process_checks: required events or observations
violation_checks: forbidden mutations or shortcuts
known_good: human completion accepted
known_bad: shortcut or wrong state rejected
limitations: edge cases and loopholes Checks
Final state
Prefer inspectable app state, file state, DOM state, database state, or exported artifact content.
Process evidence
Add required events only when the process matters for correctness or reward hacking prevention.
Violations
Reject wrong-target edits, skipped saves, unrelated mutations, and states reached by invalid shortcuts.
Audit before running models
Test one known-good attempt and one known-bad attempt before model runs. A grader that cannot pass this audit should not be used for pass@k or training decisions.