evaluate

Publish an eval.

Publishing is the path from local package authoring to a public page that a researcher can inspect: prompt, env, grader, run evidence, and quality controls.

Checklist

1. Author env

Define reset state, observation space, action space, source workflow, and version.

2. Create tasks

Write prompts, constraints, expected outcomes, and seeds for each task.

3. Attach graders

Add final-state checks, process checks, violation checks, and audit notes.

4. Run models

Collect traces, pass@k summaries, scores, rewards, and failure modes.

5. Publish evidence

Export the manifest, attach artifacts, and render public pages for evals and docs.

Minimum evidence

Public packages should include at least one task demo, one grader contract, one model run, pass@k summary, verifier FP/FN notes, and known failure modes. Without that, the page reads like a dataset claim rather than evidence.

eval ready when:
  - reset succeeds from a known seed
  - human can solve the task
  - verifier rejects at least one known bad attempt
  - verifier accepts at least one known good attempt
  - model run traces are linked to scores
  - contamination notes are written

Share preview

Share the most specific URL. Use an environment page for package-level context, a task page for prompt/env/grader review, a run page for failure evidence, and a model page for comparison.