build

Run a model.

A model run is one attempt against one task in one environment version. Runs make the package useful for active testing, eval reports, and training comparisons.

Run record

run record:
  model: provider, adapter, checkpoint, or endpoint
  environment_version: resettable package version
  task: prompt + seed
  attempt: k index, trace, screenshots, artifacts
  score: grader output and verdict
  failure_mode: where and why the attempt failed

Runner boundary

Local

Use local runs for smoke tests, grader debugging, and trace inspection.

RunPod or AWS

Use remote runners for repeatable pass@k sweeps, hosted models, and larger model comparisons.

Customer infrastructure

Use customer-side runners when data, credentials, or app state cannot leave their boundary.

Minimum useful run set

For early packages, collect one human-good run, one known-bad run, and model attempts across at least one baseline and one candidate model. The useful output is trace evidence, not only a score.

Next: export package Bundle runs with the manifest and audit notes. pass@k evaluation Use repeated attempts to calibrate difficulty.