start

Build your first package.

Start with one narrow workflow. The goal is not volume. The goal is a package that another person can inspect and decide whether it is worth training on.

Before you start

Pick one real workflow

Use a task a person actually performs, not a synthetic click sequence.

Know the final state

The grader needs something inspectable: app state, file state, DOM state, or artifact output.

Keep the scope small

One package can start with one task and one grader if the evidence is clean.

Record provenance

Track source, version, privacy treatment, and whether it overlaps with public benchmarks.

Path

1. Capture

Record the workflow in Workbench: screenshots, actions, intermediate states, artifacts, and outcome.

2. Normalize

Remove noisy events and keep the actions that explain how the work was completed.

3. Create task

Write the prompt, reset seed, constraints, allowed observations, and expected final state.

4. Write grader

Score final state first, then add process checks and violation checks where they matter.

5. Run checks

Run one known-good attempt, one known-bad attempt, and at least one model attempt.

6. Export

Package the manifest, traces, screenshots, grader output, audit notes, and provenance.

Target output

first package:
  source workflow: one real completion path
  environment: resettable seed or workflow twin
  task: prompt + constraints + expected state
  grader: final-state check + known-good/known-bad audit
  run: at least one model attempt with trace and score
  export: manifest + artifacts + provenance

Stop after one package if the grader cannot distinguish a real success from a shortcut. Fix the evidence path before collecting more trajectories.