evaluate

Control contamination.

A good score is not meaningful if the task is leaked, duplicated, or mixed across train and eval. Contamination notes explain why the evidence should be trusted.

Record

contamination record:
  source: where the workflow pattern came from
  version: environment, task, grader, and artifact versions
  split: train, eval, holdout, or customer-private
  overlap: public benchmark and prior package checks
  isolation: customer, account, credential, and artifact boundary
  redaction: sensitive fields removed or transformed

Controls

Source tracking

Record whether the workflow came from a real operator, mock app, public benchmark, or synthetic draft.

Split discipline

Keep train, eval, holdout, and customer-private packages separate by stable IDs and artifact paths.

Benchmark overlap

Check whether the task resembles public benchmark tasks, copied examples, or previously published packages.

Customer isolation

Prevent cross-customer artifact reuse unless the package is explicitly sanitized and licensed for reuse.

Review story

The contamination story should be short and inspectable. A reviewer should know what the package can be used for, what it must not be mixed with, and what source assumptions remain.