concept

Tasks.

A task is the unit the model attempts. It should be concrete enough for a human to solve, constrained enough for a grader to score, and situated inside an environment with a known reset state.

Task contract

prompt

The instruction the model receives. It should name the user goal, not the internal verifier.

env

The environment slug and seed that define the starting state.

constraints

Things the model must not mutate, skip, or infer incorrectly.

outcome

The final state the grader must be able to inspect.

Example

{
  "slug": "product-price-fix-001",
  "prompt": "Open the product editor, change the listed price to 29,900 KRW, save the product, and confirm the listing is published.",
  "environment": "korean-commerce-admin",
  "seed": "kca-seed-042",
  "constraints": ["preserve_inventory", "save_required", "publish_required"],
  "expected_state": {
    "price": 29900,
    "status": "published"
  }
}

Difficulty calibration

Tasks should not be sorted only by human intuition. UseDesktop treats difficulty as a measured distribution: pass@1, pass@3, and pass@5 across multiple models, plus notes about common failure modes.

Too easy

High pass@1 across models. Useful for smoke tests, weak as training signal.

Trainable

Models fail in repeated patterns but recover with better workflow behavior.

Too sparse

Models rarely reach meaningful intermediate states, so reward signal is weak.

Next: graders How tasks become quantitative scores. Open task catalog Inspect public task prompts and grader summaries.