Tasks.
A task is the unit the model attempts. It should be concrete enough for a human to solve, constrained enough for a grader to score, and situated inside an environment with a known reset state.
Task contract
The instruction the model receives. It should name the user goal, not the internal verifier.
The environment slug and seed that define the starting state.
Things the model must not mutate, skip, or infer incorrectly.
The final state the grader must be able to inspect.
Example
{
"slug": "product-price-fix-001",
"prompt": "Open the product editor, change the listed price to 29,900 KRW, save the product, and confirm the listing is published.",
"environment": "korean-commerce-admin",
"seed": "kca-seed-042",
"constraints": ["preserve_inventory", "save_required", "publish_required"],
"expected_state": {
"price": 29900,
"status": "published"
}
} Difficulty calibration
Tasks should not be sorted only by human intuition. UseDesktop treats difficulty as a measured distribution: pass@1, pass@3, and pass@5 across multiple models, plus notes about common failure modes.
Too easy
High pass@1 across models. Useful for smoke tests, weak as training signal.
Trainable
Models fail in repeated patterns but recover with better workflow behavior.
Too sparse
Models rarely reach meaningful intermediate states, so reward signal is weak.