Introduction

UseDesktop is infrastructure for computer-use agents. You can create verifiable RL environments for CUA, run evals, train models, compare model runs, and export evidence-backed workflow packages.

What you can do

create

Build resettable RL environments and task packages from real computer-use workflows.

evaluate

Run CUA models against tasks with grader contracts, traces, scores, and pass@k results.

train

Use verified workflow packages as SFT/RL data and compare model improvements.

export

Package environments, tasks, graders, model runs, audits, and provenance for review.

quickstart capture env spec task schema grader contract run evidence

Start here

Quickstart

Build the first reviewable package: one workflow, one task, one grader, and one model run.

Concepts in 5 minutes

Understand environment, task, grader, run, audit, and export without reading every page.

Artifact schema

Use the portable manifest that connects capture, evals, training, and export.

Build path

Capture workflow

Record the source trace, screenshots, artifacts, decisions, and final outcome.

Create task

Convert a workflow into a prompt, seed, constraints, expected state, and difficulty target.

Write grader

Score final state, process evidence, violations, and known reward-hacking paths.

Run model

Collect traces, scores, rewards, pass@k results, and failure evidence across models.

Export package

Package manifests, artifacts, audits, and provenance for evals or customer review.

Publish eval

Turn a package into a public evidence page with inspectable grader and run records.

Package shape

The public eval pages are the human-readable layer. The export is the machine-readable contract that should be runnable in local, RunPod, AWS, or customer infrastructure.

{
  "environment": {
    "id": "korean-commerce-admin",
    "reset": "seeded_state_v1",
    "action_space": ["click", "type", "key", "scroll"]
  },
  "task": {
    "prompt": "Change the listed price to 29,900 KRW and publish.",
    "constraints": ["do_not_change_inventory", "save_required"]
  },
  "grader": {
    "type": "state_and_process_v1",
    "success": ["price_field_equals_29900", "publish_state_true"]
  },
  "evidence": {
    "runs": ["pass@1", "pass@3", "pass@5"],
    "audits": ["verifier_fp", "verifier_fn", "known_loopholes"]
  }
}

Quality story

A workflow package should not only look realistic. It should carry evidence: task solvability, ambiguity checks, verifier false-positive and false-negative audits, model pass@k distributions, failure traces, and contamination notes.

env

The app state and runtime boundary the model is placed inside.

task

The goal, constraints, start condition, and expected outcome.

grader

The scoring function that turns a model attempt into quantitative evidence.

run

A model attempt with trace, score, reward, verdict, and failure notes.

Start the quickstart Build one reviewable package before expanding the dataset. Capture a workflow Begin with the source signal, not a synthetic task list.