Desktop control lets an agent drive native apps. Finder, Photos, System Settings, third-party apps with no APIs. It's powerful and the least mature layer of agent tooling. Use carefully.
The agent takes a screenshot, reasons about what's on screen, and emits actions: click at (x, y), type text, press keys, scroll. It's pixel-based vision, not DOM-aware.
Anthropic's Computer Use (and similar systems) have tiered permissions:
Default: browsers are tier "read" (observe, but use Chrome extension for actual control). Terminals/IDEs are tier "click" (can click buttons, can't type into the editor). Everything else is tier "full."
This prevents common mistakes like: agent reads an email, sees a malicious link, and clicks it via desktop control (instead of opening it in Chrome where link safety rules apply).
1. Take screenshot
2. Reason: what is on screen? What's the next step toward the goal?
3. Emit action (click, type, key)
4. Wait briefly for UI to update
5. Take screenshot again
6. Repeat
Each screenshot + reason cycle is slow. If you know the next 5 actions, batch them. Computer Use supports a computer_batch tool that executes a sequence without screenshotting between each.
Example: open Spotlight, type "Maps," press Enter, wait. That's 4 actions as one batch.
Don't click (450, 230). Click "the search icon near the top-right." The model re-locates it each time, robust to window movement.
After each action, screenshot and verify. "Did the dialog close? Did the text appear? Did the state change?" If no, retry or replan.
Apps can hang. Set a max-turns budget for any desktop task. If the goal isn't reached, stop.
Screenshots are images. Images are expensive (both in tokens and in inference time). A 50-step desktop task can cost 10× a comparable DOM-based task.
Mitigations:
Desktop control is the agent tool of last resort. Exhaust the alternatives:
The reason: every tier above is faster, cheaper, and more reliable. Desktop is the "can do anything" option, which is also the "can break anything" option.