Desktop control

Desktop control is the "go anywhere" level of agent tooling. If you can do it with a mouse and keyboard, the agent can do it - Finder, Photos, System Settings, third-party apps with no APIs, everything. It's powerful and it's the least mature layer of the stack. Use it when nothing else works, and know the failure modes before you turn it on.

How it actually works.

The agent does the same thing a human would: it looks at the screen, decides what to do, and does it. Concretely, each cycle is:

~ desktop control loop ~

Unlike browser automation, this is purely pixel-based. The agent doesn't have DOM access. It sees an image; it reasons about what's in the image. Slower, more expensive, and more error-prone than DOM work, but it works on ANYTHING visible on your screen.

When to reach for desktop control, and when not to.

~ reach for it / don't reach for it ~

The tiered permission model.

Anthropic's Computer Use (and similar systems) uses tiered permissions per application, because the damage an agent can do varies wildly by app. Three tiers:

The default tiers matter more than you'd think. Browsers default to "read" - the agent can observe what's in your browser but can't click things there. That's on purpose: for actual browser control, you should be using the Chrome extension (DOM-aware) rather than blindly clicking pixels. Terminals and IDEs default to "click" - the agent can click buttons in your editor but can't type code directly. For code, use the Edit tool or shell commands. Everything else defaults to "full."

These defaults prevent a common failure: the agent reads a suspicious email, sees a scary link, and clicks it via pixel-based desktop control - bypassing all the link-safety rules that would have kicked in if it used the Chrome extension instead. The tier system catches this automatically.

Reliability patterns.

Visual anchors, not raw coordinates.

Never hard-code "click at (450, 230)." Click "the search icon near the top-right." The model re-locates the icon on each screenshot, which means the agent survives window-position changes, resolution changes, theme changes, all the things that would break coordinate-based automation.

State verification after each action.

After every action, screenshot and verify it worked. "Did the dialog close? Did the text appear? Did state change?" If no, retry or replan. If you plow forward assuming the click succeeded, you get agents that hallucinate whole multi-step sequences that didn't actually happen.

Batched actions for known sequences.

Each screenshot+reason cycle is expensive (maybe 2-4 seconds). If you know the next 5 actions, batch them. Computer Use supports a computer_batch that runs a sequence without screenshotting between each step. "Open Spotlight, type 'Maps,' press Enter, wait" is 4 actions in one batch. Big speedup.

Hard timeouts.

Desktop apps can hang, modals can appear unexpectedly, things freeze. Set a max-turns budget for any desktop task. If the goal isn't reached in N steps, stop. Better to fail loudly than to spin forever.

Safety. Non-negotiable things.

~ desktop control never-list ~

Cost. Desktop control is expensive.

Every screenshot is an image. Images cost a lot more tokens than text (roughly 10-20× per "turn" of equivalent information). A 50-step desktop task can easily cost 10× a comparable DOM-based task. Budget accordingly.

Mitigations that help:

The last-resort principle.

Desktop control is the agent tool of last resort. Before reaching for it, exhaust every alternative:

~ try these first, in order ~

The reason: every tier above desktop control is faster, cheaper, and more reliable. Desktop control is the "can do anything" option, which also makes it the "can break anything" option. Use it when the power is necessary - not as a default.

That said: when you do need it, nothing else can do what it does. An agent that can drive any app on your desktop, combine apps across workflows, operate on whatever is in front of it right now - that's a real unlock. Just reach for it with eyes open.