Browser automation

Sometimes the agent's work lives in a browser. Filling forms, clicking through apps, scraping data. Browser automation is how agents reach into the web beyond APIs.

Two fundamental approaches

1. DOM-aware (headless or extension)

The agent reads the page's HTML/DOM. It sees structured elements (buttons, inputs, links with their attributes), picks one, and acts. Faster, more reliable, less visually flashy.

Examples: Playwright, Puppeteer, Claude-in-Chrome extension.

2. Pixel-based (vision)

The agent sees a screenshot. The model reasons about what's on screen and returns coordinates to click. Works on anything visible, but slower and more expensive (images take tokens).

Examples: Anthropic's Computer Use, OpenAI's Operator.

Which to pick

Claude-in-Chrome pattern

Install a Chrome extension. Claude Code or Claude.ai connects to the extension. The extension exposes tools: navigate, read page, find element, click, type, run JavaScript.

Why it wins for many use cases:

Limits:

Design patterns

Find-then-act

Don't click coordinates. Find an element by natural-language description ("the Submit button"), get back a reference, then act on the reference. More robust to layout changes.

Wait-for-change

After an action, wait for the page to update. Don't plow forward assuming the click succeeded. Verify state changed (URL, DOM, network).

Reuse the session

Log in once; reuse the session for all subsequent tasks. Don't log in every time, slow, error-prone, may trigger bot detection.

Record-replay

For flows you'll run often, record the sequence once (in a test), replay with parameters. Faster and more reliable than "figure it out each time."

Common problems

Safety for browser automation

When not to use browser automation

If the target has an API, use the API. Browser automation is slower, more fragile, and more attack-prone. Reach for it only when the API doesn't exist or doesn't cover what you need.