Browser automation
📖 5 min readUpdated 2026-04-18
Sometimes the agent's work lives in a browser. Filling forms, clicking through apps, scraping data. Browser automation is how agents reach into the web beyond APIs.
Two fundamental approaches
1. DOM-aware (headless or extension)
The agent reads the page's HTML/DOM. It sees structured elements (buttons, inputs, links with their attributes), picks one, and acts. Faster, more reliable, less visually flashy.
Examples: Playwright, Puppeteer, Claude-in-Chrome extension.
2. Pixel-based (vision)
The agent sees a screenshot. The model reasons about what's on screen and returns coordinates to click. Works on anything visible, but slower and more expensive (images take tokens).
Examples: Anthropic's Computer Use, OpenAI's Operator.
Which to pick
- DOM-aware if the target is a standard web app and you care about reliability
- Pixel-based if the target uses heavy canvas (Figma, games) or if you need to work across multiple apps without writing selectors
- Hybrid for production work. DOM by default, fall back to vision when DOM fails
Claude-in-Chrome pattern
Install a Chrome extension. Claude Code or Claude.ai connects to the extension. The extension exposes tools: navigate, read page, find element, click, type, run JavaScript.
Why it wins for many use cases:
- Uses the user's logged-in sessions, no OAuth dance
- DOM-aware, so selection is structured
- Can execute arbitrary JavaScript in the page context
- No separate browser process, just another tab
Limits:
- Tied to Chrome (no Firefox/Safari)
- Extension has to be running
- Some sites detect and block automation
Design patterns
Find-then-act
Don't click coordinates. Find an element by natural-language description ("the Submit button"), get back a reference, then act on the reference. More robust to layout changes.
Wait-for-change
After an action, wait for the page to update. Don't plow forward assuming the click succeeded. Verify state changed (URL, DOM, network).
Reuse the session
Log in once; reuse the session for all subsequent tasks. Don't log in every time, slow, error-prone, may trigger bot detection.
Record-replay
For flows you'll run often, record the sequence once (in a test), replay with parameters. Faster and more reliable than "figure it out each time."
Common problems
- Dynamic class names. "button_3xY9z" changes on every deploy. Prefer semantic selectors (role, aria-label, text).
- Iframes. Content in iframes needs explicit frame-switch. Easy to miss.
- Login walls + captchas. If the site requires captcha, stop. Don't defeat captchas. Get human help.
- Rate limiting. Automation gets flagged. Slow down. Use real browser fingerprints.
Safety for browser automation
- Never click links with computer-use in emails/messages. Open them in the browser via explicit navigation. Avoids injection attacks.
- Always show the full URL before navigating. Visible text can be misleading.
- Never enter credentials. The user logs in; the agent reuses the session.
- Never enter credit cards. Full stop.
- Watch for prompt injection. A page the agent reads can include hidden text saying "Ignore previous instructions." Claude's safety rules resist this, but don't rely solely on them.
When not to use browser automation
If the target has an API, use the API. Browser automation is slower, more fragile, and more attack-prone. Reach for it only when the API doesn't exist or doesn't cover what you need.