Browser agent
📖 3 min readUpdated 2026-04-19
A browser agent drives a real browser: navigates URLs, clicks, types, scrolls, reads rendered content. Useful for tasks sites don't expose via APIs. Powerful. Slow. Fragile.
Use cases
- Scraping data from sites without APIs
- Filling out forms
- Booking appointments, reservations
- Research requiring dynamic content
- Testing web applications
Tooling
- Playwright / Puppeteer, headless browser automation
- Claude Computer Use, Anthropic's vision-based browser control
- OpenAI Operator, operator model with browser
- browser-use, Skyvern, Browserbase, higher-level frameworks
Vision vs DOM
Two control paradigms:
- Vision: screenshot the page, let vision LLM decide where to click by pixel. Flexible, slow.
- DOM: extract accessibility tree or DOM, agent works with structured elements. Faster, more fragile if page is dynamic.
Hybrid systems use both.
Challenges
Page reliability
Modern sites are SPAs with async loading, anti-bot protections, captchas. Agents break often.
Latency
Each click takes seconds. Agent sessions measured in minutes, not seconds.
Cost
Vision LLM calls on screenshots are expensive. Heavy context.
Authentication
Logged-in scenarios need credential handling. Big security surface.
When to use
- No API exists for the site
- Task is high-value enough to justify latency and cost
- You have a sandboxed environment to run the browser
When to skip
- API exists (always prefer API)
- Task is frequent and needs low latency
- Target site has strong anti-automation