Data analyst agent
📖 3 min readUpdated 2026-04-19
A data analyst agent answers questions by querying data, running analyses, and interpreting results. The hard part isn't the SQL; modern LLMs write solid SQL. The hard part is making sure the agent understands the schema, writes a query that actually answers the question, interprets results correctly, and doesn't confidently state garbage when the data is weird. Done well, it's a huge productivity win for analytics teams. Done poorly, it's a confident bullshit generator in front of the board.
The loop
Core tools
list_tables() - all available tables with short descriptions.
describe_table(table) - columns + types + sample rows.
run_query(sql) - execute (read-only) and return results.
explain_plan(sql) - for large queries, cost-check before running.
create_chart(data, type) - render visualizations when appropriate.
note_finding(text) - scratch pad for the synthesis step.
Schema context is everything
An LLM can't write correct SQL without knowing the schema. A big enterprise schema might have 500 tables; you can't dump them all into context. This is a RAG problem over your metadata: given the user's question, retrieve the relevant tables and their descriptions first. Only then write SQL.
Signals that schema context is the bottleneck: agent writes syntactically valid SQL that queries the wrong table, or confuses two similarly-named columns. Fix with better metadata + retrieval, not better prompts.
Safety: read-only by default
- Read-only DB user. Agent can't accidentally drop a table.
- Query timeouts. A runaway query shouldn't take down the warehouse.
- Row-level security enforced by the DB, not the agent. The agent inherits the user's permissions, nothing more.
- Cost caps on expensive operations (full table scans, cross joins).
- Prod separation. Agent should hit a read replica or separate analytics DB, not production transactional DB.
Interpretation: numbers need narrative
A table of numbers isn't an answer. The agent should:
- State findings in plain language ("Revenue was $240K in March, up 12% from February").
- Point out anything surprising ("Interesting: the spike is driven entirely by one enterprise customer").
- Flag data quality issues ("Note: 3% of rows have null order_date and were excluded").
- Offer follow-up analyses ("Want me to break this down by region?").
Without interpretation, the agent is just a slow SQL IDE.
Visualize or table?
- Charts: trends, distributions, comparisons, anything where the shape matters.
- Tables: exact numbers, specific records, precise counts.
Agent picks based on the question. "How has X trended?" → line chart. "Which 5 customers bought most last month?" → table.
A worked example
User: "Which marketing channel drove the most revenue last quarter?"
- List tables → finds
orders, customers, marketing_attribution.
- Describe marketing_attribution → discovers
channel, customer_id, campaign_id.
- Write SQL: join orders to marketing_attribution on customer_id, group by channel, sum revenue, filter to last quarter.
- Execute: returns 6 channels, revenue amounts.
- Sanity check: total across channels matches known quarterly revenue. Good.
- Note surprise: "Paid search was #1 ($480K). But referrals were #2 ($230K), doubled from the prior quarter."
- Offer follow-up: "Want to see the referral growth broken down by referring source?"
Good output is a narrative with numbers, not a data dump.
Common failure modes
Sanity-check patterns
Before reporting a number, agent should:
- Check the grand total is plausible vs known benchmarks.
- Test the query on a small sample first when feasible.
- Cross-check with a second, different query formulation.
- Flag anything anomalous rather than smoothing it over.
What to do with this
- Start with a narrow scope: one or two high-value questions on a clean dataset. Prove accuracy before expanding.
- Build a schema-retrieval layer if your DB is larger than ~50 tables. Metadata RAG is the quiet multiplier.
- Read the RAG section for the retrieval patterns underneath all of this.
- Read safety + guardrails for the DB access controls.