Home›Expertise›AI Agents›Data analyst agent

Data analyst agent

📖 3 min readUpdated 2026-04-19

A data analyst agent answers questions by querying data, running analyses, and interpreting results. The hard part isn't the SQL; modern LLMs write solid SQL. The hard part is making sure the agent understands the schema, writes a query that actually answers the question, interprets results correctly, and doesn't confidently state garbage when the data is weird. Done well, it's a huge productivity win for analytics teams. Done poorly, it's a confident bullshit generator in front of the board.

The loop

Core tools

list_tables() - all available tables with short descriptions.
describe_table(table) - columns + types + sample rows.
run_query(sql) - execute (read-only) and return results.
explain_plan(sql) - for large queries, cost-check before running.
create_chart(data, type) - render visualizations when appropriate.
note_finding(text) - scratch pad for the synthesis step.

Schema context is everything

An LLM can't write correct SQL without knowing the schema. A big enterprise schema might have 500 tables; you can't dump them all into context. This is a RAG problem over your metadata: given the user's question, retrieve the relevant tables and their descriptions first. Only then write SQL.

Signals that schema context is the bottleneck: agent writes syntactically valid SQL that queries the wrong table, or confuses two similarly-named columns. Fix with better metadata + retrieval, not better prompts.

Safety: read-only by default

Read-only DB user. Agent can't accidentally drop a table.
Query timeouts. A runaway query shouldn't take down the warehouse.
Row-level security enforced by the DB, not the agent. The agent inherits the user's permissions, nothing more.
Cost caps on expensive operations (full table scans, cross joins).
Prod separation. Agent should hit a read replica or separate analytics DB, not production transactional DB.

Interpretation: numbers need narrative

A table of numbers isn't an answer. The agent should:

State findings in plain language ("Revenue was $240K in March, up 12% from February").
Point out anything surprising ("Interesting: the spike is driven entirely by one enterprise customer").
Flag data quality issues ("Note: 3% of rows have null order_date and were excluded").
Offer follow-up analyses ("Want me to break this down by region?").

Without interpretation, the agent is just a slow SQL IDE.

Visualize or table?

Charts: trends, distributions, comparisons, anything where the shape matters.
Tables: exact numbers, specific records, precise counts.

Agent picks based on the question. "How has X trended?" → line chart. "Which 5 customers bought most last month?" → table.

A worked example

User: "Which marketing channel drove the most revenue last quarter?"

List tables → finds orders, customers, marketing_attribution.
Describe marketing_attribution → discovers channel, customer_id, campaign_id.
Write SQL: join orders to marketing_attribution on customer_id, group by channel, sum revenue, filter to last quarter.
Execute: returns 6 channels, revenue amounts.
Sanity check: total across channels matches known quarterly revenue. Good.
Note surprise: "Paid search was #1 ($480K). But referrals were #2 ($230K), doubled from the prior quarter."
Offer follow-up: "Want to see the referral growth broken down by referring source?"

Good output is a narrative with numbers, not a data dump.

Common failure modes

Sanity-check patterns

Before reporting a number, agent should:

Check the grand total is plausible vs known benchmarks.
Test the query on a small sample first when feasible.
Cross-check with a second, different query formulation.
Flag anything anomalous rather than smoothing it over.

What to do with this

Start with a narrow scope: one or two high-value questions on a clean dataset. Prove accuracy before expanding.
Build a schema-retrieval layer if your DB is larger than ~50 tables. Metadata RAG is the quiet multiplier.
Read the RAG section for the retrieval patterns underneath all of this.
Read safety + guardrails for the DB access controls.