Data analyst agent

A data analyst agent answers questions by querying data, running analyses, and interpreting results. The hard part isn't the SQL; modern LLMs write solid SQL. The hard part is making sure the agent understands the schema, writes a query that actually answers the question, interprets results correctly, and doesn't confidently state garbage when the data is weird. Done well, it's a huge productivity win for analytics teams. Done poorly, it's a confident bullshit generator in front of the board.

The loop

Core tools

Schema context is everything

An LLM can't write correct SQL without knowing the schema. A big enterprise schema might have 500 tables; you can't dump them all into context. This is a RAG problem over your metadata: given the user's question, retrieve the relevant tables and their descriptions first. Only then write SQL.

Signals that schema context is the bottleneck: agent writes syntactically valid SQL that queries the wrong table, or confuses two similarly-named columns. Fix with better metadata + retrieval, not better prompts.

Safety: read-only by default

Interpretation: numbers need narrative

A table of numbers isn't an answer. The agent should:

Without interpretation, the agent is just a slow SQL IDE.

Visualize or table?

Agent picks based on the question. "How has X trended?" → line chart. "Which 5 customers bought most last month?" → table.

A worked example

User: "Which marketing channel drove the most revenue last quarter?"

  1. List tables → finds orders, customers, marketing_attribution.
  2. Describe marketing_attribution → discovers channel, customer_id, campaign_id.
  3. Write SQL: join orders to marketing_attribution on customer_id, group by channel, sum revenue, filter to last quarter.
  4. Execute: returns 6 channels, revenue amounts.
  5. Sanity check: total across channels matches known quarterly revenue. Good.
  6. Note surprise: "Paid search was #1 ($480K). But referrals were #2 ($230K), doubled from the prior quarter."
  7. Offer follow-up: "Want to see the referral growth broken down by referring source?"

Good output is a narrative with numbers, not a data dump.

Common failure modes

Sanity-check patterns

Before reporting a number, agent should:

What to do with this