Safety + guardrails

Agents that can send emails, write to databases, call external APIs, or move money can cause real harm. Safety isn't about LLM content filters, it's about what tools you expose and what happens when things go wrong.

The threat model

Principle: least privilege

Give the agent the smallest set of tools needed. A Q&A agent doesn't need to send emails. A research agent doesn't need DB write access. Split agents by trust level and give each exactly what it needs.

Confirmation for destructive actions

Any action that can't be undone or affects shared state should require explicit human confirmation:

Pattern: agent proposes the action, UI shows it to the user, user approves, system executes.

Prompt injection defenses

No defense is bulletproof. Defense in depth is the only way.

Output content filters

Scan agent output before showing to user:

The blast radius

For each agent, ask: what's the worst thing this agent could do? If it's bad enough, either restrict tools or require human approval for those actions. For a customer-support agent, the worst outcome might be sending a wrong answer, annoying but bounded. For a billing agent, it's charging the wrong card, require confirmation.

Audit logs

Every destructive action logged, with: who (user), what agent, what action, what parameters, approved by (human or agent), timestamp. Required for compliance and for investigating incidents.