Agents that can send emails, write to databases, call external APIs, or move money can cause real harm. Safety isn't about LLM content filters, it's about what tools you expose and what happens when things go wrong.
Give the agent the smallest set of tools needed. A Q&A agent doesn't need to send emails. A research agent doesn't need DB write access. Split agents by trust level and give each exactly what it needs.
Any action that can't be undone or affects shared state should require explicit human confirmation:
Pattern: agent proposes the action, UI shows it to the user, user approves, system executes.
No defense is bulletproof. Defense in depth is the only way.
Scan agent output before showing to user:
For each agent, ask: what's the worst thing this agent could do? If it's bad enough, either restrict tools or require human approval for those actions. For a customer-support agent, the worst outcome might be sending a wrong answer, annoying but bounded. For a billing agent, it's charging the wrong card, require confirmation.
Every destructive action logged, with: who (user), what agent, what action, what parameters, approved by (human or agent), timestamp. Required for compliance and for investigating incidents.