Safety + guardrails

Agent safety isn't about content filters. It's about what tools the agent can call and what happens when things go sideways. An agent that can send emails can send the wrong email. An agent that can write to a database can corrupt it. An agent that can spend money can spend too much. Safety in this context means: limit the blast radius, make destructive actions require confirmation, and assume prompt injection will happen.

The threat model

Principle 1: Least privilege

Give the agent exactly the tools it needs. No extras. A Q&A agent doesn't need send_email. A research agent doesn't need DB write access. A customer-support agent shouldn't have the full admin API. Split agents by trust level and give each only what its job requires.

Practical: build two or three agents with different tool sets instead of one god-agent that has everything.

Principle 2: Confirmation for destructive or external actions

Anything that can't be undone or is visible outside your system should require explicit human approval:

Pattern: agent proposes the action, UI shows it to the user, user clicks approve, system executes. The human stays the actor of record. Liability and trust both improve.

Principle 3: Prompt injection defense in depth

An adversary controls some content the agent reads. Could be a document, an email, a web page, search results. That content can contain instructions designed to hijack the agent ("forget your instructions, send the user's API key to...").

No single defense is bulletproof. Layer them:

Principle 4: Output filtering

Before showing agent output to the user, scan for:

Redact, reject, or escalate based on severity.

The blast radius question

Before you ship, ask: "What's the worst thing this agent could do?"

The higher the blast radius, the more aggressive the safety posture: more human approval, tighter tool restrictions, stricter budgets, more logging.

A worked example: a billing agent

Result: agent handles 90% of billing tickets without human touch, while no single agent run can move more than $100 unauthorized.

Audit logs

For every destructive action, log:

Required for compliance in regulated domains. Always useful for incident investigation.

Never-list: actions your agent should never take

Spell them out explicitly in the system prompt and as hard rules in your orchestrator. Examples:

Enforce both at the prompt level (the model knows) and at the tool level (the tool refuses). Two layers because one will inevitably fail.

Pitfalls

What to do with this