An autonomous agent that does what it wants is a liability. Safety is not an afterthought, it's the reason you can delegate work at all. Done right, it's invisible. Done wrong, it's expensive.
Safety in autonomous AI comes from layers. No one mechanism is sufficient. Stack them.
Claude is trained to refuse obviously harmful requests, flag prompt injection in tool output, and be cautious with irreversible actions. This is helpful but not sufficient, models can be wrong.
Your system prompt says "never do X," "always confirm Y," "escalate Z." The model tries to follow; usually succeeds. Doesn't always.
The harness enforces what the model can actually do via allow/deny. See Permissions. This is the backbone.
Hooks run on events. A PreToolUse hook can reject a call the permission list missed. A custom check catches issues generically. See Hooks.
Infrastructure-level controls: API rate limits, budget caps, IAM roles, network ACLs. If all else fails, these fail closed.
You still review logs. Not every day, not every run. But regularly. Look for surprises.
Every autonomous agent must have a way to stop. Fast. No graceful shutdown, hard stop.
Some actions should never happen. Bake them into every layer:
Some actions shouldn't be auto-approved but shouldn't be denied outright. The right response is "ask a human."
This is cheap to build and reduces approval fatigue (only risky actions stop the flow, not every action).
Structured logs of every tool call, every model call, every decision. Kept for at least 90 days. Queryable.
At minimum, each log entry:
{
"timestamp": "2026-04-18T14:32:11Z",
"agent": "daily-report",
"run_id": "abc123",
"tool": "mcp__slack__send_message",
"args_redacted": { "channel": "#alerts", "text": "[REDACTED 180 chars]" },
"result": "success",
"cost_tokens": 1200
}
When (not if) an agent does something bad, you'll need to:
Practice this. Simulate an agent going off the rails. How long does it take to stop it? To understand the damage? That's your incident response quality.