Safety boundaries

An autonomous agent that does what it wants is a liability. Safety is not an afterthought, it's the reason you can delegate work at all. Done right, it's invisible. Done wrong, it's expensive.

The layered model

Safety in autonomous AI comes from layers. No one mechanism is sufficient. Stack them.

Layer 1: Model-level safety

Claude is trained to refuse obviously harmful requests, flag prompt injection in tool output, and be cautious with irreversible actions. This is helpful but not sufficient, models can be wrong.

Layer 2: Prompt-level rules

Your system prompt says "never do X," "always confirm Y," "escalate Z." The model tries to follow; usually succeeds. Doesn't always.

Layer 3: Harness permissions

The harness enforces what the model can actually do via allow/deny. See Permissions. This is the backbone.

Layer 4: Hooks

Hooks run on events. A PreToolUse hook can reject a call the permission list missed. A custom check catches issues generically. See Hooks.

Layer 5: External circuit breakers

Infrastructure-level controls: API rate limits, budget caps, IAM roles, network ACLs. If all else fails, these fail closed.

Layer 6: Human audit

You still review logs. Not every day, not every run. But regularly. Look for surprises.

Kill switches

Every autonomous agent must have a way to stop. Fast. No graceful shutdown, hard stop.

Never do X

Some actions should never happen. Bake them into every layer:

Approval workflows for risky-but-not-forbidden actions

Some actions shouldn't be auto-approved but shouldn't be denied outright. The right response is "ask a human."

This is cheap to build and reduces approval fatigue (only risky actions stop the flow, not every action).

Audit logs

Structured logs of every tool call, every model call, every decision. Kept for at least 90 days. Queryable.

At minimum, each log entry:

{
  "timestamp": "2026-04-18T14:32:11Z",
  "agent": "daily-report",
  "run_id": "abc123",
  "tool": "mcp__slack__send_message",
  "args_redacted": { "channel": "#alerts", "text": "[REDACTED 180 chars]" },
  "result": "success",
  "cost_tokens": 1200
}

Drill the response

When (not if) an agent does something bad, you'll need to:

  1. Stop it fast
  2. Understand what it did (from audit logs)
  3. Undo if possible
  4. Fix the layer that failed
  5. Add a test case so it doesn't happen again

Practice this. Simulate an agent going off the rails. How long does it take to stop it? To understand the damage? That's your incident response quality.

Don't confuse "it hasn't happened yet" with "it won't." Autonomous agents with insufficient safety are fine until they're a disaster. Build the layers before you need them.