AI systems are not all the same. Some wait for you to type. Some take actions. Some run on their own, all day, with nobody watching. The autonomy spectrum is the ladder between "it just answers" and "it just runs." Knowing which rung your system is on tells you what it can do, what it will break, and what to build next.
There are five useful rungs on the ladder. Each one adds something the one below it doesn't have.
What it is: You type, it types back. That's it. No memory of past conversations. No tools. No actions in the real world. A very fast, very well-read writing partner.
What it CAN do: Draft an email. Answer a question. Explain a concept. Write a poem.
What it CAN'T do: Send the email. Check if the answer is actually correct by looking it up. Remember what you talked about last week.
Example: Early ChatGPT (before tool use was added). You ask. It answers. Conversation ends, memory gone.
What it is: A chatbot that got a notebook and a couple of tools. It can remember (to a degree). It can search the web, read a file you gave it, or look at a picture.
What it CAN do: Everything Level 1 can, plus pull in facts from the outside world before answering. Continue a conversation.
What it CAN'T do: Take multi-step actions without you. Every new task still starts with you typing.
Example: Modern Claude or ChatGPT with web search turned on. You ask "what's the weather in Paris?" and it actually looks it up instead of guessing.
What it is: Now it can string actions together. It can write code, run that code, see the error, fix it, and try again. But at every risky step (running a shell command, sending a message, spending money) it stops and asks: "Okay?" You say yes or no.
What it CAN do: Long, multi-step tasks. Complex research. Coding projects. Careful customer support.
What it CAN'T do: Run while you're asleep. Every decision still needs you at the keyboard to green-light it.
Example: Claude Code in default mode. You ask it to build a feature. It writes files, tries to run tests, and every time it wants to execute a shell command, it pops up asking for permission.
What it is: You've pre-approved a playbook. Inside that playbook, the agent acts without asking. Outside it, the agent pauses and asks. Think of giving an employee a budget and a scope: within that, they decide; beyond it, they call you.
What it CAN do: Finish a whole project while you get coffee. Run for an hour or more without check-ins. Handle branching tasks.
What it CAN'T do: Operate fully unsupervised over days. Recover from big surprises without you.
Example: Claude Code with auto mode on and a deny list blocking destructive commands like rm -rf. It builds, tests, commits, and files a PR, only stopping if it wants to do something outside the playbook. This is where most real productivity lives in 2026.
What it is: Nobody is watching. The agent runs on a schedule (every hour, every night) or on an event (a new email arrives, a metric crosses a threshold). It does the job. If it gets stuck, it pings a human, tags the work as needs-review, and keeps going on the rest.
What it CAN do: Monitor. React. Produce reports. Make decisions on rails you defined. Work 24/7 without a human session.
What it CAN'T do: Save you if the rails are wrong. A confident, wrong agent at Level 5 is the most expensive kind of bug.
Example: A scheduled research agent that pulls competitor pricing every morning at 6am, writes a one-page summary, emails it to the team, and flags anything that changed by more than 10%. Nobody hits "run" on this. Ever.
Pick the AI system you're working with (or thinking about building). Walk through these four questions in order. Stop at the first NO.
1. Who presses “go”?
2. Can it chain actions (use tools, read results, do another thing)?
3. How often does a human have to approve an action?
4. What happens when it hits an error it didn't expect?
The first level where you answer "no, not yet" is your current rung. Everything below that rung is what you already have. Everything above is what you'd need to build next.
| Level | Human starts it? | Chains actions? | Approvals needed? | Self-recovers? |
|---|---|---|---|---|
| 1. Chatbot | Always | No | N/A | No |
| 2. Assistant | Always | One or two steps | N/A | No |
| 3. HITL Agent | Always | Many | Every risky step | No |
| 4. Auto-Mode Agent | Usually | Many | Only outside scope | For common errors |
| 5. Fully Autonomous | Schedule or event | Unlimited within scope | Rare escalations | Yes, by design |
Level 5 sounds glamorous, but it's brittle. A fully autonomous agent that silently goes off the rails can cause real damage before anyone notices. You don't want to find out your pricing-update agent has been setting everything to $1 for three days.
Build up in stages. Ship at Level 3. Watch it run. When it handles a category of task a hundred times without a mistake, promote that specific category to Level 4. Not all tasks, just that one. Repeat for each category. Eventually the surface area of Level 4 is big enough that running on a schedule (Level 5) is a small, safe extension of what already works.
The mistake most teams make is flipping a working Level 3 system to Level 5 because “it's been working.” Working on 1,000 tasks does not mean working on 10,000. The failures it hasn't hit yet are statistically waiting for you at scale.
Autonomy should follow observed reliability, measured per task category, over time. Graduate one category at a time. Log everything. Set rails. Build the "ping the human" path early, and the rest of the ladder is just more tasks moving up it.
Andrej Karpathy - Intro to Large Language Models (1 hour)