Your autonomous agent is running somewhere without you. How do you actually know it's still doing its job? The answer is that it tells you - and the way it tells you is the difference between "the agent works" and "the agent worked until that Thursday four weeks ago and nobody noticed." Self-monitoring is what turns an agent from a demo into a reliable piece of infrastructure.
All six matter. You can skip any one and it'll be fine - until one day it won't. The agents that run reliably for a year are the ones where all six signals are tracked from day one.
Think of it as three checks, in order of how cheap and how noisy they are.
Is the agent running at all? Every successful run pings a dead-man's-switch service. If the pings stop, the service alerts you. Cheap to set up (one curl call), saves you from the worst failure mode (total silence).
# At the end of a successful run:
curl https://hc-ping.com/your-check-uuid
Services: Healthchecks.io, BetterUptime, Dead Man's Snitch. Any of them takes five minutes to wire up. No excuse not to.
Did the agent actually do its job? For a reporting agent: did the report get posted to the right Slack channel? For an email agent: did the draft appear in drafts? Define a concrete check, run it after each run, log the result.
The check itself can be as simple as "does the file exist at this path and was it updated in the last hour?" Not hard. Still needs to be written and wired in.
Harder. The output exists, it went to the right place, but is it GOOD? Is the report actually accurate? Is the summary actually useful? Two ways to check:
Both. Neither one alone is enough.
At the end of every day, the agent summarizes what it did: how many runs, what was unusual, any failures, the most interesting outputs. Gets sent to Slack or email. Five-minute read for you. Good for anything that runs frequently.
The agent compares each run to baseline (computed from the last 30 days): if this run took 3× longer, hit 5× more tool calls, or produced an output that's way off typical size, it stops and pings you. Auto-adjusts as normal behavior evolves. Catches "something's weird" cases without you having to watch.
The agent rates its own confidence in each output. High confidence → ship it. Low confidence → flag for human review before shipping. You set the threshold based on how much risk you're willing to accept. Beautiful compromise between full auto and total babysitting.
When the agent hits a wall, it shouldn't just stop. It should walk down a ladder of fallbacks, in order. Each rung is cheaper than the one below but sometimes fails:
The agent tries rung 1 first. If that fails, rung 2. And so on. By the time it's paging you (rung 4), the other rungs have been tried. Your pages are rare, and each one matters.
Each run emits a structured log record. At minimum:
completed, max-turns, error, escalatedStore as JSON lines in a log file (or a log service). You can grep for patterns later, which is how you find the slow creep that would otherwise take you three months to notice.
Andrej Karpathy - Let's reproduce GPT-2 (124M)