Self-correction

Self-correction is when the agent detects its own errors and fixes them without a human in the loop. It needs a verifier: something outside the model that can say "this is wrong, and here's why." Run code → fail test → fix code → pass test. Draft claim → fact-check against source → rewrite if unsupported. The pattern is simple but the power is enormous. It's the difference between an agent that gives up at the first stumble and one that actually gets things done.

The essential ingredient: a verifier

Self-correction is not the same as reflection. Reflection asks the model to critique itself (subjective). Self-correction runs an objective check: does the code compile? Do the tests pass? Is the claim backed by the source? Without a verifier outside the model, the model doesn't know it's wrong and can't correct.

Verifiers are the secret weapon of good agents. If you can find or build one for your task, you can self-correct. If you can't, you're stuck with reflection or human review.

The correction loop

The triggers

A worked example: coding agent

  1. Produce: agent writes def divide(a, b): return a / b.
  2. Verify: run the test suite. Test divide(10, 0) fails with ZeroDivisionError.
  3. Agent reads error: "The test expected the function to return None when b is 0, but it raised an exception. I need to handle b == 0."
  4. Fix: def divide(a, b): return None if b == 0 else a / b.
  5. Verify again: all tests pass. Done.

This is why coding agents work: the test suite is a near-perfect verifier, and the error message is a rich signal that tells the agent exactly what to change. Take away the tests and you take away the feedback loop; the agent starts fumbling.

Self-correction by task type

Stop conditions

Self-correction can spiral. A model that can't fix a problem will iterate on the same wrong answer forever. Hard limits:

The quality ceiling

Self-correction only works when the model is capable of producing the right answer but drifted from it. For problems beyond the model's capability (novel research, niche domain expertise the model doesn't have), self-correction doesn't help. The model just repeats variations of the wrong answer. You'll see this in traces: the test keeps failing, the fixes keep missing the real issue, and you're burning cost. Have a stop rule, escalate to a stronger model, or break the task into smaller pieces the current model can handle.

Pitfalls

What to do with this