Your Agent Needs to Know What It Doesn't Know

·
agentsuncertaintyresearch

AI agents today are like someone navigating a city blindfolded. They can follow directions, but they have no idea when they're about to walk into a wall. They don't know what they don't know. And for anything beyond simple tasks, that's a fatal flaw.

This article ties together three sets of experiments I ran, all circling the same question: what happens when you give a model the ability to recognize its own uncertainty, and what can you do with that information?

Problem 1: Uncertainty is expensive to measure

The standard way to check if a neural network is uncertain is to run it many times and see if the answers disagree. Train 10 separate models (an "ensemble") or sample from a single model's weight distributions 10 times (a "Bayesian" approach). If they all agree, the prediction is probably right. If they disagree, the model is unsure.

This works, but it costs 10x the compute. That's fine for a research paper, but not for an agent that needs to make decisions quickly in real time.

What I found: You can skip all of that. Instead, look at what's happening inside the model's "brain" (its hidden layers) during a single forward pass. Fit a simple statistical model on what those internal states looked like during training. At test time, check if the new internal state looks familiar or strange.

On a digit recognition task (telling apart handwritten digits from fashion items), this approach scored 0.991 AUROC. The 10-model ensemble scored 0.979. Better results, one forward pass instead of ten, roughly 4x cheaper. (Full details in Using Latent Distance as a Proxy for Uncertainty.)

Problem 2: Knowing you're uncertain is useless if you can't act on it

An agent that knows it's confused but keeps going anyway isn't better than one that doesn't know. The uncertainty signal needs to change behavior.

I tested this on a language model (Qwen 0.5B) solving arithmetic word problems. The base model gets 80% right with greedy decoding (just taking its best guess at each step). When the model is uncertain about a step, what if it explored alternatives instead of committing to its first guess?

What I found: Both entropy-based branching (checking if the model's output distribution is spread out) and density-based branching (checking if the model's internal state looks unfamiliar) recovered almost every error. 80% jumped to 98%, and the total compute actually went down because the model found correct answers faster instead of generating wrong tokens. (Full details in Uncertainty-Guided Search Beats Beam Search.)

On a separate continuous prediction task, standard beam search (exploring everywhere equally) made things worse because the random exploration added noise that compounded over steps. Uncertainty-guided search avoided this by only exploring at the hard steps and committing to the clean best-guess at easy steps.

Problem 3: Step-by-step reasoning breaks at the edges

Chain-of-thought ("think step by step") is how most current agents handle complex reasoning. It works brilliantly within the training distribution. But I found a cliff: models that score 100% at reasoning lengths they trained on drop to 0% one step beyond.

What I found: Replacing text-based intermediate steps with continuous "thought vectors" (the model thinks in numbers, not words, between steps) eliminates the cliff. The accuracy is lower overall (the learning problem is harder), but it degrades gracefully instead of collapsing. 4% at length 5, 3% at length 50. No sudden death. (Full details in The Cliff Problem with Chain-of-Thought.)

Where this is all going

The thread connecting all three results: agents need internal self-awareness to handle hard, long tasks. They need to:

  1. Know when they're in unfamiliar territory without spending 10x the compute to find out
  2. Change their behavior when uncertain by exploring alternatives at hard steps instead of committing blindly
  3. Maintain internal state that doesn't break at boundaries so their reasoning can extend beyond what they trained on

Right now I'm working on bolting a small "thought adapter" onto a frozen language model. The adapter sits between the model's internal layers, only activates when the hidden state looks unfamiliar, and learns to inject corrective signals. The idea is that you don't retrain the whole model. You just teach it when and how to think more carefully. Results pending.

All code is in two repos: