The Cliff Problem with Chain-of-Thought

When you ask a language model to "think step by step," it works surprisingly well. This technique, called chain-of-thought (CoT), has become one of the most popular ways to get better answers out of AI models. But there's a hidden failure mode that nobody talks about.

What is chain-of-thought?

Normally, a model takes a question and jumps straight to an answer. Chain-of-thought adds intermediate steps. Instead of "What is 3 + 5 2? Answer: 13," the model writes out "3 + 5 2. First, 5 * 2 = 10. Then, 3 + 10 = 13. Answer: 13."

Those intermediate steps help the model keep track of complex reasoning. It's like showing your work on a math test.

The experiment

I trained small transformer models on a multi-step arithmetic task. The model gets a starting number and a chain of operations: add 7, multiply by 3, subtract 2, and so on. It needs to track the running total through each step and predict the final value.

I compared three approaches:

Standard transformer: takes the full sequence of operations and predicts the answer directly. No intermediate work shown. The model has to figure out how to track state internally.

Token chain-of-thought: after each operation, the model writes out the intermediate answer as actual text. "3 + 7 = 10, * 3 = 30, - 2 = 28." Each step is explicitly supervised during training.

Latent thought vectors: instead of writing answers between steps as text, the model produces a continuous vector (a list of numbers, not a word) that captures its "thinking" and feeds it into the next step. These aren't human-readable, they're the model's internal notes. Trained end-to-end: the model learns what information to put in these vectors on its own.

The training setup

All three models trained on chains of length 3 to 25 (meaning 3 to 25 operations in a row). Then I tested them on lengths they'd never seen: 5, 10, 15, 20 (within training range) and 30, 40, 50 (beyond it).

The key question: what happens when you ask the model to handle a longer chain than anything it trained on?

The cliff

The standard transformer learned almost nothing. 1-3% accuracy across all lengths. Without any intermediate reasoning, it just can't track a number through a long sequence of operations.

Token chain-of-thought was perfect inside the training range. 100% at length 20. Then it fell off a cliff. 0.2% at length 30. 0% at length 50. The model memorized the pattern of writing intermediate answers at the lengths it saw during training. The moment you push one step past that, it completely breaks. Not a gradual decline. A cliff.

Latent thought vectors got lower absolute accuracy (3-5%) but the behavior was completely different. It scored about the same at length 5 as it did at length 50. Basically flat. No cliff. It generalized to lengths it never trained on, even though nobody told it how.

Why the cliff happens

Token CoT encodes its reasoning in the same format as its training data: text tokens at specific positions. The model learns "at position 20, I should be producing an answer." At position 30? It's never done that. The positional pattern breaks.

Latent thought vectors don't have this problem because continuous vectors don't have hard positional boundaries. The model learns a general strategy ("compress my running state into this vector and pass it forward"), and that strategy doesn't know or care how many times it's been applied. It just keeps going.

Why this matters

The token-CoT result looks amazing on a benchmark. 100% accuracy. You'd ship it with confidence. But the moment a real-world problem requires even slightly longer reasoning than what was in the training data, it collapses completely. And you won't know until it happens.

The latent thought approach has lower peak accuracy because learning to represent numbers as geometry in a continuous space is genuinely harder than predicting the next text token. But its failure mode is graceful degradation instead of a cliff.

If you care about reliability, the cliff is the thing to worry about. A system that scores 4% consistently is predictable. A system that scores 100% in-distribution and 0% out-of-distribution is dangerous.

Code

The experiment is in uncertainty-driven-memory (latent_thought_experiment.py and latent_thought_v2.py).