Tool Use Reduces Depth-Induced Collapse
Force a model through many out-of-distribution reasoning steps, block the usual shortcuts, and standalone systems lose reliable next-step accuracy as the chain grows. Let them write, run, and revise code, and the collapse largely disappears — even small models begin to rival frontier ones.
Introduction
A lot of optimism about where AI is heading rests on one assumption: that a language model which has learned relationships on in-distribution data can recombine them to solve genuinely new, out-of-distribution problems. The trouble is that this ability is hard to measure cleanly. On many benchmarks, a model can reach the right final answer by leaning on memorized subproblems, redundant cues, or a small slice of the available evidence. That is still interpolation. Real generalization means using all the relevant evidence, step after step, when no shortcut is available.
There is an elegant way to frame the stakes. The Diligent Learner framework (Shalev-Shwartz et al. 2025) models reasoning as search over partial solutions, where everything hinges on one number: $\gamma$, the probability that the model proposes a useful next step. If $\gamma$ stays bounded away from zero, search can scale to long horizons. If $\gamma$ decays with depth, even a powerful search procedure becomes brittle. That turns a broad question about generalization into a concrete, testable one:
To answer it, you need a benchmark where each step has exactly one correct continuation, and that continuation is recoverable only by combining the revealed history with new evidence. This paper builds one from Boolean reconstruction over GF(2), wraps it in a prefix-conditioned sampling oracle that blocks partial-information shortcuts, and measures $\gamma_g$ as a function of depth. The story has four parts:
Part I — Why benchmarks reward interpolation
Picture training as a cloud of datapoints in an idea space, linked by the relationships a model learns. At inference, the model walks those relationships from a query to an answer. If the answer sits close to the training cloud, the walk is short and familiar — that is in-distribution reasoning. If it sits far away, the model has to extend learned relationships into territory it has never seen — out-of-distribution reasoning. Many benchmarks do not force that long walk.
Fig. 1 — In-distribution reasoning stays inside the dashed neighborhood of the training cloud; out-of-distribution reasoning must reach a target far outside it. A synthesized tool (gray arrow) can extend how far the model travels reliably.
The fix is a benchmark in which ignoring either the history or the new data guarantees failure, so a high score can only mean sustained stepwise reasoning. The catch is that unrestricted Boolean-circuit recovery is the wrong object off the shelf: in its usual form, one tries to recover all terms at once from an enormous example set, and even validating arbitrary intermediate circuits can be intractable. The paper therefore restricts to a structured subclass where every stage has a unique correct continuation and every candidate next step can be checked efficiently.
Part II — A shortcut-proof benchmark
The targets are Boolean functions written in algebraic normal form over GF(2) — XORs of monomials. Each input splits into address bits $a$ and payload bits $v$. The function is a sum of terms, where each address bit gates one fixed-degree payload conjunction:
An instance fixes an ordered list of supports $(S_1,\dots,S_n)$. Reasoning becomes an iterative decoding game: at step $g$, the model is handed the already-recovered prefix $P_g=(t_1,\dots,t_g)$ plus a fresh batch of labeled examples $S_g$, and must name the next monomial $t_{g+1}$. Because the instance commits to an order, there is exactly one correct continuation, so step-success is a clean probability:
This is the empirical stand-in for the Diligent Learner's $\gamma$. To make sure it measures reasoning rather than shortcut-spotting, the paper separates solvers by what they can see — the full-information diligent solver (prefix + data), a data-only solver, a history-only solver, and a partial solver — then engineers the oracle so only the diligent one can succeed reliably:
The oracle: history as a key
Here is the mechanism that blocks the shortcuts. At step $g$, the oracle turns on the target address bit, zeros out all later address bits, flips the prefix address bits like fair coins, and draws a nearly balanced payload. The resulting label splits cleanly in two:
The mask is computable from the prefix, so a diligent solver subtracts it and reads the signal. To a solver without the prefix, the same mask behaves like a high-entropy scrambler. Its edge over a coin flip from any single example decays geometrically in the number of active prefix bits $m$, which is about $g/2$ at depth $g$:
Three guarantees fall out of this design, and together they pin the difficulty to depth. Supports are drawn independently, so the prefix says nothing about the next support — no history-only shortcut. The Hamming weight is tuned so each monomial fires with probability near one-half — no statistical-leakage shortcut. And the mask drives the data-only Bayes advantage toward zero as depth grows — no data-only shortcut. The slider below puts that last guarantee in motion.
Fig. 2. The data-only Bayes advantage is $\tfrac12|1-2\rho|^{m}$ with $m\approx g/2$ active mask bits. As depth climbs — or as $\rho$ approaches the balanced $\tfrac12$ — this advantage collapses toward zero, so new evidence alone becomes uninformative. The diligent solver subtracts the prefix mask and recovers the next term from a small batch, independent of depth. Once unmasked, each step is a single Boolean-conjunction recovery: polynomial-time, and not intrinsically harder at larger $g$.
Part III — Depth-induced collapse
Run real models on this benchmark and a sharp pattern appears: as depth grows, next-step accuracy falls, even though a polynomial-time recovery procedure exists at every step. The information is present; the difficulty is carrying out the prefix-conditioned computation reliably across a long chain. The estimator simulations confirm the benchmark behaves as designed — only the full-information solver sustains high $\gamma_g$, while data-only and partial solvers decay and history-only stays near chance. Then the language models begin to follow the same script.
Fig. 3. Illustrative curves — the shapes match the paper's measured figures (log-scale $\gamma_g$ versus depth), not exact values. Standalone small models fade toward the chance line $\gamma_{\mathrm{triv}}=1/\binom{12}{3}\approx0.45\%$ by moderate depth; frontier no-tool models delay the fall but still degrade; tool-enabled models stay much flatter across the whole range.
How much of the prefix does a model actually use?
There is a clean way to quantify the collapse. Take the partial-information solver and give it only the first $k$ prefix terms; ask which $k$ would reproduce a given model's observed accuracy at depth $g$. That $k^\star(g)$ is the effective prefix — how much history the model behaves as if it integrated. Full integration is the line $k^\star=g$. The paper's linear fits tell the story: without tools, the slopes usually sit below one, so the gap to full integration widens with depth; with tools, the slopes move back toward one and the error rate becomes far less depth-dependent.
Fig. 4. Exact linear fits $k^\star\approx\max(0,\,f g - a)$ from the paper. Without tools, the 30B variants have $f=0.71$–$0.85$, so the shaded gap to the $k^\star=g$ line opens with depth and accuracy drifts toward the random-guess boundary. With tools, slopes land at $0.96$–$1.08$ and the gap stays nearly flat. (4B-Thinking's no-tool slope of $1.04$ is misleading: partial integration scores by luck at small $g$, then accuracy falls off a cliff once full integration becomes necessary.)
Frontier systems — GPT-5.2 with extended thinking, Claude Opus 4.5 with maximum thinking, and Gemini 3 Pro — change the magnitude but not the qualitative verdict. They start higher and hold on longer, so where a small standalone model is already guessing, a frontier model can still retain real accuracy. But the valid no-tool frontier runs still degrade as depth grows. Scale delays the collapse; it does not remove the depth dependence.
Part IV — Tools change the computational structure
Each step of the game demands two different things: inferring the constraint from the prefix and the new examples, then carrying out the GF(2) computation. A standalone model has to do both token by token. As depth grows, the prefix-conditioned computation gets longer and more fragile, and $\gamma_g$ decays. Tool use splits the two jobs apart: the model specifies the computation in code, hands execution to an interpreter, inspects the result, and revises the procedure when it fails.
The effect is dramatic. Given a sandbox to generate, run, and revise code over up to ten iterations, the small Qwen3-2507 models sustain high accuracy out to depths of $g=127$ — approaching the full-information estimator and, in places, rivaling frontier models. And the iteration is what matters: a 4B model handed a single, unrevised code attempt still fails at large depth; its recovery depends on proposing code, observing how it behaves, and repairing it. Externalizing the deterministic computation is the whole point.
The same lift appears for frontier models: once they can run and revise code, $\gamma_g$ remains high even at the deepest settings, far above the $0.45\%$ chance line. By moving exact symbolic execution out of the model and into an executable artifact, tool-enabled systems maintain a much more stable step-success probability over long horizons.
What this does and does not say
The benchmark is deliberately narrow. Symbolic GF(2) reconstruction was chosen precisely because shortcuts can be controlled and the next step is exactly verifiable — which makes the failure mode easy to measure, but does not by itself establish how often the same collapse appears in messier natural-language or interactive tasks where the carried-forward state is less algebraic. The tool-use results were collected in a sandboxed code-execution setting, so the gains come bundled with practical costs in latency, execution infrastructure, and interface reliability. And the frontier evaluation is a smaller diagnostic — a handful of queries per model, depth, and tool setting — rather than the full protocol used for the small models.
What survives those caveats is a specific, measurable claim. The Diligent Learner framework assumes a non-vanishing next-step probability; on a benchmark built to demand exactly that, standalone models do not maintain it uniformly as depth grows. Small models drift toward the partial-information regime, and frontier models still show depth dependence. Tool synthesis largely restores the missing stability. That points somewhere slightly against the grain: scalable reasoning may depend less on ever-deeper search inside the model, and more on giving models reliable interfaces for building, executing, and revising their own tools.
Takeaway
On a benchmark engineered so that every step needs all the evidence and no partial-information shortcut survives, standalone language models suffer depth-induced collapse: next-step accuracy degrades as the reasoning chain grows.
The information was never missing. A polynomial-time recovery procedure exists at every depth; the failure is in carrying a lengthening, prefix-conditioned computation reliably through token-by-token reasoning.
Scale delays, tools restructure. Frontier models postpone the collapse but still degrade. Models that generate, execute, and iteratively revise code sustain high step-success over long horizons — and small models start to approach frontier-level behavior.
The role of tools is sharper than "more capabilities." By externalizing deterministic computation into an artifact the model can inspect and repair, tool use lowers the effective reasoning depth carried internally — and that may be the key difference between brittle extrapolation and stable out-of-distribution reasoning.
Comments