Reasoning Models · Inference-Time Compute · Boosting Theory

Agentic Systems as Boosting

Boosting turns weak learners into strong ones by combining many imperfect signals. Can a committee of weak reasoning-model calls do the same at inference time — reaching the level of much stronger models? And if a correct answer is already hiding in the pool, what does it take to actually pick it out?

By Tomer Galanti · May 13, 2026 · 15 min read · ◆ Sunkaraneni, Beneventano, Neumarker, Poggio, Galanti — arXiv 2026

Introduction

Classical boosting takes a weak predictor — one that is only slightly better than chance — and, by repeatedly combining imperfect but useful signals, builds a strong predictor. Modern language-model systems lean on a related idea at inference time: sample several candidates, check or compare them, search over partial states, and select a final output.

But reasoning is not ordinary supervised boosting. In supervised prediction, each weak learner returns a label you can score against training examples. In reasoning, the system has to generate an intermediate move, decide whether that move is useful, and keep small local errors from accumulating into a wrong final answer. The natural question: can a committee of weak reasoning-model calls reach the performance of much stronger models?

On SWE-bench Verified, a single call to GPT-5.4 nano resolves 67.0% of tasks. Wrapping that same nano model in a critic–comparator committee lifts it to 76.4% — matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking. And an oracle that could always pick the best of 8 nano proposals would reach 79.0%. So most of the correct patches are already in the weak model's pool. The hard part is selecting them.

The central reframing Agentic systems are inference-time boosting for reasoning models. The lever is not “more agents help.” Sampling exposes latent correct solutions; critics and comparators must then recover those solutions without access to the hidden verifier. Generation and recognition are different capabilities — and the gap between them is where most of the design lives.

The analysis separates four quantities. The rest of the post is organized around them:

1
Coverage — does a good move appear?
Whether some proposal in the pool is a progressing-sound next action. Amplified by sampling more candidates.
2
Identifiability — can the system recognize it?
Whether critics and comparators can pick the good move out of the pool. Needs an extra local soundness signal.
3
Progress — do local choices compose?
A rank function ensures each sound action gets strictly closer to a solution, so good steps chain into a terminal answer.
4
Diversity — do more calls escape shared failures?
More calls reduce sampling noise, but cannot fix blind spots that every proposer shares.
Based on: V. Sunkaraneni, P. Beneventano, R. Neumarker, T. Poggio, T. Galanti. “Agentic Systems as Boosting Weak Reasoning Models”, arXiv:2605.14163, 2026.

Part I — Boosting, moved to inference time

The setting is verifier-backed reasoning: code repair, theorem proving, program synthesis — domains where tests, type checkers, execution, proof checkers, or constraint solvers can supply a local soundness signal. The system is modeled as bounded-depth search over partial reasoning states with local progress.

Formally, each task induces a state space with a set of valid states (those from which a correct completion is still possible), terminal states accepted by a verifier, and a rank function $d_x(s)$ measuring distance to a solution, equal to $0$ at terminal states. An action is progressing-sound if it keeps the state valid and strictly decreases the rank:

Definition · progressing-sound action
$$s^{*}_a \in \mathrm{Valid}_x \qquad\text{and}\qquad d_x(s^{*}_a) < d_x(s)$$

In the running example — SWE-bench Verified — the state holds the current repo worktree, the issue, and the visible tests; success is judged by hidden tests. A progressing-sound action is a code edit that preserves some hidden-test-passing patch while reducing the remaining work. Visible tests, types, and linters reject many unsound edits, but they don't certify correctness.

The committee protocol

At each non-terminal state, the committee protocol $\Pi_{k,m,r}$ runs three roles in sequence, then advances and repeats. Click each stage:

Committee protocol Π(k, m, r) — one step, repeated up to L times
propose (k)
Proposers
Sample k candidate moves
critique (m)
Critics
Filter refutable errors
compare (r)
Comparators
Rank the survivors
advance
Apply & repeat
Move to next state
Click a stage above to see details.

Two assumptions name the resources this architecture needs. Coverage (Assumption 1) says some proposer in a polynomial-size portfolio outputs a sound action with probability at least $\alpha_0 > 0$. Identifiability (Assumption 2) says there exist efficient critics and comparators with an edge: a critic never rejects a sound move but rejects an unsound one with probability $\ge \beta_0$, and a comparator prefers sound over unsound with probability $\ge \tfrac{1}{2}+\sigma_0$. These are different capabilities — and that difference is the whole story.


Part II — Coverage is not identifiability

It is tempting to think that if good moves appear often enough, the system can simply learn to recognize them from the candidates. The paper proves this is false in general.

Proposition 1 · a black-box separation
There is a one-step task family where the proposer is uniform over $M$ actions in every world, the sound set is “everything except the hidden bad action $\theta$,” and no procedure that observes only candidate actions and polynomially many proposer samples has any uniform critic or comparator edge across worlds. Coverage holds; identifiability is impossible.

The intuition: if every world looks statistically identical from the proposal distribution alone, watching proposals tells you nothing about which move is the bad one. Sampling more candidates raises coverage, but it cannot manufacture a critic out of thin air. Recognition has to come from somewhere else — an accessible soundness signal.

Coverage does a good move appear? Identifiability can we recognize it? s_t sound unsound more samples → higher chance one is sound ? ? ? ? candidates alone verifier tests / types exec / proof recovered

Fig. 1 — Two distinct capabilities. Coverage (left) makes a sound move appear in the pool; identifiability (right) recovers it. Without a soundness signal the candidates are indistinguishable; a one-sided verifier supplies the critic and comparator edges.

When a one-sided local verifier is available — one that always accepts sound moves and rejects unsound ones with probability $\ge 1-\nu$ — it directly supplies the identifiability edges, with $\beta_0 = 1-\nu$ and $\sigma_0 = (1-\nu)/2$. This is stronger than final-answer verification: the task decomposition has to expose useful local checks.

“Sampling more candidates raises the chance a good move appears. It does not, by itself, tell you which one it is.”

Part III — Composing local steps into a trajectory

Given coverage and identifiability, the bridge theorem shows the committee amplifies them. Round-robin over the proposer portfolio with enough calls makes the chance a sound move appears at a state as high as you like:

Theorem 1 · coverage amplification
$$\alpha_{\mathrm{committee}}(s) \;\ge\; 1-(1-\alpha_0)^{\lfloor k/|P_N|\rfloor} \;\ge\; 1-\delta_{\mathrm{prop}}$$

The local committee error then splits cleanly into two failure modes — there was no good proposal, or a bad proposal survived the critics and won the comparison:

Theorem 2 · local error decomposition
$$\varepsilon_{\mathrm{loc}}(s) \;\le\; \underbrace{\varepsilon_{\mathrm{prop}}(k;s)}_{\text{no good proposal}} \;+\; \underbrace{k^{2}\,e^{-\beta m - 2r\sigma^{2}}}_{\text{bad proposal survives \& wins}}$$

The proposal term shrinks geometrically in the number of proposals, $\varepsilon_{\mathrm{prop}}(k;s) \le (1-\alpha)^k \le e^{-\alpha k}$. The identification term shrinks exponentially in the critic budget $m$ and comparator budget $r$. Because each progressing-sound action strictly decreases the rank, a trajectory has at most $L_x$ steps, and the errors simply add up along it. Combined with the blind-spot split from Part IV, the global failure bound is:

$$\mathrm{err}_x(k,m,r) \;\le\; L_x\Big[\,\underbrace{B}_{\text{blind spots}} + \underbrace{R_k}_{\text{finite sampling}} + \underbrace{k^{2}e^{-\beta m - 2r\sigma^{2}}}_{\text{identifiability}}\,\Big]$$

Three knobs, three terms. The simulator below lets you turn them. Notice what happens: with enough proposals, critic calls, and comparator votes, the sampling and identifiability terms collapse — and the only thing left is the blind-spot floor $B$, which none of these knobs can touch.

Amplification simulator — watch the three error terms respond to the committee budget
B — blind spots
R_k — finite sampling
identifiability error

Fig. 2. Illustrative model: the sampling residual is taken as $R_k=(1-B)(1-\alpha)^k$ and the local error as the sum of the three terms (each clamped to $[0,1]$); global success is shown as the per-step product $(1-\varepsilon_{\mathrm{loc}})^{L}$ with $L=3$. The identifiability term is an upper bound and can exceed $1$ when $m$ or $r$ are too small — that is the bound going vacuous, signalling you need more critic or comparator votes.


Part IV — The blind-spot ceiling

The proposal failure term hides something important. Suppose that, conditional on a latent slice $Z$ of the task, each proposal is sound with probability $q_s(Z)$. Then the chance no proposal is sound factorizes:

Lemma 2 · oracle miss splits into a floor plus a residual
$$\varepsilon_{\mathrm{prop}}(k;s) = \mathbb{E}\big[(1-q_s(Z))^k\big] = \underbrace{B_s}_{\;\mathbb{P}(q_s(Z)=0)\;} + \;R_k(s)$$

As you sample more ($k\to\infty$), the residual $R_k(s)$ vanishes, but the floor $B_s$ — the mass of task slices where the proposer assigns zero probability to any sound move — does not. No amount of sampling, critiquing, or comparing can recover a move that was never proposed. This is the formal boostable-capability ceiling: with a perfect oracle selector, best-of-$k$ converges only to $1-B$.

Why this changes evaluation
pass@1 measures one-shot capability. Oracle best-of-$k$ measures the capability latent in the proposal pool under perfect selection. The gap between them diagnoses selection; the gap between oracle best-of-$k$ and a stronger model reflects generation and shared blind spots. A single accuracy number conflates all three.

This motivates a recovery metric: of the gap that oracle selection exposes, how much does the real harness actually recover?

Oracle-gap recovery
$$\mathrm{Rec}(k,m,r;P) = \frac{p_{\mathrm{system}} - p_1}{p_{\mathrm{oracle}} - p_1}$$

An average coverage condition can quietly hide blind spots: you can have $\mathbb{E}[q_s(Z)] > 0$ overall while $q_s(Z) = 0$ on a whole subpopulation. Lowering the floor is not a sampling problem — it requires changing the proposal system itself: the model, the prompts, the tools, retrieval, decomposition, or genuine proposer diversity.


Part V — Weak to frontier, on SWE-bench Verified

The empirical centerpiece puts the theory to work. Using a single weak model, GPT-5.4 nano, the critic–comparator orchestration climbs from a weak one-shot baseline to the level of substantially stronger standalone models — and lands just short of the oracle ceiling.

SWE-bench Verified — resolve rate (k = 8 proposals)
GPT-5.4 nano (1 proposal)
67.0%
nano + orchestration
76.4%
oracle best-of-8 (ceiling)
79.0%

Fig. 3. The orchestrated nano committee at 76.4% matches standalone Gemini 3 Pro and Claude Opus 4.5 Thinking, and exceeds GPT-5.4 mini. The oracle best-of-8 curve at 79.0% shows correct patches are usually already in the nano pool; the remaining 2.6 points is a pure selection gap.

+12.0
points of latent capability
exposed by 8 proposals (67.0 → 79.0)
+9.4
points the harness actually
recovers (67.0 → 76.4)
≈78%
of the oracle gap recovered
by the real selector

The ablations trace the mechanism term by term. Increasing proposer diversity exposes latent correct patches (coverage). Critics filter flawed candidates (the $\beta$ edge). Comparators rank the plausible survivors (the $\sigma$ edge). And the failures that remain are mostly proposal-coverage failures — shared blind spots where no nano proposal contained a hidden-test-passing patch, exactly the irreducible floor $B$ that stronger selection alone cannot close.

QuantityWhat it measuresOn SWE-bench
pass@1one-shot capability of the proposer67.0%
System (k=8)what the real critic–comparator harness recovers76.4%
Oracle best-of-8latent capability under perfect selection79.0%
Selection gaporacle − system (recoverable by better selection)2.6 pts
Blind-spot region$1 -$ oracle (needs a better proposer)21.0%

What this does and does not say

The claim is mechanistic, not magical. A committee of weak calls reaches strong-model performance when two separate resources are present: coverage, so a good move appears, and identifiability, so the system can recognize it. The bridge between them is a local soundness signal — tests, execution, types, proofs, constraints, or a learned reviewer. Take that signal away and Proposition 1 says no amount of sampling rebuilds it.

The ceiling is equally real. Oracle best-of-$k$ converges to $1-B$, so once the harness is recovering most of the oracle gap, the binding constraint is the proposer's blind spots, not the selector. At that point the productive move is not more votes — it is a more diverse or more capable proposal system. The analysis also relies on a rank function and conditional-independence assumptions for the local decomposition; real trajectories are messier, and the bounds are upper bounds that can be loose.

Generation and recognition are different.
Coverage is amplified by sampling; identifiability must come from a soundness signal. Conflating them is the core mistake the framework corrects.
Evaluate by role.
pass@1, oracle best-of-$k$, and the recovery fraction $\mathrm{Rec}$ tell you whether to fix the proposer or the selector — a single accuracy number cannot.

Takeaway

Agentic systems are inference-time boosting for reasoning models: weak proposals supply breadth, local checks supply recognition, and a rank function lets sound steps compose into a solution.

The local error has three sources — a blind-spot floor $B$, a finite-sampling residual $R_k$, and an identifiability term $k^2 e^{-\beta m - 2r\sigma^2}$. More proposals shrink the second; more critic and comparator votes shrink the third; nothing shrinks the first.

Coverage does not imply identifiability. Reliable amplification needs an additional soundness signal — execution, tests, types, proofs, or constraints — without which selection cannot recover what sampling exposes.

On SWE-bench Verified, a committee of weak GPT-5.4 nano calls reaches 76.4%, matching frontier standalone models and recovering about 78% of the 67.0→79.0 oracle gap. The correct patches were mostly already there; the win was learning to pick them — and the remaining failures point at the proposer's blind spots, not the selector.


Comments