Agentic Systems as Boosting
Boosting turns weak learners into strong ones by combining many imperfect signals. Can a committee of weak reasoning-model calls do the same at inference time — reaching the level of much stronger models? And if a correct answer is already hiding in the pool, what does it take to actually pick it out?
Introduction
Classical boosting takes a weak predictor — one that is only slightly better than chance — and, by repeatedly combining imperfect but useful signals, builds a strong predictor. Modern language-model systems lean on a related idea at inference time: sample several candidates, check or compare them, search over partial states, and select a final output.
But reasoning is not ordinary supervised boosting. In supervised prediction, each weak learner returns a label you can score against training examples. In reasoning, the system has to generate an intermediate move, decide whether that move is useful, and keep small local errors from accumulating into a wrong final answer. The natural question: can a committee of weak reasoning-model calls reach the performance of much stronger models?
On SWE-bench Verified, a single call to GPT-5.4 nano resolves 67.0% of tasks. Wrapping that same nano model in a critic–comparator committee lifts it to 76.4% — matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking. And an oracle that could always pick the best of 8 nano proposals would reach 79.0%. So most of the correct patches are already in the weak model's pool. The hard part is selecting them.
The analysis separates four quantities. The rest of the post is organized around them:
Part I — Boosting, moved to inference time
The setting is verifier-backed reasoning: code repair, theorem proving, program synthesis — domains where tests, type checkers, execution, proof checkers, or constraint solvers can supply a local soundness signal. The system is modeled as bounded-depth search over partial reasoning states with local progress.
Formally, each task induces a state space with a set of valid states (those from which a correct completion is still possible), terminal states accepted by a verifier, and a rank function $d_x(s)$ measuring distance to a solution, equal to $0$ at terminal states. An action is progressing-sound if it keeps the state valid and strictly decreases the rank:
In the running example — SWE-bench Verified — the state holds the current repo worktree, the issue, and the visible tests; success is judged by hidden tests. A progressing-sound action is a code edit that preserves some hidden-test-passing patch while reducing the remaining work. Visible tests, types, and linters reject many unsound edits, but they don't certify correctness.
The committee protocol
At each non-terminal state, the committee protocol $\Pi_{k,m,r}$ runs three roles in sequence, then advances and repeats. Click each stage:
Two assumptions name the resources this architecture needs. Coverage (Assumption 1) says some proposer in a polynomial-size portfolio outputs a sound action with probability at least $\alpha_0 > 0$. Identifiability (Assumption 2) says there exist efficient critics and comparators with an edge: a critic never rejects a sound move but rejects an unsound one with probability $\ge \beta_0$, and a comparator prefers sound over unsound with probability $\ge \tfrac{1}{2}+\sigma_0$. These are different capabilities — and that difference is the whole story.
Part II — Coverage is not identifiability
It is tempting to think that if good moves appear often enough, the system can simply learn to recognize them from the candidates. The paper proves this is false in general.
The intuition: if every world looks statistically identical from the proposal distribution alone, watching proposals tells you nothing about which move is the bad one. Sampling more candidates raises coverage, but it cannot manufacture a critic out of thin air. Recognition has to come from somewhere else — an accessible soundness signal.
Fig. 1 — Two distinct capabilities. Coverage (left) makes a sound move appear in the pool; identifiability (right) recovers it. Without a soundness signal the candidates are indistinguishable; a one-sided verifier supplies the critic and comparator edges.
When a one-sided local verifier is available — one that always accepts sound moves and rejects unsound ones with probability $\ge 1-\nu$ — it directly supplies the identifiability edges, with $\beta_0 = 1-\nu$ and $\sigma_0 = (1-\nu)/2$. This is stronger than final-answer verification: the task decomposition has to expose useful local checks.
Part III — Composing local steps into a trajectory
Given coverage and identifiability, the bridge theorem shows the committee amplifies them. Round-robin over the proposer portfolio with enough calls makes the chance a sound move appears at a state as high as you like:
The local committee error then splits cleanly into two failure modes — there was no good proposal, or a bad proposal survived the critics and won the comparison:
The proposal term shrinks geometrically in the number of proposals, $\varepsilon_{\mathrm{prop}}(k;s) \le (1-\alpha)^k \le e^{-\alpha k}$. The identification term shrinks exponentially in the critic budget $m$ and comparator budget $r$. Because each progressing-sound action strictly decreases the rank, a trajectory has at most $L_x$ steps, and the errors simply add up along it. Combined with the blind-spot split from Part IV, the global failure bound is:
Three knobs, three terms. The simulator below lets you turn them. Notice what happens: with enough proposals, critic calls, and comparator votes, the sampling and identifiability terms collapse — and the only thing left is the blind-spot floor $B$, which none of these knobs can touch.
Fig. 2. Illustrative model: the sampling residual is taken as $R_k=(1-B)(1-\alpha)^k$ and the local error as the sum of the three terms (each clamped to $[0,1]$); global success is shown as the per-step product $(1-\varepsilon_{\mathrm{loc}})^{L}$ with $L=3$. The identifiability term is an upper bound and can exceed $1$ when $m$ or $r$ are too small — that is the bound going vacuous, signalling you need more critic or comparator votes.
Part IV — The blind-spot ceiling
The proposal failure term hides something important. Suppose that, conditional on a latent slice $Z$ of the task, each proposal is sound with probability $q_s(Z)$. Then the chance no proposal is sound factorizes:
As you sample more ($k\to\infty$), the residual $R_k(s)$ vanishes, but the floor $B_s$ — the mass of task slices where the proposer assigns zero probability to any sound move — does not. No amount of sampling, critiquing, or comparing can recover a move that was never proposed. This is the formal boostable-capability ceiling: with a perfect oracle selector, best-of-$k$ converges only to $1-B$.
This motivates a recovery metric: of the gap that oracle selection exposes, how much does the real harness actually recover?
An average coverage condition can quietly hide blind spots: you can have $\mathbb{E}[q_s(Z)] > 0$ overall while $q_s(Z) = 0$ on a whole subpopulation. Lowering the floor is not a sampling problem — it requires changing the proposal system itself: the model, the prompts, the tools, retrieval, decomposition, or genuine proposer diversity.
Part V — Weak to frontier, on SWE-bench Verified
The empirical centerpiece puts the theory to work. Using a single weak model, GPT-5.4 nano, the critic–comparator orchestration climbs from a weak one-shot baseline to the level of substantially stronger standalone models — and lands just short of the oracle ceiling.
Fig. 3. The orchestrated nano committee at 76.4% matches standalone Gemini 3 Pro and Claude Opus 4.5 Thinking, and exceeds GPT-5.4 mini. The oracle best-of-8 curve at 79.0% shows correct patches are usually already in the nano pool; the remaining 2.6 points is a pure selection gap.
exposed by 8 proposals (67.0 → 79.0)
recovers (67.0 → 76.4)
by the real selector
The ablations trace the mechanism term by term. Increasing proposer diversity exposes latent correct patches (coverage). Critics filter flawed candidates (the $\beta$ edge). Comparators rank the plausible survivors (the $\sigma$ edge). And the failures that remain are mostly proposal-coverage failures — shared blind spots where no nano proposal contained a hidden-test-passing patch, exactly the irreducible floor $B$ that stronger selection alone cannot close.
| Quantity | What it measures | On SWE-bench |
|---|---|---|
| pass@1 | one-shot capability of the proposer | 67.0% |
| System (k=8) | what the real critic–comparator harness recovers | 76.4% |
| Oracle best-of-8 | latent capability under perfect selection | 79.0% |
| Selection gap | oracle − system (recoverable by better selection) | 2.6 pts |
| Blind-spot region | $1 -$ oracle (needs a better proposer) | 21.0% |
What this does and does not say
The claim is mechanistic, not magical. A committee of weak calls reaches strong-model performance when two separate resources are present: coverage, so a good move appears, and identifiability, so the system can recognize it. The bridge between them is a local soundness signal — tests, execution, types, proofs, constraints, or a learned reviewer. Take that signal away and Proposition 1 says no amount of sampling rebuilds it.
The ceiling is equally real. Oracle best-of-$k$ converges to $1-B$, so once the harness is recovering most of the oracle gap, the binding constraint is the proposer's blind spots, not the selector. At that point the productive move is not more votes — it is a more diverse or more capable proposal system. The analysis also relies on a rank function and conditional-independence assumptions for the local decomposition; real trajectories are messier, and the bounds are upper bounds that can be loose.
Takeaway
Agentic systems are inference-time boosting for reasoning models: weak proposals supply breadth, local checks supply recognition, and a rank function lets sound steps compose into a solution.
The local error has three sources — a blind-spot floor $B$, a finite-sampling residual $R_k$, and an identifiability term $k^2 e^{-\beta m - 2r\sigma^2}$. More proposals shrink the second; more critic and comparator votes shrink the third; nothing shrinks the first.
Coverage does not imply identifiability. Reliable amplification needs an additional soundness signal — execution, tests, types, proofs, or constraints — without which selection cannot recover what sampling exposes.
On SWE-bench Verified, a committee of weak GPT-5.4 nano calls reaches 76.4%, matching frontier standalone models and recovering about 78% of the 67.0→79.0 oracle gap. The correct patches were mostly already there; the win was learning to pick them — and the remaining failures point at the proposer's blind spots, not the selector.
Comments