Learning Theory · Program Synthesis · LLMs

Revisiting ERM in the LLM Era

How pretrained LLMs can serve as search priors for empirical risk minimization over programs, recovering exact algorithmic rules from just a handful of examples — while SGD-trained models remain at chance.

By Tomer Galanti · March 20, 2026 · 14 min read · ◆ Singhal, Mishra, Malach, Galanti — arXiv 2025

Introduction

Deep learning works extraordinarily well in vision, language, and audio. But outside these regimes there are many simple tasks that standard training methods still do not handle well: primality testing, parity, Boolean circuits, graph connectivity, and other rule-based problems. One can cast them as supervised learning tasks and train a large model on labeled examples. In practice, the model often fits the training data and still fails to recover the underlying rule.

A natural alternative is to use the fact that many such tasks admit short exact implementations — often just a few lines of Python. Instead of training a model to imitate the input-output map, we can search directly for a program that explains the data. This is empirical risk minimization (ERM) over a discrete class of programs, and classical learning theory tells us that such an approach can generalize from surprisingly little data.

The difficulty is computational. Exhaustive search over all Python programs is clearly infeasible. The number of candidates grows exponentially with program length, so even when the correct hypothesis is very short, brute-force ERM is usually not practical.

The central claim Pretrained LLMs can serve as search priors for empirical risk minimization over short programs — making classical program-based ERM practically searchable while preserving its strong sample-efficiency advantages.

The story has four parts:

SGD is the wrong search.

SGD-trained networks fit the training data and still fail to recover algorithmic rules. The failure is fundamental on SQ-hard tasks, not a tuning issue.

Program ERM is sample-efficient.

PAC-style guarantees scale with description length $L$, not input dimension $n$. A compact parity program generalizes from $O(k \log n)$ examples.

III

LLM-PV makes ERM feasible.

A pretrained LLM replaces uniform search with a structured proposal distribution. Verification still comes from held-out data — the ERM principle is preserved.

Exact rules from few examples.

LLM-PV recovers perfect accuracy on 10 of 13 algorithmic tasks from only 200 examples, while all neural baselines remain near chance.

Based on: S. Singhal, P. Mishra, E. Malach, T. Galanti. “LLM Priors for ERM over Programs”, arXiv, 2025.

Part I — Deep learning can be the wrong search procedure

Where our usual intuition breaks down

Most of our intuitions about sample efficiency come from settings where deep learning is well matched to the structure of the problem. Algorithmic tasks are fundamentally different — here the target hypothesis is compact as code, but opaque to gradient descent.

Fig. 1 — Tasks with short program descriptions are compact as code but gradient-opaque. Across 1B-scale models, test accuracy remains near chance even after perfect training-set fit.

The failure mode

Across several 1B-scale language models, training on a few hundred labeled examples yields near-perfect training accuracy on tasks such as parity, Dyck-2, and cellular automata parity. Test accuracy, however, often remains near chance — and this persists even when the amount of training data is substantially increased.

Why SGD struggles

The statistical query perspective gives a clean explanation. For planted $k$-parity over $n$ variables, mini-batch coordinate SGD requires at least:

$$q = \Omega\!\left(\frac{\sqrt{\binom{n}{k}}}{2^b}\right)$$

fresh examples to achieve nontrivial population error. The target rule is simple as a program, but SGD is the wrong search procedure for this hypothesis class. The difficulty is statistical and fundamental, not a matter of model capacity or hyperparameter tuning.

Part II — ERM over short programs is statistically attractive

Think of the target as code

Suppose the target rule can be expressed as a program of length $L$ over an alphabet $\Sigma$. The relevant hypothesis class is the set of valid programs of length at most $L$, with $|\mathcal{H}_L| \le |\Sigma|^L$. This immediately yields a classical PAC-style guarantee:

$$\text{test error} \;\le\; \frac{L \log|\Sigma| + \log(2L^2/\delta)}{m}$$

The sample complexity depends on description length, not on raw input dimension. A compact parity program can therefore generalize from surprisingly few examples — the number of samples needed grows with $k \log n$, not with $n^{k/2}$.

The central tension

	SGD	Length-first ERM
Samples for low error	$\Omega(n^{k/2})$ — exponential	$O(k \log n)$ — compact
Per-sample compute	$O(1)$ — cheap	$\Omega(n^{\Theta(k)})$ — intractable

SGD is computationally cheap but statistically wasteful on these tasks. Program ERM is statistically efficient but computationally intractable under uniform search. The real question is whether one can keep ERM's statistical advantages while replacing uniform search with something more structured.

Part III — LLM-PV makes ERM searchable

From uniform search to proposal priors

The real bottleneck in program ERM is not verification — given a candidate program, checking whether it fits the data is easy. The difficult part is proposing promising candidates. Uniform search over programs of length $L$ assigns probability roughly $|\Sigma|^{-L}$ to any fixed target. A pretrained LLM changes this: conditioned on labeled examples, it places probability mass on short, structured, plausible rules suggested by the data.

“The LLM does not replace ERM — it makes a previously impractical ERM procedure searchable.”

The LLM-PV pipeline

LLM-PV — propose and verify — click each stage for details

Step 1

Prompt

Format examples as input-output pairs

→

Step 2

Sample

Draw $k$ candidate programs from LLM

→

Step 3

Execute

Compile and run in sandbox

→

Step 4

Select

Pick lowest validation error

Click a step above to see details.

The formal guarantee

The guarantee cleanly separates the statistical role of validation from the search role of the LLM. Let $p_\epsilon(S_{\text{tr}})$ denote the probability that a fresh LLM proposal has population error at most $\epsilon$. Then after $k$ trials:

$$\text{err}_D(h^\star) \;\le\; \epsilon \;+\; 2\sqrt{\frac{\log(4k/\delta)}{2m_{\text{val}}}}$$

The LLM enters only through $p_\epsilon$ — the probability that it proposes a good program. It does not need to be trusted as the final predictor. The effective search size becomes roughly $1/p_\epsilon(S_{\text{tr}})$ instead of $|\Sigma|^L$.

Watching candidates evolve: reasoning trace simulator

Reasoning trace simulator — candidates sharpen from heuristics to correct solutions

Task:

Round 0 / 2

Press Play or Next round to begin

Best validation accuracy over rounds

Fig. 2. Simulated reasoning trace for LLM-PV. On IsPrime, proposals evolve from "check if odd" (58%) to trial division with √n bound (100%) in just two rounds. On parity, the correct solution appears in round 1. The LLM does not need to be perfect — it only needs to assign enough mass to the right program for ERM verification to find it.

Part IV — Results

Recovering exact rules from a handful of examples

Across 13 algorithmic tasks at input length $n = 100$, LLM-PV recovers exact or near-exact rules from only 200 labeled examples:

Test accuracy (%) — 200 training examples, input length 100

Fig. 3. LLM-PV achieves perfect accuracy on 10 of 13 tasks. On SHA-256 parity — intentionally pseudorandom with no exploitable short structure — it fails, confirming the method's dependence on compressible hypotheses.

The full comparison with baselines:

Task	XGBoost	SGD (1B)	Fine-tune	ICL (30B)	TabPFN	LLM-PV
Full parity	49.9	49.3	50.4	53.0	50.1	100
First-half parity	50.0	50.6	51.3	54.0	50.5	100
Pattern 1	50.6	51.5	61.5	56.0	49.2	100
Pattern 2	52.3	53.6	57.6	51.0	51.2	100
3-parity	49.9	49.7	51.0	50.0	49.3	100
10-parity	49.3	50.3	50.5	50.0	50.5	100
IsPalindrome	83.6	49.2	49.6	47.0	49.4	100
Dyck-2	49.7	51.4	52.7	48.0	50.5	100
CA parity	50.4	49.4	50.7	49.0	48.0	100
IsPrime	61.8	58.8	57.2	50.0	57.1	100
IsPrime+47	59.3	53.6	58.7	58.0	65.4	80.6
CycleCount	51.1	53.5	65.7	46.0	51.3	96.9
SHA-256	50.3	49.8	50.8	54.0	50.6	50.1

More data does not rescue SGD

Even with up to 100,000 training examples, SGD-trained models can remain near chance on parity-style tasks while fitting the training data perfectly. These tasks are not difficult because they require huge datasets — they are difficult because the search procedure is poorly matched to the hypothesis class.

Length generalization emerges naturally

A particularly striking property of LLM-PV is that it often synthesizes dimension-invariant programs. When trained on short inputs ($n = 10$) and tested on much longer ones ($n = 100$), it can maintain perfect accuracy. This follows naturally from representing the hypothesis as code — a loop or invariant-checking procedure extends cleanly beyond the training input size, whereas a fixed-dimensional parametric map cannot.

Auditable output

The output is executable code and the search trace is logged. The learned hypothesis can be read, tested, modified, and verified directly — a major conceptual difference from opaque neural predictors.

Beyond algorithmic tasks

On standard tabular datasets with only 100 training samples, LLM-PV is competitive with XGBoost, Random Forest, and TabPFN — the propose-and-verify view extends beyond exact program recovery.

What this does and does not say

This argument does not say that LLM-PV replaces deep learning on standard vision, language, or audio tasks. In those domains, the usual pipeline remains extremely effective. The picture is narrower and cleaner.

Classical ERM over programs is statistically attractive but computationally inaccessible under uniform search. A pretrained LLM changes the proposal distribution — making some short, structured hypotheses much easier to find. Verification and final selection still come from the data. If the target does not admit a short exploitable program, or if the LLM assigns negligible mass to useful candidates, the method fails. The SHA-256 result confirms exactly this regime.

✓

The LLM proposes.

Conditioned on examples, it generates candidate programs from a structured distribution over code — vastly more efficient than uniform enumeration.

✓

The data verifies.

Held-out validation selects the best candidate. The LLM is not trusted as the final predictor — it is a search engine. ERM is the selection rule.

Takeaway

Pretrained LLMs can make program-based ERM practical by replacing brute-force search with guided search. The mechanism has four parts.

Short programs generalize. If the target has a length-$L$ description, ERM needs only $O(L \log|\Sigma|)$ samples — independent of raw input dimension.

SGD can still fail badly. On SQ-hard families such as parity, gradient-based learners may require exponentially many samples even when the target is simple as code.

Brute-force ERM is computationally intractable. Uniform search over programs scales like $|\Sigma|^L$ — impractical even for short targets.

LLM-PV changes the search distribution. A pretrained LLM proposes plausible candidates; held-out verification selects the best one. The ERM principle is preserved; only the proposal mechanism changes.

The resulting paradigm recovers exact algorithmic rules from a handful of examples, generalizes beyond the training distribution, and produces executable hypotheses that can be inspected and verified directly.

Revisiting ERM in the LLM Era

Introduction

Part I — Deep learning can be the wrong search procedure

Where our usual intuition breaks down

The failure mode

Why SGD struggles

Part II — ERM over short programs is statistically attractive

Think of the target as code

The central tension

Part III — LLM-PV makes ERM searchable

From uniform search to proposal priors

The LLM-PV pipeline

The formal guarantee

Watching candidates evolve: reasoning trace simulator

Part IV — Results

Recovering exact rules from a handful of examples

More data does not rescue SGD

Length generalization emerges naturally

What this does and does not say

Takeaway

Comments