Learning Theory · Program Synthesis · LLMs

Revisiting ERM in the LLM Era

Given only labeled examples and no task description, can we recover the hidden rule as executable code? LLM-PV uses pretrained LLMs as proposal priors for ERM over programs: the model guesses code, the data verifies it, and short algorithmic rules become searchable.

By Tomer Galanti · March 20, 2026 · 14 min read · ◆ Singhal*, Mishra*, Malach, Galanti — ICML 2026

Introduction

Imagine being handed a tiny table of input-output examples. There is no task description, no grammar, no hint that the labels came from parity, primality, a bracket parser, a graph invariant, or something stranger. The only question is the oldest question in supervised learning: what rule generated these labels, and will it keep working on new inputs?

Deep learning is spectacular when the right representation can be learned smoothly from data. But many algorithmic tasks have a different flavor. The correct rule may be a very short program — a parity loop, a stack check, a primality test — while the input-output map looks almost random to a gradient-based learner. A neural classifier can fit the 200 training examples and still learn the wrong thing: a brittle surface predictor rather than the underlying computation.

The classical alternative is beautifully simple: search over programs and select one that fits the data. That is empirical risk minimization over a discrete hypothesis class. If the true rule has a short description, the sample complexity can be tiny. The catch is computational: length-first enumeration over programs grows like $|\Sigma|^L$, so the cute statistical story becomes an impossible search problem almost immediately.

The central claim A pretrained LLM can be used as a proposal prior for ERM over programs. It does not replace validation, and it is not trusted as the final classifier. It simply changes the search distribution: instead of enumerating all code, sample plausible programs, execute them, and let held-out data choose.

The story has four parts:

I
SGD is the wrong search.
Gradient-trained networks can fit small algorithmic datasets while failing to recover the rule. On SQ-hard families such as parity, this is a structural issue, not a hyperparameter accident.
II
Program ERM is statistically lovely.
If the rule has a length-$L$ implementation, Occam-style bounds scale with description length rather than raw input dimension. Short code is a strong prior.
III
LLM-PV makes the search finite.
The LLM proposes candidate programs; a sandbox executes them; held-out validation performs ERM-style selection. The data is the referee.
IV
Rules become inspectable objects.
Successful outputs are executable hypotheses — loops, tests, parsers, sieves — that can be read, debugged, stress-tested, and reused beyond the training length.
Based on: S. Singhal*, P. Mishra*, E. Malach, T. Galanti. “LLM Priors for ERM over Programs”, ICML, 2026.

Part I — Deep learning can be the wrong search procedure

Where our usual intuition breaks down

Most of our intuitions about sample efficiency come from settings where the optimization path and the target structure are aligned. Algorithmic tasks violate this. The target may be compact in a programming language, yet invisible to the local statistics that gradient descent extracts from a small sample.

This is the uncomfortable regime: the rule is not complicated, but the search interface is wrong. A parity function over ten hidden coordinates is one line once the support is known; before the support is known, its low-order correlations vanish. A primality test is short code; as a classifier on decimal strings, it looks like a patchwork of modular facts. The same object can be simple for a programmer and hostile to SGD.

Parity XOR of k chosen bits sum(x) % 2 — 1 line Primality Is integer prime? trial_div(x) — 3 lines Dyck-2 Balanced brackets? stack_check(x) — 5 lines Pattern match Motif in string? any(match) — 1 line SGD: fits training data perfectly → test accuracy ≈ 50% (near chance)

Fig. 1 — Tasks with short program descriptions are compact as code but gradient-opaque. Across 1B-scale models, test accuracy remains near chance even after perfect training-set fit.

The failure mode

Across several 1B-scale language models, training on a few hundred labeled examples often yields near-perfect training accuracy on tasks such as parity, Dyck-2, and cellular-automata parity. The test accuracy, however, remains near chance. The model has memorized a finite table, not discovered the program that produced it.

The same pattern survives larger data sweeps: more examples can make the training fit cleaner without making the learned predictor more algorithmic. That is the little tragedy of this setting — the data contains a compact rule, but the learner is not searching in the language where the rule is compact.

Why SGD struggles

The statistical query perspective gives a clean explanation for the canonical case. For planted $k$-parity over $n$ variables, finite-precision mini-batch coordinate SGD needs on the order of

$$q = \Omega\!\left(\frac{\sqrt{\binom{n}{k}}}{2^b}\right)$$

fresh examples to achieve nontrivial population error. The factor $2^b$ reflects numerical precision, and the combinatorial term reflects the hidden support. The theorem is not saying parity is impossible; it is saying that the wrong learner pays for the wrong coordinates. As code, the target is tiny. As a gradient signal, it is nearly silent.


Part II — ERM over short programs is statistically attractive

Think of the target as code

Now switch languages. Suppose the target rule can be expressed as a program of length $L$ over an alphabet $\Sigma$. The relevant hypothesis class is the set of valid programs of length at most $L$, with $|\mathcal{H}_L| \le |\Sigma|^L$. In the realizable case, an ERM solution over this finite class obeys an Occam-style PAC guarantee:

$$\text{test error} \;\le\; \frac{L \log|\Sigma| + \log(2L^2/\delta)}{m}$$

The sample complexity depends on description length, not raw input dimension. If the target is a short piece of code, the data only has to identify which short code it is. For planted parity, that means the statistical price scales like the support description, roughly $k\log n$, rather than the SQ-style $n^{k/2}$ barrier paid by the wrong optimization interface.

The central tension

SGDLength-first ERM
Samples for low error $\Omega(n^{k/2})$ — exponential $O(k \log n)$ — compact
Per-sample compute $O(1)$ — cheap $\Omega(n^{\Theta(k)})$ — intractable

This is the central tension. SGD is computationally cheap per update but statistically wasteful on the wrong families. Program ERM is statistically elegant but computationally hopeless under uniform search. The missing ingredient is neither a new loss nor a bigger classifier. It is a better way to propose candidate programs.


Part III — LLM-PV makes ERM searchable

From uniform search to proposal priors

The real bottleneck in program ERM is not verification. Given a candidate program, we can compile it, run it, and score it on examples. The hard part is proposal: where should the search look first?

Uniform search treats every token string of length $L$ as equally plausible, so the target receives probability roughly $|\Sigma|^{-L}$. A pretrained code model is not uniform. Conditioned on examples, it has learned to place mass on programs that look like human-written explanations: loops for length-invariant rules, stacks for brackets, modular arithmetic for primality, Gaussian elimination for parity. LLM-PV asks whether that prior can be harvested without trusting the LLM's answer blindly.

“The LLM does not replace ERM — it makes a previously impractical ERM procedure searchable.”

The LLM-PV pipeline

LLM-PV — propose and verify — click each stage for details
Step 1
Prompt
Format examples as input-output pairs
Step 2
Sample
Draw $k$ candidate programs from LLM
Step 3
Execute
Compile and run in sandbox
Step 4
Select
Pick lowest validation error
Click a step above to see details.

The formal guarantee

The guarantee cleanly separates the statistical role of validation from the search role of the LLM. Let $p_\epsilon(S_{\text{tr}})$ denote the probability that a fresh LLM proposal has population error at most $\epsilon$. Then after $k$ trials:

$$\text{err}_D(h^\star) \;\le\; \epsilon \;+\; 2\sqrt{\frac{\log(4k/\delta)}{2m_{\text{val}}}}$$

The LLM enters only through $p_\epsilon$ — the probability that it proposes a good program. It does not need to be trusted as the final predictor. The effective search size becomes roughly $1/p_\epsilon(S_{\text{tr}})$ instead of $|\Sigma|^L$.

Watching candidates evolve: reasoning trace simulator

Reasoning trace simulator — candidates sharpen from heuristics to correct solutions
Task:
Round 0 / 2
Press Play or Next round to begin
Best validation accuracy over rounds

Fig. 2. Simulated reasoning trace for LLM-PV. On IsPrime, proposals evolve from "check if odd" (58%) to trial division with √n bound (100%) in just two rounds. On parity, the correct solution appears in round 1. The LLM does not need to be perfect — it only needs to assign enough mass to the right program for ERM verification to find it.


Part IV — Results

What was compared?

The experiment is intentionally small-data: $m=200$ labeled examples for the algorithmic tasks, split into training and validation, then evaluated on a large independent test set. The baselines are not strawmen; they cover the usual ways one might attack the problem without program ERM.

FamilyWhat it triesWhy it matters
Classic MLXGBoost, Random Forest, SVM, genetic programsStrong feature-based learners and symbolic-search baselines with validation tuning.
Tabular foundation modelTabPFNA pretrained model specialized for small tabular datasets.
SGD from scratch1B-scale LMs trained as classifiersTests whether ordinary gradient training can recover the rule from examples.
Fine-tuningQwen, Llama, DeepSeek-Coder variantsTests whether pretrained representations plus supervised adaptation solve the task.
Direct LLM / agentsICL with large models, tool-use ICL, MLAgentBench, AIDE-MLTests whether prompting or generic agentic ML loops suffice without ERM-style program selection.
LLM-PVLLM proposes executable programs; validation selectsThe proposed route: use the LLM as a search prior, not as the final predictor.

Recovering exact rules from a handful of examples

Across 13 algorithmic tasks at input length $n=100$, LLM-PV often recovers the actual rule, not merely a high-accuracy surrogate. The cute thing is that the winning hypothesis usually looks like what you would hope to find: an XOR over the right coordinates, a stack parser, a palindrome check, a primality test, a graph component counter.

Test accuracy (%) — 200 labeled examples, input length 100

Fig. 3. Teal bars show LLM-PV. Gray bars show the strongest non-PV baseline reported for that task, including classic ML, TabPFN, SGD/fine-tuning, direct ICL, and generic agent baselines. LLM-PV reaches 100% on 10 of 13 tasks. SHA-256 parity stays near chance, which is exactly the failure mode one should expect: the target is pseudorandom rather than compressible by a simple program the proposer can find.

A compact task-by-task view:

Task Strongest non-PV Who got it? LLM-PV
Full parity99.0direct GPT-5 + tools100
First-half parity88.0direct GPT-5 + tools100
Pattern 183.0direct Gemini + tools100
Pattern 260.9MLAgentBench100
3-parity75.0direct GPT-5 + tools100
10-parity79.0direct GPT-5 + tools100
IsPalindrome100.0direct Gemini + tools100
Dyck-259.0direct GPT-5 + tools100
CA parity51.0direct Gemini + tools100
IsPrime62.0direct GPT-5 + tools100
IsPrime+4765.4TabPFN80.6
CycleCount65.7fine-tuned Llama-3.2-1B96.9
SHA-25654.0no-tool ICL50.1

There are two lessons hidden in this table. First, direct tool use is not the same as LLM-PV. A tool-using LLM can sometimes reason its way to a useful answer, but it is not systematically maintaining a pool of executable hypotheses and selecting by validation error. Second, the success cases are not just high accuracies — they are program recoveries. The output is a rule one can read.

More data does not rescue SGD

Increasing the number of examples does not automatically fix the gradient route. Even with up to 100,000 training examples, SGD-trained models can fit the training set while remaining near chance on random 10-parity and cellular-automata parity. That is the empirical face of the theory: the target is not statistically absent; it is hidden from the learner's search geometry.

Length generalization emerges naturally

A particularly striking property of LLM-PV is that it often synthesizes dimension-invariant programs. Train on short inputs and test on much longer ones, and a loop still loops. A stack still parses. A primality test still tests. This is the quiet superpower of code as a hypothesis class: a good program does not merely interpolate the observed length; it encodes the invariant.

Auditable output
The output is executable code and the search trace is logged. The learned hypothesis can be read, tested, modified, and verified directly. The model does not hand you a mysterious vector; it hands you a little machine.

Beyond toy algorithms: tabular datasets

The paper also checks whether propose-and-verify is only a synthetic-algorithm trick. On standard tabular datasets, each with only 100 fixed training samples and 100 validation samples, LLM-PV is competitive with strong tabular learners. The inputs are randomly transformed so the LLM cannot simply recognize famous datasets; it must propose useful feature transformations and decision rules from the examples.

Model Adult Income Mushroom CDC Diabetes HRTU2 Chess
SVM52.868.169.191.589.4
Genetic Algorithm75.264.769.090.692.5
Decision Tree67.869.062.589.194.1
Random Forest74.270.872.791.488.1
XGBoost71.466.168.691.493.1
TabPFN77.769.972.193.193.1
LLM-PV75.872.271.192.392.3

Table 2. Tabular benchmark accuracy (%) with 100 train and 100 validation samples. LLM-PV is not uniformly best — nor should it be. The point is that propose-and-verify remains competitive with specialized tabular learners, suggesting that the method is not limited to exact symbolic recovery.

A useful distinction
On algorithmic tasks, the win often comes from finding the exact hidden program. On tabular tasks, the win is softer: LLM-PV proposes small predictive pipelines and lets validation choose. Same philosophy, different kind of hypothesis.

What this does and does not say

This argument does not say that LLM-PV replaces deep learning on vision, language, or audio. In those domains, learned representations and gradient training are often exactly the right hammer. The claim is narrower and cleaner: when the target is better described as a short executable rule than as a smooth function to be fit, the hypothesis space should reflect that.

Nor does the method magically solve arbitrary program synthesis. Its success depends on proposal mass. If the target has no short exploitable structure, or if the LLM assigns negligible probability to useful candidates, validation has nothing good to select. SHA-256 parity is the sanity check: pseudorandom labels remain pseudorandom, and LLM-PV stays near chance.

The conceptual shift is about the division of labor. The LLM supplies a biased search distribution over code; the sandbox supplies execution; the validation set supplies selection. The cute slogan is: the LLM dreams, the data decides.

The LLM proposes.
Conditioned on examples, it generates candidate programs from a structured distribution over code — vastly more efficient than uniform enumeration.
The data verifies.
Held-out validation selects the best candidate. The LLM is not trusted as the final predictor — it is a search engine. ERM is the selection rule.

Takeaway

LLM-PV revives an old learning-theory dream: learn simple programs from few examples, but avoid brute-force enumeration.

Short programs generalize. If the target has a length-$L$ description, ERM needs roughly $O(L\log|\Sigma|)$ samples. The statistical story is beautiful.

Uniform search kills the beauty. Enumerating programs costs $|\Sigma|^L$, so classical ERM is usually computationally impossible.

SGD is not a universal search engine. On SQ-hard families such as parity, gradient-trained predictors can fit the data and still miss the rule.

The LLM supplies a prior, not a verdict. It proposes executable hypotheses; the sandbox runs them; validation selects. The ERM principle is preserved, but the candidate distribution is no longer uniform.

The result is a tiny but powerful change in perspective: use pretrained models not only to predict labels, but to search over verifiable programs. When the hidden rule is really code, this turns a few examples into an inspectable little machine.


Comments