Revisiting ERM in the LLM Era
Given only labeled examples and no task description, can we recover the hidden rule as executable code? LLM-PV uses pretrained LLMs as proposal priors for ERM over programs: the model guesses code, the data verifies it, and short algorithmic rules become searchable.
Introduction
Imagine being handed a tiny table of input-output examples. There is no task description, no grammar, no hint that the labels came from parity, primality, a bracket parser, a graph invariant, or something stranger. The only question is the oldest question in supervised learning: what rule generated these labels, and will it keep working on new inputs?
Deep learning is spectacular when the right representation can be learned smoothly from data. But many algorithmic tasks have a different flavor. The correct rule may be a very short program — a parity loop, a stack check, a primality test — while the input-output map looks almost random to a gradient-based learner. A neural classifier can fit the 200 training examples and still learn the wrong thing: a brittle surface predictor rather than the underlying computation.
The classical alternative is beautifully simple: search over programs and select one that fits the data. That is empirical risk minimization over a discrete hypothesis class. If the true rule has a short description, the sample complexity can be tiny. The catch is computational: length-first enumeration over programs grows like $|\Sigma|^L$, so the cute statistical story becomes an impossible search problem almost immediately.
The story has four parts:
Part I — Deep learning can be the wrong search procedure
Where our usual intuition breaks down
Most of our intuitions about sample efficiency come from settings where the optimization path and the target structure are aligned. Algorithmic tasks violate this. The target may be compact in a programming language, yet invisible to the local statistics that gradient descent extracts from a small sample.
This is the uncomfortable regime: the rule is not complicated, but the search interface is wrong. A parity function over ten hidden coordinates is one line once the support is known; before the support is known, its low-order correlations vanish. A primality test is short code; as a classifier on decimal strings, it looks like a patchwork of modular facts. The same object can be simple for a programmer and hostile to SGD.
Fig. 1 — Tasks with short program descriptions are compact as code but gradient-opaque. Across 1B-scale models, test accuracy remains near chance even after perfect training-set fit.
The failure mode
Across several 1B-scale language models, training on a few hundred labeled examples often yields near-perfect training accuracy on tasks such as parity, Dyck-2, and cellular-automata parity. The test accuracy, however, remains near chance. The model has memorized a finite table, not discovered the program that produced it.
The same pattern survives larger data sweeps: more examples can make the training fit cleaner without making the learned predictor more algorithmic. That is the little tragedy of this setting — the data contains a compact rule, but the learner is not searching in the language where the rule is compact.
Why SGD struggles
The statistical query perspective gives a clean explanation for the canonical case. For planted $k$-parity over $n$ variables, finite-precision mini-batch coordinate SGD needs on the order of
fresh examples to achieve nontrivial population error. The factor $2^b$ reflects numerical precision, and the combinatorial term reflects the hidden support. The theorem is not saying parity is impossible; it is saying that the wrong learner pays for the wrong coordinates. As code, the target is tiny. As a gradient signal, it is nearly silent.
Part II — ERM over short programs is statistically attractive
Think of the target as code
Now switch languages. Suppose the target rule can be expressed as a program of length $L$ over an alphabet $\Sigma$. The relevant hypothesis class is the set of valid programs of length at most $L$, with $|\mathcal{H}_L| \le |\Sigma|^L$. In the realizable case, an ERM solution over this finite class obeys an Occam-style PAC guarantee:
The sample complexity depends on description length, not raw input dimension. If the target is a short piece of code, the data only has to identify which short code it is. For planted parity, that means the statistical price scales like the support description, roughly $k\log n$, rather than the SQ-style $n^{k/2}$ barrier paid by the wrong optimization interface.
The central tension
| SGD | Length-first ERM | |
|---|---|---|
| Samples for low error | $\Omega(n^{k/2})$ — exponential | $O(k \log n)$ — compact |
| Per-sample compute | $O(1)$ — cheap | $\Omega(n^{\Theta(k)})$ — intractable |
This is the central tension. SGD is computationally cheap per update but statistically wasteful on the wrong families. Program ERM is statistically elegant but computationally hopeless under uniform search. The missing ingredient is neither a new loss nor a bigger classifier. It is a better way to propose candidate programs.
Part III — LLM-PV makes ERM searchable
From uniform search to proposal priors
The real bottleneck in program ERM is not verification. Given a candidate program, we can compile it, run it, and score it on examples. The hard part is proposal: where should the search look first?
Uniform search treats every token string of length $L$ as equally plausible, so the target receives probability roughly $|\Sigma|^{-L}$. A pretrained code model is not uniform. Conditioned on examples, it has learned to place mass on programs that look like human-written explanations: loops for length-invariant rules, stacks for brackets, modular arithmetic for primality, Gaussian elimination for parity. LLM-PV asks whether that prior can be harvested without trusting the LLM's answer blindly.
The LLM-PV pipeline
The formal guarantee
The guarantee cleanly separates the statistical role of validation from the search role of the LLM. Let $p_\epsilon(S_{\text{tr}})$ denote the probability that a fresh LLM proposal has population error at most $\epsilon$. Then after $k$ trials:
The LLM enters only through $p_\epsilon$ — the probability that it proposes a good program. It does not need to be trusted as the final predictor. The effective search size becomes roughly $1/p_\epsilon(S_{\text{tr}})$ instead of $|\Sigma|^L$.
Watching candidates evolve: reasoning trace simulator
Fig. 2. Simulated reasoning trace for LLM-PV. On IsPrime, proposals evolve from "check if odd" (58%) to trial division with √n bound (100%) in just two rounds. On parity, the correct solution appears in round 1. The LLM does not need to be perfect — it only needs to assign enough mass to the right program for ERM verification to find it.
Part IV — Results
What was compared?
The experiment is intentionally small-data: $m=200$ labeled examples for the algorithmic tasks, split into training and validation, then evaluated on a large independent test set. The baselines are not strawmen; they cover the usual ways one might attack the problem without program ERM.
| Family | What it tries | Why it matters |
|---|---|---|
| Classic ML | XGBoost, Random Forest, SVM, genetic programs | Strong feature-based learners and symbolic-search baselines with validation tuning. |
| Tabular foundation model | TabPFN | A pretrained model specialized for small tabular datasets. |
| SGD from scratch | 1B-scale LMs trained as classifiers | Tests whether ordinary gradient training can recover the rule from examples. |
| Fine-tuning | Qwen, Llama, DeepSeek-Coder variants | Tests whether pretrained representations plus supervised adaptation solve the task. |
| Direct LLM / agents | ICL with large models, tool-use ICL, MLAgentBench, AIDE-ML | Tests whether prompting or generic agentic ML loops suffice without ERM-style program selection. |
| LLM-PV | LLM proposes executable programs; validation selects | The proposed route: use the LLM as a search prior, not as the final predictor. |
Recovering exact rules from a handful of examples
Across 13 algorithmic tasks at input length $n=100$, LLM-PV often recovers the actual rule, not merely a high-accuracy surrogate. The cute thing is that the winning hypothesis usually looks like what you would hope to find: an XOR over the right coordinates, a stack parser, a palindrome check, a primality test, a graph component counter.
Fig. 3. Teal bars show LLM-PV. Gray bars show the strongest non-PV baseline reported for that task, including classic ML, TabPFN, SGD/fine-tuning, direct ICL, and generic agent baselines. LLM-PV reaches 100% on 10 of 13 tasks. SHA-256 parity stays near chance, which is exactly the failure mode one should expect: the target is pseudorandom rather than compressible by a simple program the proposer can find.
A compact task-by-task view:
| Task | Strongest non-PV | Who got it? | LLM-PV |
|---|---|---|---|
| Full parity | 99.0 | direct GPT-5 + tools | 100 |
| First-half parity | 88.0 | direct GPT-5 + tools | 100 |
| Pattern 1 | 83.0 | direct Gemini + tools | 100 |
| Pattern 2 | 60.9 | MLAgentBench | 100 |
| 3-parity | 75.0 | direct GPT-5 + tools | 100 |
| 10-parity | 79.0 | direct GPT-5 + tools | 100 |
| IsPalindrome | 100.0 | direct Gemini + tools | 100 |
| Dyck-2 | 59.0 | direct GPT-5 + tools | 100 |
| CA parity | 51.0 | direct Gemini + tools | 100 |
| IsPrime | 62.0 | direct GPT-5 + tools | 100 |
| IsPrime+47 | 65.4 | TabPFN | 80.6 |
| CycleCount | 65.7 | fine-tuned Llama-3.2-1B | 96.9 |
| SHA-256 | 54.0 | no-tool ICL | 50.1 |
There are two lessons hidden in this table. First, direct tool use is not the same as LLM-PV. A tool-using LLM can sometimes reason its way to a useful answer, but it is not systematically maintaining a pool of executable hypotheses and selecting by validation error. Second, the success cases are not just high accuracies — they are program recoveries. The output is a rule one can read.
More data does not rescue SGD
Increasing the number of examples does not automatically fix the gradient route. Even with up to 100,000 training examples, SGD-trained models can fit the training set while remaining near chance on random 10-parity and cellular-automata parity. That is the empirical face of the theory: the target is not statistically absent; it is hidden from the learner's search geometry.
Length generalization emerges naturally
A particularly striking property of LLM-PV is that it often synthesizes dimension-invariant programs. Train on short inputs and test on much longer ones, and a loop still loops. A stack still parses. A primality test still tests. This is the quiet superpower of code as a hypothesis class: a good program does not merely interpolate the observed length; it encodes the invariant.
Beyond toy algorithms: tabular datasets
The paper also checks whether propose-and-verify is only a synthetic-algorithm trick. On standard tabular datasets, each with only 100 fixed training samples and 100 validation samples, LLM-PV is competitive with strong tabular learners. The inputs are randomly transformed so the LLM cannot simply recognize famous datasets; it must propose useful feature transformations and decision rules from the examples.
| Model | Adult Income | Mushroom | CDC Diabetes | HRTU2 | Chess |
|---|---|---|---|---|---|
| SVM | 52.8 | 68.1 | 69.1 | 91.5 | 89.4 |
| Genetic Algorithm | 75.2 | 64.7 | 69.0 | 90.6 | 92.5 |
| Decision Tree | 67.8 | 69.0 | 62.5 | 89.1 | 94.1 |
| Random Forest | 74.2 | 70.8 | 72.7 | 91.4 | 88.1 |
| XGBoost | 71.4 | 66.1 | 68.6 | 91.4 | 93.1 |
| TabPFN | 77.7 | 69.9 | 72.1 | 93.1 | 93.1 |
| LLM-PV | 75.8 | 72.2 | 71.1 | 92.3 | 92.3 |
Table 2. Tabular benchmark accuracy (%) with 100 train and 100 validation samples. LLM-PV is not uniformly best — nor should it be. The point is that propose-and-verify remains competitive with specialized tabular learners, suggesting that the method is not limited to exact symbolic recovery.
What this does and does not say
This argument does not say that LLM-PV replaces deep learning on vision, language, or audio. In those domains, learned representations and gradient training are often exactly the right hammer. The claim is narrower and cleaner: when the target is better described as a short executable rule than as a smooth function to be fit, the hypothesis space should reflect that.
Nor does the method magically solve arbitrary program synthesis. Its success depends on proposal mass. If the target has no short exploitable structure, or if the LLM assigns negligible probability to useful candidates, validation has nothing good to select. SHA-256 parity is the sanity check: pseudorandom labels remain pseudorandom, and LLM-PV stays near chance.
The conceptual shift is about the division of labor. The LLM supplies a biased search distribution over code; the sandbox supplies execution; the validation set supplies selection. The cute slogan is: the LLM dreams, the data decides.
Takeaway
LLM-PV revives an old learning-theory dream: learn simple programs from few examples, but avoid brute-force enumeration.
Short programs generalize. If the target has a length-$L$ description, ERM needs roughly $O(L\log|\Sigma|)$ samples. The statistical story is beautiful.
Uniform search kills the beauty. Enumerating programs costs $|\Sigma|^L$, so classical ERM is usually computationally impossible.
SGD is not a universal search engine. On SQ-hard families such as parity, gradient-trained predictors can fit the data and still miss the rule.
The LLM supplies a prior, not a verdict. It proposes executable hypotheses; the sandbox runs them; validation selects. The ERM principle is preserved, but the candidate distribution is no longer uniform.
The result is a tiny but powerful change in perspective: use pretrained models not only to predict labels, but to search over verifiable programs. When the hidden rule is really code, this turns a few examples into an inspectable little machine.
Comments