This post is based on: S. Singhal, P. Mishra, E. Malach, T. Galanti. “LLM Priors for ERM over Programs”, arXiv, 2025.
Deep learning is extraordinarily effective across vision, language, and audio. In some sense, we feel like we have the ultimate learning algorithm. But step outside these standard domains, and there are plenty of tasks we completely ignore. Learning the primality function. Learning a boolean circuit or a parity function. Deciding whether a graph is connected. One can cast these as supervised learning problems and throw a transformer at them. In practice, training will fit the data perfectly, yet fail to generalize at all.
But is there a way around it? The answer is yes — with a caveat. These tasks admit short programs: a few lines of Python. So instead of training a model to mimic the function, we can search for the exact program that fits the data. This is empirical risk minimization (ERM) over a discrete program class, and classical learning theory tells us this approach generalizes from remarkably little data. The catch is the search.
The main message of this post is:
Pretrained LLMs can serve as search priors for ERM over discrete program classes. This recovers the sample efficiency of classical learning theory while making the search tractable — learning exact algorithmic rules from as few as 200 labeled examples.
The argument develops in four parts:
- Deep learning fails on program-like tasks — SGD-trained networks overfit without recovering the underlying rule.
-
ERM over short programs is sample-efficient — classical PAC bounds guarantee generalization from $O(L \log \Sigma )$ examples if the target has description length $L$. -
But brute-force ERM is computationally infeasible — enumerating $ \Sigma ^L$ candidate programs is exponential. - LLM-PV bridges the gap — a pretrained LLM replaces uniform enumeration with a data-guided proposal distribution, making ERM-style selection practical.
Part I: Deep learning fails on algorithmic tasks
Where our sample-efficiency intuition comes from
Most of our intuition about how many examples a learner needs comes from standard benchmarks. In vision, language, and audio, the deep-learning stack — a suitable architecture plus SGD plus a standard loss — is well-matched to the task. Networks generalize from modest amounts of data, and the story is clean: drive the training error to zero, the generalization gap stays small.
But consider a different class of problems:
- Parity: decide whether $\langle s, x \rangle \bmod 2 = 1$ for an unknown sparse vector $s$.
- Primality: decide whether a 100-digit number is prime.
- Pattern matching: does a fixed motif appear anywhere in a binary string?
- Balanced parentheses: does a string of brackets form a valid Dyck-2 word?
These are not exotic edge cases. They are natural computational problems with short, clean implementations. Yet when we train a transformer on labeled examples of these tasks, something goes systematically wrong.
The failure mode
We trained 1B-scale language models (Qwen3-1.7B, Llama-3.2-1B, Deepseek-Coder-1.3B) on 200 labeled examples of each task. Every model achieved near-perfect training accuracy. But test accuracy? Near chance — around 50% — across parity variants, Dyck-2, and cellular automata parity, even as we scaled training data to 100,000 examples.
This is not an architecture issue — we tested multiple backbones. It is not a hyperparameter issue — we swept learning rates across seven orders of magnitude and batch sizes from 10 to 200. It is not a pre-training issue — fine-tuning pre-trained models and in-context learning with 30B+ instruction-tuned models also failed.
Why SGD struggles: the SQ perspective
The statistical query (SQ) framework explains this failure cleanly. Many gradient-based methods can be modeled as algorithms that access the data distribution only through bounded statistics — essentially, noisy estimates of expectations. For families like $k$-parity, the SQ dimension is exponentially large: any learner operating through such a bottleneck requires exponentially many queries (and hence samples) to achieve nontrivial error.
Concretely, for planted $k$-parity over $n$ variables, finite-precision mini-batch coordinate SGD with batch size $B$ and $b$-bit quantization requires at least
\[q = \Omega\left(\frac{\sqrt{\binom{n}{k}}}{2^b}\right)\]fresh examples to reach population error below $1/4$. For $n = 100$ and $k = 10$, this is astronomically large. The target function is short — a few lines of code — but SGD cannot find it efficiently.
Part II: ERM over short programs is sample-efficient
Think of the target as code
The key insight is to step back from neural networks and think about the hypothesis class differently. If the target function can be written as a Python program of length $L$ over a token alphabet $\Sigma$, then the hypothesis class is finite:
\[|\mathcal{H}_L| \le |\Sigma|^L.\]This is the set of all strings of length at most $L$ that compile, halt, and produce valid outputs. Classical PAC learning theory gives a clean guarantee: if the shortest consistent program has length $L$, then with probability at least $1 - \delta$ over the training sample,
\[\textnormal{test error} \le \frac{L \log |\Sigma| + \log(2L^2/\delta)}{m}.\]| The sample complexity scales with the program length, not the input dimension. For a 10-parity function over 100 variables, a compact Python implementation is roughly 30–50 characters. With $ | \Sigma | = 128$ (ASCII) and $m = 100{,}000$ training samples, the PAC bound guarantees test accuracy above 99%. SGD on the same data stays at chance. |
Length-first enumeration
The direct implementation of ERM over programs is length-first enumeration: try all programs of length 1, then length 2, and so on, keeping the first one that fits the training data. Simple, elegant, and provably sample-efficient.
| But the search space is enormous. With $ | \Sigma | = 128$ and $L = 20$, there are $128^{20} \approx 10^{42}$ candidate strings. Even with fast verification, this is completely infeasible. |
The tension
This gives us a clean tension:
| Online SGD | Length-First ERM | |
|---|---|---|
| Samples for low error | $\Omega(n^{k/2})$ | $O(k \log n)$ |
| Per-sample compute | $O(1)$ | $\Omega(n^{\Theta(k)})$ |
SGD is cheap per step but needs exponentially many samples. ERM needs few samples but exponential search. Both become infeasible as the problem grows. We need a way to get ERM’s sample efficiency without paying for its search cost.
Part III: LLM-PV bridges the gap
From uniform search to data-guided sampling
| The bottleneck in ERM is not verification — it is the proposal distribution. Uniform sampling over programs of length $L$ hits any fixed target with probability $ | \Sigma | ^{-L}$, so the expected number of trials matches brute-force enumeration. |
But what if the proposal distribution were smarter? What if, conditioned on the training data, it concentrated mass on plausible programs?
A pretrained LLM provides exactly this. Prompted with labeled examples, it generates code concentrated on recognizable algorithmic templates and concise rules suggested by the data, rather than spreading mass uniformly over the vast program space.
The algorithm
LLM-PV (propose-and-verify) is simple:
- Build a prompt from the training set $S_{\text{tr}}$.
- Sample $k$ candidate programs from the LLM.
- Compile and execute each candidate in a sandbox.
- Select the program with the lowest validation error on $S_{\text{val}}$.
No gradient updates. No adaptive sampling. The LLM is used purely as a proposal prior, and selection is by held-out error — exactly the ERM objective.
The guarantee
The formal guarantee cleanly separates the two roles. Let $p_\epsilon(S_{\text{tr}})$ be the probability that a fresh proposal has population error at most $\epsilon$. Then after $k$ trials, with probability at least $1 - \delta$:
\[\text{err}_D(h^\star) \le \epsilon + 2\sqrt{\frac{\log(4k/\delta)}{2m_{\text{val}}}}.\]The LLM enters only through $p_\epsilon$ — the probability it places on good programs. The validation penalty is the standard finite-class bound. The LLM does not need to be trusted as a predictor; it only needs to put enough mass on correct programs so that one appears within $k$ trials.
| The effective search size is $1/p_\epsilon(S_{\text{tr}})$ instead of $ | \Sigma | ^L$. Whenever the LLM’s proposal quality satisfies $p_\epsilon \gg | \Sigma | ^{-L}$, the search becomes dramatically cheaper than enumeration. |
Part IV: Results
LLM-PV recovers exact rules from 200 examples
We evaluated LLM-PV (using GPT-5-Thinking with $k = 5$ trials) across 13 algorithmic tasks at input length $n = 100$, comparing it against classical ML baselines (XGBoost, Random Forest, SVM), SGD-trained 1B-scale transformers, fine-tuning, in-context learning with 30B+ models, and even TabPFN — a strong pre-trained tabular foundation model.
The results are stark: LLM-PV achieves perfect or near-perfect test accuracy on all tasks except SHA-256 parity — a cryptographic hash deliberately designed to be pseudorandom:
The full comparison table (all methods, all 13 tasks):
| Task | XGBoost | SGD (1B) | Fine-tune | ICL (30B) | TabPFN | LLM-PV |
|---|---|---|---|---|---|---|
| Full Parity | 49.9 | 49.3 | 50.4 | 53.0 | 50.1 | 100 |
| First-Half Parity | 50.0 | 50.6 | 51.3 | 54.0 | 50.5 | 100 |
| Pattern1 | 50.6 | 51.5 | 61.5 | 56.0 | 49.2 | 100 |
| Pattern2 | 52.3 | 53.6 | 57.6 | 51.0 | 51.2 | 100 |
| 3-Parity | 49.9 | 49.7 | 51.0 | 50.0 | 49.3 | 100 |
| 10-Parity | 49.3 | 50.3 | 50.5 | 50.0 | 50.5 | 100 |
| IsPalindrome | 83.6 | 49.2 | 49.6 | 47.0 | 49.4 | 100 |
| Dyck-2 | 49.7 | 51.4 | 52.7 | 48.0 | 50.5 | 100 |
| CA Parity | 50.4 | 49.4 | 50.7 | 49.0 | 48.0 | 100 |
| IsPrime | 61.8 | 58.8 | 57.2 | 50.0 | 57.1 | 100 |
| IsPrime+47 | 59.3 | 53.6 | 58.7 | 58.0 | 65.4 | 80.6 |
| CycleCount | 51.1 | 53.5 | 65.7 | 46.0 | 51.3 | 96.9 |
| SHA-256 | 50.3 | 49.8 | 50.8 | 54.0 | 50.6 | 50.1 |
Notice that even TabPFN — which excels on standard tabular data — cannot break past chance on parity, Dyck-2, or cellular automata parity. These tasks require algorithmic reasoning, not feature-based pattern matching.
The IsPrime+47 result is particularly telling. This task applies a fixed offset ($x + 47$) before checking primality, breaking superficial digit-level correlations. That LLM-PV still reaches 80.6% suggests it is discovering transferable computational structure, not memorizing patterns.
More data does not fix SGD
Is the failure a small-data artifact? We trained BLOOM-75M from scratch on 10 to 100,000 examples per task. Perfect training accuracy throughout. Test accuracy? Flat at chance on 10-Parity and CA Parity, even at 100k samples. The PAC bound guarantees >99% accuracy at this scale — if you do ERM over short programs. SGD doesn’t.
This is the core paradox. The PAC bound says these tasks are easy — a few hundred examples should suffice for any learner that performs ERM over short programs. SGD’s failure is not about the task’s intrinsic difficulty. It is about the mismatch between the learning algorithm and the hypothesis structure.
Length generalization
Perhaps the most striking property of LLM-PV is that it often synthesizes dimension-invariant programs. When trained on 200 examples at input length $n = 10$ and tested at lengths up to $n = 100$, LLM-PV maintains perfect accuracy on pattern matching and 3-parity. SGD-trained models collapse to chance as $n$ grows.
for loops that work at any length. SGD learns a fixed-dimensional mapping that cannot extrapolate. Click the task tabs to compare.
This happens because LLM-PV outputs actual code — a for loop that iterates over the input, a recursive function that checks an invariant — while SGD learns a fixed-dimensional mapping locked to the training distribution. The output is a program, not a weight matrix, and programs naturally generalize across input sizes.
The reasoning trace is interpretable
Because the output is executable code and the search process is logged, LLM-PV is auditable by design. We can inspect exactly how the model arrives at its solution. The trace below shows how GPT-5-Thinking discovers the primality rule from 100 examples of 100-digit numbers:
Each step is linked to the errors it aims to fix. The progression is natural: start with the simplest possible hypotheses, observe where they fail, escalate to richer function classes. When the residue-class predictor hits 98%, the model recognizes the sieve-like structure and hypothesizes primality directly. The final Miller–Rabin implementation is a correct, efficient primality test.
Contrast this with a neural network trained on the same data. Even if it achieved perfect accuracy, we would have no way to verify why it works. The program output of LLM-PV can be unit-tested, stress-tested out of distribution, and edited by hand.
Tabular benchmarks: LLM-PV is competitive beyond algorithmic tasks
To check whether LLM-PV is useful only for clean algorithmic tasks, we also evaluated it on standard tabular datasets (Adult Income, Secondary Mushroom, CDC Diabetes, Chess, HRTU2). To prevent the LLM from recognizing well-known datasets, we applied a random linear transformation to all inputs.
| Model | Adult Income | Sec. Mushroom | CDC Diabetes | HRTU2 | Chess |
|---|---|---|---|---|---|
| SVM | 52.8 | 68.1 | 69.1 | 91.5 | 89.4 |
| Random Forest | 74.2 | 70.8 | 72.7 | 91.4 | 88.1 |
| XGBoost | 71.4 | 66.1 | 68.6 | 91.4 | 93.1 |
| TabPFN | 77.7 | 69.9 | 72.1 | 93.1 | 93.1 |
| LLM-PV | 75.8 | 72.2 | 71.1 | 92.3 | 92.3 |
LLM-PV is competitive with the best tabular methods using only 100 training samples — not dominant, but solidly in the mix. This suggests the propose-and-verify paradigm has broad applicability, even when the target isn’t a clean algorithmic rule.
What this does and does not say
This argument does not claim that LLM-PV replaces deep learning for standard tasks like image classification or language modeling. In those domains, the deep-learning stack is well-matched and effective.
What it does give is a concrete demonstration that ERM over programs — a classical idea from learning theory — becomes practical when a pretrained LLM provides the search prior. The LLM does not change the learning objective. It does not replace verification. It serves as an inductive bias that makes the search tractable.
This also clarifies why the method fails on SHA-256 parity. The target is a cryptographic hash, deliberately designed so that no compact description helps predict it. When the target truly has no short-program structure, no amount of search-bias helps. This is not a limitation of LLM-PV specifically — it is a fundamental information-theoretic constraint.
The interpretability angle deserves emphasis. In an era of increasingly opaque models, LLM-PV produces outputs you can read. The learned hypothesis is a Python function. The search process is a logged sequence of proposals and errors. When the model fails, you can see why and where. When it succeeds, you can verify the result independently.
Takeaway
The gap between sample efficiency and computational efficiency in program learning can be bridged by using pretrained LLMs as proposal priors. The mechanism has four parts:
-
Short programs generalize. If the target admits a length-$L$ description, ERM needs only $O(L \log \Sigma )$ samples — independent of input dimension. -
SGD cannot find short programs. For SQ-hard families like parity, gradient-based methods require exponentially many samples, even though the target is simple.
-
Brute-force ERM is too expensive. Enumerating $ \Sigma ^L$ candidates is infeasible for even moderate $L$. - LLM-PV makes ERM tractable. A pretrained LLM concentrates proposals on plausible programs, replacing exponential search with a small number of guided trials. Selection is by held-out error, preserving the generalization guarantees of ERM.
The result is a learning algorithm that recovers exact algorithmic rules from a handful of examples, generalizes beyond the training distribution, and produces interpretable, executable code. It suggests a new paradigm: learning via LLM-guided search with verification — where the LLM provides the hypothesis, but the data provides the judgment.
Comments