Distribution-Aware Programming
A program is usually one fixed recipe meant to survive every possible input. But real workloads are not worst-case: they repeat, they carry hidden structure, and that structure can be learned. What if samples did not merely improve predictions, but improved the computation itself?
Introduction
Classical learning theory is organized around correctness. A hypothesis is good if it predicts well on fresh draws from the distribution, and generalization is measured by accuracy. But when the thing being learned is an executable procedure — an algorithm, a solver, a piece of code — correctness is only the entry ticket.
Two solvers can both return valid answers on every instance you will ever see, and still be profoundly different. One spends milliseconds; the other spends seconds. One searches the full ambient space; the other has discovered the few degrees of freedom that matter. By an accuracy-only yardstick they are tied. Computationally, one has learned much more.
This is the question behind distribution-aware program learning: from samples of an unknown deployment distribution, a learner returns solver code that is evaluated on fresh instances by both solution quality and runtime. The goal is not to defeat the worst case for SAT, coloring, packing, or TSP. The goal is to learn an algorithm whose execution is specialized to the regime you actually operate in.
The paper's conceptual move is to put runtime inside the learning problem. A sample does not merely choose among labels; it chooses among computations. In this view, a successful learned solver is not just a model with good average quality. It is a small piece of amortized algorithm design: pay once to infer the hidden structure, then reuse that structure cheaply on every future instance.
The story has four parts:
Part I — When you learn code, runtime is part of generalization
“Correct on every input” means worst-case
Asking a program to be correct on all inputs forces us to design for the hardest one — an instance we may never actually encounter. We engineer the procedure for inputs we never see. Yet real workloads are not adversarial: they are drawn from some distribution with exploitable structure. The instances we run on are usually far easier than the worst case, and a procedure tuned to them can be dramatically cheaper.
Two students, same answer, different learning
A small analogy makes the point. Ask two students to compute $17 \times 24$. Student A computes it as repeated addition — twenty-four copies of seventeen — and reaches $408$. Student B decomposes by place value, $(10+7)(20+4) = 200+40+140+28 = 408$, and reaches the same answer. Both are correct. But only one learned how to compute: repeated addition scales with the magnitude of the numbers, while place-value multiplication scales with the number of digits.
The objective, made precise
Let $\mathrm{V}(x,z)\in\{0,1\}$ say whether $z$ is a valid solution for instance $x$, and let a solver $c$ run in time $\mathrm{T}(c,x)$. We evaluate a solver by its deployment error $\mathrm{Err}_D(c)=\Pr_{x\sim D}[\mathrm{V}(x,c(x))=0]$ and its expected deployment runtime $\mathrm{Run}_D(c)=\mathbb{E}_{x\sim D}[\mathrm{T}(c,x)]$. Correctness is a feasibility constraint: among solvers with $\mathrm{Err}_D(c)=0$, runtime can still vary arbitrarily. The quantity we actually want is the best runtime achievable by a correct solver in the class:
Unlike worst-case runtime, this depends on $D$. The role of the sample is to discover which fast, correct solver the present $D$ admits.
Part II — Three ways to design against a distribution
This places the problem cleanly between two classical traditions. Worst-case algorithm design assumes no distributional information at all. Average-case complexity assumes an analytic specification of $D$ before analysis can begin. In practice, deployment distributions rarely come with a closed form — but they do come with examples. We study that intermediate, sample-access regime.
Fig. 1 — Three access models for designing a solver against a distribution $D$. We study the sample-access regime: infer a hint from $S\sim D^n$, then compile it into a deployed solver. The effective search space becomes the structured subfamily $\mathrm{Comp}(\mathcal{H})\subseteq\mathcal{C}$.
Part III — Solver hints: sample → hint → solver
What a hint is
The central abstraction is a solver hint: reusable structure inferred from samples and compiled into solver code. The sample-to-solver map factors as
A hint might be a backdoor set in SAT, a latent decomposition in a graph problem, an active-resource pattern in packing, or geometric structure in routing. Crucially, correctness need not be learned: the compiled solver can always fall back to a generic complete routine. What the sample learns is which shortcut to compile so that future instances are solved faster.
The concrete case: hidden SAT backdoors
SAT illustrates the mechanism cleanly because correctness never requires learning — a complete solver is always available. The value of learning is purely computational. Suppose there is an unknown backdoor set $B$ of $k$ variables: once you fix the values of those $k$ variables, every clause becomes easy (e.g., Horn), solvable by fast propagation with no further branching. Brute force tries $2^{d}$ assignments over all $d$ variables; if you know $B$, you only enumerate $2^{k}$ settings and let the rest follow.
The hint is which variables form $B$. If membership in $B$ is identifiable from a bounded salience statistic with margin $\gamma$, then a handful of samples recovers it — and the search collapses. Drag the sliders below to see the effect.
Fig. 2. Fixing the $k$ backdoor variables leaves an easy residual problem, so the solver searches $2^{k}$ cases instead of $2^{d}$. The sample is only needed to identify the backdoor — a cost logarithmic in the number of variables — after which the per-instance speedup is exponential.
The guarantees
Two results formalize when sample-conditioned design generalizes. First, when the solver library is fixed, runtime-aware empirical risk minimization works: keep the sample-consistent solvers and pick the fastest. With description length $L(c)$, runtime cap $B$, and $m$ samples, the empirically fastest sample-consistent solver $\widehat c$ satisfies, with high probability,
So runtime-aware ERM is not merely picking a solver that looked fast on the sample — it estimates the best correct distribution-specialized solver available in the class. The guarantee is class-relative: it does not promise the right specialization is present, only that if it is, samples can find it.
Second, when the useful specialization is not enumerated in advance, the learner must recover a hint. If the hints induce score functions separated by a margin $\gamma$, then identifying the true hint among $N$ candidates needs only
— logarithmic in the number of candidate hints. For the SAT backdoor over $d$ variables, this specializes to $m \ge 8\gamma^{-2}\log(2d/\delta)$ samples to recover $B$ exactly, after which the compiled solver runs in $O(2^{k}\,\mathrm{poly}(|F|))$ on the deployment distribution.
Part IV — Synthesizing solvers with an LLM agent
The synthesis procedure implements the sample → hint → solver factorization directly. Each candidate is not a one-shot guess at a solver; it is a triple — a distributional hypothesis, a train-time analysis program that estimates the hint from samples, and a deployment solver conditioned on the recovered hint. Click through the stages:
The key discipline is the separation between train-time inference and deployment. The analysis program may spend computation once on the public sample to estimate the hint $a_c = A_c(S_{\mathrm{tr}})$; the deployed solver then uses that cheap summary on each new instance. This is amortization: pay a one-time synthesis cost against the sample, then deploy a solver whose per-instance cost is much lower.
The pipeline, run for real
The diagram above is the clean abstraction; the candidate archive shows the loop with all the mess left in. Across 21 targets, the run tries 189 candidate hypotheses — three slots over three rounds for each target. Those attempts produce 157 train-time analysis programs, 131 deployment solvers, and 110 successfully validation-evaluated solvers; the 21 selected winners are then run on test. So this is not one-shot code generation. It is a small experimental process: propose a structural story, write analysis code to measure it, compile a solver, let validation kill weak ideas, and keep the best solver found across all rounds.
Fig. 3. Representative evolution traces from the exported candidate archive. Scores and runtimes are validation metrics, not hidden test feedback. The important pattern is qualitative: some targets are solved immediately, some improve through refinement, and some later rounds get faster but lose enough quality that validation keeps an earlier solver.
Part V — Results
High quality, far lower runtime
The benchmark spans 21 structured combinatorial-optimization target distributions across seven problem classes — coloring, MAXSAT, maximum independent set, minimum dominating set, packing LP, multidimensional knapsack, and TSP. Each target pairs a familiar hard optimization problem with a hidden distribution family. The learner sees public samples and the scoring rule, but not the family identity, hidden rule, optimum solutions, optimum objective values, or test performance. The exported traces make the evaluation especially interpretable: for every target we can inspect the hypotheses tried, the analysis code written, the deployment solvers produced, and which candidate actually survived validation.
The aggregate result is simple: the synthesized programs are high quality, much faster, and interpretable enough to inspect. The main heuristic baseline below is a strong heuristic rather than a deliberately slow quality-maximizing heuristic; the precise definition is kept in the figure caption so the story can stay focused on computation.
(1.0 = optimal)
strong heuristic
strong heuristic
(10s budget)
The synthesized solvers also improve by $+0.224$ over the average heuristic pool and by $+0.023$ over the learned ML baseline in mean normalized quality. Their geometric-mean per-instance runtime is $12.7$ ms. Against exact-style methods, they are $16.9\times$ faster than the selected time-limited exact backend. The ML comparison is more nuanced: the synthesized solvers are slightly higher quality on average, but ML inference is often faster on CPU or GPU. So the claim is not “LLMs dominate every baseline”; the claim is that the LLM-generated programs recover a surprisingly strong quality--runtime frontier.
Fig. 4. Per-family results. The synthesized solver (teal) is compared with a strong heuristic baseline (muted). Here “strong heuristic” means $h_{\mathrm{fast}}^{95}$: the fastest heuristic whose quality is at least 95% of the best heuristic quality for that family. Speedup bars use a log scale; values above $1\times$ mean the synthesized solver is faster.
The full aggregate table:
| Family | Quality | ΔQ vs avg | ΔQ vs strong heur. | ΔQ vs ML | Runtime | vs strong heur. | vs Gurobi | vs exact |
|---|---|---|---|---|---|---|---|---|
| Coloring | 0.868 | +0.217 | +0.121 | −0.127 | 2.7 ms | 1326.3× | 2285.6× | 23.1× |
| MAXSAT | 1.000 | +0.122 | +0.083 | 0.000 | 17.1 ms | 212.2× | 328.8× | 1.2× |
| MIS | 0.992 | +0.218 | +0.113 | −0.002 | 18.8 ms | 1904.2× | 155.8× | 41.2× |
| MDS | 0.973 | +0.148 | +0.124 | +0.021 | 13.3 ms | 1626.8× | 443.0× | 21.6× |
| Packing LP | 0.994 | +0.301 | +0.317 | +0.186 | 3.3 ms | 13483.7× | 2829.1× | 44.1× |
| MDKP | 0.973 | +0.215 | +0.009 | 0.000 | 94.0 ms | 46.4× | 33.8× | 9.6× |
| TSP | 0.993 | +0.348 | −0.007 | +0.084 | 14.6 ms | 33.7× | 117.5× | 37.2× |
| All (21) | 0.971 | +0.224 | +0.109 | +0.023 | 12.7 ms | 564.9× | 345.1× | 16.9× |
The exceptions are just as useful as the wins. On TSP, synthesis is slightly below the strong heuristic in solution quality, though it is much faster. On Coloring, the learned ML baseline attains higher quality. Across several families, ML inference can also be faster than the synthesized code. The result is therefore not a universal dominance statement; it is a distribution-aware algorithm-design result. The agent often finds a specialized computation that is high quality, very fast, and interpretable enough to inspect.
What kinds of algorithms it creates
The payoff is computational. In almost every family, the deployed solver does not merely tune the generic method; it replaces a large ambient search with a smaller distribution-specific routine inferred from samples. The hint changes the scale of the computation.
| Problem | The generic computation | What the agent built instead |
|---|---|---|
| Graph coloring | search over possible colorings | learn a stable global color-role template, verify it, and locally repair rare conflicts |
| MAX-SAT | assignment search or broad local search from scratch | seed each variable by per-instance polarity majority, then run a short incremental WalkSAT cleanup |
| Independent set | branch-and-bound over many subsets | use backbone/order statistics to build structure-aligned greedy candidates, then improve locally or solve tiny residuals |
| Dominating set | set-cover search over all vertices | force leaf-neighbor centers, cover greedily, and solve only small leftover kernels exactly |
| Packing LP | full LP solve over all resources and variables | identify active/bottleneck resources, sort by learned density, and repair feasibility |
| Multidimensional knapsack | integer program over item subsets | collapse constraints into adaptive resource prices, pack by priced density, then repair and add back |
| TSP | Held--Karp-style dynamic programming or broad tour search | detect latent geometry, build structured tours, and run bounded candidate-restricted 2-opt/or-opt |
Fig. 5. Representative deployed-solver patterns across problem classes. The expensive generic computation is avoided; any exact search that remains is pushed onto a small residual piece.
An external stress test: PACE 2025 Dominating Set
To test the procedure outside its own benchmark, the authors evaluate on the released PACE 2025 Dominating Set instances, against highly engineered competition solvers. The agent uses the public PACE set as its development distribution, then reports on the released private set. The synthesized solver is valid on all 100 private graphs and runs roughly two orders of magnitude faster than the released PACE solvers, while returning moderately larger dominating sets.
| Solver | Valid | Size vs LLM | Runtime | LLM speedup |
|---|---|---|---|---|
| LLM (synthesized) | 100/100 | 1.000 | 2.89 s | 1.0× |
| Fontan–Verger | 100/100 | 1.033 | 286.24 s | 98.9× |
| Root | 100/100 | 1.033 | 360.42 s | 124.5× |
| Shadoks | 100/100 | 1.032 | 316.07 s | 109.2× |
| AEG Heidelberg | 100/100 | 1.034 | 350.14 s | 121.0× |
| Greeduce | 100/100 | 1.031 | 300.86 s | 104.0× |
| Swats | 75/100 | 1.028† | 287.61 s† | 130.3׆ |
Fig. 6. “Size vs LLM” is the matched total-size ratio; values above $1$ mean the synthesized solver returns a larger (worse) dominating set. The gap is about $3.3\%$ against the top released solvers, while runtime is roughly $99\times$–$109\times$ lower. †Computed on the 75 instances Swats solved validly. Not an official PACE score.
What this does and does not say
This is not a claim of new worst-case algorithms for SAT, coloring, knapsack, or TSP. Several of the generated solvers are bounded heuristics with fallback. The faithful claim is distributional and mechanistic: on these deployment distributions, the synthesized programs often replace generic optimization over a large ambient space with a much smaller, distribution-specific computation — and that is exactly the kind of win that accuracy-only generalization cannot see.
The same lens names the limitations. A one-time synthesis cost only pays off when amortized over enough future instances. The resulting solver is specialized to the sampled regime, so its advantage can degrade under distribution shift; even presentation changes such as relabeling can move quality on some targets. And because the agent searches a rich program space, different runs may recover different hints: some genuine structural summaries, some useful proxies, and some brittle shortcuts.
Takeaway
When the learned object is code, generalization has two axes — quality and computation. Distribution-aware programming targets both.
Runtime is part of the objective. Among correct solvers, the sample's job is to find the one whose computation is specialized to the distribution you actually deploy on.
A solver hint is the unit of reuse. Identifiable structure can be recovered from $O(\log N / \gamma^2)$ samples and compiled into a fast solver, with a complete fallback keeping every answer correct.
LLM agents can discover hints. Across 21 structured distributions, synthesized solvers reach $0.971$ mean quality, improve by $+0.109$ over a strong heuristic baseline, and run $564.9\times$ faster than that baseline, $345.1\times$ faster than Gurobi, and $16.9\times$ faster than selected time-limited exact backends.
The learned object ends up closer to a specialized algorithm for the deployment regime than to a tuned heuristic for isolated instances.
Comments