Algorithm Design · Program Synthesis · LLM Agents

Distribution-Aware Programming

A program is usually one fixed recipe meant to survive every possible input. But real workloads are not worst-case: they repeat, they carry hidden structure, and that structure can be learned. What if samples did not merely improve predictions, but improved the computation itself?

By Tomer Galanti · June 8, 2026 · 15 min read · ◆ Koganti, Mishra, Beneventano, Galanti — 2026

Introduction

Classical learning theory is organized around correctness. A hypothesis is good if it predicts well on fresh draws from the distribution, and generalization is measured by accuracy. But when the thing being learned is an executable procedure — an algorithm, a solver, a piece of code — correctness is only the entry ticket.

Two solvers can both return valid answers on every instance you will ever see, and still be profoundly different. One spends milliseconds; the other spends seconds. One searches the full ambient space; the other has discovered the few degrees of freedom that matter. By an accuracy-only yardstick they are tied. Computationally, one has learned much more.

This is the question behind distribution-aware program learning: from samples of an unknown deployment distribution, a learner returns solver code that is evaluated on fresh instances by both solution quality and runtime. The goal is not to defeat the worst case for SAT, coloring, packing, or TSP. The goal is to learn an algorithm whose execution is specialized to the regime you actually operate in.

The central claim Samples can improve computation, not just accuracy. The reusable object is a solver hint: structure inferred from the sample, compressed once, and compiled into specialized solver code. A pretrained LLM agent can search over these hints by proposing hypotheses, writing analysis programs, and producing deployment solvers.

The paper's conceptual move is to put runtime inside the learning problem. A sample does not merely choose among labels; it chooses among computations. In this view, a successful learned solver is not just a model with good average quality. It is a small piece of amortized algorithm design: pay once to infer the hidden structure, then reuse that structure cheaply on every future instance.

The story has four parts:

I
Runtime is part of generalization.
When the learned object is code, two correct solvers can differ arbitrarily in runtime. Correctness is a feasibility constraint; the objective is fast, correct computation on the deployment distribution.
II
There are three access models for a distribution.
Worst-case design assumes no useful information about $D$. Average-case complexity assumes $D$ is specified analytically. Here $D$ is unknown but sampled, so the solver must infer the useful regularities directly from examples.
III
Solver hints make samples actionable.
The sample-to-solver map factors through a hint: $S \mapsto \widehat h_S \mapsto \widehat c_S = \mathrm{Comp}(\widehat h_S)$. Identifiable hints are recoverable from polynomially many samples and can be compiled into fast solvers, with fallback preserving correctness.
IV
LLM agents synthesize them.
Across 21 structured optimization distributions, synthesized solvers reach mean quality 0.971 and are 564.9× faster than a strong heuristic baseline, 345.1× faster than Gurobi, and 16.9× faster than selected time-limited exact backends.
Based on: S. Koganti, P. Mishra, P. Beneventano, T. Galanti. “Distribution-Aware Algorithm Design with LLM Agents”, 2026.

Part I — When you learn code, runtime is part of generalization

“Correct on every input” means worst-case

Asking a program to be correct on all inputs forces us to design for the hardest one — an instance we may never actually encounter. We engineer the procedure for inputs we never see. Yet real workloads are not adversarial: they are drawn from some distribution with exploitable structure. The instances we run on are usually far easier than the worst case, and a procedure tuned to them can be dramatically cheaper.

Two students, same answer, different learning

A small analogy makes the point. Ask two students to compute $17 \times 24$. Student A computes it as repeated addition — twenty-four copies of seventeen — and reaches $408$. Student B decomposes by place value, $(10+7)(20+4) = 200+40+140+28 = 408$, and reaches the same answer. Both are correct. But only one learned how to compute: repeated addition scales with the magnitude of the numbers, while place-value multiplication scales with the number of digits.

The point of the analogy
Both students learned correctness. Only one learned a better computation. If correctness is the only thing we measure, we cannot tell them apart — which is exactly the blind spot of accuracy-only generalization for learned code.

The objective, made precise

Let $\mathrm{V}(x,z)\in\{0,1\}$ say whether $z$ is a valid solution for instance $x$, and let a solver $c$ run in time $\mathrm{T}(c,x)$. We evaluate a solver by its deployment error $\mathrm{Err}_D(c)=\Pr_{x\sim D}[\mathrm{V}(x,c(x))=0]$ and its expected deployment runtime $\mathrm{Run}_D(c)=\mathbb{E}_{x\sim D}[\mathrm{T}(c,x)]$. Correctness is a feasibility constraint: among solvers with $\mathrm{Err}_D(c)=0$, runtime can still vary arbitrarily. The quantity we actually want is the best runtime achievable by a correct solver in the class:

$$\mathrm{Run}_D^\star(\mathcal{C}) \;:=\; \inf_{c\in\mathcal{C}\,:\,\mathrm{Err}_D(c)=0}\; \mathrm{Run}_D(c)$$

Unlike worst-case runtime, this depends on $D$. The role of the sample is to discover which fast, correct solver the present $D$ admits.


Part II — Three ways to design against a distribution

This places the problem cleanly between two classical traditions. Worst-case algorithm design assumes no distributional information at all. Average-case complexity assumes an analytic specification of $D$ before analysis can begin. In practice, deployment distributions rarely come with a closed form — but they do come with examples. We study that intermediate, sample-access regime.

access to D learned rep. deployed solver Worst-case Average-case This paper none — worst over D algorithm with worst-case guarantee D specified analytically algorithm tuned to D sample S ~ Dⁿ hint ĥ_S ∈ H solver ĉ_S = Comp(ĥ_S)

Fig. 1 — Three access models for designing a solver against a distribution $D$. We study the sample-access regime: infer a hint from $S\sim D^n$, then compile it into a deployed solver. The effective search space becomes the structured subfamily $\mathrm{Comp}(\mathcal{H})\subseteq\mathcal{C}$.


Part III — Solver hints: sample → hint → solver

What a hint is

The central abstraction is a solver hint: reusable structure inferred from samples and compiled into solver code. The sample-to-solver map factors as

$$S \;\longmapsto\; \widehat h_S \;\longmapsto\; \widehat c_S = \mathrm{Comp}(\widehat h_S)$$

A hint might be a backdoor set in SAT, a latent decomposition in a graph problem, an active-resource pattern in packing, or geometric structure in routing. Crucially, correctness need not be learned: the compiled solver can always fall back to a generic complete routine. What the sample learns is which shortcut to compile so that future instances are solved faster.

“A solver hint is not a solution to one instance — it is information that changes the algorithm used for many.”

The concrete case: hidden SAT backdoors

SAT illustrates the mechanism cleanly because correctness never requires learning — a complete solver is always available. The value of learning is purely computational. Suppose there is an unknown backdoor set $B$ of $k$ variables: once you fix the values of those $k$ variables, every clause becomes easy (e.g., Horn), solvable by fast propagation with no further branching. Brute force tries $2^{d}$ assignments over all $d$ variables; if you know $B$, you only enumerate $2^{k}$ settings and let the rest follow.

The hint is which variables form $B$. If membership in $B$ is identifiable from a bounded salience statistic with margin $\gamma$, then a handful of samples recovers it — and the search collapses. Drag the sliders below to see the effect.

SAT backdoor — search collapse — fix the hint, shrink the search
brute force 2d
with hint 2k

Fig. 2. Fixing the $k$ backdoor variables leaves an easy residual problem, so the solver searches $2^{k}$ cases instead of $2^{d}$. The sample is only needed to identify the backdoor — a cost logarithmic in the number of variables — after which the per-instance speedup is exponential.

The guarantees

Two results formalize when sample-conditioned design generalizes. First, when the solver library is fixed, runtime-aware empirical risk minimization works: keep the sample-consistent solvers and pick the fastest. With description length $L(c)$, runtime cap $B$, and $m$ samples, the empirically fastest sample-consistent solver $\widehat c$ satisfies, with high probability,

Correctness · Occam bound
$$\mathrm{Err}_D(\widehat c) \;\le\; \frac{L(\widehat c) + \log(2/\delta)}{m}$$
Runtime · competitive with any correct solver
$$\mathrm{Run}_D(\widehat c) \;\le\; \mathrm{Run}_D(c) \;+\; 2B\sqrt{\frac{L(c)+L(\widehat c)+\log(4/\delta)}{2m}}$$

So runtime-aware ERM is not merely picking a solver that looked fast on the sample — it estimates the best correct distribution-specialized solver available in the class. The guarantee is class-relative: it does not promise the right specialization is present, only that if it is, samples can find it.

Second, when the useful specialization is not enumerated in advance, the learner must recover a hint. If the hints induce score functions separated by a margin $\gamma$, then identifying the true hint among $N$ candidates needs only

$$n \;\ge\; \frac{2}{\gamma^2}\,\log\frac{2N}{\delta}\qquad\text{samples}$$

— logarithmic in the number of candidate hints. For the SAT backdoor over $d$ variables, this specializes to $m \ge 8\gamma^{-2}\log(2d/\delta)$ samples to recover $B$ exactly, after which the compiled solver runs in $O(2^{k}\,\mathrm{poly}(|F|))$ on the deployment distribution.

The honest caveat The hard part in practice is not estimating a hint for a known score family. It is discovering what the hint should be, what statistic reveals it, and how to compile it into code. That discovery problem is exactly what the LLM agent approximates.

Part IV — Synthesizing solvers with an LLM agent

The synthesis procedure implements the sample → hint → solver factorization directly. Each candidate is not a one-shot guess at a solver; it is a triple — a distributional hypothesis, a train-time analysis program that estimates the hint from samples, and a deployment solver conditioned on the recovered hint. Click through the stages:

Synthesis pipeline — propose, compile, verify — click each stage for details
Stage 1
Hypothesize
Guess the reusable structure
Stage 2
Analyze → hint
Measure it once over the sample
Stage 3
Compile
Write a solver using the hint
Stage 4
Verify & select
Rank on held-out data
Click a stage above to see details.

The key discipline is the separation between train-time inference and deployment. The analysis program may spend computation once on the public sample to estimate the hint $a_c = A_c(S_{\mathrm{tr}})$; the deployed solver then uses that cheap summary on each new instance. This is amortization: pay a one-time synthesis cost against the sample, then deploy a solver whose per-instance cost is much lower.

The leakage boundary
The agent sees only public instances, the problem specification, and the scoring rule. It never sees the hidden distribution-family identity, the planted rule, optimum solutions, optimum objective values, or any test performance. This models genuine deployment: you know the problem and the metric, not the generative mechanism.

The pipeline, run for real

The diagram above is the clean abstraction; the candidate archive shows the loop with all the mess left in. Across 21 targets, the run tries 189 candidate hypotheses — three slots over three rounds for each target. Those attempts produce 157 train-time analysis programs, 131 deployment solvers, and 110 successfully validation-evaluated solvers; the 21 selected winners are then run on test. So this is not one-shot code generation. It is a small experimental process: propose a structural story, write analysis code to measure it, compile a solver, let validation kill weak ideas, and keep the best solver found across all rounds.

Each tab shows one representative target. Cards are the best validation-evaluated solver in each round; the gold card is the selected solver for that target.
Representative target · MAX-SAT / community parity · climbs from a wrong guess to near-optimal
Round 0
0.0%25.9 ms
Header-keyed templates
Round 1
99.9%28.1 ms
Header bits expose the assignment
Round 2selected
99.99%33.6 ms
Per-instance polarity backbone + local search
What it found: Clause signs are biased toward a strong per-instance assignment signal.
What the solver does: Initialize each variable by its polarity majority, then run incremental min-break WalkSAT with small restarts and a safe anchor-clause post-fix.
Net effect: instead of starting with uninformed assignment search, the solver begins near the planted backbone and spends runtime only on the residual unsatisfied clauses.
Representative target · Coloring / cluster ring · finds the reusable template immediately
Round 0selected
100.0%0.35 ms
One global color template
Round 1
100.0%8.4 ms
Global 4-color role partition
Round 2
100.0%0.44 ms
Compressed global template
What it found: The vertices share a stable color-role assignment across instances.
What the solver does: Return the learned template, check for edge conflicts, run a tiny min-conflicts repair if needed, and fall back to greedy DSATUR only if the template is missing or mismatched.
Net effect: instead of searching over colorings, inference is essentially template lookup plus a quick verifier.
Representative target · Independent set / clique-path mix · recovers after a failed refinement
Round 0
97.5%127.0 ms
Clique attachment / low-treewidth mixture
Round 1
0.0%91.9 ms
Backbone path plus chord noise
Round 2selected
99.2%15.1 ms
Fixed backbone with small-world chords
What it found: Vertex IDs carry a stable backbone order, with distance-biased chord structure visible in offset statistics.
What the solver does: Use those statistics to seed deterministic mod-order greedy constructions, then run a short local-improvement search.
Net effect: instead of branching over all subsets, the solver searches around a small set of structure-aligned independent sets.
Representative target · Packing LP / latent active basis · jumps once it prices the active resources
Round 0
88.7%13.7 ms
Two-tier tightness, weighted greedy
Round 1selected
98.0%90.9 ms
Few active bottlenecks, priced tradeoffs
Round 2
98.0%39.7 ms
Sparse active bottleneck in a fixed prefix
What it found: A small critical set of resources controls the objective-feasibility tradeoff.
What the solver does: Price items by tightness over that critical set, run fractional greedy fills, and keep the best feasible portfolio variant.
Net effect: instead of calling a full LP backend, most work becomes density sorting under a learned active-resource model.
Representative target · Knapsack / latent classes · improves the surrogate each round
Round 0
93.3%14.0 ms
Near-1D surrogate plus repair
Round 1
93.7%15.6 ms
Risk-adjusted one-D surrogate
Round 2selected
96.8%120.3 ms
Adaptive resource-price surrogate
What it found: The many constraints can be compressed into instance-specific resource prices.
What the solver does: Learn prices by a few overload-driven multiplicative updates, pack by value over priced load, then repair feasibility and add back greedily.
Net effect: instead of integer search over item subsets, the solver uses a priced one-dimensional surrogate plus bounded repair.
Representative target · TSP / clustered Euclidean · peaks in the middle and keeps that solver
Round 0
98.4%74.2 ms
Ring of clusters
Round 1selected
98.8%67.8 ms
Angular-sector ring, polar order
Round 2
93.9%4.7 ms
Portal ring with gate cities
What it found: Cities form angular communities around a common ring-like geometry.
What the solver does: Build polar-angle and nearest-neighbor tours, then improve them with candidate-restricted 2-opt and a small or-opt pass.
Net effect: instead of Held--Karp-style dynamic programming, the solver generates a structured tour and spends local-search budget only on nearby candidate swaps.
Representative target · Dominating set / star cluster · two strong rounds, second selected
Round 0
100.0%18.0 ms
Majority-template hubs
Round 1selected
100.0%10.1 ms
Leaf-forced core plus connectors
What it found: Leaf neighbors are effectively forced, leaving only a tiny residual kernel.
What the solver does: Select all forced leaf neighbors, solve small kernel components exactly with bitmask set-cover DP, and fall back to greedy completion if the kernel is unexpectedly large.
Net effect: instead of set-cover search over the whole graph, exact search is confined to small leftover kernels.

Fig. 3. Representative evolution traces from the exported candidate archive. Scores and runtimes are validation metrics, not hidden test feedback. The important pattern is qualitative: some targets are solved immediately, some improve through refinement, and some later rounds get faster but lose enough quality that validation keeps an earlier solver.

What the archive shows
The right mental model is beam search over explanations, not one-shot code generation. The LLM proposes possible hidden structures, but execution supplies the discipline: run the analysis, compile the solver, measure held-out quality and runtime, then refine, fork, replace, or stop. In the exported run, the selected solvers come from all three rounds — 5 from round 0, 7 from round 1, and 9 from round 2.

Part V — Results

High quality, far lower runtime

The benchmark spans 21 structured combinatorial-optimization target distributions across seven problem classes — coloring, MAXSAT, maximum independent set, minimum dominating set, packing LP, multidimensional knapsack, and TSP. Each target pairs a familiar hard optimization problem with a hidden distribution family. The learner sees public samples and the scoring rule, but not the family identity, hidden rule, optimum solutions, optimum objective values, or test performance. The exported traces make the evaluation especially interpretable: for every target we can inspect the hypotheses tried, the analysis code written, the deployment solvers produced, and which candidate actually survived validation.

The aggregate result is simple: the synthesized programs are high quality, much faster, and interpretable enough to inspect. The main heuristic baseline below is a strong heuristic rather than a deliberately slow quality-maximizing heuristic; the precise definition is kept in the figure caption so the story can stay focused on computation.

0.971
mean normalized quality
(1.0 = optimal)
+0.109
quality over a
strong heuristic
564.9×
faster than a
strong heuristic
345.1×
faster than Gurobi
(10s budget)

The synthesized solvers also improve by $+0.224$ over the average heuristic pool and by $+0.023$ over the learned ML baseline in mean normalized quality. Their geometric-mean per-instance runtime is $12.7$ ms. Against exact-style methods, they are $16.9\times$ faster than the selected time-limited exact backend. The ML comparison is more nuanced: the synthesized solvers are slightly higher quality on average, but ML inference is often faster on CPU or GPU. So the claim is not “LLMs dominate every baseline”; the claim is that the LLM-generated programs recover a surprisingly strong quality--runtime frontier.

Solution quality (normalized, higher is better)

Fig. 4. Per-family results. The synthesized solver (teal) is compared with a strong heuristic baseline (muted). Here “strong heuristic” means $h_{\mathrm{fast}}^{95}$: the fastest heuristic whose quality is at least 95% of the best heuristic quality for that family. Speedup bars use a log scale; values above $1\times$ mean the synthesized solver is faster.

The full aggregate table:

Family Quality ΔQ vs avg ΔQ vs strong heur. ΔQ vs ML Runtime vs strong heur. vs Gurobi vs exact
Coloring0.868+0.217+0.121−0.1272.7 ms1326.3×2285.6×23.1×
MAXSAT1.000+0.122+0.0830.00017.1 ms212.2×328.8×1.2×
MIS0.992+0.218+0.113−0.00218.8 ms1904.2×155.8×41.2×
MDS0.973+0.148+0.124+0.02113.3 ms1626.8×443.0×21.6×
Packing LP0.994+0.301+0.317+0.1863.3 ms13483.7×2829.1×44.1×
MDKP0.973+0.215+0.0090.00094.0 ms46.4×33.8×9.6×
TSP0.993+0.348−0.007+0.08414.6 ms33.7×117.5×37.2×
All (21)0.971+0.224+0.109+0.02312.7 ms564.9×345.1×16.9×

The exceptions are just as useful as the wins. On TSP, synthesis is slightly below the strong heuristic in solution quality, though it is much faster. On Coloring, the learned ML baseline attains higher quality. Across several families, ML inference can also be faster than the synthesized code. The result is therefore not a universal dominance statement; it is a distribution-aware algorithm-design result. The agent often finds a specialized computation that is high quality, very fast, and interpretable enough to inspect.

What kinds of algorithms it creates

The payoff is computational. In almost every family, the deployed solver does not merely tune the generic method; it replaces a large ambient search with a smaller distribution-specific routine inferred from samples. The hint changes the scale of the computation.

ProblemThe generic computationWhat the agent built instead
Graph coloringsearch over possible coloringslearn a stable global color-role template, verify it, and locally repair rare conflicts
MAX-SATassignment search or broad local search from scratchseed each variable by per-instance polarity majority, then run a short incremental WalkSAT cleanup
Independent setbranch-and-bound over many subsetsuse backbone/order statistics to build structure-aligned greedy candidates, then improve locally or solve tiny residuals
Dominating setset-cover search over all verticesforce leaf-neighbor centers, cover greedily, and solve only small leftover kernels exactly
Packing LPfull LP solve over all resources and variablesidentify active/bottleneck resources, sort by learned density, and repair feasibility
Multidimensional knapsackinteger program over item subsetscollapse constraints into adaptive resource prices, pack by priced density, then repair and add back
TSPHeld--Karp-style dynamic programming or broad tour searchdetect latent geometry, build structured tours, and run bounded candidate-restricted 2-opt/or-opt

Fig. 5. Representative deployed-solver patterns across problem classes. The expensive generic computation is avoided; any exact search that remains is pushed onto a small residual piece.

The mechanism, confirmed
Diagnostic traces show the synthesized solvers take a learned fast path on $86.9\%$ of instances on average, with a generic fallback firing on only $3.0\%$. The speedups therefore come from distribution-specific computation, not from quietly calling a heavy backend.

An external stress test: PACE 2025 Dominating Set

To test the procedure outside its own benchmark, the authors evaluate on the released PACE 2025 Dominating Set instances, against highly engineered competition solvers. The agent uses the public PACE set as its development distribution, then reports on the released private set. The synthesized solver is valid on all 100 private graphs and runs roughly two orders of magnitude faster than the released PACE solvers, while returning moderately larger dominating sets.

SolverValidSize vs LLMRuntimeLLM speedup
LLM (synthesized)100/1001.0002.89 s1.0×
Fontan–Verger100/1001.033286.24 s98.9×
Root100/1001.033360.42 s124.5×
Shadoks100/1001.032316.07 s109.2×
AEG Heidelberg100/1001.034350.14 s121.0×
Greeduce100/1001.031300.86 s104.0×
Swats75/1001.028†287.61 s†130.3׆

Fig. 6. “Size vs LLM” is the matched total-size ratio; values above $1$ mean the synthesized solver returns a larger (worse) dominating set. The gap is about $3.3\%$ against the top released solvers, while runtime is roughly $99\times$–$109\times$ lower. †Computed on the 75 instances Swats solved validly. Not an official PACE score.


What this does and does not say

This is not a claim of new worst-case algorithms for SAT, coloring, knapsack, or TSP. Several of the generated solvers are bounded heuristics with fallback. The faithful claim is distributional and mechanistic: on these deployment distributions, the synthesized programs often replace generic optimization over a large ambient space with a much smaller, distribution-specific computation — and that is exactly the kind of win that accuracy-only generalization cannot see.

The same lens names the limitations. A one-time synthesis cost only pays off when amortized over enough future instances. The resulting solver is specialized to the sampled regime, so its advantage can degrade under distribution shift; even presentation changes such as relabeling can move quality on some targets. And because the agent searches a rich program space, different runs may recover different hints: some genuine structural summaries, some useful proxies, and some brittle shortcuts.

The hint encodes the shortcut.
An analysis program estimates reusable structure from the sample once, producing a compact summary the solver can reuse cheaply per instance.
The data verifies.
Held-out validation selects among candidate hints and solvers by quality, optimality, and runtime. The LLM proposes; execution and validation decide.

Takeaway

When the learned object is code, generalization has two axes — quality and computation. Distribution-aware programming targets both.

Runtime is part of the objective. Among correct solvers, the sample's job is to find the one whose computation is specialized to the distribution you actually deploy on.

A solver hint is the unit of reuse. Identifiable structure can be recovered from $O(\log N / \gamma^2)$ samples and compiled into a fast solver, with a complete fallback keeping every answer correct.

LLM agents can discover hints. Across 21 structured distributions, synthesized solvers reach $0.971$ mean quality, improve by $+0.109$ over a strong heuristic baseline, and run $564.9\times$ faster than that baseline, $345.1\times$ faster than Gurobi, and $16.9\times$ faster than selected time-limited exact backends.

The learned object ends up closer to a specialized algorithm for the deployment regime than to a tuned heuristic for isolated instances.


Comments