1

Training data

Split m=200 labeled examples into S_tr (100) and S_val (100)

2

Build prompt from S_tr

Format I/O examples into a code-generation prompt for the LLM

3

LLM proposes k=5 candidate programs

Sample independent programs — no gradient updates, no adaptive feedback

4

Execute and score on S_val

Compile each candidate in a sandbox, evaluate validation error

5

Select best program by validation error

Return h* = argmin validation error — standard ERM selection