Revisiting ERM in the LLM Era
How pretrained LLMs can serve as search priors for empirical risk minimization over programs, recovering exact algorithmic rules from a handful of examples.
Why SGD Prefers Low-Rank Neural Networks
Why do trained neural networks often end up low rank? Mini-batch SGD and weight decay together create a built-in pressure toward compressible layers.
Why Pretrained Classifiers Work So Well in Few-Shot Learning
A geometric explanation for why ordinary supervised pretraining can transfer remarkably well to new classes with only a few labeled examples.
Self-Supervised Learning ≈ Supervised Learning
Contrastive learning is often much closer to supervised contrastive learning than it first appears, both at the level of the objective and at the level of the learned representation geometry.