Why Pretrained Classifiers Work So Well in Few-Shot Learning
A geometric explanation for why ordinary supervised pretraining transfers remarkably well to new classes with only a few labeled examples — and the precise quantity that controls when this works.
Introduction
Deep networks trained on large classification benchmarks such as ImageNet often transfer surprisingly well to new tasks. One trains on a large source dataset, freezes the learned representation, fits a simple head on only a handful of labeled examples from new classes, and the result often works remarkably well.
From a theoretical perspective, however, this is not obvious. A classifier trained on ImageNet is optimized to separate the ImageNet classes. Why should the same representation make it easy to separate new classes that never appeared during training? And why should only a few labeled examples be enough?
The argument has three parts:
Interactive — Neural collapse and transfer in 3D
The visualization below shows neural collapse unfolding in a 3D feature space. Drag the epoch slider to watch source-class features (filled dots) tighten into clusters while target-class features (rings) inherit the same geometry. Drag the viewport to rotate.
Fig. 1. 3D feature space during training. Source classes (filled dots, 4 classes × 18 samples) collapse toward an equiangular tight frame as the epoch slider advances. Target classes (rings, 3 classes × 5 samples) are unseen during training but inherit the same favorable geometry. CDNV drops as within-class variance shrinks and class means separate. At epoch ~170+, nearest-center classification achieves 100% on target classes.
Part I — A formal model for transfer to new classes
Classes as random objects
To reason about transfer to new classes, we need a model relating source and target tasks. The key step is to treat classes themselves as random objects. There is an unknown distribution $\mathcal{D}$ over a collection $\mathcal{E}$ of class-conditional distributions. Source classes are i.i.d. draws $\tilde{P}_1, \dots, \tilde{P}_\ell \sim \mathcal{D}$, and target classes are new, independent draws $P_1, \dots, P_k \sim \mathcal{D}$.
Feature means and within-class variance
Let $f: \mathcal{X} \to \mathbb{R}^p$ be the learned feature map. For any class-conditional distribution $Q$:
At transfer time, we freeze $f$. For each new target class $P_c$, we observe only a few labeled examples $S_c \sim P_c^n$ and classify by nearest empirical class center:
Part II — The geometric quantity that controls transfer
Class-distance normalized variance (CDNV)
For two class-conditionals $Q_i$ and $Q_j$, define:
This compares within-class spread to between-class separation in a single scale-free number. Small CDNV is the favorable regime — same-class samples stay concentrated while different class means stay well separated.
The double generalization
The stepper below illustrates how CDNV drives the two-stage generalization argument:
Fig. 2. The double generalization in 2D. Step 1: Source training points cluster around empirical class centers. Step 2: With enough samples, this reflects the true class-conditional distributions (confidence ellipses). Step 3: Because classes are i.i.d. from the same population, unseen target classes inherit the same favorable geometry — and a few labeled examples suffice for nearest-center classification.
Part III — The transfer bound
The main result
The transfer error of a pretrained feature map $f$ is controlled by the average empirical CDNV on source classes, together with terms quantifying the two generalization steps:
What each term means
The key point is that the bound remains informative even when $n$ is small. The downstream learner doesn't need to discover complicated structure from limited target data. That structure was already built into the feature space during pretraining.
What this does and does not claim
This theory does not say that every pretrained classifier transfers well to every target task, or that exact neural collapse must hold perfectly, or that source and target may come from unrelated class populations.
What it does say is more precise: if the learned representation exhibits small CDNV on many source training classes, then this geometry first generalizes to the underlying source-class distributions and then, because classes are i.i.d. draws from a common population, to unseen target classes.
Takeaway
Few-shot transfer works because pretraining learns a feature geometry that generalizes twice.
From source samples to source classes. If source training points cluster tightly and class means are well separated, then with enough samples per class this reflects the true geometry of the source-class distributions.
From source classes to unseen classes. Because classes are i.i.d. draws from a common population, favorable geometry on many source classes extends to new target classes.
From geometry to few-shot learning. Once unseen classes also form tight, well-separated clusters, a nearest-center classifier can learn them from only a handful of labeled examples.
The hard part of few-shot learning is therefore not done at transfer time. It is done during pretraining, by shaping a representation whose clustering geometry survives both kinds of generalization.
Comments