Pre-Training · Few-Shot Learning · Neural Collapse

Why Pretrained Classifiers Work So Well in Few-Shot Learning

A deeper geometric explanation for why supervised pretraining can make new classes learnable from only a few labels: neural collapse shrinks class clouds, spreads their centers, and leaves behind a feature space where nearest-center classification becomes statistically cheap.

By Tomer Galanti · March 11, 2026 · 14 min read · ◆ Galanti, György, Hutter — JMLR 2026

Introduction

Few-shot transfer has a slightly magical flavor. We train a classifier on a large source problem — say, many ImageNet-like classes — throw away the last layer, freeze the representation, and then classify entirely new classes from only one or five labeled examples. In practice, this simple recipe is often shockingly strong.

The puzzle is mathematical. The source classifier was never asked to separate the target classes. It only saw its own labels. So why should its penultimate layer arrange unseen classes in a way that a nearest-center classifier can recover from a handful of samples?

The answer in the paper is not that pretraining learns every future task. It is subtler and prettier: supervised pretraining can learn a geometry of classes. If features from the same class concentrate tightly, and class means are far apart, then the class identity is almost encoded by the location of its center. Few-shot learning then becomes less like learning a new classifier from scratch and more like placing a few pins on an already organized map.

The main idea A pretrained classifier transfers well when its feature map makes classes look like small, well-separated clouds. The key quantity is class-distance normalized variance (CDNV): within-class spread divided by squared distance to another class mean. If this quantity is small on many source classes, and those classes are representative of a larger population, then nearest-center classification can succeed on new classes with very few examples.

The mathematical story has four moving parts:

1
Treat classes as random objects.
Source and target classes are modeled as independent draws from a common population $\mathcal{D}$ of class-conditional distributions.
2
Use nearest centers as the test-time rule.
At transfer time, freeze the representation and estimate each new class center from only $n$ examples.
3
Relate few-shot error to CDNV.
A soft-margin NCC loss turns geometric collapse into a controllable classification error.
4
Generalize twice.
The source geometry must generalize from finite samples to source distributions, and then from source classes to unseen target classes.
Based on: T. Galanti, A. György, M. Hutter. “Generalization Bounds for Few-Shot Transfer Learning with Pretrained Classifiers”, JMLR 2026 / arXiv 2022.

Interactive — Neural collapse and transfer in 3D

The visualization below is a toy version of the paper's geometry. Drag the epoch slider. The filled source points tighten into clusters, while the class means spread apart. The target points, shown as rings, were never used during pretraining — but in this cartoon they inherit the same geometry. Drag the viewport to rotate.

src 1 src 2 src 3 src 4 tgt A tgt B tgt C
Drag to rotate
Pre-training: features scattered
Source CDNV
Target CDNV
NCC accuracy
Min class sep.
Epoch 0 200 0

Fig. 1. 3D feature space during training. Source classes (filled dots, 4 classes × 18 samples) collapse toward an equiangular tight frame as the epoch slider advances. Target classes (rings, 3 classes × 5 samples) are unseen during training but inherit the same favorable geometry. CDNV drops because within-class variance shrinks while class means separate. At epoch ~170+, nearest-center classification succeeds from only a few target examples.

Watch the numerator shrink.
Early in training, points from the same class scatter broadly. As NC1 emerges, each class becomes a tight little cloud around its feature mean.
Watch the denominator grow.
Later, class means spread toward a maximally separated configuration. Transfer improves when both effects happen together: small clouds, large gaps.

Part I — A formal model for transfer to new classes

Classes as random objects

To make transfer a theorem rather than a slogan, the paper models classes themselves as random objects. There is an unknown distribution $\mathcal{D}$ over class-conditional distributions. A source task is formed by drawing $\tilde P_1,\dots,\tilde P_\ell \sim \mathcal{D}$. A target task is formed later by drawing fresh classes $P_1,\dots,P_k \sim \mathcal{D}$, independently of the source draw.

Source and target classes
$$\tilde P_1,\dots,\tilde P_\ell \sim \mathcal{D}, \qquad P_1,\dots,P_k \sim \mathcal{D}$$

This assumption is doing real work. Without some shared population of classes, there is no reason for a representation learned on the source task to help on the target task. But the assumption is also modest: $\mathcal{D}$ is not known to the algorithm and does not need a parametric form. It is a mathematical way to say that the source labels are a sample from the kind of classes we hope to see again.

What gets learned, and what gets adapted

Pretraining learns a feature map $f:\mathcal{X}\to\mathbb{R}^p$. At transfer time, $f$ is frozen. For each new class $P_c$, we only get $n$ examples $S_c\sim P_c^n$, and we classify by the nearest empirical class center:

$$h_{f,S}^{\mathrm{NCC}}(x)=\arg\min_{c\in[k]}\|f(x)-\mu_f(S_c)\|$$

The object of interest is not the training error on one lucky target task. It is the expected few-shot error over a random future task and over the small random support set used to estimate the centers:

Transfer risk
$$\mathcal{L}_\mathcal{D}(f)= \mathbb{E}_{P_1,\dots,P_k\sim\mathcal{D}} \mathbb{E}_{S_c\sim P_c^n} \big[L_P(h_{f,S}^{\mathrm{NCC}})\big]$$

This is the right quantity for foundation models. The feature map is trained before the target classes are known, and the target sample size $n$ may remain tiny. The theorem asks whether $\mathcal{L}_\mathcal{D}(f)$ can still go to zero as the source problem becomes rich: many source classes $\ell$, many samples per source class $m$, and increasingly collapsed feature geometry.


Part II — The geometric quantity that controls transfer

Class-distance normalized variance (CDNV)

For a class-conditional distribution $Q$, write its feature mean and feature variance as

Feature mean and feature variance
$$\mu_f(Q)=\mathbb{E}_{x\sim Q}[f(x)], \qquad \operatorname{Var}_f(Q)=\mathbb{E}_{x\sim Q}\|f(x)-\mu_f(Q)\|^2.$$

For two classes $Q_i$ and $Q_j$, the paper's basic geometric quantity is

$$V_f(Q_i,Q_j)= \frac{\operatorname{Var}_f(Q_i)}{\|\mu_f(Q_i)-\mu_f(Q_j)\|^2}.$$

CDNV is deliberately local and pairwise. The numerator asks: how fuzzy is class $i$ in feature space? The denominator asks: how far is its center from class $j$? The ratio is scale-free, which is important because multiplying all features by a constant should not make transfer magically easier.

“Neural collapse helps twice: NC1 shrinks the clouds; NC2 pulls the class means apart.”

The ratio is asymmetric because it measures mistakes made by samples from $Q_i$ when competing against the center of $Q_j$. In the final bound, the average is taken over ordered pairs, so both directions matter. This asymmetry is not a cosmetic detail — it is what lets the proof control the pairwise nearest-center error more directly.

The bridge: a soft-margin nearest-center loss

To connect geometry to classification, compare the distance from a point $x\sim Q_i$ to its own center with the distance to the competing center. Define the nearest-center margin

Pairwise NCC margin
$$r_{ij}(x)=\|f(x)-\mu_f(Q_i)\|-\|f(x)-\mu_f(Q_j)\|.$$

If $r_{ij}(x)<0$, the point is closer to its own class center. If $r_{ij}(x)>0$, it is closer to the wrong center. The paper uses a softened version of this mistake indicator, with margin parameter $\Delta>0$:

Soft-margin loss
$$\ell_\Delta(r)= \begin{cases} 0, & r<-\Delta,\\ 1+r/\Delta, & -\Delta\le r\le 0,\\ 1, & r>0. \end{cases}$$

This little loss is the proof's workhorse. It upper-bounds the actual nearest-center mistake, lower-bounds a stricter margin mistake, and is $1/\Delta$-Lipschitz. That last property is what allows concentration tools to compare finite source samples with the underlying class distributions.

$$\mathbf{1}\{r>0\}\le \ell_\Delta(r)\le \mathbf{1}\{r>-\Delta\}, \qquad \ell_\Delta \text{ is } \frac1\Delta\text{-Lipschitz}.$$

Now CDNV enters through a clean geometric implication: when the clouds are small relative to their separation, the soft-margin nearest-center loss is small. Very roughly, the paper proves bounds of the form

$$\ell_\Delta(f;Q_i,Q_j) \;\lesssim\; \left(1+\frac1n\right)V_f(Q_i,Q_j) +\frac1n\,V_f(Q_j,Q_i),$$

provided the margin scale $\Delta$ is chosen below a constant fraction of the class-mean distance. The exact statement has constants and a symmetric center-estimation term, but the intuition is simple: a few-shot center is noisy, yet if the class cloud is tiny compared to the gap between class means, even a noisy center is enough.

The double generalization

The stepper below illustrates the two-level argument. First, the empirical source geometry must reflect the true geometry of the source class-conditionals. Second, because the source classes are themselves random draws from $\mathcal{D}$, the average geometry of many source classes estimates the average geometry of new target classes.

Fig. 2. The double generalization in 2D. Step 1: empirical source points cluster around empirical centers. Step 2: with enough samples per source class, this reflects the source class-conditional distributions. Step 3: with enough source classes, the average pairwise geometry reflects the population $\mathcal{D}$, so unseen target classes inherit the same favorable nearest-center behavior.


Part III — The transfer bound

The theorem in one line

Suppressing universal constants and logarithmic factors, the CDNV version of the transfer bound says that with high probability over the source classes and source samples,

$$\mathcal{L}_\mathcal{D}(f) \;\lesssim\; (k-1)\operatorname{Avg}_{i\ne j}V_f(\tilde S_i,\tilde S_j) + \frac{(k-1)\alpha_f B}{\Lambda} \,\widetilde{O}\!\left(\frac{n^2}{\sqrt m}+\frac1{\sqrt\ell}\right).$$

Here $k$ is the number of target classes, $n$ is the number of target examples per class, $m$ is the number of source examples per class, and $\ell$ is the number of source classes. The quantity $\alpha_f$ is a norm/complexity proxy for the network, $B$ bounds the input radius, and

Minimum empirical source separation
$$\Lambda=\min_{i\ne j}\|\mu_f(\tilde S_i)-\mu_f(\tilde S_j)\|.$$

This display is not the full theorem — the paper keeps the constants, the confidence parameter, the margin scale, and several logarithms. But this is the mathematical shape that matters for intuition.

What each term means

Empirical CDNV — the observable geometric term
This is the term pretraining directly improves. Small source CDNV says that source training classes are already arranged as tight, separated clouds. In the neural-collapse picture, this is the measurable certificate that the representation has become few-shot-friendly.
$1/\sqrt{m}$ — samples → distributions
More examples per source class make empirical class means and variances reliable estimates of the true source class-conditionals. This is the usual sample-level concentration step.
$1/\sqrt{\ell}$ — source classes → target classes
More source classes improve the estimate of the class population $\mathcal{D}$. This term cannot be replaced by more samples per class: if every class had identical duplicate images, increasing $m$ would not teach us about new classes.
$\Lambda$ — the price of small margins
If two empirical class means are too close, nearest-center classification is fragile. Neural collapse helps by pushing class means toward a well-spread configuration, effectively increasing the margin scale available to the theorem.
$\alpha_f\,B$ — complexity and scale
The representation cannot wiggle arbitrarily. The bound pays for the network norm and the input radius because concentration for composed neural features depends on the size of the function class.

Why this is genuinely few-shot

The striking part is the role of $n$. Ordinary learning guarantees usually need the number of target samples to grow. Here, $n$ can remain a small constant. If the average empirical CDNV goes to zero during pretraining, and if $m$ and $\ell$ grow so the two generalization terms vanish, the transfer risk can converge to zero even though the target task still supplies only a few examples per class.

“The target examples do not learn the representation. They only locate the new class centers inside a representation that was already organized.”

The somewhat awkward $n^2$ dependence in the simplified bound is not the philosophical point. The paper explicitly treats it as a proof artifact and focuses on the small-$n$ regime. The important message is that few-shot transfer is possible because the hard statistical work has moved upstream into source pretraining.

What the proof is doing, without the proof

At a high level, the proof is a chain of reductions:

I
Reduce $k$-class error to pairwise errors.
A target mistake means some wrong class center beats the correct one, so a union bound introduces the factor $(k-1)$.
II
Compare target pairs to source pairs.
Because source and target classes are i.i.d. from $\mathcal{D}$, average pairwise behavior over many source classes estimates future pairwise behavior.
III
Estimate source-pair losses from finite samples.
The soft-margin loss is Lipschitz, so Rademacher-style concentration controls the gap between empirical and population pairwise losses.
IV
Upper-bound soft-margin loss by CDNV.
This final geometric step converts neural collapse into a concrete few-shot transfer guarantee.

That is the whole mechanism. The theorem does not require the target learner to discover a complicated classifier. It only requires the target support set to estimate a few class centers in a geometry where centers already carry the class identity.


What this does and does not claim

This theory does not say that every pretrained classifier transfers to every possible target task. The source and target classes must be meaningfully modeled as draws from the same class population. If the target classes come from a different world, the theorem has no reason to apply.

It also does not say that exact neural collapse is necessary, or that NCC is always the best downstream classifier. The paper uses NCC because it makes the geometry transparent and because NC3/NC4 suggest that trained linear heads become closely related to nearest-center rules. In practice, linear probes, ridge regression, or logistic regression may do better. The theorem is explaining why a very simple rule can already work.

Finally, the bound is a generalization bound, not a sharp prediction of test accuracy. It has constants, logarithms, norm factors, and worst-case margins. Its value is conceptual: it identifies a route by which a single supervised classifier can produce a representation that transfers in the few-shot regime.

Takeaway

Few-shot transfer works when pretraining learns a class geometry that survives two kinds of randomness.

Geometry: neural collapse makes each class a small cloud and pushes class means apart. CDNV measures exactly this ratio: cloud size divided by squared inter-center distance.

Statistics: the geometry observed on finite source samples must generalize to the true source classes, and the average over source classes must generalize to new classes drawn from the same population.

Few-shot adaptation: once unseen classes inherit this geometry, a few labeled examples are enough to estimate class centers. The target learner is not discovering the representation; it is placing names on an already structured feature space.

The cute picture is this: pretraining turns the feature space into a constellation. Few-shot learning only has to identify which little cluster is which.


Comments