Pre-Training · Few-Shot Learning · Neural Collapse

Why Pretrained Classifiers Work So Well in Few-Shot Learning

A geometric explanation for why ordinary supervised pretraining transfers remarkably well to new classes with only a few labeled examples — and the precise quantity that controls when this works.

By Tomer Galanti · March 11, 2026 · 14 min read · ◆ Galanti, György, Hutter — arXiv 2022

Introduction

Deep networks trained on large classification benchmarks such as ImageNet often transfer surprisingly well to new tasks. One trains on a large source dataset, freezes the learned representation, fits a simple head on only a handful of labeled examples from new classes, and the result often works remarkably well.

From a theoretical perspective, however, this is not obvious. A classifier trained on ImageNet is optimized to separate the ImageNet classes. Why should the same representation make it easy to separate new classes that never appeared during training? And why should only a few labeled examples be enough?

The main idea Few-shot transfer is not mysterious once the right geometric quantity is identified. What matters is not just that source classes are separated, but that their within-class variability is small relative to the distance between class means. If this geometry generalizes from source samples to source classes, and from source classes to unseen target classes, then few-shot learning becomes possible.

The argument has three parts:

1
Classes as random draws.
Model source and target classes as i.i.d. draws from a common population $\mathcal{D}$, so transfer to unseen classes becomes a well-posed question.
2
Measure CDNV.
The class-distance normalized variance compares within-class spread to between-class separation — the single quantity that controls transfer.
3
Double generalization.
Small source CDNV generalizes twice: from samples to source distributions, then from source classes to unseen target classes.
Based on: T. Galanti, A. György, M. Hutter. “Generalization Bounds for Few-Shot Transfer Learning with Pretrained Classifiers”, arXiv 2022.

Interactive — Neural collapse and transfer in 3D

The visualization below shows neural collapse unfolding in a 3D feature space. Drag the epoch slider to watch source-class features (filled dots) tighten into clusters while target-class features (rings) inherit the same geometry. Drag the viewport to rotate.

src 1 src 2 src 3 src 4 tgt A tgt B tgt C
Drag to rotate
Pre-training: features scattered
Source CDNV
Target CDNV
NCC accuracy
Min class sep.
Epoch 0 200 0

Fig. 1. 3D feature space during training. Source classes (filled dots, 4 classes × 18 samples) collapse toward an equiangular tight frame as the epoch slider advances. Target classes (rings, 3 classes × 5 samples) are unseen during training but inherit the same favorable geometry. CDNV drops as within-class variance shrinks and class means separate. At epoch ~170+, nearest-center classification achieves 100% on target classes.

Watch the collapse.
Drag the epoch slider from 0 to ~120. Source features tighten into distinct clusters (NC1). Then from ~120 to 200, class means spread toward the vertices of a regular tetrahedron (NC2).
Watch the transfer.
Target classes (rings) were never seen during training — yet they cluster and separate in lockstep with source classes. By epoch ~170, nearest-center classification hits 100% from just 5 examples per target class.

Part I — A formal model for transfer to new classes

Classes as random objects

To reason about transfer to new classes, we need a model relating source and target tasks. The key step is to treat classes themselves as random objects. There is an unknown distribution $\mathcal{D}$ over a collection $\mathcal{E}$ of class-conditional distributions. Source classes are i.i.d. draws $\tilde{P}_1, \dots, \tilde{P}_\ell \sim \mathcal{D}$, and target classes are new, independent draws $P_1, \dots, P_k \sim \mathcal{D}$.

Feature means and within-class variance

Let $f: \mathcal{X} \to \mathbb{R}^p$ be the learned feature map. For any class-conditional distribution $Q$:

Population feature mean and variance
$$\mu_f(Q) := \mathbb{E}_{x \sim Q}[f(x)], \qquad \text{Var}_f(Q) := \mathbb{E}_{x \sim Q}\|f(x) - \mu_f(Q)\|^2$$

At transfer time, we freeze $f$. For each new target class $P_c$, we observe only a few labeled examples $S_c \sim P_c^n$ and classify by nearest empirical class center:

$$h(x) = \arg\min_{c \in [k]} \|f(x) - \mu_f(S_c)\|$$

Part II — The geometric quantity that controls transfer

Class-distance normalized variance (CDNV)

For two class-conditionals $Q_i$ and $Q_j$, define:

$$V_f(Q_i, Q_j) = \frac{\text{Var}_f(Q_i)}{\|\mu_f(Q_i) - \mu_f(Q_j)\|^2}$$

This compares within-class spread to between-class separation in a single scale-free number. Small CDNV is the favorable regime — same-class samples stay concentrated while different class means stay well separated.

“CDNV is improved by neural collapse from both sides: NC1 shrinks the numerator, NC2 expands the denominator.”

The double generalization

The stepper below illustrates how CDNV drives the two-stage generalization argument:

Fig. 2. The double generalization in 2D. Step 1: Source training points cluster around empirical class centers. Step 2: With enough samples, this reflects the true class-conditional distributions (confidence ellipses). Step 3: Because classes are i.i.d. from the same population, unseen target classes inherit the same favorable geometry — and a few labeled examples suffice for nearest-center classification.


Part III — The transfer bound

The main result

The transfer error of a pretrained feature map $f$ is controlled by the average empirical CDNV on source classes, together with terms quantifying the two generalization steps:

$$\mathcal{L}_{\mathcal{D}}(f) \;\lesssim\; k\,\text{Avg}_{i \neq j}\, V_f(\tilde{S}_i, \tilde{S}_j) \;+\; \frac{k\,\mathcal{C}(f)}{\min_{i \neq j}\|\mu_f(\tilde{S}_i) - \mu_f(\tilde{S}_j)\|}\left(\frac{n^2}{\sqrt{m}} + \frac{1}{\sqrt{\ell}}\right)$$

What each term means

Geometric term — $k \cdot \text{Avg}\, V_f$
Small when source training classes are tightly clustered relative to their separation. This is the observable signature of favorable few-shot transfer — visible in the 3D visualization as CDNV drops.
$1/\sqrt{m}$ — first generalization: samples → distributions
More samples per source class means empirical clustering better reflects the true geometry of the underlying source-class distributions.
$1/\sqrt{\ell}$ — second generalization: source → target classes
More source classes means the average geometry better reflects the full population $\mathcal{D}$, letting the argument extend to unseen target classes.
Min class distance
The minimum pairwise distance between source class means captures worst-case separation. Neural collapse pushes this up — the equiangular tight frame is the maximal configuration.

The key point is that the bound remains informative even when $n$ is small. The downstream learner doesn't need to discover complicated structure from limited target data. That structure was already built into the feature space during pretraining.


What this does and does not claim

This theory does not say that every pretrained classifier transfers well to every target task, or that exact neural collapse must hold perfectly, or that source and target may come from unrelated class populations.

What it does say is more precise: if the learned representation exhibits small CDNV on many source training classes, then this geometry first generalizes to the underlying source-class distributions and then, because classes are i.i.d. draws from a common population, to unseen target classes.

Takeaway

Few-shot transfer works because pretraining learns a feature geometry that generalizes twice.

From source samples to source classes. If source training points cluster tightly and class means are well separated, then with enough samples per class this reflects the true geometry of the source-class distributions.

From source classes to unseen classes. Because classes are i.i.d. draws from a common population, favorable geometry on many source classes extends to new target classes.

From geometry to few-shot learning. Once unseen classes also form tight, well-separated clusters, a nearest-center classifier can learn them from only a handful of labeled examples.

The hard part of few-shot learning is therefore not done at transfer time. It is done during pretraining, by shaping a representation whose clustering geometry survives both kinds of generalization.


Comments