Self-Supervised Learning · Neural Collapse · Few-Shot Theory

Directional Neural Collapse

Frozen self-supervised features can transfer from a handful of labels, even when their class clusters look wide, anisotropic, and nothing like classical neural collapse. The trick is beautifully geometric: decisions only care about variance along one line, and SSL quietly suppresses exactly that direction.

By Tomer Galanti · March 3, 2026 · 14 min read · ◆ Luthra, Salunkhe, Galanti — ICML 2026

Introduction

A frozen self-supervised encoder is a strange little miracle. It was never told what a dog, chair, texture, or shape is. Yet after pretraining, attach a nearest-centroid classifier or a tiny linear probe on top, give it only a few labeled examples per class, and it often behaves as if the relevant categories were already drawn in the feature space.

For supervised classifiers, we have a satisfying story: neural collapse. Late in training, examples from the same class concentrate near their class mean, different class means spread apart in a nearly symmetric pattern, and the last-layer classifier aligns with those means. This is exactly the geometry a nearest-class-center rule wants. But self-supervised learning (SSL) is not trained with those labels. It has no explicit reason to compress every semantic class into a tiny ball. Measured by the usual global clustering statistics, SSL features remain broad and highly anisotropic — and still transfer beautifully. So the classical ruler is reading the wrong thing.

The resolution in one line For two classes, a nearest-centroid decision only depends on the projection onto the line joining their means. SSL need not collapse a whole class cloud; it only has to shrink the variance in this decision direction. The rest of the cloud can stay fluffy.

The paper turns this observation into a compact theory built around one quantity: directional CDNV, the class-distance-normalized variance after projecting onto the decision axis. It explains two phenomena that otherwise look unrelated:

The anisotropy puzzle.

Classical CDNV measures total within-class scatter, so it punishes harmless nuisance directions and badly underestimates SSL transfer geometry.

Only one direction matters.

The NCC decision is a one-dimensional margin test after projection onto the line between class means. Directional CDNV measures exactly that margin noise.

III

A sharp few-shot bound.

The few-shot NCC error is controlled by $4\tilde V$ plus explicit centroid-estimation and fourth-moment corrections; the leading constant cannot be improved distribution-free.

Many tasks, forced orthogonal.

Small directional CDNV across independent labelings forces their decision axes apart, giving a geometric reason multitask transfer can coexist inside one frozen encoder.

Based on: A. Luthra, Y. Salunkhe, T. Galanti. “Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning”, ICML 2026.

Part I — The anisotropy puzzle

Start with the classical quantity used to certify few-shot transfer: class-distance-normalized variance (CDNV). For two classes, it compares how much the feature clouds spread inside each class to how far apart their means are:

Classical CDNV $$V_{ij} = \frac{v_i + v_j}{\|\mu_i-\mu_j\|^2}, \qquad v_c = \mathbb{E}_{x\sim D_c}\,\|f(x)-\mu_c\|^2$$

Small $V_{ij}$ says that the two clouds are tight compared with the gap $d_{ij}=\|\mu_i-\mu_j\|$. In supervised training, this is the familiar neural-collapse picture: $v_c$ shrinks, the class means remain separated, and a nearest-class-centroid (NCC) classifier becomes almost inevitable. A few labeled examples are enough because the representation has already done the hard geometric work.

SSL breaks this story in the most interesting way. Without labels, there is no obvious force making every semantic class collapse into a small Euclidean ball. Instead, SSL representations are anisotropic: they often keep large variance in directions corresponding to style, pose, background, augmentation, or other nuisance factors. Classical CDNV adds all of those directions into $v_c$, so it can remain large even when the representation is perfectly organized for the downstream decision. The transfer succeeds; the scalar summary is too blunt.

“A class cloud can be huge and still easy to classify — provided it is thin in the one direction where the boundary lives.”

Part II — Only one direction matters

Fix two classes $i$ and $j$. The NCC rule compares the two squared distances, but after expanding the algebra almost everything cancels. A point from class $i$ is misclassified as $j$ exactly when its projection along the line from $\mu_i$ to $\mu_j$ crosses the midpoint. All orthogonal coordinates vanish from the decision.

NCC is a one-dimensional margin test $$\|f(x)-\mu_j\|^2 \le \|f(x)-\mu_i\|^2 \quad\Longleftrightarrow\quad u_{ij}^{\top}\!\big(f(x)-\mu_i\big) \ge \frac{\|\mu_i-\mu_j\|}{2}$$

This identity tells us the right variance to measure. The decision axis is the unit vector pointing from one mean to the other, and directional CDNV is the variance of class $i$ after projecting onto that axis, normalized by the squared gap:

Decision axis & directional CDNV $$u_{ij} = \frac{\mu_j-\mu_i}{\|\mu_j-\mu_i\|}, \qquad \tilde V_{ij} = \frac{u_{ij}^\top \Sigma_i\, u_{ij}}{\|\mu_i-\mu_j\|^2}$$

So $\tilde V_{ij}$ is not merely another clustering score; it is the second moment of the actual margin variable. Variance orthogonal to $u_{ij}$ can make the cluster visually enormous, but it cannot move the point through the NCC boundary. That is the geometry the figure below makes concrete — and the explorer after it lets you feel it.

Fig. 1 — Anisotropic clusters. The clouds are tall, so classical CDNV sees a lot of variance. But the boundary is vertical and the decision axis is horizontal; only horizontal spread can cause mistakes. Directional CDNV reads that narrow projection and ignores the harmless height.

Decision-axis explorer — tune the two kinds of variance and watch which one moves the error

gap between means d 3.0

variance ALONG axis σ∥ 0.50

nuisance variance σ⊥ 2.0

–

classical CDNV
(all directions)

–

directional CDNV
(along axis)

–

true NCC error
(known centroids)

–

bound 4·directional
CDNV

Fig. 2. Drag nuisance variance up: the clusters balloon vertically, classical CDNV climbs, but the true error and directional CDNV do not budge. Drag variance along axis up: the clusters overlap at the boundary and the error rises in lockstep with directional CDNV. The error is a Gaussian tail along the axis; the $4\tilde V$ value is the distribution-free Cantelli bound.

Part III — A sharp few-shot bound

The main theorem turns the one-dimensional margin picture into a finite-shot guarantee. Suppose each target class is represented by only $m$ labeled examples, so the class means used by NCC are empirical centroids rather than population means. Then the average few-shot NCC error over $C'$ classes splits into two pieces: the intrinsic decision-axis overlap, and the extra noise introduced by estimating the centroids from few samples.

Theorem 4.1 — finite-shot NCC bound $$\mathrm{err}^{\mathrm{NCC}}_{m,\mathcal{C}}(f) \;\le\; \underbrace{\frac{1}{C'}\sum_{i}\sum_{j\ne i}\frac{4\,\tilde V_{ij}}{\big(1+\tfrac{v_j-v_i}{m\,d_{ij}^2}\big)^2}}_{\text{decision-axis term}} \;+\; \underbrace{\frac{1}{C'}\sum_{i}\sum_{j\ne i}\frac{\big(\sqrt{E^1_{ij}}+\sqrt{E^2_{ij}}+\sqrt{E^3_{ij}}\,\big)^2}{\big(1+\tfrac{v_j-v_i}{m\,d_{ij}^2}\big)^2}}_{\text{finite-shot remainder}}$$

The theorem keeps the correction terms explicit, which is useful because each one corresponds to a real statistical cost. Let $\Theta_{ij}=\big(M_{4,i}+M_{4,j}\big)/d_{ij}^4$ be the normalized fourth moment, with $M_{4,i}=\mathbb{E}\|f(x)-\mu_i\|^4$. Then the finite-shot price decomposes as:

$$E^{1}_{ij}=\frac{4}{m}\Big(V_{ij}^2+\tfrac14 V_{ij}\Big),\qquad E^{2}_{ij}=\frac{V_{ij}}{m},\qquad E^{3}_{ij}=\frac{\Theta_{ij}+2(m-1)V_{ij}^2}{m^3}$$

Here is the intuition. $E^{2}_{ij}\asymp V_{ij}/m$ is the ordinary centroid-estimation cost: with few shots, the empirical class mean wiggles. $E^{1}_{ij}$ is a quadratic correction coming from the interaction between class spread and the random centroid. $E^{3}_{ij}$ is the higher-moment term that protects the theorem from heavy tails. The important asymmetry is that the leading term is directional CDNV $\tilde V_{ij}$, while the finite-shot corrections depend on the coarser global CDNV $V_{ij}$ but shrink with $m$. In the many-shot limit, all that remains is the sharp directional certificate.

The constant 4 is optimal

In the known-centroid limit, the pairwise NCC error is a one-sided tail event for the scalar random variable $u_{ij}^{\top}(f(x)-\mu_i)$. Its variance is $d_{ij}^2\tilde V_{ij}$ and the mistake threshold is $d_{ij}/2$. Cantelli's inequality gives $p^{\mathrm{NCC}}_{i\to j}\le \tfrac{4\tilde V_{ij}}{1+4\tilde V_{ij}}\le 4\tilde V_{ij}$, and a two-point construction shows that no distribution-free bound using only second moments can improve the leading factor. The $4$ is not proof slack; it is the geometry of one-sided deviation.

The explorer below decomposes the bound. Its moral is simple: more shots help estimate centroids, but they do not repair a representation whose class clouds overlap along the decision axis. The asymptotic floor is $4\tilde V$; the rest is the price of learning the centers from a tiny support set.

Few-shot bound decomposition — the leading term is a floor; the corrections melt with shots

directional CDNV Ṽ 0.05

classical CDNV V 1.00

shots per class m 20

4·Ṽ — decision axis

centroid estimation

tail / higher order

Fig. 3. Illustrative decomposition: leading term $4\tilde V$, a centroid-estimation term shown as $\propto \sqrt{V}/\sqrt{m}$, and a tail term shown as $\propto V/m$ (constants are schematic for intuition; the paper's exact corrections are the $V/m$, $V^2/m$ and $\Theta/m^3$ terms). Raise the shots $m$ and the corrections collapse toward the directional-CDNV floor.

Part IV — Many tasks at once, forced orthogonal

Directional CDNV also explains why one frozen SSL encoder can support many downstream labelings at once. Real images have several semantic factors — object identity, color, texture, pose, shape, size — and a good representation should let us read out many of them without rewriting the whole space. The clean model is a factor model: $M$ independent binary tasks, each carried by its own orthonormal direction $v_\ell$, with the embedding

Orthogonal factor model (§4.3) $$z \;=\; \sum_{\ell=1}^{M}\frac{\Delta_\ell}{2}\,t^{(\ell)}\,v_\ell \;+\; \eta \;+\; \xi$$

Each task label $t^{(\ell)}\in\{\pm1\}$ shifts $z$ by $\pm\tfrac{\Delta_\ell}{2}$ along its own axis. Thus the task-$\ell$ centroid gap is $\mu^{(\ell)}_+ - \mu^{(\ell)}_- = \Delta_\ell v_\ell$, so $v_\ell$ is literally the decision axis for that task. The term $\xi$ is small on-axis noise; $\eta$ is nuisance variation in the orthogonal complement. Plugging this into the definitions gives the whole phenomenon in one line: directional CDNV can be tiny while classical CDNV can be arbitrarily large,

$$\tilde V^{(\ell)} = \frac{v_\ell^\top \mathrm{Cov}(\xi)\,v_\ell}{\Delta_\ell^2}\ \text{(small)}, \qquad V^{(\ell)} \ge \frac{2\,\mathrm{tr}\!\big(\mathrm{Cov}(\eta)\big)}{\Delta_\ell^2}\ \text{(arbitrarily large)}$$

With three tasks, the picture is a box. The eight combinations of $(t^{(1)},t^{(2)},t^{(3)})$ become eight granular class centers at the corners of a hyperrectangle. Each edge direction is a task axis; each task cuts the box in half. The large nuisance variance $\eta$ lives outside this displayed subspace, which is exactly why a low-dimensional picture of SSL can look deceptively tidy while the full feature cloud remains broad. Drag to rotate:

Three tasks, one representation — drag to rotate · 3 orthogonal decision axes · 8 granular centers

on-axis noise ξ 0.16

sample clouds shown

Fig. 4. The factor model with $M=3$. Eight granular centers (◆) sit at the corners of a hyperrectangle with unequal gaps $\Delta_1,\Delta_2,\Delta_3$. The three colored double-arrows are the decision axes: each task splits the box across one axis (blue = task A, green = task B, purple = task C). Sample clouds hug the corners along the axes; turning up $\xi$ loosens them and raises directional CDNV. Axes are orthogonal by construction — the geometry Proposition 4.2 forces from small directional CDNV alone.

The orthogonalization theorem, stated

The factor model is only a cartoon; the structural theorem does not assume such a clean generative story. Take two independent balanced labelings $y^{(1)}\in[K_1]$ and $y^{(2)}\in[K_2]$ of the same representation. Let $u^{(1)}_{aa'}$ be the decision axis between two classes of task 1, with gap $d^{(1)}_{aa'}$, and let $\tilde V^{(1)}_{aa'}$ be the worst directional CDNV along that axis; define the analogous quantities for task 2.

Proposition 4.2 — near-orthogonality from small directional CDNV

For any pair $a\ne a'$ in task 1 and $b\ne b'$ in task 2,

\Big|\,(u^{(1)}_{aa'})^{\top} u^{(2)}_{bb'}\,\Big| \;\le\; \min\!\left\{\, \frac{d^{(1)}_{aa'}}{d^{(2)}_{bb'}}\sqrt{2K_2\,\tilde V^{(1)}_{aa'}}\;,\;\; \frac{d^{(2)}_{bb'}}{d^{(1)}_{aa'}}\sqrt{2K_1\,\tilde V^{(2)}_{bb'}} \,\right\}

Read the inequality as an interference budget. The left side is the cosine similarity between a decision axis for task 1 and a decision axis for task 2. The right side shrinks like $\sqrt{\tilde V}$: if either task has very small directional CDNV, the two decision-axis families are forced toward orthogonality. Intuitively, a single direction cannot simultaneously be a low-variance separator for two independent labelings; if it tried, one labeling would inject variance into the other. For balanced binary tasks with equal gaps, the clean form is $|\cos\theta|\le 2\sqrt{\tilde V}$ — the relation the slider below traces.

Two binary tasks — the bound in action — smaller directional CDNV forces the axes apart

directional CDNV per task Ṽ 0.03

Fig. 5. The binary case of Proposition 4.2 with equal gaps: the guaranteed alignment ceiling is $|\cos\theta|\le 2\sqrt{\tilde V}$. As directional CDNV shrinks the worst-case angle is pushed toward $90^\circ$, so task B's separation barely projects onto task A's axis and adapting one leaves the other untouched.

Part V — What SSL actually does

The experiments are deliberately broad: contrastive learning (SimCLR), masked prediction (MAE), distillation-style pretraining (DINO-v2), redundancy reduction (VICReg), and multimodal pretraining (CLIP, SigLIP). These objectives are very different, but the same geometric signature appears:

Ṽ ↓

directional CDNV collapses during pretraining — across every objective

V →

classical CDNV stays large: the anisotropy is real and pervasive

tracks

the directional bound follows few-shot error at practical shot sizes

The important observation is not merely that $\tilde V$ is smaller than $V$. It is that training moves them differently. Variability along downstream decision axes falls steadily, while substantial orthogonal variance persists. SSL is not producing classical class collapse; it is producing a subtler, task-compatible collapse of margins. Different objectives arrive at the same geometry, suggesting that directional collapse is a broad consequence of learning invariances rather than an artifact of one loss.

Fig. 6 — Schematic of the reported qualitative finding (not measured values). During SSL pretraining, directional CDNV collapses toward zero while classical CDNV stays high — the gap is the anisotropy that classical clustering measures misread.

The multitask prediction also survives the empirical check. On controlled synthetic data with independent visual factors — shape, size, color, pattern — SSL encoders map distinct factors to approximately orthogonal directions. The median absolute cosine similarity between decision axes from different labelings decays toward zero over training, staying within the qualitative envelope predicted by the theorem. In plain language: the representation learns a little coordinate system for independent factors.

What this does and does not say

The claim is geometric and specific. It does not say SSL induces full neural collapse; it says something more delicate and more useful. The total class cloud may remain large, but the projection that controls the decision becomes small. The theorem also does not promise a precise accuracy number for every encoder and every dataset. It identifies the right control variable, proves a sharp dependence in a distribution-free setting, and separates representation error from finite-shot centroid-estimation error.

The scope matters. The analysis is centered on NCC and linear-probe-style downstream rules, so it is a theory of geometry seen by simple heads, not a universal theorem about every possible adaptation procedure. The finite-shot terms are honest about the fact that support-set centroids must be estimated. And the orthogonality theorem assumes independent balanced labelings, a clean abstraction of real visual factors. Still, the message survives these caveats: stop averaging over directions the decision ignores, and SSL transfer stops looking mysterious.

✓

Measure variance where decisions happen.

Directional CDNV reads only the spread along the class-separating axis — the only variance that can change an NCC or linear-probe label.

✓

One representation, many tasks.

Small directional CDNV across independent labelings forces near-orthogonal decision axes, the geometric basis for low-interference multitask transfer.

Takeaway

Frozen self-supervised features transfer from a few labels because the variance that matters for decisions collapses, even when the full class cloud remains broad and anisotropic.

Directional CDNV is the right ruler. Project within-class variance onto the line between two class means. That projection is the NCC margin noise; everything orthogonal is mostly scenery.

The bound is sharp. In the known-centroid limit, few-shot NCC error is controlled by $4\tilde V$, and no distribution-free second-moment argument can improve that leading constant.

Orthogonality comes for free. When directional CDNV is small across independent tasks, the decision axes are forced apart, letting one frozen encoder serve many tasks at once. Across SimCLR, MAE, DINO-v2, VICReg, CLIP, and SigLIP, this directional collapse shows up while classical CDNV stays large — anisotropy is the rule, and it is exactly what makes SSL transfer.

Directional Neural Collapse

Introduction

Part I — The anisotropy puzzle

Part II — Only one direction matters

Part III — A sharp few-shot bound

Part IV — Many tasks at once, forced orthogonal

The orthogonalization theorem, stated

Part V — What SSL actually does

What this does and does not say

Takeaway

Comments