Directional Neural Collapse
Frozen self-supervised features can transfer from a handful of labels, even when their class clusters look wide, anisotropic, and nothing like classical neural collapse. The trick is beautifully geometric: decisions only care about variance along one line, and SSL quietly suppresses exactly that direction.
Introduction
A frozen self-supervised encoder is a strange little miracle. It was never told what a dog, chair, texture, or shape is. Yet after pretraining, attach a nearest-centroid classifier or a tiny linear probe on top, give it only a few labeled examples per class, and it often behaves as if the relevant categories were already drawn in the feature space.
For supervised classifiers, we have a satisfying story: neural collapse. Late in training, examples from the same class concentrate near their class mean, different class means spread apart in a nearly symmetric pattern, and the last-layer classifier aligns with those means. This is exactly the geometry a nearest-class-center rule wants. But self-supervised learning (SSL) is not trained with those labels. It has no explicit reason to compress every semantic class into a tiny ball. Measured by the usual global clustering statistics, SSL features remain broad and highly anisotropic — and still transfer beautifully. So the classical ruler is reading the wrong thing.
The paper turns this observation into a compact theory built around one quantity: directional CDNV, the class-distance-normalized variance after projecting onto the decision axis. It explains two phenomena that otherwise look unrelated:
Part I — The anisotropy puzzle
Start with the classical quantity used to certify few-shot transfer: class-distance-normalized variance (CDNV). For two classes, it compares how much the feature clouds spread inside each class to how far apart their means are:
Small $V_{ij}$ says that the two clouds are tight compared with the gap $d_{ij}=\|\mu_i-\mu_j\|$. In supervised training, this is the familiar neural-collapse picture: $v_c$ shrinks, the class means remain separated, and a nearest-class-centroid (NCC) classifier becomes almost inevitable. A few labeled examples are enough because the representation has already done the hard geometric work.
SSL breaks this story in the most interesting way. Without labels, there is no obvious force making every semantic class collapse into a small Euclidean ball. Instead, SSL representations are anisotropic: they often keep large variance in directions corresponding to style, pose, background, augmentation, or other nuisance factors. Classical CDNV adds all of those directions into $v_c$, so it can remain large even when the representation is perfectly organized for the downstream decision. The transfer succeeds; the scalar summary is too blunt.
Part II — Only one direction matters
Fix two classes $i$ and $j$. The NCC rule compares the two squared distances, but after expanding the algebra almost everything cancels. A point from class $i$ is misclassified as $j$ exactly when its projection along the line from $\mu_i$ to $\mu_j$ crosses the midpoint. All orthogonal coordinates vanish from the decision.
This identity tells us the right variance to measure. The decision axis is the unit vector pointing from one mean to the other, and directional CDNV is the variance of class $i$ after projecting onto that axis, normalized by the squared gap:
So $\tilde V_{ij}$ is not merely another clustering score; it is the second moment of the actual margin variable. Variance orthogonal to $u_{ij}$ can make the cluster visually enormous, but it cannot move the point through the NCC boundary. That is the geometry the figure below makes concrete — and the explorer after it lets you feel it.
Fig. 1 — Anisotropic clusters. The clouds are tall, so classical CDNV sees a lot of variance. But the boundary is vertical and the decision axis is horizontal; only horizontal spread can cause mistakes. Directional CDNV reads that narrow projection and ignores the harmless height.
(all directions)
(along axis)
(known centroids)
CDNV
Fig. 2. Drag nuisance variance up: the clusters balloon vertically, classical CDNV climbs, but the true error and directional CDNV do not budge. Drag variance along axis up: the clusters overlap at the boundary and the error rises in lockstep with directional CDNV. The error is a Gaussian tail along the axis; the $4\tilde V$ value is the distribution-free Cantelli bound.
Part III — A sharp few-shot bound
The main theorem turns the one-dimensional margin picture into a finite-shot guarantee. Suppose each target class is represented by only $m$ labeled examples, so the class means used by NCC are empirical centroids rather than population means. Then the average few-shot NCC error over $C'$ classes splits into two pieces: the intrinsic decision-axis overlap, and the extra noise introduced by estimating the centroids from few samples.
The theorem keeps the correction terms explicit, which is useful because each one corresponds to a real statistical cost. Let $\Theta_{ij}=\big(M_{4,i}+M_{4,j}\big)/d_{ij}^4$ be the normalized fourth moment, with $M_{4,i}=\mathbb{E}\|f(x)-\mu_i\|^4$. Then the finite-shot price decomposes as:
Here is the intuition. $E^{2}_{ij}\asymp V_{ij}/m$ is the ordinary centroid-estimation cost: with few shots, the empirical class mean wiggles. $E^{1}_{ij}$ is a quadratic correction coming from the interaction between class spread and the random centroid. $E^{3}_{ij}$ is the higher-moment term that protects the theorem from heavy tails. The important asymmetry is that the leading term is directional CDNV $\tilde V_{ij}$, while the finite-shot corrections depend on the coarser global CDNV $V_{ij}$ but shrink with $m$. In the many-shot limit, all that remains is the sharp directional certificate.
The explorer below decomposes the bound. Its moral is simple: more shots help estimate centroids, but they do not repair a representation whose class clouds overlap along the decision axis. The asymptotic floor is $4\tilde V$; the rest is the price of learning the centers from a tiny support set.
Fig. 3. Illustrative decomposition: leading term $4\tilde V$, a centroid-estimation term shown as $\propto \sqrt{V}/\sqrt{m}$, and a tail term shown as $\propto V/m$ (constants are schematic for intuition; the paper's exact corrections are the $V/m$, $V^2/m$ and $\Theta/m^3$ terms). Raise the shots $m$ and the corrections collapse toward the directional-CDNV floor.
Part IV — Many tasks at once, forced orthogonal
Directional CDNV also explains why one frozen SSL encoder can support many downstream labelings at once. Real images have several semantic factors — object identity, color, texture, pose, shape, size — and a good representation should let us read out many of them without rewriting the whole space. The clean model is a factor model: $M$ independent binary tasks, each carried by its own orthonormal direction $v_\ell$, with the embedding
Each task label $t^{(\ell)}\in\{\pm1\}$ shifts $z$ by $\pm\tfrac{\Delta_\ell}{2}$ along its own axis. Thus the task-$\ell$ centroid gap is $\mu^{(\ell)}_+ - \mu^{(\ell)}_- = \Delta_\ell v_\ell$, so $v_\ell$ is literally the decision axis for that task. The term $\xi$ is small on-axis noise; $\eta$ is nuisance variation in the orthogonal complement. Plugging this into the definitions gives the whole phenomenon in one line: directional CDNV can be tiny while classical CDNV can be arbitrarily large,
With three tasks, the picture is a box. The eight combinations of $(t^{(1)},t^{(2)},t^{(3)})$ become eight granular class centers at the corners of a hyperrectangle. Each edge direction is a task axis; each task cuts the box in half. The large nuisance variance $\eta$ lives outside this displayed subspace, which is exactly why a low-dimensional picture of SSL can look deceptively tidy while the full feature cloud remains broad. Drag to rotate:
Fig. 4. The factor model with $M=3$. Eight granular centers (◆) sit at the corners of a hyperrectangle with unequal gaps $\Delta_1,\Delta_2,\Delta_3$. The three colored double-arrows are the decision axes: each task splits the box across one axis (blue = task A, green = task B, purple = task C). Sample clouds hug the corners along the axes; turning up $\xi$ loosens them and raises directional CDNV. Axes are orthogonal by construction — the geometry Proposition 4.2 forces from small directional CDNV alone.
The orthogonalization theorem, stated
The factor model is only a cartoon; the structural theorem does not assume such a clean generative story. Take two independent balanced labelings $y^{(1)}\in[K_1]$ and $y^{(2)}\in[K_2]$ of the same representation. Let $u^{(1)}_{aa'}$ be the decision axis between two classes of task 1, with gap $d^{(1)}_{aa'}$, and let $\tilde V^{(1)}_{aa'}$ be the worst directional CDNV along that axis; define the analogous quantities for task 2.
Read the inequality as an interference budget. The left side is the cosine similarity between a decision axis for task 1 and a decision axis for task 2. The right side shrinks like $\sqrt{\tilde V}$: if either task has very small directional CDNV, the two decision-axis families are forced toward orthogonality. Intuitively, a single direction cannot simultaneously be a low-variance separator for two independent labelings; if it tried, one labeling would inject variance into the other. For balanced binary tasks with equal gaps, the clean form is $|\cos\theta|\le 2\sqrt{\tilde V}$ — the relation the slider below traces.
Fig. 5. The binary case of Proposition 4.2 with equal gaps: the guaranteed alignment ceiling is $|\cos\theta|\le 2\sqrt{\tilde V}$. As directional CDNV shrinks the worst-case angle is pushed toward $90^\circ$, so task B's separation barely projects onto task A's axis and adapting one leaves the other untouched.
Part V — What SSL actually does
The experiments are deliberately broad: contrastive learning (SimCLR), masked prediction (MAE), distillation-style pretraining (DINO-v2), redundancy reduction (VICReg), and multimodal pretraining (CLIP, SigLIP). These objectives are very different, but the same geometric signature appears:
The important observation is not merely that $\tilde V$ is smaller than $V$. It is that training moves them differently. Variability along downstream decision axes falls steadily, while substantial orthogonal variance persists. SSL is not producing classical class collapse; it is producing a subtler, task-compatible collapse of margins. Different objectives arrive at the same geometry, suggesting that directional collapse is a broad consequence of learning invariances rather than an artifact of one loss.
Fig. 6 — Schematic of the reported qualitative finding (not measured values). During SSL pretraining, directional CDNV collapses toward zero while classical CDNV stays high — the gap is the anisotropy that classical clustering measures misread.
The multitask prediction also survives the empirical check. On controlled synthetic data with independent visual factors — shape, size, color, pattern — SSL encoders map distinct factors to approximately orthogonal directions. The median absolute cosine similarity between decision axes from different labelings decays toward zero over training, staying within the qualitative envelope predicted by the theorem. In plain language: the representation learns a little coordinate system for independent factors.
What this does and does not say
The claim is geometric and specific. It does not say SSL induces full neural collapse; it says something more delicate and more useful. The total class cloud may remain large, but the projection that controls the decision becomes small. The theorem also does not promise a precise accuracy number for every encoder and every dataset. It identifies the right control variable, proves a sharp dependence in a distribution-free setting, and separates representation error from finite-shot centroid-estimation error.
The scope matters. The analysis is centered on NCC and linear-probe-style downstream rules, so it is a theory of geometry seen by simple heads, not a universal theorem about every possible adaptation procedure. The finite-shot terms are honest about the fact that support-set centroids must be estimated. And the orthogonality theorem assumes independent balanced labelings, a clean abstraction of real visual factors. Still, the message survives these caveats: stop averaging over directions the decision ignores, and SSL transfer stops looking mysterious.
Takeaway
Frozen self-supervised features transfer from a few labels because the variance that matters for decisions collapses, even when the full class cloud remains broad and anisotropic.
Directional CDNV is the right ruler. Project within-class variance onto the line between two class means. That projection is the NCC margin noise; everything orthogonal is mostly scenery.
The bound is sharp. In the known-centroid limit, few-shot NCC error is controlled by $4\tilde V$, and no distribution-free second-moment argument can improve that leading constant.
Orthogonality comes for free. When directional CDNV is small across independent tasks, the decision axes are forced apart, letting one frozen encoder serve many tasks at once. Across SimCLR, MAE, DINO-v2, VICReg, CLIP, and SigLIP, this directional collapse shows up while classical CDNV stays large — anisotropy is the rule, and it is exactly what makes SSL transfer.
Comments