Directional NC 1.0

Overview

Self-supervised learning (SSL) can learn representations that transfer well to downstream tasks using only a few labeled examples. But why does it work so well?

In the supervised learning world, there's a theory called Neural Collapse that describes what a "perfect" model looks like. In the neural collapse regime, at the end of training, the learned embeddings of samples from the same class concentrate into a single point. Imagine taking hundreds of diverse dog images and projecting them down into a single, perfectly tight point in the embedding space.

Illustration showing diverse dog images converging into a single, perfectly tight cluster point in a 3D embedding space.

From a classification perspective, this global collapse is a desirable geometric property. It learns representations that are invariant to nuisance factors and allows the model to generalize well to new classification tasks. That said, such class-wise collapse generally requires access to labels during training, since labels specify which samples belong together. Without labels, as in SSL, you can't expect the model to exhibit this global neural collapse.

This raises the following question:

What aspects of the learned embedding geometry in self-supervised learning explain effective few-shot generalization?

Problem Setup

Let's start with a standard metric that measures the degree of neural collapse, called Class-Distance-Normalized Variance aka CDNV (check out our previous NeurIPS blog for a deep dive into CDNV visuals). It measures how tight our clusters are relative to how far apart the class means are:

$$V_{ij} := \frac{v_i + v_j}{d_{ij}^2}$$

$v_i, v_j$: Variance within class $i$ and class $j$.
$d_{ij}$: Distance between the centers (means) of class $i$ and class $j$.

Recent theory gives the following generalization bound based on this metric:

$$\text{err}_{m,D}(f) \lesssim \left(1 + \frac{1}{m}\right) V_f(D_1, D_2)$$

$\text{err}_{m,D}(f)$: Generalization error for the downstream classifier.
$m$: Number of labeled examples (shots) per class.
$V_f(D_1, D_2)$: CDNV between the two class distributions.

This theoretical bound predicts low error if the CDNV is low. But what happens when CDNV is high? Let's look at a standard example:

High CDNV (Overlapping) Bad High CDNV Plot

In this scenario, high CDNV correctly predicts a problem. The clusters are completely overlapping, and a classifier would struggle to separate them. So far, the theory holds up—high CDNV is bad.

Let's consider another case...

High CDNV (Separable) Baguette CDNV Plot

Look at these elongated, "baguette-like" clusters. Because standard CDNV averages variance across all directions, it sees the massive variance along the stretched axis and outputs a very high penalty. It predicts terrible generalization.

But isn’t this kind of clustering already sufficient? A linear probe can learn a separating hyperplane easily and achieve perfect accuracy.

This begs the question:

Do SSL representations naturally stretch along task-irrelevant dimensions?
If so, that would neatly explain why they can exhibit high global CDNV yet still achieve competitive downstream performance.

This is exactly why we need Directional CDNV. Instead of looking at the whole baguette, it measures only the variance that actually moves points across the decision boundary:

$$\tilde{V}_{ij} := \frac{u_{ij}^{\top}\Sigma_i u_{ij}}{d_{ij}^2}$$

$u_{ij}$: The unit vector along the decision axis between classes $i$ and $j$.
$\Sigma_i$: The covariance matrix within class $i$.
$d_{ij}$: Distance between the centers (means) of class $i$ and class $j$.

Let's dive into our key observations that explain why this specific geometric property is the true engine behind few-shot transfer.

Key Observations

1 · Pervasive Decision-Axis Collapse

Directional neural collapse is a universal phenomenon in SSL. When we train models across various SSL paradigms (like SimCLR, VICReg, MAE, and DINO-v2), standard CDNV barely drops. But the Directional CDNV plummets. These models are actively suppressing variance along the specific directions that separate classes, while completely ignoring the orthogonal "nuisance" directions.

Decision-axis collapse emerges during SSL training. Directional CDNV decreases dramatically over training (from roughly 2⁻¹ down to 2⁻⁵), while standard CDNV decreases only modestly or even increases. This indicates that SSL models primarily tighten class geometry along separating directions even when overall within-class variability is large.

MAE (ViT-B/16) MAE CDNV Plot

SimCLR (ResNet-50) SimCLR CDNV Plot

DINO-v2 (ViT-B/16) DINO-v2 CDNV Plot

VICReg (ResNet-50) VICReg CDNV Plot

2 · Reliable Certificates via Tighter Error Bounds

Directional variance provides mathematically sound and practically tight guarantees for how well a model will perform. We provide non-asymptotic multiclass generalization bounds where the leading term is purely the directional CDNV:

$$\text{err}_{m,D}(f) \lesssim \tilde{V}_f + \mathcal{O}\left(\frac{1}{\text{poly}(m)}\right) V_f$$

$\tilde{V}_f$: Directional CDNV (Decision-axis variance).
$V_f$: Standard CDNV (Global within-class variance).
$m$: Number of labeled examples (shots) per class.

A similar directional bound was proposed in [Luthra et al., 2025], but by cleanly separating the intrinsic decision-axis variability from finite-sample centroid estimation errors, we provide a reliable and significantly tighter bound that remains non-vacuous at practical shot sizes.

Decision-axis variance yields informative few-shot certificates. Few-shot NCC test error versus the number of shots per class ($m$). Our finite-$m$ certificates are highly informative at practical shot counts and closely track the observed few-shot error, dropping well below the random chance threshold.

MAE (ViT-B/16) MAE Bounds C=2

CLIP (ViT-B/16) CLIP Bounds C=2

I-JEPA (ViT-B/16) I-JEPA Bounds C=2

SigLIP (ViT-B/16) SigLIP Bounds C=2

SimCLR (ResNet-50) SimCLR Bounds C=2

VICReg (ResNet-50) VICReg Bounds C=2

3 · Multitask Orthogonalization

A single anisotropic representation can support many different downstream tasks by naturally assigning them nearly orthogonal decision spaces.

If directional CDNV is small for multiple independent tasks (for example, color and shape classification), the underlying geometry mathematically forces the decision axes for those tasks to be nearly orthogonal - as shown in the illustration.

We verified this on synthetic data with distinct visual factors (color, shape, size, style). As training progresses, the cosine similarity between decision axes for different labelings rapidly decays to zero. The model does not collapse to a single discriminative direction; it allocates orthogonal subspaces to different semantic tasks.

SimCLR (ResNet-18) SimCLR Multitask Orthogonalization Plot

VICReg (ResNet-50) VICReg Multitask Orthogonalization Plot

Final Remarks: Why Anisotropy is a feature and not a bug

Let's zoom in on the multitask orthogonality experiment. In supervised learning, achieving perfect Neural Collapse (low global CDNV) means that all variance is collapsed. The representations become incredibly rigid and specialized for the exact task they were trained on. But if we collapse everything, we destroy the feature variance needed for other tasks. A perfectly collapsed model cannot easily generalize to new, unseen labelings!

This is exactly why SSL is arguably a superior pretraining method. By collapsing only along the necessary decision axes, SSL naturally preserves independent features (like color, shape, or size) in orthogonal subspaces. Anisotropy is not a flaw in SSL; it is precisely what enables strong multitask transfer.

Read the Paper View Code on GitHub

BibTeX

      @misc{2026directionalcollapse,
          title={Directional Neural Collapse in Self-Supervised Visual Representation Learning}, 
          author={Achleshwar Luthra* and Yash Salunkhe* and Tomer Galanti*},
          year={2026},
          publisher={ArXiv},
      }

Directional Neural Collapse Explains Few-Shot Transfer in Self-Supervised Learning