CL ≈ NSCL

Overview

Self-supervised contrastive learning (CL) is everywhere these days. It’s become a go-to tool for learning useful representations without needing labeled data — and it works remarkably well. Models trained with CL often show properties that look a lot like those of supervised models: tight clusters, clean separation between classes, and features that transfer across tasks. But despite how widely it’s used, we still don’t fully understand why it works so well. After all, these models are trained without any labels at all.

Init Epoch 10 Epoch 100 Epoch 500 Epoch 1000 Epoch 2000

Figure 1: Self-supervised CL (top) forms semantic clusters without label supervision, while supervised CL (bottom) yields tighter, more separable clusters.

This behavior raises the following question:

How does self-supervised CL learn representations similar to supervised learning, despite lacking explicit supervision?

In order to address this question, we study the relationship between CL and supervised contrastive learning (SCL). We show that the popular CL objective implicitly optimizes a supervised contrastive loss (SCL).

Key Observations

1 · CL ≈ NSCL

Our first observation suggests that the global Contrastive Loss (NTXent) used in SSL is closely related to a supervised loss which we refer to as Negatives-only Supervised Contrastive Loss (NSCL). Specifically, the gap between the two losses is at most $O(1/C)$ as shown below.

$$0 \leq \left| \textcolor{blue}{\mathcal{L}^{\text{CL}}} - \textcolor{red}{\mathcal{L}^{\text{NSCL}}} \right| \leq \log\left(1 + \frac{e^2}{C - 1}\right)$$

Theorem (1)

$$\small \mathcal{L}^{\mathrm{CL}}(f) = -\frac{1}{K^2N}\sum^{K}_{l_1,l_2=1}\sum_{i=1}^N \log \left( \frac{\exp(\mathrm{sim}(z^{l_1}_i, z^{l_2}_i))}{ \sum^{K}_{l_3=1} \textcolor{blue}{\sum_{j\in [N]\setminus \{i\}}} \exp (\mathrm{sim}(z^{l_1}_i, z^{l_3}_j)) } \right) $$

$$\small \mathcal{L}^{\mathrm{NSCL}}(f) = -\frac{1}{K^2N}\sum^{K}_{l_1,l_2=1}\sum_{i=1}^N \log \left( \frac{\exp(\mathrm{sim}(z^{l_1}_i, z^{l_2}_i))}{ \sum^{K}_{l_3=1} \textcolor{red}{\sum_{j: y_j \neq y_i}} \exp (\mathrm{sim}(z^{l_1}_i, z^{l_3}_j)) } \right) $$

$\mathrm{sim(\cdot, \cdot)}$ denotes cosine similarity
$N$: Total number of training samples
$K$: Total number of augmented versions of each sample
$z^l_i = f(\alpha_k(x_i))$, where $x_i$ is an input image and $\alpha_k$ is its $k^{th}$ augmentation.

To validate Thm. (1), we track the losses during training on CIFAR-100, and mini‑ImageNet for 2k epochs. The empirical gap between the two losses shrinks as the number of classes grows and CL remains tightly bounded by NSCL + $\log\left(1 + \frac{e^2}{C - 1}\right)$.

(a) CIFAR100

(b) mini-ImageNet

Figure 2: Validating Thm. (1) as a function of training epochs.

In another attempt to verify Thm. (1), we randomly sample $C$ classes from each dataset and train a new SSL model on the subsets corresponding to those classes. We observe that the empirical gap between the two losses exponentially decays as the number of classes increases, and the gap is highly correlated with $\log\left(1 + \frac{e^2}{C - 1}\right)$.

(a) CIFAR100

(b) mini-ImageNet

Figure 3: Validating Thm. (1) as a function of semantic classes ($C$).

Great! We have established that the CL objective is approximately equal to NSCL but ...

2 · Why care about NSCL?

NSCL minimizers yield collapse + simplex ETF geometry. At the global minimum of the NSCL loss, representations exhibit:

(i) Augmentation Collapse: All augmented views of a sample map to the same point.
(ii) Within-Class Collapse: All samples from the same class share an identical embedding.
(iii) Simplex ETF Geometry: Class means form a symmetric equiangular tight frame in feature space.

Theorem (2)

To evaluate the quality of representations learned by $\mathrm{NSCL}$ objective, we report NCCC error and linear probing error on few-shot classification tasks. 1‑shot linear probing accuracy exceeds 70 % on CIFAR-100 and mini‑ImageNet, while 1-shot NCCC exceeds 90 % on CIFAR10.

(a) CIFAR10

(b) CIFAR100

(c) mini-ImageNet

Figure 4: Downstream performance of a model $f$ trained with NSCL objective on few-shot classification tasks.

In short, NSCL-training performs well!

Theorem 2 implies that NSCL yields perfectly clustered representations, which in turn allow us to easily infer the labels from the learned model. This raises the following question:

Can we come up with an error bound based on the geometric properties of representations that explains transferability?

Estimating the downstream error of NSCL-trained models is fairly simply. Suppose we have a downstream task with two classes (for eg: cats and dogs) with distributions $D_1$ and $D_2$, and a pre-trained model $f$, we measure the downstream performance $\mathrm{err}_{m,D}(f)$ as the test performance of a linear probe trained on top of $f$ using $m$ random samples per class. One approach for estimating this bound is to use the class-distance-normalized variance (CDNV).

For a model $f$, and two class-conditional distributions $D_1$ and $D_2$, the CDNV is defined as: $$V_{f}(D_1, D_2) = \frac{\text{Intra-class variance}}{\left(\text{Inter-class distance}\right)^2}$$

Intra-class variance:

Intra-class variance quantifies how tightly the samples are packed around their class-center. Smaller intra-class variance implies more compact representations of a given class.

Inter-class distance:

Inter-class distance measures how far apart the class-centers are in the embedding space.

What CDNV captures:

CDNV balances these two forces—compactness vs. separation—and is minimized when clusters are tight and well-separated.

A low CDNV value is associated with better downstream performance. For instance, the few-shot classification error, $\mathrm{err}_{m,D}(f)$, trained on a data distribution $D$ with $m$ labeled samples per class can be bounded as follows:

$$ \mathrm{err}_{m,D}(f) \lesssim (1 + \tfrac{1}{m}) \, V_{f}(D_1,D_2) $$

(Eqn. 1)

Eqn. (1) explains the strong few-shot performance of NSCL (shown in Figure 4) as Thm. (2) implies that NSCL minimizes the CDNV term, which further minimizes the few-shot classification error.

But wait, is CDNV good enough to exlain transferability in SSL? Not quite. CDNV is great if you assume near-perfect clustering— *each class* forming a tight, spherical blob. But that's a pretty strong assumption, especially in SSL where no labels guide the geometry. So instead, we introduce a weaker notion of clustering: directional class-distance-normalized variance (dir-CDNV).

For a model $f$, and two class-conditional distributions $D_1$ and $D_2$, the dir-CDNV is defined as: $$\tilde{V}_f(D_1,D_2) = \frac{\text{Projected variance}}{\left(\text{Inter-class distance}\right)^2}$$

Projected variance:

Projected variance (or directional variance) measures variance only along the direction between class centers. Instead of using full isotropic distance from a point to its class center, we project each point onto the unit direction (shown in blue) connecting class centers.

What dir-CDNV captures:

dir-CDNV captures how spread out the points are in the discriminative direction.

We analyze both variance terms for both CL and NSCL models on CIFAR-100, and mini‑ImageNet. During training, we observe that CDNV decreases significantly for NSCL but it remains relatively high for CL. However, dir-CDNV decreases for both CL and NSCL models, indicating that both models learn to reduce the spread of points in the discriminative direction.

(a) CIFAR100

(b) mini-ImageNet

Figure 5: Analysis of CDNV and dir-CDNV during training.

So why go through all this trouble defining directional CDNV? Because it turns out—this is the term that actually controls how well we can recover labels from representations, as we show in the next section.

3 · Recoverability of Labels

Our key insight is that few-shot classification error is governed by two geometric variance terms, with the dir‑CDNV playing the dominant role.

Formally,

$$ \mathrm{err}_{m,D}(f) \lesssim \textcolor{purple}{\tilde{V}_f(D_1,D_2)} + \tfrac{1}{m}\,\textcolor{orange}{V_{f}(D_1,D_2)} $$

Proposition (1)

As the number of labeled samples per class $m$ increases, the influence of the standard CDNV diminishes, leaving d‑CDNV as the main driver of performance.

Low d‑CDNV ⇒ strong transfer from unlabeled to labeled data.

Definitions of the variance terms used in Proposition (1):

$$V_f(D_1, D_2) = \frac{\sigma_1^2}{\|\mu_1 - \mu_2\|^2}$$

$$\sigma_{1}^{2} = \operatorname{Var}_{x \sim D_1} [ f(x) ]$$

$$\tilde{V}_f(D_1, D_2) = \frac{\sigma_{12}^2}{\|\mu_1 - \mu_2\|^2}$$

$$\sigma_{12}^{2} = \operatorname{Var}_{x \sim D_1} [ \langle f(x) - \mu_1, u_{12} \rangle ]$$

$$u_{12} = \frac{\mu_1 - \mu_2}{\|\mu_1 - \mu_2\|^2}$$

Other terms:

$D_i$: Class-conditional distribution for class $i$
$m$: Number of labeled samples per class (i.e., number of shots)

We validate Proposition (1) by tracking the few-shot performance of linear probes and NCCC on CIFAR-10, CIFAR-100, and mini‑ImageNet. The error bound goes down to 0.25 for CIFAR-10 at a very large m, shown in red below.

(a) CIFAR10

(b) CIFAR100

(c) mini-ImageNet

Final Remarks

Our work takes several steps toward a better understanding of self-supervised learning, both theoretically and empirically. We provide a principled explanation for CL’s success in few-shot learning and identify the geometric structure that underlies high-performing representations.

Interested in the details?

Read the Paper (arxiv) View Code on GitHub

Self‑Supervised Contrastive Learning
is Approximately Supervised Contrastive Learning

Overview