On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Achleshwar Luthra · Priyadarsi Mishra · Tomer Galanti

Texas A&M University

Paper under review

Code Paper

Overview

Self-supervised contrastive learning (CL) has achieved remarkable success, producing representations that rival supervised pre-training. Recent work has shown that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL), as the number of classes grows. But this similarity at the loss-level leaves a crucial question unresolved.

Do contrastive and supervised models remain aligned throughout training, not just at the level of their objectives?

Understanding this dynamic relationship is crucial. A close alignment throughout training would suggest that CL's learning process inherently mimics a supervised signal, providing a stronger foundation for its empirical success and the transferability of its learned features. In this study, we analyze the alignment between CL and NSCL and find a notable divergence in their parameter-space trajectories. Despite this, we demonstrate that their learned representations remain remarkably aligned. We provide theoretical guarantees for this representation alignment and a better understanding of how this alignment depend on number of classes, higher temperatures and varying batch-sizes.

Methodology

To investigate representation alignment, we conduct a controlled study focusing on the self-supervised contrastive loss, $L^{CL}$, and its supervised counterpart, $L^{NSCL}$. We train two models under identical conditions—shared initialization, mini-batches, augmentations, and hyperparameters—to isolate the effect of the loss function itself.

Self-Supervised Contrastive Loss (CL)

$$\small \mathcal{L}^{\mathrm{CL}}(f) = -\frac{1}{K^2N}\sum^{K}_{l_1,l_2=1}\sum_{i=1}^N \log \left( \frac{\exp(\mathrm{sim}(z^{l_1}_i, z^{l_2}_i)/\tau)}{ \sum^{K}_{l_3=1} \textcolor{red}{\sum_{j\in [N]\setminus \{i\}}} \exp (\mathrm{sim}(z^{l_1}_i, z^{l_3}_j)/\tau) } \right) $$

Negatives-Only Supervised Contrastive Loss (NSCL)

$$\small \mathcal{L}^{\mathrm{NSCL}}(f) = -\frac{1}{K^2N}\sum^{K}_{l_1,l_2=1}\sum_{i=1}^N \log \left( \frac{\exp(\mathrm{sim}(z^{l_1}_i, z^{l_2}_i)/\tau)}{ \sum^{K}_{l_3=1} \textcolor{blue}{\sum_{j: y_j \neq y_i}} \exp (\mathrm{sim}(z^{l_1}_i, z^{l_3}_j)/\tau) } \right) $$

Key Observations

1 · Divergence in Weights vs. Alignment in Representations

Our central finding is a tale of two spaces. When trained with shared randomness (initialization, batches, augmentations), CL and NSCL models take different paths in parameter space, leading to a significant weight gap. However, the representational geometry they induce remains remarkably similar.

Weight Space Divergence (a) Weight Space: Model parameters diverge significantly as training progresses.
Representation Space Alignment (b) Representation Space: Learned features remain closely aligned, maintaining high similarity.

Corresponds to Figure 1 in the paper. See Appendix A for additional details.

The divergence in weight space suggests that directly comparing model parameters isn't the best approach. What truly matters for downstream performance is the geometry of the learned representations. To better quantify this, we analyze the alignment in "similarity space". Instead of tracking millions of parameters, we track the $N \times N$ pairwise similarity matrix, $\Sigma$, whose entries $\Sigma_{ij}$ represent the cosine similarity between the embeddings of two inputs, $x_i$ and $x_j$. This perspective allows us to directly measure how the geometric structure of the representation space evolves via the following bound.

$$||\Sigma_{T}^{\textcolor{red}{\text{CL}}}-\Sigma_{T}^{\textcolor{blue}{\text{NSCL}}}||_{F} \le \exp\left(\frac{1}{2\tau^{2}B}\sum_{t=0}^{T-1}\eta_{t}\right)\frac{1}{\tau\sqrt{B}}\left(\sum_{t=0}^{T-1}\eta_{t}\right) \cdot \mathcal{O}\left(\frac{e^{2/\tau}}{C}\right)$$

Theorem (1)

Note on Metrics: Theorem (1) provides a direct bound on the difference between the two similarity matrices. In our experiments, we use standard and widely-accepted metrics—Centered Kernel Alignment (CKA) and Representational Similarity Analysis (RSA) —to empirically measure alignment. These metrics are fundamentally dependent on the underlying similarity matrices; a small distance between the matrices directly implies high CKA and RSA scores. For the formal derivation connecting our bound to these specific metrics, we refer the interested reader to Corollaries 1 and 2 in our paper.

2 · Factors Influencing Alignment

Our theory predicts how alignment should behave under different conditions. We empirically confirm these predictions across multiple datasets by varying key hyperparameters.

Finding: Alignment gets stronger as the number of classes increases. This holds consistently across different datasets.

Mini-ImageNet Classes Experiment - Train Mini-ImageNet Classes Experiment - Test Mini-ImageNet
Tiny-ImageNet Classes Experiment - Train Tiny-ImageNet Classes Experiment - Test Tiny-ImageNet
ImageNet-1k Classes Experiment - Train ImageNet-1k Classes Experiment - Test ImageNet-1k

Corresponds to Figure 3 in the paper. The heatmaps show the linear CKA between CL and NSCL models on both train (top, green) and test (bottom, purple) datasets.

So far, we've demonstrated that the alignment between CL and NSCL is not just theoretical but holds up empirically, influenced by factors like class count, temperature, and batch size. But this raises a natural question: why focus specifically on NSCL as the supervised benchmark? Is it truly the best proxy for understanding self-supervised CL, compared to other supervised methods?

How does the CL-NSCL alignment compare to other supervised objectives?

3 · NSCL is the Best Supervised Proxy

We find that NSCL is a much better proxy for CL than both standard Supervised Contrastive Learning (SCL) and Cross-Entropy (CE). This positions NSCL as a principled bridge for understanding self-supervised learning.

CIFAR-100 Temperature Experiment - RSA CIFAR-100 Temperature Experiment - CKA CIFAR-100
Tiny-ImageNet Temperature Experiment - RSA Tiny-ImageNet Temperature Experiment - CKA Tiny-ImageNet
Mini-ImageNet Temperature Experiment - RSA Mini-ImageNet Temperature Experiment - CKA Mini-ImageNet

Corresponds to Figure 2 in the paper. We train CL, NSCL, SCL and CE models with ResNet-50 on CIFAR-100, Tiny-ImageNet and Mini-ImageNet datasets. The plots show RSA (top) and CKA (bottom) between CL and the three supervised objectives. NSCL consistently achieves the highest alignment with CL.

Final Remarks

Our results highlight that the implicit supervised signal in CL is not confined to its loss function but extends throughout the entire optimization trajectory. By showing that CL and NSCL representations co-evolve in a stable and coupled manner, we provide a stronger theoretical bridge between supervised and self-supervised learning.

BibTeX

      @misc{clnscl2025alignment,
          title={On the Alignment Between Supervised and Self-Supervised Contrastive Learning}, 
          author={Achleshwar Luthra and Priyadarsi Mishra and Tomer Galanti},
          year={2025},
          publisher={arXiv},
      }