Low-Rank Bias · SGD · Compression

The Secret Low-Rank Bias of Regularized SGD

Why do trained neural networks often end up low rank? Mini-batch SGD and weight decay together create a built-in pressure toward compressible layers. The mechanism is simple, general, and explains why post-training compression works so well.

By Tomer Galanti · March 11, 2026 · 12 min read · ◆ Galanti et al., CPAL 2025

Introduction

Deep networks are heavily overparameterized, yet the solutions found in practice are far from arbitrary. Even when many parameter settings can fit the training data, stochastic gradient methods often converge to highly structured models. One particularly striking form of structure is low rank: across many architectures, trained weight matrices are far more compressible than their full dimension would suggest.

Much of the existing theory explains low-rank behavior only in cleaner settings — linear models, specialized losses, exact symmetries, or global optimality arguments. What is still missing is a structural explanation for the regime practitioners actually use: training practical neural networks with mini-batch SGD and weight decay.

The central claim Mini-batch SGD with weight decay creates a strong built-in pressure toward low-rank layers. This pressure becomes stronger with smaller batch size $B$, larger learning rate $\mu$, and stronger weight decay $\lambda$.

The mechanism operates in three steps:

I
Low-rank updates.
A single-example gradient is rank 1. A mini-batch gradient has rank at most $B$. Every SGD step writes only a low-rank correction to the weight matrix.
II
Finite memory.
Weight decay exponentially suppresses old updates. Only a short window of past gradients contributes meaningfully to the current weight matrix.
III
Low-rank layers.
The current matrix is dominated by a short history of low-rank corrections, yielding an effective rank of roughly $B / (\mu\lambda)$.
Based on: T. Galanti, Z. Siegel, A. Gupte, T. Poggio. “SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network”, CPAL 2025.

Interactive explorer

Here is a live simulation of the mechanism. An $8 \times 8$ weight matrix is built step by step from rank-$B$ stochastic gradient updates under weight decay. Adjust the sliders to see how batch size, learning rate, and weight decay each affect the resulting rank structure.

Low-rank bias simulator — SGD + weight decay builds low-rank weight matrices
Weight matrix $W_T$
Singular values
Effective rank
Decay factor
Memory window
t = 0
Gradient memory timeline

Fig. 1. Gradient slabs enter from the right; weight decay fades older contributions. The singular value bars expose the resulting rank structure. Try $B=1$, $\mu=0.10$, $\lambda=0.10$ for extreme low rank, or $B=8$, $\mu=0.02$, $\lambda=0.01$ for higher rank.


Part I — Stochastic gradients are low rank

A local view of one layer

Fix all parameters except one trainable matrix $W$. Locally around that layer, the network can be written as $h(x) = g(Wf(x))$, where $f(x)$ is the representation entering the layer and $g$ collects everything afterward. Under mini-batch SGD with weight decay, the update is:

$$W_{t+1} = (1 - 2\mu\lambda)\,W_t \;-\; \mu\, G_t$$

The previous matrix is shrunk by the factor $1 - 2\mu\lambda$, and a fresh stochastic gradient $G_t$ is added. The key question is: what kind of matrix is $G_t$?

One example gives a rank-1 gradient

For a single training example $x$, the chain rule gives an outer product — the simplest possible matrix structure:

Single-example gradient
$$\nabla_W \ell(h(x)) = \delta(x)\, f(x)^\top \qquad \Rightarrow \qquad \mathrm{rank} = 1$$

where $\delta(x) := J_g(Wf(x))^\top \nabla_h \ell(h(x))$. One left direction times one right direction — and nothing else.

A mini-batch gives rank at most $B$

The mini-batch gradient averages $B$ rank-1 terms:

$$G_t = \frac{1}{B}\sum_{i=1}^B \delta_i\, f_i^\top, \qquad \text{rank}(G_t) \le \min(d_{\text{out}},\, d_{\text{in}},\, B)$$

Every SGD step writes only a rank-$B$ correction. The diagram below illustrates what this looks like for a layer of dimension 8.

Wt 8 × 8 current ×(1−2μλ) Wt shrunk decayed μ Gt rank ≤ B gradient step = Wt+1 updated next step Each update writes only B new directions into an 8×8 matrix (just B/64 of capacity) Alone, this is not enough — repeated rank-B additions could eventually fill the matrix. Weight decay is the missing piece.

Fig. 2 — Each SGD step writes a rank-$B$ correction. With $B=2$ on an 8×8 matrix, each step touches just 3% of the available singular directions.

Low-rank updates alone are not enough — if we accumulate them forever with no forgetting, their sum can eventually become full rank. The second ingredient is weight decay.


Part II — Weight decay limits memory

Unrolling the recursion $W_{t+1} = (1 - 2\mu\lambda)\,W_t - \mu\, G_t$ for $n$ steps gives the identity at the heart of the argument:

$$W_T = \underbrace{(1 - 2\mu\lambda)^n\, W_{T-n}}_{\text{decayed past}} \;-\; \mu \underbrace{\sum_{j=1}^{n}(1 - 2\mu\lambda)^{j-1}\, G_{T-j}}_{\text{recent low-rank updates}}$$

The first term shrinks exponentially in $n$. The second is a weighted sum of recent mini-batch gradients, each of rank at most $B$. After enough training, the current matrix is well approximated by a short moving memory of low-rank corrections.

older ← → most recent GT-8 ~forgotten GT-6 GT-4 GT-3 GT-2 GT-1 rank ≤ B GT current ×(1−2μλ)8 ×(1−2μλ)6 ×(1−2μλ)4 ×(1−2μλ)3 ×(1−2μλ)2 ×(1−2μλ)1 ×1 effective memory window ≈ log(1/ε) / (μλ) Older updates fade. Only recent rank-B matter.

Fig. 3 — Weight decay exponentially suppresses old gradient contributions. Only a window of $\approx \log(1/\varepsilon)/(\mu\lambda)$ recent steps matters for the current weight matrix.

A simple effective-rank bound

Choose $n$ so the decayed-past term is negligible: $(1 - 2\mu\lambda)^n \approx e^{-2\mu\lambda n} \le \varepsilon$, giving $n \approx \log(1/\varepsilon) / (\mu\lambda)$. The recent-updates term is a sum of $n$ gradients of rank at most $B$, so:

$$\text{rank}_\varepsilon(W_T) \;\lesssim\; \frac{B\,\log(1/\varepsilon)}{\mu\lambda}$$

This captures the correct qualitative dependencies: smaller batch size $B$, larger learning rate $\mu$, or larger weight decay $\lambda$ all shorten the effective memory window and produce lower effective rank.

This bound is a structural heuristic, not a worst-case guarantee. In practice the rank is often even lower, because gradient directions are not independent — training on structured data concentrates updates in a small number of directions.

Part III — Shared operators

The rank-1 statement changes when the same matrix $W$ is reused multiple times within a single example — as happens in convolutions (same kernel applied at many spatial locations) and self-attention projections ($W_Q$, $W_K$, $W_V$ applied to many tokens).

In that case, $h(x) = g(Wf_1(x), \dots, Wf_R(x))$, and the chain rule gives:

Shared operator gradient
$$\nabla_W \ell(h(x)) = \sum_{r=1}^R \delta_r(x)\, f_r(x)^\top, \qquad \text{rank}\bigl(\nabla_W \ell(h(x))\bigr) \le R$$

For a mini-batch, $\text{rank}(G_t) \le \min(d_{\text{out}}, d_{\text{in}}, BR)$. The rest of the argument is unchanged — weight decay still exponentially suppresses old updates, so the current matrix remains close to a weighted sum of recent low-rank gradients. The one-use setting $R = 1$ is simply the cleanest case.

Why the local view is natural

The decomposition $h(x) = g(Wf(x))$ is not an artificial simplification. It is the natural local view of any layer: fix all other parameters, isolate the point where $W$ acts, and absorb everything before it into $f$ and everything after it into $g$. For fully connected layers this is immediate. For residual blocks, the dependence on $W$ still enters through $Wf(x)$, so the outer-product structure of the gradient is preserved.


What this does and does not say

This argument does not imply that every trained layer must be exactly low rank. Nor does it eliminate the influence of architecture, normalization, or data geometry. What it provides is a broad structural reason that low-rank behavior should often appear in practice — and two direct consequences.

Why LoRA works
If SGD already pushes layers toward low rank, fine-tuning with an explicit low-rank parameterization ($W = W_0 + AB$) is not imposing an alien constraint — it is matching the structure that the original training would have produced anyway.
Why post-training compression works
SVD truncation after training is often surprisingly effective. This result explains why: compression is not inventing structure. It is extracting structure that training had already encouraged into the weight matrices.
“Low rank is not an accident. It is built into the training dynamics through the interaction of stochastic gradients and weight decay.”

Takeaway

SGD with weight decay does more than optimize the loss. It quietly pushes layers toward low-rank structure through three interlocking effects.

Low-rank updates. Each stochastic gradient is low rank — rank 1 per example, rank $B$ per mini-batch, rank $BR$ for shared operators like convolutions and attention projections.

Finite memory. Weight decay exponentially forgets old updates, limiting the effective memory to roughly $\log(1/\varepsilon)/(\mu\lambda)$ steps.

Low-rank layers. The current weight matrix is dominated by a short history of low-rank corrections, giving an effective rank bounded by $B\log(1/\varepsilon)/(\mu\lambda)$.

Low-rank structure in trained neural networks is not an empirical curiosity. It is a natural consequence of how SGD with weight decay writes directions in and forgets them over time.


Comments