The Secret Low-Rank Bias of Regularized SGD
Why do trained neural networks often end up low rank? Mini-batch SGD and weight decay together create a built-in pressure toward compressible layers. The mechanism is simple, general, and explains why post-training compression works so well.
Introduction
Deep networks are heavily overparameterized, yet the solutions found in practice are far from arbitrary. Even when many parameter settings can fit the training data, stochastic gradient methods often converge to highly structured models. One particularly striking form of structure is low rank: across many architectures, trained weight matrices are far more compressible than their full dimension would suggest.
Much of the existing theory explains low-rank behavior only in cleaner settings — linear models, specialized losses, exact symmetries, or global optimality arguments. What is still missing is a structural explanation for the regime practitioners actually use: training practical neural networks with mini-batch SGD and weight decay.
The mechanism operates in three steps:
Interactive explorer
Here is a live simulation of the mechanism. An $8 \times 8$ weight matrix is built step by step from rank-$B$ stochastic gradient updates under weight decay. Adjust the sliders to see how batch size, learning rate, and weight decay each affect the resulting rank structure.
Fig. 1. Gradient slabs enter from the right; weight decay fades older contributions. The singular value bars expose the resulting rank structure. Try $B=1$, $\mu=0.10$, $\lambda=0.10$ for extreme low rank, or $B=8$, $\mu=0.02$, $\lambda=0.01$ for higher rank.
Part I — Stochastic gradients are low rank
A local view of one layer
Fix all parameters except one trainable matrix $W$. Locally around that layer, the network can be written as $h(x) = g(Wf(x))$, where $f(x)$ is the representation entering the layer and $g$ collects everything afterward. Under mini-batch SGD with weight decay, the update is:
The previous matrix is shrunk by the factor $1 - 2\mu\lambda$, and a fresh stochastic gradient $G_t$ is added. The key question is: what kind of matrix is $G_t$?
One example gives a rank-1 gradient
For a single training example $x$, the chain rule gives an outer product — the simplest possible matrix structure:
where $\delta(x) := J_g(Wf(x))^\top \nabla_h \ell(h(x))$. One left direction times one right direction — and nothing else.
A mini-batch gives rank at most $B$
The mini-batch gradient averages $B$ rank-1 terms:
Every SGD step writes only a rank-$B$ correction. The diagram below illustrates what this looks like for a layer of dimension 8.
Fig. 2 — Each SGD step writes a rank-$B$ correction. With $B=2$ on an 8×8 matrix, each step touches just 3% of the available singular directions.
Low-rank updates alone are not enough — if we accumulate them forever with no forgetting, their sum can eventually become full rank. The second ingredient is weight decay.
Part II — Weight decay limits memory
Unrolling the recursion $W_{t+1} = (1 - 2\mu\lambda)\,W_t - \mu\, G_t$ for $n$ steps gives the identity at the heart of the argument:
The first term shrinks exponentially in $n$. The second is a weighted sum of recent mini-batch gradients, each of rank at most $B$. After enough training, the current matrix is well approximated by a short moving memory of low-rank corrections.
Fig. 3 — Weight decay exponentially suppresses old gradient contributions. Only a window of $\approx \log(1/\varepsilon)/(\mu\lambda)$ recent steps matters for the current weight matrix.
A simple effective-rank bound
Choose $n$ so the decayed-past term is negligible: $(1 - 2\mu\lambda)^n \approx e^{-2\mu\lambda n} \le \varepsilon$, giving $n \approx \log(1/\varepsilon) / (\mu\lambda)$. The recent-updates term is a sum of $n$ gradients of rank at most $B$, so:
This captures the correct qualitative dependencies: smaller batch size $B$, larger learning rate $\mu$, or larger weight decay $\lambda$ all shorten the effective memory window and produce lower effective rank.
Part III — Shared operators
The rank-1 statement changes when the same matrix $W$ is reused multiple times within a single example — as happens in convolutions (same kernel applied at many spatial locations) and self-attention projections ($W_Q$, $W_K$, $W_V$ applied to many tokens).
In that case, $h(x) = g(Wf_1(x), \dots, Wf_R(x))$, and the chain rule gives:
For a mini-batch, $\text{rank}(G_t) \le \min(d_{\text{out}}, d_{\text{in}}, BR)$. The rest of the argument is unchanged — weight decay still exponentially suppresses old updates, so the current matrix remains close to a weighted sum of recent low-rank gradients. The one-use setting $R = 1$ is simply the cleanest case.
Why the local view is natural
The decomposition $h(x) = g(Wf(x))$ is not an artificial simplification. It is the natural local view of any layer: fix all other parameters, isolate the point where $W$ acts, and absorb everything before it into $f$ and everything after it into $g$. For fully connected layers this is immediate. For residual blocks, the dependence on $W$ still enters through $Wf(x)$, so the outer-product structure of the gradient is preserved.
What this does and does not say
This argument does not imply that every trained layer must be exactly low rank. Nor does it eliminate the influence of architecture, normalization, or data geometry. What it provides is a broad structural reason that low-rank behavior should often appear in practice — and two direct consequences.
Takeaway
SGD with weight decay does more than optimize the loss. It quietly pushes layers toward low-rank structure through three interlocking effects.
Low-rank updates. Each stochastic gradient is low rank — rank 1 per example, rank $B$ per mini-batch, rank $BR$ for shared operators like convolutions and attention projections.
Finite memory. Weight decay exponentially forgets old updates, limiting the effective memory to roughly $\log(1/\varepsilon)/(\mu\lambda)$ steps.
Low-rank layers. The current weight matrix is dominated by a short history of low-rank corrections, giving an effective rank bounded by $B\log(1/\varepsilon)/(\mu\lambda)$.
Low-rank structure in trained neural networks is not an empirical curiosity. It is a natural consequence of how SGD with weight decay writes directions in and forgets them over time.
Comments