The Secret Low-Rank Bias of Regularized SGD
Why do trained neural networks so often become compressible? The cute answer is that SGD writes each layer in thin strokes, while weight decay keeps erasing the old ones. What remains is not arbitrary: it is a short memory of low-rank updates.
Introduction
Neural networks are enormous, but the solutions found by training are often surprisingly small. Not small in parameter count — the matrices are still huge — but small in effective dimension. Look at many trained layers and their singular values decay quickly: the layer behaves as though most of its action lives in a much lower-dimensional subspace.
This is one reason post-training compression works. It is also why low-rank fine-tuning methods such as LoRA feel so natural: they are not fighting the geometry of trained networks. They are exploiting it. The question is where this geometry comes from. Why should ordinary training, with no explicit rank penalty, keep producing compressible matrices?
The paper’s answer is beautifully mechanical. A mini-batch gradient does not write a full matrix. It writes a low-rank slab. Weight decay then fades old slabs exponentially. So a trained layer is not an arbitrary accumulation of everything that ever happened; it is closer to a short moving window of recent low-rank updates.
The whole mechanism can be read as a tiny accounting identity:
Interactive explorer
The explorer below turns the rank accounting into a little toy machine. An $8 \times 8$ matrix is built by repeated rank-$B$ stochastic updates, while weight decay dims older contributions. The heatmap shows the current matrix; the bars show its singular spectrum; the timeline shows which gradients are still remembered.
Fig. 1. Low-rank memory in motion. Gradient slabs enter from the right; weight decay fades older slabs; the singular values reveal how much of the matrix is really being used. Try $B=1$, $\mu=0.10$, $\lambda=0.10$ for an aggressively compressed layer, or $B=8$, $\mu=0.02$, $\lambda=0.01$ for a longer, fuller memory.
Part I — Stochastic gradients are low rank
A local view of one layer
Pick one trainable matrix $W$ inside a neural network and freeze everything else. Locally, the network can always be written as $h(x)=g(Wf(x))$: the map $f$ produces the vector entering the layer, $W$ acts on it, and $g$ contains the rest of the network. Under mini-batch SGD with weight decay, the layer evolves by
This equation has two forces. The old matrix is shrunk by $1-2\mu\lambda$, and the new mini-batch gradient $G_t$ is written into it. The low-rank story begins by asking what shape that new writing has.
One example gives a rank-1 gradient
For a single example, the chain rule gives an outer product:
That is one left vector times one right vector. No matter how wide the layer is, one example can only update one matrix direction. Its rank is at most one.
A mini-batch gives rank at most $B$
A mini-batch simply averages $B$ such outer products:
So each SGD step writes at most $B$ new directions into the layer. This is already a strong bottleneck: a $4096\times4096$ matrix has millions of degrees of freedom, but a batch of 32 writes a correction living in a 32-dimensional column/row span. The diagram below shows the cartoon version.
Fig. 2 — Each SGD step writes a rank-$B$ correction. With $B=2$ on an 8×8 matrix, each step contributes only two fresh directions; with modern large matrices the contrast is much more dramatic.
But low-rank updates alone do not force a low-rank final matrix. Add enough rank-$B$ slabs forever and their span can eventually fill the whole space. The second ingredient — the part that makes the story work — is forgetting.
Part II — Weight decay limits memory
Unroll the update equation for $n$ steps. The current matrix decomposes into two pieces: an old matrix, exponentially faded, plus a weighted sum of recent stochastic gradients:
This is the central identity. Weight decay turns the training trajectory into a fading memory. Old gradients are not gone, but their coefficients shrink like $(1-2\mu\lambda)^j\approx e^{-2\mu\lambda j}$. The layer remembers recent low-rank updates sharply and older updates faintly.
Fig. 3 — Weight decay exponentially suppresses old gradient contributions. Only a window of $\approx \log(1/\varepsilon)/(\mu\lambda)$ recent steps matters for the current weight matrix.
A simple effective-rank bound
Choose a tolerance $\varepsilon$ and ask: how many recent updates do we need before the older past is smaller than that tolerance? Since $(1-2\mu\lambda)^n\approx e^{-2\mu\lambda n}$, the effective memory window is roughly
Within that window, each gradient has rank at most $B$. A sum of $n_\varepsilon$ rank-$B$ matrices has rank at most $B n_\varepsilon$. Hence the effective-rank budget is
This is not saying the numerical rank is exactly this quantity. It is saying something more structural: the part of the layer that survives above tolerance can be represented by a controlled number of recent low-rank updates. The dependencies are the real message.
| Knob | Role in the dynamics | Effect on rank bias |
|---|---|---|
| $B$ | Number of examples in a mini-batch; each step writes at most $B$ directions. | Smaller $B$ ⇒ thinner updates ⇒ stronger low-rank pressure. |
| $\mu$ | Learning rate; controls both the update size and the decay-per-step timescale. | Larger $\mu$ ⇒ shorter memory window ⇒ lower effective rank. |
| $\lambda$ | Weight decay strength; exponentially suppresses old gradient directions. | Larger $\lambda$ ⇒ faster forgetting ⇒ lower effective rank. |
| $\varepsilon$ | Compression tolerance; decides how much of the decayed past we ignore. | Smaller $\varepsilon$ ⇒ more remembered history ⇒ larger rank budget. |
Part III — Shared operators
The cleanest derivation assumes the matrix $W$ is used once per example. Real neural networks often reuse the same matrix many times. A convolutional kernel is applied at many spatial locations; a self-attention projection is applied to many tokens; the same operator appears repeatedly across a structured computation.
In that case the local form becomes $h(x)=g(Wf_1(x),\dots,Wf_R(x))$, where $R$ is the number of uses of the shared operator. The chain rule now gives a sum of $R$ outer products for a single example:
For a mini-batch, this becomes $\mathrm{rank}(G_t)\le \min(d_{\mathrm{out}},d_{\mathrm{in}},BR)$. The story is the same, but the step is wider: each update can write up to $BR$ directions rather than $B$.
That extra factor $R$ is important. It explains why the clean rank-1-per-example picture is not literally true for every parameter tensor, while preserving the core mechanism: shared operators still receive low-rank updates relative to their ambient matrix dimension, and weight decay still limits how many such updates remain visible.
Why the local view is natural
The decomposition $h(x)=g(Wf(x))$ is not a toy assumption. It is the local view of a layer: freeze every other parameter, isolate where $W$ acts, and absorb everything before it into $f$ and everything after it into $g$. For residual networks, attention blocks, and MLP layers, the same principle applies. The surrounding architecture changes the vectors $f$ and $\delta$; it does not remove the outer-product structure of the gradient.
What this does and does not say
This argument does not say every trained layer is exactly low rank, or that optimization ignores the data. It also does not say rank is the only kind of simplicity found by SGD. Architecture, normalization, data geometry, and loss shape still matter. The point is narrower and cleaner: the standard training recipe already contains a mechanism that favors compressible matrices.
Takeaway
SGD with weight decay does more than minimize the loss. It gives neural networks a quiet bias toward low-rank, compressible layers.
SGD writes low-rank updates. A single example contributes an outer product. A mini-batch contributes at most $B$ such directions; a shared operator contributes at most $BR$.
Weight decay creates finite memory. Old updates are multiplied by powers of $1-2\mu\lambda$, so the layer is dominated by a recent window of gradients of length about $\log(1/\varepsilon)/(\mu\lambda)$.
The rank budget follows. Up to tolerance $\varepsilon$, the effective rank scales like $B\log(1/\varepsilon)/(\mu\lambda)$ in the one-use case, and like $BR\log(1/\varepsilon)/(\mu\lambda)$ for shared operators.
The practical moral is simple. Low-rank structure is not a post-hoc miracle. It is written into the training dynamics — one thin stochastic update at a time, with weight decay quietly erasing the past.
Comments