Low-Rank Bias · SGD · Compression

The Secret Low-Rank Bias of Regularized SGD

Why do trained neural networks so often become compressible? The cute answer is that SGD writes each layer in thin strokes, while weight decay keeps erasing the old ones. What remains is not arbitrary: it is a short memory of low-rank updates.

By Tomer Galanti · March 11, 2026 · 12 min read · ◆ Galanti, Siegel, Gupte, Poggio - CPAL 2025

Introduction

Neural networks are enormous, but the solutions found by training are often surprisingly small. Not small in parameter count — the matrices are still huge — but small in effective dimension. Look at many trained layers and their singular values decay quickly: the layer behaves as though most of its action lives in a much lower-dimensional subspace.

This is one reason post-training compression works. It is also why low-rank fine-tuning methods such as LoRA feel so natural: they are not fighting the geometry of trained networks. They are exploiting it. The question is where this geometry comes from. Why should ordinary training, with no explicit rank penalty, keep producing compressible matrices?

The paper’s answer is beautifully mechanical. A mini-batch gradient does not write a full matrix. It writes a low-rank slab. Weight decay then fades old slabs exponentially. So a trained layer is not an arbitrary accumulation of everything that ever happened; it is closer to a short moving window of recent low-rank updates.

The central claim Mini-batch SGD with weight decay creates an implicit low-rank bias. Smaller batch size $B$ writes fewer directions per step; larger learning rate $\mu$ and stronger weight decay $\lambda$ shorten the memory of past directions. Together they make the layer compressible.

The whole mechanism can be read as a tiny accounting identity:

I
Each step writes only a few directions.
For one example, the gradient of a matrix layer is an outer product: rank one. For a mini-batch, it has rank at most $B$.
II
Weight decay makes the past fade.
The update multiplies the old matrix by $1-2\mu\lambda$. After many steps, far-back gradients are exponentially suppressed.
III
The layer becomes a short low-rank memory.
The current matrix is dominated by about $1/(\mu\lambda)$ recent rank-$B$ updates, giving an effective rank budget on the order of $B/(\mu\lambda)$.
“SGD writes in low-rank strokes; weight decay decides how many strokes remain visible.”
Based on: T. Galanti, Z. Siegel, A. Gupte, T. Poggio. “SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network”, CPAL 2025.

Interactive explorer

The explorer below turns the rank accounting into a little toy machine. An $8 \times 8$ matrix is built by repeated rank-$B$ stochastic updates, while weight decay dims older contributions. The heatmap shows the current matrix; the bars show its singular spectrum; the timeline shows which gradients are still remembered.

Low-rank bias simulator — SGD + weight decay builds low-rank weight matrices
Weight matrix $W_T$
Singular values
Effective rank
Decay factor
Memory window
t = 0
Gradient memory timeline

Fig. 1. Low-rank memory in motion. Gradient slabs enter from the right; weight decay fades older slabs; the singular values reveal how much of the matrix is really being used. Try $B=1$, $\mu=0.10$, $\lambda=0.10$ for an aggressively compressed layer, or $B=8$, $\mu=0.02$, $\lambda=0.01$ for a longer, fuller memory.


Part I — Stochastic gradients are low rank

A local view of one layer

Pick one trainable matrix $W$ inside a neural network and freeze everything else. Locally, the network can always be written as $h(x)=g(Wf(x))$: the map $f$ produces the vector entering the layer, $W$ acts on it, and $g$ contains the rest of the network. Under mini-batch SGD with weight decay, the layer evolves by

$$W_{t+1} = (1 - 2\mu\lambda)\,W_t \;-\; \mu\, G_t$$

This equation has two forces. The old matrix is shrunk by $1-2\mu\lambda$, and the new mini-batch gradient $G_t$ is written into it. The low-rank story begins by asking what shape that new writing has.

One example gives a rank-1 gradient

For a single example, the chain rule gives an outer product:

Single-example gradient
$$\nabla_W \ell(h(x)) = \delta(x)\, f(x)^\top, \qquad \delta(x) := J_g(Wf(x))^\top \nabla_h\ell(h(x))$$

That is one left vector times one right vector. No matter how wide the layer is, one example can only update one matrix direction. Its rank is at most one.

A mini-batch gives rank at most $B$

A mini-batch simply averages $B$ such outer products:

$$G_t = \frac{1}{B}\sum_{i=1}^B \delta_i\, f_i^\top, \qquad \mathrm{rank}(G_t) \le \min(d_{\mathrm{out}},\,d_{\mathrm{in}},\,B)$$

So each SGD step writes at most $B$ new directions into the layer. This is already a strong bottleneck: a $4096\times4096$ matrix has millions of degrees of freedom, but a batch of 32 writes a correction living in a 32-dimensional column/row span. The diagram below shows the cartoon version.

Wt 8 × 8 current ×(1−2μλ) Wt shrunk decayed μ Gt rank ≤ B gradient step = Wt+1 updated next step Each update writes only B new directions into an 8×8 matrix (just B/64 of capacity) Alone, this is not enough — repeated rank-B additions could eventually fill the matrix. Weight decay is the missing piece.

Fig. 2 — Each SGD step writes a rank-$B$ correction. With $B=2$ on an 8×8 matrix, each step contributes only two fresh directions; with modern large matrices the contrast is much more dramatic.

But low-rank updates alone do not force a low-rank final matrix. Add enough rank-$B$ slabs forever and their span can eventually fill the whole space. The second ingredient — the part that makes the story work — is forgetting.


Part II — Weight decay limits memory

Unroll the update equation for $n$ steps. The current matrix decomposes into two pieces: an old matrix, exponentially faded, plus a weighted sum of recent stochastic gradients:

$$W_T = \underbrace{(1 - 2\mu\lambda)^n\, W_{T-n}}_{\text{decayed past}} \;-\; \mu \underbrace{\sum_{j=1}^{n}(1 - 2\mu\lambda)^{j-1}\, G_{T-j}}_{\text{recent low-rank updates}}$$

This is the central identity. Weight decay turns the training trajectory into a fading memory. Old gradients are not gone, but their coefficients shrink like $(1-2\mu\lambda)^j\approx e^{-2\mu\lambda j}$. The layer remembers recent low-rank updates sharply and older updates faintly.

older ← → most recent GT-8 ~forgotten GT-6 GT-4 GT-3 GT-2 GT-1 rank ≤ B GT current ×(1−2μλ)8 ×(1−2μλ)6 ×(1−2μλ)4 ×(1−2μλ)3 ×(1−2μλ)2 ×(1−2μλ)1 ×1 effective memory window ≈ log(1/ε) / (μλ) Older updates fade. Only recent rank-B matter.

Fig. 3 — Weight decay exponentially suppresses old gradient contributions. Only a window of $\approx \log(1/\varepsilon)/(\mu\lambda)$ recent steps matters for the current weight matrix.

A simple effective-rank bound

Choose a tolerance $\varepsilon$ and ask: how many recent updates do we need before the older past is smaller than that tolerance? Since $(1-2\mu\lambda)^n\approx e^{-2\mu\lambda n}$, the effective memory window is roughly

Memory length
$$n_\varepsilon \;\approx\; \frac{\log(1/\varepsilon)}{\mu\lambda}$$

Within that window, each gradient has rank at most $B$. A sum of $n_\varepsilon$ rank-$B$ matrices has rank at most $B n_\varepsilon$. Hence the effective-rank budget is

$$\mathrm{rank}_\varepsilon(W_T) \;\lesssim\; \frac{B\,\log(1/\varepsilon)}{\mu\lambda}$$

This is not saying the numerical rank is exactly this quantity. It is saying something more structural: the part of the layer that survives above tolerance can be represented by a controlled number of recent low-rank updates. The dependencies are the real message.

Knob Role in the dynamics Effect on rank bias
$B$ Number of examples in a mini-batch; each step writes at most $B$ directions. Smaller $B$ ⇒ thinner updates ⇒ stronger low-rank pressure.
$\mu$ Learning rate; controls both the update size and the decay-per-step timescale. Larger $\mu$ ⇒ shorter memory window ⇒ lower effective rank.
$\lambda$ Weight decay strength; exponentially suppresses old gradient directions. Larger $\lambda$ ⇒ faster forgetting ⇒ lower effective rank.
$\varepsilon$ Compression tolerance; decides how much of the decayed past we ignore. Smaller $\varepsilon$ ⇒ more remembered history ⇒ larger rank budget.
The bound is best read as a structural compression certificate. It does not require the loss to be convex or the layer to be at a global optimum; it follows from the algebra of SGD updates and weight decay. In real networks the observed rank can be even lower, because gradient directions are correlated rather than adversarially independent.

Part III — Shared operators

The cleanest derivation assumes the matrix $W$ is used once per example. Real neural networks often reuse the same matrix many times. A convolutional kernel is applied at many spatial locations; a self-attention projection is applied to many tokens; the same operator appears repeatedly across a structured computation.

In that case the local form becomes $h(x)=g(Wf_1(x),\dots,Wf_R(x))$, where $R$ is the number of uses of the shared operator. The chain rule now gives a sum of $R$ outer products for a single example:

Shared operator gradient
$$\nabla_W \ell(h(x)) = \sum_{r=1}^R \delta_r(x)\, f_r(x)^\top, \qquad \mathrm{rank}\bigl(\nabla_W \ell(h(x))\bigr) \le R$$

For a mini-batch, this becomes $\mathrm{rank}(G_t)\le \min(d_{\mathrm{out}},d_{\mathrm{in}},BR)$. The story is the same, but the step is wider: each update can write up to $BR$ directions rather than $B$.

$$\mathrm{rank}_\varepsilon(W_T) \;\lesssim\; \frac{BR\,\log(1/\varepsilon)}{\mu\lambda}$$

That extra factor $R$ is important. It explains why the clean rank-1-per-example picture is not literally true for every parameter tensor, while preserving the core mechanism: shared operators still receive low-rank updates relative to their ambient matrix dimension, and weight decay still limits how many such updates remain visible.

Why the local view is natural

The decomposition $h(x)=g(Wf(x))$ is not a toy assumption. It is the local view of a layer: freeze every other parameter, isolate where $W$ acts, and absorb everything before it into $f$ and everything after it into $g$. For residual networks, attention blocks, and MLP layers, the same principle applies. The surrounding architecture changes the vectors $f$ and $\delta$; it does not remove the outer-product structure of the gradient.


What this does and does not say

This argument does not say every trained layer is exactly low rank, or that optimization ignores the data. It also does not say rank is the only kind of simplicity found by SGD. Architecture, normalization, data geometry, and loss shape still matter. The point is narrower and cleaner: the standard training recipe already contains a mechanism that favors compressible matrices.

Why LoRA feels natural
Low-rank fine-tuning writes an update of the form $AB$ on top of a frozen weight matrix. If the original training dynamics already tend to express useful changes through low-dimensional update subspaces, LoRA is not an arbitrary restriction; it is aligned with the geometry SGD often discovers.
Why post-training compression works
SVD truncation after training often preserves accuracy because the small singular directions are not carrying much of the learned computation. Compression is not inventing structure after the fact. It is harvesting structure that training dynamics helped create.
Why hyperparameters matter
Batch size, learning rate, and weight decay are not only optimization knobs. They also control the rank budget: $B$ controls how many directions are written per step, while $\mu\lambda$ controls how long those directions remain in memory.
“A trained layer is less like a carved stone tablet and more like a chalkboard: SGD writes, weight decay fades, and only the persistent directions remain.”

Takeaway

SGD with weight decay does more than minimize the loss. It gives neural networks a quiet bias toward low-rank, compressible layers.

SGD writes low-rank updates. A single example contributes an outer product. A mini-batch contributes at most $B$ such directions; a shared operator contributes at most $BR$.

Weight decay creates finite memory. Old updates are multiplied by powers of $1-2\mu\lambda$, so the layer is dominated by a recent window of gradients of length about $\log(1/\varepsilon)/(\mu\lambda)$.

The rank budget follows. Up to tolerance $\varepsilon$, the effective rank scales like $B\log(1/\varepsilon)/(\mu\lambda)$ in the one-use case, and like $BR\log(1/\varepsilon)/(\mu\lambda)$ for shared operators.

The practical moral is simple. Low-rank structure is not a post-hoc miracle. It is written into the training dynamics — one thin stochastic update at a time, with weight decay quietly erasing the past.


Comments