The Forward-Forward Algorithm: Two Forward Passes Instead of a Backward One

The Forward-Forward algorithm (FF) is Geoffrey Hinton's proposal for training neural networks without backpropagation ("The Forward-Forward Algorithm: Some Preliminary Investigations", 2022). Backprop learns by running data forward through the network and then propagating exact error derivatives backward through every layer. FF removes the backward pass entirely and asks the question this note answers: what has to change so that a network can learn from forward passes only — and what does that buy, and cost, compared to backprop?

One-sentence summary. Forward-Forward replaces backprop's forward-then-backward sweep with two forward passes — one on real ("positive") data and one on fabricated ("negative") data — and every layer learns locally and immediately by pushing a scalar "goodness" (the sum of its squared activities) above a threshold for real data and below it for fake data; layer normalization between layers keeps the trick from collapsing.

🔁 What backprop actually requires

Backprop's chain rule is exact, which is its strength — and the source of all its structural requirements. To compute ∂loss/∂w for a weight deep in the network, backprop must:

Store every layer's activations during the forward pass, because the backward pass needs them to compute gradients.
Know the derivative of every operation on the forward path. A black-box or non-differentiable component anywhere breaks the chain.
Wait for the round trip. The first layer cannot update until the signal has gone all the way up and the error has come all the way back down ("update locking").
Reuse the same weights backward (transposed) to route the error — the "weight transport" problem, one reason there is no evidence the cortex implements anything like backprop.

Forward-Forward drops all four requirements at once. Watch one training step of each, side by side — the thing to track is when each layer gets to update its weights:

Backpropagation

press play

Forward-Forward

press play

On the left, weight updates (⚡) only happen on the backward sweep, in reverse order, after the loss is known — and every layer had to keep its activations (a badges) in memory the whole time. On the right there is no backward sweep at all: each layer updates the moment the pass flows through it, once to make real data score high (green pass) and once to make fake data score low (red pass). Nothing is stored, nothing waits.

⚡ Goodness: a local objective for every layer

Backprop gives every weight a share of one global loss. FF instead gives every layer its own private objective. A layer's goodness on an input is simply the sum of its squared activities:

G = Σ_j y_j² p(positive) = σ(G − θ)

Each layer is trained as a tiny logistic classifier: activities should be energetic (goodness above the threshold θ) when the input is real, and quiet (goodness below θ) when the input is fake. The gradient of that objective with respect to each activity is local — it only needs values the layer already has, so the weight update needs no information from any other layer.

Below is one layer of 8 units. Choose whether the current input is being treated as positive or negative, then apply update steps and watch the layer push its own goodness to the correct side of θ. Note the update pressure |σ(G−θ) − target|: it fades as the layer becomes confident, so learning naturally stops once the sample is well classified.

θ = 4.0

goodness G

–

σ(G−θ)

–

update pressure

–

layer says

–

📏 The load-bearing trick: normalize between layers

There is an obvious way for this scheme to collapse. If layer 1 learns to make its activity vector long for real data, layer 2 can get perfect goodness by just copying its input — the vector's length already answers "real or fake?", so no layer after the first would learn anything new about the data.

FF's fix: before a layer's activities are fed to the next layer, normalize the vector to unit length. The length of the activity vector is the goodness (up to the square root), so normalizing strips exactly the quantity the previous layer was optimizing. What survives is the vector's orientation — the relative pattern of which units are active. Each layer is forced to find new evidence in that pattern.

h = y / (‖y‖ + ε) — length (goodness) removed, direction (information) kept

Drag the activity vector of a toy 2-unit layer. Its length — and therefore its goodness — can be anything; the next layer only ever sees the point on the unit circle.

this layer's goodness G = y₁²+y₂²

–

what the next layer sees

–

Every point along the dashed ray has different goodness but is identical to the next layer.

🏷️ Negative data, and how FF does supervised learning

FF is contrastive: layers only learn because positive and negative data pull goodness in opposite directions. Good negative data should have the same low-level statistics as real data and differ only in structure — otherwise layers can win by spotting trivia. In the paper's unsupervised experiments the negatives are hybrid images: two digits blended through a large-blob mask, so every patch looks locally like a digit but the whole is wrong.

The supervised recipe is neater, and it is the one used in the demo below: put the label into the input. For MNIST, the first 10 pixels of the image are overwritten with a one-hot encoding of the label. A positive example is an image with its correct label baked in; a negative example is the same image with a wrong label. The only way a layer can score them differently is to learn the relationship between the label and the image content — which is exactly classification.

positive

0123456789

a 7, labeled "7" → goodness up

negative

0123456789

the same 7, labeled "3" → goodness down

Inference follows directly: run one forward pass per candidate label and pick the label whose pass accumulates the most goodness across layers. (In the paper's deeper nets the first hidden layer's goodness is excluded from this sum; in the shallow 2-layer demo below, summing all layers works better, so that is what it does.) The paper also describes a cheaper one-pass variant — train a small softmax readout on the hidden activities — but the try-every-label procedure is the purely forward-forward one.

🥊 Forward-Forward vs backprop, live

Everything above, assembled and trained in the browser. Both learners see the same 300-point dataset — an inner disk (● class A) surrounded by a ring (○ class B), not linearly separable. The backprop net is a standard 2→12→12→2 MLP with softmax cross-entropy. The FF net has the same two hidden layers of 12 ReLU units, but its input is [x, y, label as one-hot], it trains with the local goodness rule (θ = 2, positives = true label, negatives = wrong label), normalizes between layers, and classifies by running both candidate labels and comparing summed goodness. Press train.

step 0 / 900

Forward-Forward

accuracy

–

Backprop

accuracy

–

Backprop converges in fewer steps — exact global gradients are hard to beat, and that echoes the paper's own findings. But FF gets to the same boundary using only information each layer already had. After training, click anywhere on the Forward-Forward canvas to see inference happen: the point is run once with each candidate label, and the goodness readouts below decide the class.

⚖️ The honest scorecard

FF was published as a preliminary investigation, and the paper is upfront about where it stands:

	Backpropagation	Forward-Forward
Learning signal	Exact global error derivatives, chained backward through every layer	A local per-layer objective: goodness high on real data, low on fake
Memory during training	Must store all activations for the backward pass	Nothing stored — a layer updates and is done
Differentiability	Every forward operation must have a known derivative	A black-box stage between layers is fine — no derivative ever crosses it
Timing	Update locking: layer 1 waits for the full round trip	Layers update during the pass; learning can be pipelined through streaming data
Biological plausibility	Weight transport + backward error channel: no cortical evidence	Forward-only and local — closer to something neurons could do
Hardware	Wants exact digital arithmetic	Tolerates imprecise, unit-to-unit-variable analog hardware — Hinton's "mortal computation"
Empirical results	State of the art everywhere it's been tried at scale	1.36% MNIST test error (4 hidden layers of 2000 ReLUs, 60 epochs) — comparable to a ~1.4% backprop MLP baseline; slightly behind on CIFAR-10 (≈41% vs ≈37% test error); unproven at scale
Speed	Faster (reaches FF's MNIST accuracy in ~20 epochs)	Somewhat slower (~60 epochs for the MNIST result), and generalizes slightly worse on several of the toy problems tested

So FF is not a backprop replacement today. It is an existence proof: networks can learn useful multi-layer representations from purely local, forward-only signals — which matters if the goal is to understand how the cortex might learn, or to train networks on low-power analog hardware where backprop is not an option.

Takeaway. Backprop answers "how should this weight change?" with one exact, global, backward-flowing derivative. Forward-Forward answers it with two cheap local questions — "did real data excite you?" (be more excitable) and "did fake data excite you?" (be less) — asked during two ordinary forward passes, with layer normalization forcing every layer to add new information rather than echo the last one's verdict.

Source: G. Hinton, The Forward-Forward Algorithm: Some Preliminary Investigations, 2022 — arXiv:2212.13345.

Written on July 5, 2026