A hands-on tour through the operations that let machines see — convolutions, kernels, strides, padding, and pooling, made visible. Drag, click, and step through every operation.
A convolution slides a small grid — the kernel — across a larger grid (your image), multiplying overlapping cells and summing the result. Each sum becomes one pixel of the output. Press Step to watch the kernel walk.
A 5×5 input with a 3×3 kernel produces a 3×3 output. The shape shrinks — that's where padding comes in.
The numbers inside the kernel decide what the convolution detects: edges, blurs, sharpens, embosses. Edit the 3×3 grid below or pick a preset to see how a real image of a cat transforms under different operators.
In a real CNN, these weights aren't hand-designed — they're learned from data through gradient descent.
Stride is how many cells the kernel jumps each step — bigger strides skip pixels and shrink output faster. Padding wraps the input in zeros so the kernel can reach edge pixels and the output stays the same size.
After convolution, feature maps are still large. Pooling shrinks them by replacing each window with a single number — either the max (preserve strongest signal) or the average (preserve overall energy). It also makes the network slightly translation-invariant.