CNN Field Guide — Fundamentals

§ 01 — Foundation

What is a convolution, really?

A convolution slides a small grid — the kernel — across a larger grid (your image), multiplying overlapping cells and summing the result. Each sum becomes one pixel of the output. Press Step to watch the kernel walk.

Demo 01 · Sliding Window

Input

Kernel position

Output cell

Input · Kernel · Output

output[i,j] = Σ input[i+k, j+l] · kernel[k,l]

Position (0,0) — click Step to begin.

A 5×5 input with a 3×3 kernel produces a 3×3 output. The shape shrinks — that's where padding comes in.

§ 02 — The Operator

Kernels are tiny instruction sets.

The numbers inside the kernel decide what the convolution detects: edges, blurs, sharpens, embosses. Edit the 3×3 grid below or pick a preset to see how a real image of a cat transforms under different operators.

Demo 02 · Kernel Workshop

Original · Filtered output

Edit kernel weights Identity

Edge detection uses a kernel where the center is positive and surroundings are negative — when neighbors disagree (an edge!), the result is large. Blur averages neighbors. Sharpen amplifies the center against its surroundings.

In a real CNN, these weights aren't hand-designed — they're learned from data through gradient descent.

§ 03 — Geometry

Stride controls speed. Padding controls shape.

Stride is how many cells the kernel jumps each step — bigger strides skip pixels and shrink output faster. Padding wraps the input in zeros so the kernel can reach edge pixels and the output stays the same size.

Demo 03 · Geometry Lab

Input

Padding (zeros)

Kernel

Output

Input (padded) · Output

Input size N 8×8

5678910

Kernel size K 3×3

1357

Stride S 1

1234

Padding P 0

01234

output = ⌊(N + 2P − K) / S⌋ + 1

N=8 K=3 P=0 S=1 → output = 6×6

§ 04 — Downsampling

Pooling compresses what matters.

After convolution, feature maps are still large. Pooling shrinks them by replacing each window with a single number — either the max (preserve strongest signal) or the average (preserve overall energy). It also makes the network slightly translation-invariant.

Demo 04 · Pooling Comparison

Input · Max-pooled output

Window size 2×2

234

Pool stride 2

1234

max-pool: output[i,j] = max over window

Window (0,0) — Click Step.

Max pooling is the favorite for sharp features — it keeps the strongest activation and discards the rest. Average pooling smooths and is gentler on small variations.