CNNs & Computer Vision · AI Engineer

5CNNs & Computer Vision

What makes images different?

So far, our neural networks have taken a list of numbers as input. Height, weight, age. Three features, three input neurons. Done.

But an image isn't a tidy list. A small 28x28 grayscale image (like a handwritten digit) has 784 pixels. A typical phone photo at 1080p has over 6 million pixels. And every pixel matters in relation to its neighbors.

If you flattened all those pixels into one long list and fed them to a plain neural network, something important would get lost. The network wouldn't know that pixel [10, 10] is next to pixel [10, 11]. It wouldn't know that two pixels being close together means something.

Spatial structure matters. And that's exactly what Convolutional Neural Networks (CNNs) are designed to handle.

The core idea: a sliding filter

Think about how you'd scan a document for a specific word. You don't jump around randomly — you slide your eyes across the page, row by row, checking each position.

CNNs do something similar with images. They use a small filter (also called a kernel) that slides across the image, one step at a time, looking for a particular pattern at each position.

A filter is just a tiny grid of numbers — usually 3x3 or 5x5. At each position, the filter multiplies its values with the corresponding pixels in the image, sums everything up, and produces a single output number. This operation is called a convolution.

Filter (3x3):           Image patch (3x3):
  1  0  -1               100  120   80
  1  0  -1        ×       110  100   90
  1  0  -1                90  110   85

Result = (1×100)+(0×120)+(-1×80)
       + (1×110)+(0×100)+(-1×90)
       + (1×90) +(0×110)+(-1×85)
       = 45

The filter slides across every position in the image — left to right, top to bottom. The result is a new grid of numbers called a feature map. Each number in the feature map represents "how strongly this filter pattern was detected at this location."

One filter detects one type of pattern. But in practice, you apply dozens or hundreds of filters to the same image. Each one learns to detect something different — edges, curves, textures, colors.

What do filters actually learn?

In the early days, engineers hand-crafted filters for specific tasks — edge detection, sharpening, blurring. Those filters still exist (you use them in Photoshop without knowing it).

But in deep learning, the filters aren't hand-designed. The network learns them from data.

Just like a plain neural network learns weights, a CNN learns what values to put in its filters. Start with random numbers. Apply backpropagation. Update the filter values. After thousands of training examples, the filters converge on patterns that are actually useful for the task.

For an image classifier:

Layer	What filters learn
Layer 1	Edges (horizontal, vertical, diagonal)
Layer 2	Corners, textures, simple shapes
Layer 3	Eyes, noses, wheels, leaves — object parts
Layer 4	Entire faces, cars, animals

This emergent structure is not programmed

Nobody told the network "look for eyes in layer 3." It discovered that edges → shapes → parts → objects is the most useful way to decompose visual information. Every CNN trained on images learns this hierarchy on its own.

Pooling — making things smaller

After a convolution layer, you usually have a large feature map. You don't need all of it. You want to keep the important parts and throw away the rest.

Pooling does this. The most common type is max pooling — you divide the feature map into small regions and keep only the maximum value from each one.

Feature map (4x4):           After 2x2 max pooling:
  1   3   2   4                   3   4
  5   6   1   2       →           6   8
  3   2   8   1
  4   1   4   7                   4   8  ← max of bottom-right 2x2

Max pooling does three things:

Reduces size — fewer values to process in the next layer
Adds translation invariance — if a pattern shifts slightly, the max pooling still catches it
Keeps the strongest activations — the "did I detect this pattern?" signal survives

Translation invariance — why it matters

If a cat is in the top-left corner vs. the center of the image, a plain network might see them as completely different inputs. Max pooling helps the network recognize that "there is a cat here" regardless of exactly where "here" is.

The full CNN architecture

A typical CNN stacks these operations in a sequence:

Convolutional layer → Activation (ReLU) → Pooling → repeat → Fully connected layers → Output

The early layers (convolution + pooling) act as a feature extractor — they transform the raw image into a compact set of features. The later layers (fully connected) act as a classifier — they take those features and decide what the image is.

You can think of the CNN as two networks bolted together. A spatial understanding machine feeding into a decision-making machine.

Why not just use a regular network?

Fair question. Let's say you have a 28x28 image. That's 784 inputs. If you connect those to a hidden layer of 512 neurons, you have 784 × 512 = 401,408 weights — just for the first layer.

Now try a 224x224 color photo. That's 224 × 224 × 3 = 150,528 inputs. Connect that to 1000 neurons: 150 million weights in one layer.

CNNs solve this by weight sharing. One filter — say, 3x3x3 (27 values) — slides across the entire image. Every position uses the same 27 weights. The filter at the top-left is identical to the filter at the bottom-right.

This is a brilliant trick. Instead of learning a different detector for "edge in the top-left corner" and "edge in the bottom-right corner" separately, you learn one edge detector that works everywhere.

Approach	Parameters for one layer (224x224 image)
Fully connected	~150 million
Conv layer (32 filters, 3x3)	32 × 3 × 3 × 3 = 864

That's not a typo. A convolutional layer can capture spatial patterns with a tiny fraction of the parameters.

Real CNNs that changed everything

A few architectures worth knowing:

LeNet (1998) — the original. Yann LeCun trained it to read handwritten zip codes for the US Postal Service. It was the proof of concept.

AlexNet (2012) — won the ImageNet competition by a huge margin. Used GPUs, deep stacking, and ReLU. Started the modern deep learning era.

VGG (2014) — showed that depth matters more than complicated architectures. Just stack 3x3 conv layers, deeper and deeper.

ResNet (2015) — introduced skip connections (shortcuts that let gradients flow past layers). Made it possible to train networks 100+ layers deep without them breaking down.

CNNs are everywhere now

The camera app on your phone uses a CNN to detect faces. Google Photos uses CNNs to search your pictures by content. Medical imaging systems use CNNs to spot tumors in X-rays. Tesla's autopilot uses CNNs to identify road markings and pedestrians in real time.

Beyond images

CNNs were invented for images, but the same idea — sliding a filter over structured data — works for other things too.

1D convolutions work on sequences of numbers (audio signals, time-series data, even text). The filter slides along the time axis instead of across a grid.

3D convolutions work on video (the third dimension is time) or medical scans like MRI (the third dimension is depth slices).

The key insight is: if your data has local structure — where neighboring values are related to each other — convolutions are probably a good tool.

What's next?

CNNs handle spatial data beautifully. But what about data where order matters in a different way? What if you're reading a sentence and the word "not" completely flips the meaning of everything that comes after it?

That's the problem with sequences. And it's messier than it sounds.

Next up: RNNs and Sequential Data. We'll look at how recurrent networks handle time, memory, and the challenge of learning from sequences — plus why they struggle with long dependencies and what LSTM and GRU do about it.