Computer vision is one of the most transformative corners of modern AI, and convolutional neural networks are the engine powering almost all of it. Before CNNs arrived on the scene, teaching a machine to recognize a cat required hand-crafting features: you'd painstakingly write rules about whisker shapes, ear geometry, and fur texture. Researchers spent years on this. The results were fragile and domain-specific. Then convolutional networks changed everything.

The shift happened because CNNs don't need you to tell them what to look for. You hand them labeled images, and they figure out the relevant features on their own, from raw pixels to polished predictions. Today, this paradigm powers self-driving cars scanning the road ahead, medical imaging systems spotting tumors in X-rays, satellite imagery analysis detecting deforestation, smartphone cameras identifying faces in real time, and manufacturing quality-control systems catching defects on assembly lines at superhuman speed. The applications are endless because images are everywhere, and CNNs have gotten extraordinarily good at understanding them.

What makes image classification hard for traditional neural networks isn't just the scale of the problem, it's the nature of image data itself. Spatial relationships matter. The pixel at position (100, 100) doesn't mean anything in isolation; it only has meaning in relation to its neighbors. An edge is an edge whether it appears on the left side of the image or the right. A texture is a texture whether it's in the foreground or background. Traditional dense layers miss all of this. They treat pixels as interchangeable numbers with no spatial context whatsoever.

CNNs were specifically designed to exploit the structure that images naturally carry. They understand locality, they share weights across positions, and they build up representations hierarchically, from edges to shapes to objects, layer by layer. This article breaks down exactly how that works, walks you through building and training a CNN from scratch in PyTorch, and gives you the mental models you need to design and debug your own architectures. We're going practical and specific, so by the end you'll have working code and genuine understanding, not just surface familiarity.

You've trained dense neural networks. You've tuned learning rates. But here's the thing about images: they have structure. Pixels next to each other matter. That's what convolutional neural networks (CNNs) exploit.

In this article, we're going deep into how CNNs work, from the mechanics of convolution itself, through pooling and skip connections, all the way to training a competitive image classifier. You'll see why a simple "look at every pixel independently" approach fails on images, and why convolution is so elegant.

Let's build something real.

What's Wrong With Dense Layers on Images?

Before we talk convolutions, let's be honest about the problem. A 224×224 RGB image has 150,528 pixels. That's 150,528 inputs to your first dense layer. If the next hidden layer has even 1,024 neurons, you're looking at 154 million parameters in one layer. That's computational hell, and most of those parameters learn nothing useful.

Here's why: a dense layer treats every pixel the same. It doesn't care that pixel (10, 10) and pixel (11, 10) are neighbors. It doesn't understand that edges, textures, and shapes repeat across the image. You're throwing away the structure of the data.

Convolution changes that. Instead of connecting every pixel to every neuron, you slide small filters (kernels) across the image. Each filter learns to detect something local: an edge, a curve, a blob of color. You reuse that filter everywhere in the image. Fewer parameters. More meaning.

How Convolutions See Images

To really understand CNNs, you need to think about what a filter actually represents. A 3×3 filter is a tiny 3×3 grid of learned numbers. When you slide it over an image, you're asking: "How much does this 3×3 region of the image match this pattern?" High response means the pattern is present. Low or negative response means it isn't. Each filter is, in essence, a tiny pattern detector.

In the early layers of a CNN, filters learn to detect low-level primitives: horizontal edges, vertical edges, diagonal edges, gradients of color, corners where two edges meet. You can actually visualize the filters from the first layer of a trained network and they look surprisingly like the basic building blocks of visual structure, simple oriented bars and color blobs. Nature didn't predetermine these; the network discovers them because they're genuinely useful for distinguishing one class of image from another.

The critical insight is weight sharing. A filter applied to detect a horizontal edge in the top-left corner of an image uses the exact same weights when it's applied to the bottom-right corner. This is a mathematically elegant way to encode the intuition that a horizontal edge is a horizontal edge regardless of where it appears in the image. It's also why CNNs need far fewer parameters than dense networks, instead of learning a unique set of weights for every position, you learn one set of weights (one filter) and apply it everywhere. A single 3×3 filter has just 9 weights plus a bias. Even with 64 such filters in a first conv layer, that's only 640 parameters to cover the entire spatial extent of the input image.

The depth of the filter matters too. For an RGB image with 3 color channels, a 3×3 filter is actually a 3×3×3 volume, it spans all three channels simultaneously. This lets filters detect color-channel combinations, not just spatial patterns. A filter might learn to respond to "red blob next to green blob" as a single coherent detector. Each output feature map is produced by one filter sliding across all input channels, which is why the number of output channels equals the number of filters you specify.

The Convolution Operation: Kernels, Strides, Padding

Let's start simple. A convolution is a sliding window operation.

The kernel is a small matrix, say, 3×3. You place it at the top-left of your input, multiply each kernel value by the image pixel underneath, sum it all up, and write that sum as the output. Then you slide right. When you hit the edge, you drop down and slide left again. That's convolution.

Here's a concrete example to make this tangible. We'll create a minimal conv layer, pass a tiny synthetic image through it, and observe how the spatial dimensions change:

python

import torch
import torch.nn as nn
 
# 1 input channel, 1 output channel, 3x3 kernel
conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3)
 
# Create a simple input: 1 sample, 1 channel, 5x5 image
x = torch.arange(25, dtype=torch.float32).reshape(1, 1, 5, 5)
print("Input:")
print(x[0, 0])
 
# Apply convolution
y = conv(x)
print(f"\nOutput shape: {y.shape}")
print("Output:")
print(y[0, 0])

The output shows how the 5×5 input becomes a 3×3 output. Why 3×3? Because the 3×3 kernel fits in a 5×5 image exactly 3 times horizontally and 3 times vertically. Each output value is the dot product between the kernel weights and a 3×3 patch of the input, one scalar value summarizing how well that patch matches the learned pattern.

The formula for output size is:

output_size = floor((input_size - kernel_size + 2*padding) / stride) + 1

Stride controls how far you slide the kernel. Stride=1 (the default) means one pixel over. Stride=2 means two pixels over, faster, but you "miss" pixels in between.

Padding adds zeros (or other values) around the edges. Without padding, your output shrinks with each convolution. With padding=1 on a 3×3 kernel, you preserve the input size. That's why you'll often see padding=1 and kernel_size=3 together.

Run this and you'll immediately see the practical impact of these hyperparameters on feature map sizes, something that trips up almost everyone when they first build CNN architectures:

python

# Stride and padding example
conv_stride2 = nn.Conv2d(1, 1, kernel_size=3, stride=2)
conv_padded = nn.Conv2d(1, 1, kernel_size=3, padding=1)
 
x = torch.randn(1, 1, 8, 8)
print(f"Input: {x.shape}")
print(f"With stride=2: {conv_stride2(x).shape}")
print(f"With padding=1: {conv_padded(x).shape}")

When you stack multiple conv layers, the spatial dimensions shrink (unless you use padding). That's intentional, early layers detect fine details, later layers work with abstract features over larger receptive fields. The receptive field of a neuron in a deep layer can span hundreds of pixels in the original image even though each individual filter is only 3×3, because each layer builds on the output of the previous one.

Pooling and Feature Hierarchies

Convolution detects features, but you end up with a lot of them. Pooling layers reduce spatial dimensions while keeping the important information. But pooling does something more subtle than just downsizing, it builds the hierarchical structure that makes CNNs so powerful.

Think about what happens as you alternate conv layers with pooling layers. After the first pool, each feature map value summarizes a 2×2 region of the original image. After two pools, each value summarizes a 4×4 region. After three pools, a 8×8 region. Neurons deeper in the network have a much larger window onto the original image, which means they can detect patterns that span larger spatial extents. Early layers detect edges and corners. Middle layers combine those into textures and object parts like wheels or windows. Deep layers recognize whole objects, cars, dogs, faces. This hierarchy isn't programmed; it emerges naturally from the combination of convolution, nonlinear activation, and pooling.

Max pooling slides a window (say, 2×2) over your feature maps and takes the maximum value in each window. Why the max? It's like asking "was this feature detected anywhere in this region?" The most activated neuron in that region matters most.

Average pooling takes the mean instead, smoother, sometimes more stable, but max pooling is the standard for CNNs.

python

import torch.nn as nn
 
# Max pooling
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
x = torch.randn(1, 3, 8, 8)
y = maxpool(x)
print(f"Input: {x.shape}")
print(f"After MaxPool2d(2, 2): {y.shape}")  # (1, 3, 4, 4)
 
# Average pooling
avgpool = nn.AvgPool2d(kernel_size=2, stride=2)
y_avg = avgpool(x)
print(f"After AvgPool2d(2, 2): {y_avg.shape}")  # (1, 3, 4, 4)
 
# Adaptive pooling: guarantee output size
adaptive_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))
y_adaptive = adaptive_pool(x)
print(f"After adaptive pooling: {y_adaptive.shape}")  # (1, 3, 1, 1)

Adaptive pooling is clever: you specify the output size you want, and PyTorch figures out the pooling window automatically. This matters when you're combining networks trained on different image sizes, or when you want to guarantee a fixed size before a dense layer. It also gives you translation invariance for free, a feature detected anywhere in the pooled region still contributes positively to the output, making the network robust to small shifts in object position.

Building a CNN from Scratch

Let's piece this together. A typical CNN block is: Conv → Batch Norm → ReLU → Pool. This ordering matters, BatchNorm before activation tends to work better than after in practice, though you'll see both in real codebases.

python

import torch
import torch.nn as nn
 
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
 
        # Block 1
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(2, 2)
 
        # Block 2
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(2, 2)
 
        # Block 3
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool3 = nn.MaxPool2d(2, 2)
 
        # After 3 pooling operations, a 32x32 image becomes 4x4
        self.fc1 = nn.Linear(128 * 4 * 4, 256)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)
 
    def forward(self, x):
        # Block 1: 32x32 -> 16x16
        x = self.pool1(torch.relu(self.bn1(self.conv1(x))))
 
        # Block 2: 16x16 -> 8x8
        x = self.pool2(torch.relu(self.bn2(self.conv2(x))))
 
        # Block 3: 8x8 -> 4x4
        x = self.pool3(torch.relu(self.bn3(self.conv3(x))))
 
        # Flatten and classify
        x = x.view(x.size(0), -1)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
 
        return x
 
# Test with CIFAR-10 input (32x32 RGB)
model = SimpleCNN(num_classes=10)
x = torch.randn(4, 3, 32, 32)
output = model(x)
print(f"Input: {x.shape}")
print(f"Output: {output.shape}")  # (4, 10) - 4 samples, 10 classes

Notice the progression: 32 → 64 → 128 channels. Early layers extract simple features (edges, colors). Later layers build on those to recognize more complex patterns. The spatial dimensions shrink (32→16→8→4) while channel depth grows. That's the CNN sweet spot, you're trading spatial resolution for semantic richness as you go deeper. The final x.view(x.size(0), -1) flattens the spatial feature maps into a 1D vector that the dense classifier can work with.

Architecture Design Principles

Designing a CNN architecture isn't magic, there are clear principles that separate good designs from mediocre ones. Once you internalize these, you can look at any architecture like ResNet or EfficientNet and immediately understand the reasoning behind its choices.

The first principle is the channel-spatial tradeoff. As spatial dimensions shrink through pooling, channel count should grow. This keeps the total information capacity roughly constant across layers. If you halve the spatial dimensions (from 16×16 to 8×8), you should roughly double the channels (from 64 to 128). Violating this, say, shrinking spatial dimensions while keeping channels constant, creates an information bottleneck where you lose representational capacity.

The second principle is receptive field growth. Each conv layer only sees a local neighborhood, but by stacking layers, each neuron in a deeper layer effectively sees a larger region of the original image. Two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution, but require fewer parameters and introduce more nonlinearity. This is why modern architectures overwhelmingly prefer stacks of small 3×3 kernels over single large kernels, you get the same coverage with more expressiveness and less compute.

The third principle is depth before width. Adding more layers (depth) generally outperforms adding more filters per layer (width) for a given parameter budget. Deeper networks learn more abstract hierarchies. However, very deep networks suffer from vanishing gradients, which is exactly why skip connections were invented and why you should use them in any network with more than ~10 layers.

The fourth principle is normalization placement. BatchNorm stabilizes training and should appear after convolutions and before activations. Without it, deep networks are notoriously finicky to train. In very deep modern architectures, you'll sometimes see LayerNorm or GroupNorm instead of BatchNorm, but the idea is the same: normalize activations to keep them in a well-behaved range.

Why Batch Normalization Matters

Batch normalization (BatchNorm2d for CNNs) normalizes activations within each batch, stabilizing training and allowing higher learning rates. Without it, early layers can cause activations to blow up or vanish, especially in deep networks.

python

# BatchNorm2d normalizes across the batch dimension
# Input: (batch_size, channels, height, width)
# It normalizes each channel independently
 
bn = nn.BatchNorm2d(64)
x = torch.randn(32, 64, 8, 8)  # 32 samples, 64 channels, 8x8 spatial
y = bn(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")  # Same shape, but normalized
print(f"Output mean (should be ~0): {y.mean().item():.6f}")
print(f"Output std (should be ~1): {y.std().item():.6f}")

During training, BatchNorm estimates mean and variance from the batch. During inference, it uses running statistics accumulated during training. Always call model.eval() at test time, that switches BatchNorm to use those running stats. Forgetting this is a common source of mysterious performance degradation at inference time: training accuracy looks great but test accuracy is inexplicably lower.

Classic Architectures: LeNet, AlexNet, VGG

You don't have to design your own CNN from scratch. Learning from proven architectures is smart, each one introduced innovations that moved the field forward, and understanding why each design choice was made gives you tools you can apply in your own work.

LeNet-5 (1998): The grandfather of deep learning. 5 layers, ~60K parameters. Designed for digit recognition. Simple, but the core ideas are still sound.

AlexNet (2012): Won ImageNet in 2012, shocking everyone. 8 layers, 60M parameters. Key innovations: ReLU activation (way faster than tanh), dropout (combat overfitting), and GPU training (made deep learning practical).

VGG (2014): Obsessively simple architecture. Repeated blocks of 3×3 convolutions, 2×2 pooling. Deeper than AlexNet (16-19 layers), and somehow it worked better. Proof that depth matters.

PyTorch's torchvision library gives you pretrained versions of all these architectures with a single line of code. The weights parameter loads weights already trained on ImageNet, which you can then fine-tune on your own dataset, a strategy called transfer learning that dramatically reduces the training time and data you need:

python

# You don't need to write these yourself, PyTorch has them
from torchvision import models
 
# Pretrained on ImageNet
vgg16 = models.vgg16(weights=models.VGG16_Weights.IMAGENET1K_V1)
print(vgg16)  # Shows the full architecture
 
# VGG16 has a fixed structure:
# - 5 convolutional blocks (doubling channels: 64->512)
# - Each block has 2-3 conv layers + maxpool
# - Followed by 3 dense layers (4096->4096->1000)

All three use the same recipe: stack conv+activation, periodically pool, eventually flatten and densify. The differences are in depth, filter sizes, and regularization. What AlexNet and VGG demonstrated is that you can keep stacking these simple building blocks and the network just keeps getting better, up to a point. That point is where vanishing gradients bite, and it's why the next architectural leap mattered so much.

ResNet and Skip Connections: Solving Vanishing Gradients

Here's a problem: very deep networks are hard to train. Gradients vanish as they backpropagate through dozens of layers. ResNet (2015) solved this with skip connections.

The idea is stupidly elegant: instead of learning y = F(x), learn y = F(x) + x. The + x part is the skip connection. It lets gradients flow directly to earlier layers, bypassing the learned layers. During backpropagation, the gradient doesn't have to pass through every transformation, it can travel directly along the skip path. This keeps gradients healthy even in networks with hundreds of layers.

python

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()
 
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
 
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
 
        # If dimensions change, project the skip connection
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )
 
    def forward(self, x):
        identity = x
 
        # Main path
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
 
        # Skip connection
        identity = self.shortcut(x)
 
        # Add them
        out = out + identity
        out = torch.relu(out)
        return out
 
# Usage
block = ResidualBlock(64, 64)
x = torch.randn(1, 64, 32, 32)
y = block(x)
print(f"Input and output shapes match: {x.shape} == {y.shape}")

Notice the shortcut layer: when channels or spatial dimensions change, you can't just add F(x) + x. You need to project x to match. That's what the 1×1 convolution does, changes channels without affecting spatial dimensions. This projection is cheap: a 1×1 conv is just a weighted sum across channels at each spatial location, with no local neighborhood involved.

Skip connections enabled networks like ResNet-50, ResNet-152, networks deep enough to capture incredibly nuanced patterns. Without them, 152 layers would be untrainable. They also introduced an interesting interpretation: the network can learn to be shallower than its nominal depth by setting the conv layers to near-zero, effectively using only the skip connections when the extra transformation isn't helpful.

Data Preparation: Transforms and Augmentation

Your CNN is only as good as your data. Let's talk preprocessing. Raw pixel values vary wildly in scale across images and datasets. Before your network can learn efficiently, you need to normalize, and that requires knowing the mean and standard deviation of your dataset's pixels.

python

from torchvision import transforms, datasets
import torch.utils.data as data
 
# Standard preprocessing pipeline
transform = transforms.Compose([
    # Normalize CIFAR-10 (using known mean/std)
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])
 
# Load CIFAR-10
train_dataset = datasets.CIFAR10(root='./data', train=True,
                                 download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False,
                                download=True, transform=transform)
 
train_loader = data.DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = data.DataLoader(test_dataset, batch_size=128, shuffle=False)
 
print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

Normalization is critical: subtract the mean and divide by std. This centers your data and scales it to a consistent range. Without it, training is slower and less stable because the optimizer has to deal with wildly different gradient magnitudes across the input dimensions. The CIFAR-10 mean and std values above were precomputed across the entire training set, use dataset-specific statistics whenever possible.

Now, augmentation. Your training set is finite. Augmentation artificially expands it by applying realistic transformations that don't change the class label, a horizontally flipped cat is still a cat, a slightly rotated car is still a car:

python

# Training transform with augmentation
train_transform = transforms.Compose([
    # Geometric augmentation
    transforms.RandomHorizontalFlip(p=0.5),        # Random flip left-right
    transforms.RandomCrop(32, padding=4),          # Random crop with padding
    transforms.RandomRotation(degrees=15),         # Random rotation
 
    # Color augmentation
    transforms.ColorJitter(brightness=0.2, contrast=0.2,
                          saturation=0.2, hue=0.1),
 
    # Normalization
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])
 
# Test transform: only normalization (no augmentation)
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])
 
train_dataset_aug = datasets.CIFAR10(root='./data', train=True,
                                     download=True, transform=train_transform)
test_dataset_clean = datasets.CIFAR10(root='./data', train=False,
                                      download=True, transform=test_transform)
 
train_loader = data.DataLoader(train_dataset_aug, batch_size=128, shuffle=True)
test_loader = data.DataLoader(test_dataset_clean, batch_size=128, shuffle=False)

Why separate augmentation for train and test? At test time, you want the true image, not a flipped or rotated version. You're evaluating real performance, not the model's ability to recognize augmented data. Applying augmentation to the test set would make your accuracy numbers meaningless, you'd be measuring performance on a different distribution than real users would see.

Training and Achieving >90% on CIFAR-10

Now the payoff. Let's train a CNN to beat 90% accuracy on CIFAR-10. The architecture below uses double conv blocks (two conv layers before each pool) which gives better feature extraction than single conv blocks without becoming hard to train. Combined with proper augmentation and a decaying learning rate, this gets you to the 90% threshold reliably.

python

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
 
# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
 
# Model (ResNet-like, but simpler for clarity)
class CIFAR10CNN(nn.Module):
    def __init__(self):
        super(CIFAR10CNN, self).__init__()
        self.features = nn.Sequential(
            # Block 1: 3 -> 64
            nn.Conv2d(3, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
 
            # Block 2: 64 -> 128
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
 
            # Block 3: 128 -> 256
            nn.Conv2d(128, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
        )
 
        self.classifier = nn.Sequential(
            nn.Linear(256 * 4 * 4, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, 10)
        )
 
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x
 
model = CIFAR10CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
 
# Training
num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for images, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        images, labels = images.to(device), labels.to(device)
 
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
 
        train_loss += loss.item()
 
    scheduler.step()
 
    # Validation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
 
    accuracy = 100 * correct / total
    avg_loss = train_loss / len(train_loader)
    print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Accuracy={accuracy:.2f}%")
 
    if accuracy > 90:
        print(f"Reached >90% accuracy! Stopping early.")
        break
 
print(f"Final test accuracy: {accuracy:.2f}%")

With proper augmentation, batch normalization, and learning rate scheduling, you'll hit 90%+ in 20-30 epochs. The key insights:

Momentum: Helps escape local minima. SGD with momentum=0.9 is standard.
Weight decay: L2 regularization. Penalizes large weights, reduces overfitting.
Learning rate scheduling: Drop the LR periodically. Start at 0.01, drop to 0.001 after 20 epochs, then 0.0001.

Common CNN Mistakes

Every practitioner makes these mistakes at least once. Knowing them in advance saves you days of debugging.

The most common mistake is forgetting model.eval() during inference. When you leave your model in training mode, BatchNorm uses batch statistics instead of the accumulated running statistics, and Dropout randomly zeroes out activations. Your test accuracy will be mysteriously lower than expected, and it'll vary run to run because of the randomness in Dropout. Always call model.eval() before evaluating, and model.train() before your training loop restarts.

The second mistake is using the wrong normalization statistics. If you compute normalization parameters on the test set (or both sets combined) instead of only the training set, you've contaminated your evaluation. The test set must remain completely unseen. Always compute mean and std from training data only, then apply those same values to the test set.

The third mistake is mismatching spatial dimensions in the fully connected layer. If you change image input size, or add or remove pooling layers, the spatial dimensions going into nn.Linear change. Getting this wrong causes a runtime error with a cryptic dimension mismatch message. The safest fix is to use nn.AdaptiveAvgPool2d((1,1)) before your classifier, which always produces a 1×1 spatial output regardless of input size, then your linear layer size only depends on the number of channels, not the spatial dimensions.

The fourth mistake is applying augmentation to the test set. It seems obvious stated directly, but in the rush of debugging it's easy to accidentally use the training transform pipeline for both datasets. Your test accuracy will be artificially worse than your true performance (or just noisier), because you're evaluating on randomly modified versions of the test images instead of the originals.

The fifth mistake is ignoring class imbalance. If your dataset has ten times more "dog" images than "cat" images, your network will learn to predict "dog" very often because that's what minimizes cross-entropy on a majority-class dataset. Check your class distribution before training, and use weighted sampling or a weighted loss function if the classes are skewed. The standard nn.CrossEntropyLoss accepts a weight parameter that lets you penalize errors on minority classes more heavily.

Feature Map Visualization: See What Your Network Learns

This is the magic moment. Let's visualize what each convolutional layer actually learns. PyTorch's hook system lets you intercept the output of any layer during a forward pass without modifying the model itself, a clean way to inspect internals.

python

import matplotlib.pyplot as plt
 
# Hook into a conv layer to capture its output
activations = {}
 
def get_activation(name):
    def hook(model, input, output):
        activations[name] = output.detach()
    return hook
 
# Register hooks on early layers
model.features[0].register_forward_hook(get_activation('conv1'))
model.features[7].register_forward_hook(get_activation('conv2'))
model.features[14].register_forward_hook(get_activation('conv3'))
 
# Forward pass on a test image
test_image, _ = test_dataset_clean[0]
test_image = test_image.unsqueeze(0).to(device)
 
with torch.no_grad():
    model(test_image)
 
# Visualize feature maps from each layer
fig, axes = plt.subplots(3, 8, figsize=(15, 6))
 
# Layer 1: 64 channels
for i in range(8):
    ax = axes[0, i]
    ax.imshow(activations['conv1'][0, i, :, :].cpu().numpy(), cmap='gray')
    ax.set_title(f'Filter {i}')
    ax.axis('off')
 
# Layer 2: 128 channels (show first 8)
for i in range(8):
    ax = axes[1, i]
    ax.imshow(activations['conv2'][0, i, :, :].cpu().numpy(), cmap='gray')
    ax.set_title(f'Filter {i}')
    ax.axis('off')
 
# Layer 3: 256 channels (show first 8)
for i in range(8):
    ax = axes[2, i]
    ax.imshow(activations['conv3'][0, i, :, :].cpu().numpy(), cmap='gray')
    ax.set_title(f'Filter {i}')
    ax.axis('off')
 
plt.tight_layout()
plt.savefig('feature_maps.png', dpi=150)
plt.show()

Early layers learn simple features: edges, corners, color blobs. Middle layers combine those into textures and shapes. Deep layers recognize objects and their parts. This hierarchical feature learning is why CNNs work so well for images. When you run this on a trained network and look at the resulting plots, the progression from crisp edge-detection in layer 1 to blurry, high-level blobs in layer 3 is immediately visible. The network has genuinely learned to see.

Summary

Convolutional neural networks are elegant because they respect the structure of images. Convolution finds features, pooling reduces noise, and stacking builds abstraction. Everything else, BatchNorm, skip connections, augmentation, learning rate scheduling, exists to make that fundamental process work reliably at scale.

You now understand:

Convolution mechanics: kernels, strides, padding, output size.
How convolutions see: weight sharing, local receptive fields, filter depth.
Pooling: max and average pooling, adaptive pooling, and how pooling builds feature hierarchies.
CNN building blocks: Conv→BatchNorm→ReLU→Pool.
Architecture design principles: channel-spatial tradeoff, receptive field growth, depth before width.
Classic architectures: LeNet, AlexNet, VGG, ResNet, and why skip connections changed everything.
Data augmentation: Why it matters, how to apply it correctly, and why train and test transforms must differ.
Training: Momentum, weight decay, learning rate scheduling, hitting 90%+.
Common mistakes: eval mode, normalization statistics, dimension mismatches, class imbalance.
Feature visualization: Seeing what layers actually learn.

The real power of CNNs isn't any single component, it's how they compose. Each piece reinforces the others. BatchNorm makes training stable enough to go deep. Skip connections let you go deeper without vanishing gradients. Augmentation makes the learned features robust to real-world variation. Weight sharing makes the whole thing computationally feasible. Together they produce systems that genuinely understand images.

From here, you're ready for transfer learning (using pretrained models), custom architectures for specific problems, and diving into even deeper networks. The foundations you've built in this article are the foundations every modern vision system is built on, and they'll carry you as far as you want to go.

The next article: Recurrent Neural Networks and LSTMs for Sequence Data. Images have spatial structure; sequences have temporal structure. RNNs are built for that.

Convolutional Neural Networks for Image Classification

What's Wrong With Dense Layers on Images?

How Convolutions See Images

The Convolution Operation: Kernels, Strides, Padding

Pooling and Feature Hierarchies

Building a CNN from Scratch

Architecture Design Principles

Why Batch Normalization Matters

Classic Architectures: LeNet, AlexNet, VGG

ResNet and Skip Connections: Solving Vanishing Gradients

Data Preparation: Transforms and Augmentation

Training and Achieving >90% on CIFAR-10

Common CNN Mistakes

Feature Map Visualization: See What Your Network Learns

Summary

Need help implementing this?