Convolutional Neural Networks: From LeNet to Modern Architectures
Convolutional Neural Networks (CNNs) remain the backbone of computer vision. Understanding their evolution from LeNet-5 (1998) to modern architectures like ConvNeXt reveals fundamental principles of deep learning design.
Core Building Blocks
A CNN stacks three types of layers:
Convolutional layers apply learnable filters (kernels) that slide across the input, detecting local patterns. A kernel with stride 1 on a image produces a feature map. Each filter learns to detect a specific pattern — edges, textures, shapes — with increasing abstraction at deeper layers.
Pooling layers reduce spatial dimensions. Max pooling takes the maximum value in each window, halving width and height. This provides translation invariance and reduces computation. Average pooling and global average pooling (GAP) are common alternatives.
Fully connected layers at the end flatten the feature maps and map them to output classes. Modern architectures often replace FC layers with GAP to reduce parameters.
CNN Architecture (simplified)
==============================
Input Image [3 x 224 x 224]
│
▼
┌─────────────────┐
│ Conv 3x3, 64 │──► ReLU ──► BatchNorm
│ Conv 3x3, 64 │──► ReLU ──► BatchNorm
└────────┬────────┘
▼
MaxPool 2x2 [64 x 112 x 112]
│
▼
┌─────────────────┐
│ Conv 3x3, 128 │──► ReLU ──► BatchNorm
│ Conv 3x3, 128 │──► ReLU ──► BatchNorm
└────────┬────────┘
▼
MaxPool 2x2 [128 x 56 x 56]
│
▼
... (deeper blocks)
│
▼
Global Avg Pool [512 x 1 x 1]
│
▼
FC ──► Softmax [num_classes]
Architecture Evolution
| Architecture | Year | Key Innovation | Top-5 Error (ImageNet) | Params |
|---|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN | — | 60K |
| AlexNet | 2012 | GPU training, ReLU, dropout | 15.3% | 61M |
| VGGNet | 2014 | Small 3x3 filters, depth | 7.3% | 138M |
| GoogLeNet | 2014 | Inception modules, 1x1 convs | 6.7% | 6.8M |
| ResNet | 2015 | Skip connections, 152 layers | 3.6% | 60M |
| DenseNet | 2017 | Dense connections between layers | 3.5% | 20M |
| EfficientNet | 2019 | Compound scaling (depth/width/resolution) | 2.9% | 66M |
| ConvNeXt | 2022 | Modernized ResNet with Transformer tricks | 2.7% | 89M |
The ResNet Revolution
ResNet's skip connections solved the degradation problem — deeper networks were harder to train, not because of vanishing gradients, but because learning identity mappings is hard for stacked nonlinear layers. The residual formulation makes identity the default, letting additional layers only learn the residual.
Residual Block
===============
Input x ──────────────────────────┐
│ │
▼ │
Conv 3x3 ──► BN ──► ReLU │ (skip / shortcut)
│ │
▼ │
Conv 3x3 ──► BN │
│ │
▼ │
(+) ◄───────────────────────────┘
│
▼
ReLU
│
▼
Output: F(x) + x
Modern Techniques
Depthwise separable convolutions (MobileNet): factor a standard convolution into a depthwise (per-channel) and pointwise (1x1) convolution. Reduces computation by - with minimal accuracy loss. Essential for mobile and edge deployment.
Attention in CNNs: Squeeze-and-Excitation (SE) blocks recalibrate channel responses. CBAM adds spatial attention. These mechanisms help the network focus on relevant features.
Data augmentation at scale: RandAugment, CutMix, MixUp, and AugMax have become standard. They act as regularizers and can improve top-1 accuracy by 1-3% without architectural changes.
Knowledge distillation: train a small "student" CNN to mimic a large "teacher" model's output distribution. The soft targets carry more information than hard labels, enabling compact models that retain most of the teacher's performance.
CNNs vs Vision Transformers
Vision Transformers (ViT) process images as sequences of patches using self-attention. They outperform CNNs on large datasets but require more data and compute. The landscape in 2026:
| Aspect | CNNs | Vision Transformers |
|---|---|---|
| Inductive bias | Translation equivariance built-in | Must learn spatial structure |
| Data efficiency | Strong with limited data | Needs large datasets or pretraining |
| Scalability | Plateaus at extreme depth | Scales well with data and compute |
| Inference speed | Fast, hardware-optimized | Slower (quadratic attention) |
| Edge deployment | Mature tooling (TFLite, CoreML) | Improving but heavier |
| Best use | Mobile, real-time, limited data | Large-scale, high accuracy |
The trend is convergence: ConvNeXt brings Transformer design principles to CNNs, while models like CoAtNet and EfficientFormer combine both.