tadata
Back to home

Convolutional Neural Networks: From LeNet to Modern Architectures

#deep-learning#computer-vision#neural-networks#cnn

Convolutional Neural Networks (CNNs) remain the backbone of computer vision. Understanding their evolution from LeNet-5 (1998) to modern architectures like ConvNeXt reveals fundamental principles of deep learning design.

Core Building Blocks

A CNN stacks three types of layers:

Convolutional layers apply learnable filters (kernels) that slide across the input, detecting local patterns. A 3×33 \times 3 kernel with stride 1 on a 28×2828 \times 28 image produces a 26×2626 \times 26 feature map. Each filter learns to detect a specific pattern — edges, textures, shapes — with increasing abstraction at deeper layers.

Pooling layers reduce spatial dimensions. Max pooling takes the maximum value in each 2×22 \times 2 window, halving width and height. This provides translation invariance and reduces computation. Average pooling and global average pooling (GAP) are common alternatives.

Fully connected layers at the end flatten the feature maps and map them to output classes. Modern architectures often replace FC layers with GAP to reduce parameters.

CNN Architecture (simplified)
==============================

Input Image [3 x 224 x 224]
     │
     ▼
┌─────────────────┐
│ Conv 3x3, 64    │──► ReLU ──► BatchNorm
│ Conv 3x3, 64    │──► ReLU ──► BatchNorm
└────────┬────────┘
         ▼
    MaxPool 2x2         [64 x 112 x 112]
         │
         ▼
┌─────────────────┐
│ Conv 3x3, 128   │──► ReLU ──► BatchNorm
│ Conv 3x3, 128   │──► ReLU ──► BatchNorm
└────────┬────────┘
         ▼
    MaxPool 2x2         [128 x 56 x 56]
         │
         ▼
    ... (deeper blocks)
         │
         ▼
    Global Avg Pool     [512 x 1 x 1]
         │
         ▼
    FC ──► Softmax      [num_classes]

Architecture Evolution

ArchitectureYearKey InnovationTop-5 Error (ImageNet)Params
LeNet-51998First practical CNN60K
AlexNet2012GPU training, ReLU, dropout15.3%61M
VGGNet2014Small 3x3 filters, depth7.3%138M
GoogLeNet2014Inception modules, 1x1 convs6.7%6.8M
ResNet2015Skip connections, 152 layers3.6%60M
DenseNet2017Dense connections between layers3.5%20M
EfficientNet2019Compound scaling (depth/width/resolution)2.9%66M
ConvNeXt2022Modernized ResNet with Transformer tricks2.7%89M

The ResNet Revolution

ResNet's skip connections solved the degradation problem — deeper networks were harder to train, not because of vanishing gradients, but because learning identity mappings is hard for stacked nonlinear layers. The residual formulation F(x)+x\mathcal{F}(x) + x makes identity the default, letting additional layers only learn the residual.

Residual Block
===============

Input x ──────────────────────────┐
    │                             │
    ▼                             │
 Conv 3x3 ──► BN ──► ReLU        │ (skip / shortcut)
    │                             │
    ▼                             │
 Conv 3x3 ──► BN                  │
    │                             │
    ▼                             │
  (+) ◄───────────────────────────┘
    │
    ▼
  ReLU
    │
    ▼
 Output: F(x) + x

Modern Techniques

Depthwise separable convolutions (MobileNet): factor a standard convolution into a depthwise (per-channel) and pointwise (1x1) convolution. Reduces computation by 8\sim 8-9×9\times with minimal accuracy loss. Essential for mobile and edge deployment.

Attention in CNNs: Squeeze-and-Excitation (SE) blocks recalibrate channel responses. CBAM adds spatial attention. These mechanisms help the network focus on relevant features.

Data augmentation at scale: RandAugment, CutMix, MixUp, and AugMax have become standard. They act as regularizers and can improve top-1 accuracy by 1-3% without architectural changes.

Knowledge distillation: train a small "student" CNN to mimic a large "teacher" model's output distribution. The soft targets carry more information than hard labels, enabling compact models that retain most of the teacher's performance.

CNNs vs Vision Transformers

Vision Transformers (ViT) process images as sequences of patches using self-attention. They outperform CNNs on large datasets but require more data and compute. The landscape in 2026:

AspectCNNsVision Transformers
Inductive biasTranslation equivariance built-inMust learn spatial structure
Data efficiencyStrong with limited dataNeeds large datasets or pretraining
ScalabilityPlateaus at extreme depthScales well with data and compute
Inference speedFast, hardware-optimizedSlower (quadratic attention)
Edge deploymentMature tooling (TFLite, CoreML)Improving but heavier
Best useMobile, real-time, limited dataLarge-scale, high accuracy

The trend is convergence: ConvNeXt brings Transformer design principles to CNNs, while models like CoAtNet and EfficientFormer combine both.