Recurrent Neural Networks: Sequence Modeling Before Transformers
Recurrent Neural Networks (RNNs) were the dominant architecture for sequence modeling before Transformers took over. Understanding RNNs, LSTMs, and GRUs remains essential — they are still used in edge devices, real-time systems, and time-series forecasting where Transformer overhead is too high.
The Recurrence Principle
A feedforward network processes each input independently. An RNN maintains a hidden state that carries information from previous timesteps:
RNN Unrolled Through Time
===========================
x_1 x_2 x_3 x_4
│ │ │ │
▼ ▼ ▼ ▼
┌───┐ ┌───┐ ┌───┐ ┌───┐
│ h │────►│ h │────►│ h │────►│ h │
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
│ │ │ │
▼ ▼ ▼ ▼
y_1 y_2 y_3 y_4
The same weights , are shared across all timesteps — this is what makes RNNs parameter-efficient for sequences of any length.
The Vanishing Gradient Problem
Training RNNs via backpropagation through time (BPTT) requires computing gradients through many timesteps. The gradient of the loss w.r.t. early hidden states involves repeated multiplication by :
If the eigenvalues of are < 1, gradients shrink exponentially (vanishing). If > 1, they explode. In practice, vanilla RNNs struggle to learn dependencies beyond ~10-20 timesteps.
LSTM: Long Short-Term Memory
LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a cell state that flows through time with minimal transformation, plus three gates that control information flow:
| Gate | Formula | Role |
|---|---|---|
| Forget | What to erase from cell state | |
| Input | What new info to store | |
| Output | What to output from cell state |
The cell state update:
where is the candidate state.
The forget gate is the key innovation: when , gradients flow through unchanged, enabling learning over hundreds of timesteps.
GRU: Gated Recurrent Unit
GRUs (Cho et al., 2014) simplify LSTMs by merging the cell and hidden state into one, using only two gates:
| Gate | Role |
|---|---|
| Reset () | Controls how much past state to forget |
| Update () | Balances between old and new state (replaces forget + input gates) |
GRUs have fewer parameters than LSTMs and train faster, with comparable performance on most tasks. They are preferred when compute is constrained.
Architecture Variants
| Architecture | Use Case | Key Feature |
|---|---|---|
| Vanilla RNN | Short sequences, simple patterns | Minimal, fast |
| LSTM | Long-range dependencies | Cell state + 3 gates |
| GRU | Resource-constrained long sequences | 2 gates, fewer params |
| Bidirectional RNN | Full context needed (NER, translation) | Forward + backward pass |
| Encoder-Decoder | Seq-to-seq (translation, summarization) | Compress → generate |
| Attention + RNN | Long sequences with selective focus | Bahdanau/Luong attention |
RNNs vs Transformers
| Aspect | RNNs/LSTMs | Transformers |
|---|---|---|
| Sequence processing | Sequential (O(n) steps) | Parallel (O(1) depth) |
| Long-range dependencies | Degrades with distance | Self-attention captures any distance |
| Training speed | Slow (sequential) | Fast (parallelizable) |
| Memory footprint | O(1) per step | O(n²) attention matrix |
| Inference (streaming) | Natural — process token by token | Needs full sequence or KV cache |
| Best for today | Edge, real-time, time-series | NLP, vision, general sequence |
Where RNNs Still Win
Real-time streaming: RNNs process one token at a time with constant memory. For live audio, sensor data, or financial ticks, this is a natural fit.
Edge deployment: an LSTM with 500K parameters can run on a microcontroller. A Transformer of similar capability needs 10-100x more parameters.
Time-series forecasting: for univariate or low-dimensional time series, LSTMs remain competitive with Transformers and are simpler to train and deploy.
State Space Models: the recent S4/Mamba family combines RNN-like sequential processing with Transformer-like performance, suggesting recurrence is not dead — just evolving.