tadata
Back to home

Recurrent Neural Networks: Sequence Modeling Before Transformers

#deep-learning#neural-networks#rnn#sequence-modeling#time-series

Recurrent Neural Networks (RNNs) were the dominant architecture for sequence modeling before Transformers took over. Understanding RNNs, LSTMs, and GRUs remains essential — they are still used in edge devices, real-time systems, and time-series forecasting where Transformer overhead is too high.

The Recurrence Principle

A feedforward network processes each input independently. An RNN maintains a hidden state hth_t that carries information from previous timesteps:

ht=tanh(Whhht1+Wxhxt+bh)h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) yt=Whyht+byy_t = W_{hy} h_t + b_y

RNN Unrolled Through Time
===========================

x_1       x_2       x_3       x_4
 │         │         │         │
 ▼         ▼         ▼         ▼
┌───┐     ┌───┐     ┌───┐     ┌───┐
│ h │────►│ h │────►│ h │────►│ h │
└─┬─┘     └─┬─┘     └─┬─┘     └─┬─┘
  │         │         │         │
  ▼         ▼         ▼         ▼
 y_1       y_2       y_3       y_4

The same weights WhhW_{hh}, WxhW_{xh} are shared across all timesteps — this is what makes RNNs parameter-efficient for sequences of any length.

The Vanishing Gradient Problem

Training RNNs via backpropagation through time (BPTT) requires computing gradients through many timesteps. The gradient of the loss w.r.t. early hidden states involves repeated multiplication by WhhW_{hh}:

hth1=i=2thihi1\frac{\partial h_t}{\partial h_1} = \prod_{i=2}^{t} \frac{\partial h_i}{\partial h_{i-1}}

If the eigenvalues of WhhW_{hh} are < 1, gradients shrink exponentially (vanishing). If > 1, they explode. In practice, vanilla RNNs struggle to learn dependencies beyond ~10-20 timesteps.

LSTM: Long Short-Term Memory

LSTMs (Hochreiter & Schmidhuber, 1997) solve the vanishing gradient problem with a cell state ctc_t that flows through time with minimal transformation, plus three gates that control information flow:

GateFormulaRole
Forgetft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)What to erase from cell state
Inputit=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)What new info to store
Outputot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)What to output from cell state

The cell state update:

ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

where c~t=tanh(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) is the candidate state.

The forget gate is the key innovation: when ft1f_t \approx 1, gradients flow through unchanged, enabling learning over hundreds of timesteps.

GRU: Gated Recurrent Unit

GRUs (Cho et al., 2014) simplify LSTMs by merging the cell and hidden state into one, using only two gates:

GateRole
Reset (rtr_t)Controls how much past state to forget
Update (ztz_t)Balances between old and new state (replaces forget + input gates)

GRUs have fewer parameters than LSTMs and train faster, with comparable performance on most tasks. They are preferred when compute is constrained.

Architecture Variants

ArchitectureUse CaseKey Feature
Vanilla RNNShort sequences, simple patternsMinimal, fast
LSTMLong-range dependenciesCell state + 3 gates
GRUResource-constrained long sequences2 gates, fewer params
Bidirectional RNNFull context needed (NER, translation)Forward + backward pass
Encoder-DecoderSeq-to-seq (translation, summarization)Compress → generate
Attention + RNNLong sequences with selective focusBahdanau/Luong attention

RNNs vs Transformers

AspectRNNs/LSTMsTransformers
Sequence processingSequential (O(n) steps)Parallel (O(1) depth)
Long-range dependenciesDegrades with distanceSelf-attention captures any distance
Training speedSlow (sequential)Fast (parallelizable)
Memory footprintO(1) per stepO(n²) attention matrix
Inference (streaming)Natural — process token by tokenNeeds full sequence or KV cache
Best for todayEdge, real-time, time-seriesNLP, vision, general sequence

Where RNNs Still Win

Real-time streaming: RNNs process one token at a time with constant memory. For live audio, sensor data, or financial ticks, this is a natural fit.

Edge deployment: an LSTM with 500K parameters can run on a microcontroller. A Transformer of similar capability needs 10-100x more parameters.

Time-series forecasting: for univariate or low-dimensional time series, LSTMs remain competitive with Transformers and are simpler to train and deploy.

State Space Models: the recent S4/Mamba family combines RNN-like sequential processing with Transformer-like performance, suggesting recurrence is not dead — just evolving.