Order matters — a lot
"The dog bit the man."
"The man bit the dog."
Same words. Completely different meaning.
This is the fundamental challenge of sequential data. In a plain neural network or CNN, you feed in a set of values and the network doesn't really care what order they came in. Shuffle the pixels of an image horizontally? You've broken the image. Shuffle the words in a sentence? You've broken the meaning.
Sequences are everywhere — spoken language, written text, stock prices, sensor readings, music, DNA. In all of these, position and order carry critical information that you can't just ignore.
The problem with plain networks
Imagine you're trying to predict the next word in a sentence.
"The cat sat on the ___"
A plain neural network takes a fixed-size input. You'd have to decide: "I'll look at the last 5 words." But what if the key context was 15 words ago? Or 50? And different sentences are different lengths — how do you handle that with fixed-size inputs?
You can't. Not cleanly.
What you need is a network that can:
- Process inputs one at a time, in order
- Remember what it saw earlier
- Use that memory when processing each new input
Recurrent Neural Networks (RNNs) do exactly this.
So how does an RNN actually work?
An RNN processes a sequence step by step. At each step, it takes two inputs:
- The current element in the sequence (e.g., the current word)
- A hidden state — a vector that summarizes everything the network has seen so far
It combines those two inputs, produces an output, and updates the hidden state. Then moves to the next step.
At each time step t:
hidden_state = activation(W × input_t + U × hidden_state_prev + bias)
output_t = V × hidden_state
The hidden state is the memory. It carries information from the past into the future.
The same RNN cell — with the same weights — is applied at every step. That's what "recurrent" means. It's one cell, reused over and over, processing the sequence one element at a time.
And because the hidden state carries forward, the prediction at step 10 theoretically has access to information from step 1.
Theoretically.
The vanishing gradient problem
Here's where things get messy.
Remember backpropagation? Gradients flow backward through the network. In an RNN, when you unroll it over time, the gradient has to travel backward through many, many copies of that same cell.
At each step, the gradient gets multiplied by the same weight matrix. If that matrix has values slightly less than 1, the gradient shrinks with every step. By the time it reaches 50 steps back, it's essentially zero.
This is the vanishing gradient problem. The network can't learn long-range dependencies because the signal from distant past inputs disappears before it can update the early weights.
When gradients vanish, the network stops learning from the distant past. Words from 20 steps ago have no influence on the current update. The network develops amnesia — it can only really use recent context.
The opposite problem exists too: if weights are slightly greater than 1, gradients explode. Numbers grow uncontrollably large and the training crashes. This is exploding gradients.
(Exploding gradients have a pragmatic fix — gradient clipping, which just caps the gradient at a maximum value. Vanishing gradients are harder.)
LSTM — memory with gates
Sepp Hochreiter and Jürgen Schmidhuber introduced the Long Short-Term Memory (LSTM) network in 1997 to solve this exact problem.
The key idea: give the network an explicit memory mechanism with gates that can learn what to remember and what to forget.
An LSTM cell has three gates:
Forget gate — looks at the current input and the previous hidden state, and decides: "how much of the previous memory should we keep?"
Input gate — decides what new information to write into memory.
Output gate — decides what part of the memory to expose as the hidden state at this step.
The cell state is a separate memory track that runs straight through the network with very little interference. Gradients can flow along this highway without vanishing. That's the trick.
An LSTM can learn: "this piece of information is important — keep it for 50 steps." Or: "now that we've seen X, forget what we stored earlier."
It's not magic — it's just learned gates. The network figures out, through training, when to open and close each gate.
GRU — the simpler cousin
In 2014, Cho et al. introduced the Gated Recurrent Unit (GRU).
Same idea as LSTM: use gates to control what flows through. But GRU simplifies the design — it merges the forget and input gates into a single "update gate" and eliminates the separate cell state.
The result is a simpler model with fewer parameters. Trains faster. And in many tasks, works just as well as LSTM.
| LSTM | GRU | |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| Memory | Separate cell state + hidden | Single hidden state |
| Parameters | More | Fewer |
| Training speed | Slower | Faster |
| Performance | Slightly better on some tasks | Competitive on most |
Neither is universally better. In practice, GRU is a good default — faster and simpler. Try LSTM when you need the extra expressiveness.
Where do RNNs actually show up?
Early versions of Gmail's smart compose used RNN-based models to predict the next word as you typed. That's language modeling — the network learns the statistical shape of language well enough to guess what comes next.
Machine translation was another big win. You feed an entire sentence through an RNN encoder, it compresses the meaning into a hidden state, then a decoder RNN generates the translation word by word. The "encoder-decoder" architecture was a genuine breakthrough.
Speech recognition works the same way — audio is just a time sequence of sound measurements. Feed it through an RNN, get text out. And time-series forecasting (stock prices, weather, energy usage) is a natural fit too — historical values in, future prediction out.
But RNNs have a problem...
Even with LSTM, there's a fundamental bottleneck. All information from the past is compressed into a single hidden state vector. That's a narrow pipe.
If you're translating a long sentence, by the time you're generating the last word of the output, all the context from the beginning of the input sentence has been squeezed through hundreds of hidden states. Important things get lost.
And RNNs are slow to train. Because each step depends on the previous one, you can't process the sequence in parallel. You have to go one step at a time — which means big sequences, big batches, and long training times.
This bottleneck is exactly what the next major breakthrough in AI came to solve.
In 2017, a team at Google published a paper called "Attention Is All You Need." It introduced the Transformer architecture — no recurrence at all, just a mechanism called self-attention that lets every position in the sequence look directly at every other position. It solved the long-range dependency problem and enabled parallel training. This is the architecture behind GPT, BERT, and essentially every large language model today.
What's next?
Now we move into the modern era. The next series is all about Transformers and LLMs — the architecture that changed everything and enabled GPT, LLaMA, and the AI systems you use daily.
First up: Attention Is All You Need — the mechanism that replaced recurrence and became the backbone of modern AI.