Start here
This is Chapter 2 of the Essence of Calculus series. Chapter 1 set up the spirit of the subject: hard problems become easy when you slice them into tiny pieces and add the pieces back up. This chapter takes the most famous tool in calculus, the derivative, and confronts the strange phrase everyone uses for it: "instantaneous rate of change."
Sit with that phrase for a second. It sounds reasonable until you poke it, and then it falls apart.
This series follows 3Blue1Brown's "Essence of Calculus". Watch Chapter 2 here: The paradox of the derivative
The paradox
A rate of change is a change in something divided by the time it took. Miles per hour. Dollars per month. You measure where you started, where you ended, and how long it took. Change needs two moments to even exist.
Now glue the word "instantaneous" onto it. An instant is a single moment. No gap, no interval, no elapsed time. In a single frozen instant, nothing moves. The change is 0 and the time is 0.
So "instantaneous rate of change" is asking for 0 / 0. That is the paradox. The phrase seems to demand a rate where, by definition, nothing is changing and no time is passing.
Do not paper over the contradiction. The whole payoff of this chapter is watching it resolve cleanly. The derivative is not a rate at a single instant. It is something subtler that we have nicknamed that way.
The car that exposes it
Picture a car driving away from your house. Let s(t) be the distance it has traveled by time t. Its velocity is how fast that distance is growing, and we write it ds/dt.
Here is the trap. Imagine a video of the car, and you pause on a single frame. In that frozen frame, can you tell how fast the car is going? You cannot. The car moves 0 distance over 0 seconds in a frozen frame. A single instant carries no information about speed.
distance s(t)
^
| ___/
| ___/
| ___/ <- how "steep" is the curve
| __/ right *here*?
| __/
| __/
+-------------------------> time t
t
Yet your speedometer happily shows a number at every instant. So the speedometer is not really reporting an instantaneous change. It is reporting the best constant-rate approximation of your motion around that moment. If the car kept going at this rate, it would cover this many miles per hour.
Rise over run, shrunk to nothing
Forget the single instant. Take two moments: time t and a slightly later time t + dt, where dt is a tiny nudge in time. Now there is a real interval, so a real rate exists:
change in distance s(t + dt) - s(t)
rate = -------------------- = ------------------
change in time dt
This is honest. There is a gap, so there is a rate. Then comes the move that defines calculus: make dt smaller and smaller. Not zero, just shrinking toward zero. As you do, that ratio does not blow up and it does not collapse to nonsense. It settles down and approaches one definite number.
That number it approaches is the derivative.
dt = 1.0 -> ratio = 4.10
dt = 0.1 -> ratio = 3.31
dt = 0.01 -> ratio = 3.0301
dt = 0.001 -> ratio = 3.003001
|
v approaches 3 (example at t = 1)
ds/dt is not 0/0. It is what the ratio of "small change in s" over "small change in t" approaches as dt shrinks. You never plug in dt = 0. You ask where the trend is heading. That single idea, the limit, is how the paradox dissolves.
Watch it work: s(t) = t³
Let the car follow s(t) = t³. We want ds/dt. Start with the honest ratio and expand (t + dt)³:
(t + dt)³ = t³ + 3t²·dt + 3t·dt² + dt³
The change in distance, ds = s(t + dt) - s(t), is that minus the original t³:
ds = 3t²·dt + 3t·dt² + dt³
Now divide by dt to get the rate over the tiny interval:
ds 3t²·dt + 3t·dt² + dt³
---- = ----------------------- = 3t² + 3t·dt + dt²
dt dt
Look closely at the three terms on the right. The first, 3t², has no dt in it at all. The other two, 3t·dt and dt², each still carry a factor of dt. As dt shrinks toward zero, those two terms shrink toward zero with it and simply vanish. What is left is clean:
ds
---- = 3t²
dt
We are not sloppily setting dt = 0 (that would also kill the dt we divided by, and the whole thing would be 0/0 again). We are taking a limit: asking what 3t² + 3t·dt + dt² approaches as dt gets arbitrarily small. The 3t² stays put; the others can be made as tiny as we like. So the value it approaches is exactly 3t². Discarding them is not approximation, it is the precise answer to "where is this heading."
The terms with dt² and higher matter for a finite step, but they die in the limit. That is the recurring rhythm of derivatives: keep the part that survives, drop the part that vanishes.
The picture behind the number
Geometrically, (s(t + dt) - s(t)) / dt is the slope of the line through two nearby points on the curve. Shrink dt and those two points slide together, and the line they define stops being a chord and becomes the tangent, the line that just kisses the curve at t.
So the derivative at a point is the slope of the tangent line there. Steep tangent, fast change. Flat tangent, no change. That is what your speedometer is really reading off: the steepness of the distance curve right where you are.
Why an AI engineer should care
This is not a museum piece. The derivative is the single most-run computation in modern machine learning.
Training a neural network means turning a knob called loss (how wrong the model is) as low as possible. The model has millions of weights. For each weight, you ask: if I nudge this weight a hair, does the loss go up or down, and how steeply? That question is exactly a derivative, d(loss)/d(weight).
The answer is a direction. Every single parameter update is, at heart:
weight = weight - learning_rate * d(loss)/d(weight)
The derivative is the signal that says which way is downhill. Its sign tells you which direction reduces the loss; its size tells you how aggressively. Multiply by a small learning_rate (your dt, basically) and you take one cautious step downhill. Do it a few billion times and the model learns.
A derivative is not a rate frozen in one instant. It is the value a shrinking rise-over-run approaches. Geometrically it is the slope of the tangent. In ML it is the compass needle pointing downhill on the loss surface, and every training step follows it.
Quick gotchas
dt is tiny but never zero. The instant you set it to zero you are back to 0/0. The derivative lives in the limit, the trend, not at the endpoint.
A derivative is a function, not a single number. s(t) = t³ gives ds/dt = 3t². That is a whole new function, telling you the slope at every t. Plug in a specific t to get a specific rate.
Higher-order dt terms vanish, lower ones do not. dt² dies faster than dt, which dies faster than a constant. When taking a derivative, only the term with no leftover dt survives. Knowing what survives is half the skill.
Slope of the tangent, not the chord. A chord through two separated points only approximates the rate. The derivative is the exact slope you reach as the gap closes.
What you walked away with
- "Instantaneous rate of change" is a paradox only if you read it literally; it is shorthand for a best constant-rate approximation around a point.
- The honest object is
(s(t + dt) - s(t)) / dt, and the derivative is what that ratio approaches asdtshrinks, never0/0. - For
s(t) = t³, expanding and dividing leaves3t² + 3t·dt + dt², and thedtterms vanish in the limit, givingds/dt = 3t². - Geometrically the derivative is the slope of the tangent line; in machine learning it is the downhill signal behind every weight update.
Next up, Chapter 3: derivatives of the everyday building blocks, powers, sums, and products, through pictures rather than memorized rules. Once you can see why d(x²)/dx = 2x as a growing square, the rules stop being arbitrary. See you there.