Calculus is essentially the study of differentiation and integration. In deep learning, we only really need to use differentiation. Differentiation is the process of getting a derivative of a function.
We already introduced one notion of derivative: the ratio of an output interval to the input interval. It tells you how much the output space is stretched or squished. Importantly, these intervals are oriented intervals so they can be negative or positive, and thus the ratio can be negative or positive.
For example, consider the function f(x) = x2. Take a point x and its neighborhood (x – ε,x + ε), where ε is some arbitrarily small value, and we get an interval around x. To be concrete, let x = 3, ε = 0.1; the interval around x = 3 is (2.9,3.1). The size (and orientation) of this interval is 3.1 – 2.9 = +0.2, and this interval gets mapped to f(2.9) = 8.41 and f(3.1) = 9.61. This output interval is (8.41,9.61) and its size is 9.61 – 8.41 = 1.2. As you can see, the output interval is still positive, so the ratio , which is the derivative of the function f at x = 3.
We denote the derivative of a function, f, with respect to an input variable, x, as df/dx, but this is not to be thought of as a literal fraction; it’s just a notation. We don’t need to take an interval on both sides of the point; an interval on one side will do as long as it’s small, i.e., we can define an interval as (x,x + ε) and the size of the interval is just ε, whereas the size of the output interval is f(x + ε) – f(x).
Using concrete values like we did only yields approximations in general; to get absolutes we’d need to use infinitely small intervals. We can do this symbolically by imagining that ε is an infinitely small number such that it is bigger than 0 but smaller than any other number in our number system. Now differentiation becomes an algebra problem.
Here we simply take the ratio of the output interval to the input interval, both of which are infinitely small because ε is an infinitesimal number. We can algebraically reduce the expression to 2x + ε, and since ε is infinitesimal, 2x + ε is infinitely close to 2x, which we take as the true derivative of the original function f(x) = x2. Remember, we’re taking ratios of oriented intervals that can be positive or negative. We not only want to know how much a function stretches (or squeezes) the input, but whether it changes the direction of the interval. There is a lot of advanced mathematics justifying all of this (see nonstandard analysis or smooth infinitesimal analysis) but this process works just fine for practical purposes.
Why is differentiation a useful concept in deep learning? Well, in machine learning we are trying to optimize a function, which means finding the input points to the function such that the output of the function is a maximum or minimum over all possible inputs. That is, given some function, f(x), we want to find an x such that f(x) is smaller than any other choice of x; we generally denote this as argmin(f(x)). Usually we have a loss function (or cost or error function) that takes some input vector, a target vector, and a parameter vector and returns the degree of error between the predicted output and the true output, and our goal is to find the set of parameters that minimizes this error function. There are many possible ways to minimize this function, not all of which depend on using derivatives, but in most cases the most effective and efficient way to optimize loss functions in machine learning is to use derivative information.
Since deep learning models are nonlinear (i.e., they do not preserve addition and scalar multiplication), the derivatives are not constant like in linear transformations. The amount and direction of squishing or stretching that happens from input to output points varies from point to point. In another sense, it tells us which direction the function is curving, so we can follow the curve downward to the lowest point. Multivariable functions like deep learning models don’t just have a single derivative but a set of partial derivatives that describe the curvature of the function with respect to each individual input component. This way we can figure out which sets of parameters for a deep neural network lead to the smallest error.
The simplest example of using derivative information to minimize a function is to see how it works for a simple compositional function. The function we will try to find the minimum of is:
|f(x) = log(x4 + x3 + 2)|
The graph is shown in figure A.1. You can see that the minimum of this function appears to be around –1. This is a compositional function because it contains a polynomial expression “wrapped” in a logarithm, so we need to use the chain rule from calculus to compute the derivative. We want the derivative of this function with respect to x. This function only has one “valley,” so it will only have one minimum; however, deep learning models are high-dimensional and highly compositional and tend to have many minima. Ideally, we’d like to find the global minimum that is the lowest point in the function. Global or local minima are points on the function where the slope (i.e., the derivative) at those points is 0. For some functions, like this simple example, we can compute the minimum analytically, using algebra. Deep learning models are generally too complex for algebraic calculations, and we must use iterative techniques.
The chain rule in calculus gives us a way of computing derivatives of compositional functions by decomposing them into pieces. If you’ve heard of backpropagation, it’s basically just the chain rule applied to neural networks with some tricks to make it more efficient. For our example case, let’s rewrite the previous function as two functions:
|h(x) = x4 + x3 + 2|
|f(x) = log(h(x))|
We first compute the derivative of the “outer” function, which is f(x) = log(h(x)), but this just gives us df/dh and what we really want is df/dx. You may have learned that the derivative of natural-log is
And the derivative of the inner function h(x) is
To get the full derivative of the compositional function, we notice that
That is, the derivative we want, df/dx, is obtained by multiplying the derivative of the outer function with respect to its input and the inner function (the polynomial) with respect to x.
You can set this derivative to 0 to calculate the minima algebraically: 4x2 + 3x = 0. This function has two minima at x = 0 and x = –3/4 = –0.75. But only x = –0.75 is the global minimum since f(–0.75) = 0.638971 whereas f(0) = 0.693147, which is slightly larger.
Let’s see how we can solve this using gradient descent, which is an iterative algorithm to find the minima of a function. The idea is we start with a random x as a starting point. We then compute the derivative of the function at this point, which tells us the magnitude and direction of curvature at this point. We then choose a new x point based on the old x point, its derivative, and a step-size parameter to control how fast we move. That is,
Let’s see how to do this in code.
import numpy as np def f(x): 1 return np.log(np.power(x,4) + np.power(x,3) + 2) def dfdx(x): 2 return (4*np.power(x,3) + 3*np.power(x,2)) / f(x) x = -9.41 3 lr = 0.001 4 epochs = 5000 5 for i in range(epochs): deriv = dfdx(x) 6 x = x - lr * deriv 7
- 1 The original function
- 2 The derivative function
- 3 Random starting point
- 4 Learning rate (step size)
- 5 Number of iterations to optimize over
- 6 Calculates derivative of current point
- 7 Updates current point
If you run this gradient descent algorithm, you should get x = –0.750000000882165, which is (if rounded) exactly what you get when calculated algebraically. This simple process is the same one we use when training deep neural networks, except that deep neural networks are multivariable compositional functions, so we use partial derivatives. A partial derivative is no more complex than a normal derivative.
Consider the multivariable function f(x,y) = x4 + y2. There is no longer a single derivative of this function since it has two input variables. We can take the derivative with respect to x or y or both. When we take the derivative of a multivariable function with respect to all of its inputs and package this into a vector, we call it the gradient, which is denoted by the nabla symbol ∇, i.e., ∇f(x) = [df/dx,df/dy]. To compute the partial derivative of f with respect to x, i.e., df/dx, we simply set the other variable y to be a constant and differentiate as usual. In this case, df/dx = 4x3 and df/dy = 2y. So the gradient ∇f(x) = [4x3,2y], which is the vector of partial derivatives. Then we can run gradient descent as usual, except now we find the vector associated with the lowest point in an error function of the deep neural network.