A.1. Linear algebra
Linear algebra is the study of linear transformations. A linear transformation is a transformation (e.g., a function) in which the sum of the transformation of two inputs separately, such as T(a) and T(b), is the same as summing the two inputs and transforming them together, i.e., T(a + b) = T(a) + T(b). A linear transformation also has the property that T(a × b) = a × T(b). Linear transformations are said to preserve the operations of addition and multiplication since you can apply these operations either before or after the linear transformation and the result is the same.
One informal way to think of this is that linear transformations do not have “economies of scale.” For example, think of a linear transformation as converting money as the input into some other resource, like gold, so that T($100) = 1 unit of gold. The unit price of gold will be constant no matter how much money you put in. In contrast, nonlinear transformations might give you a “bulk discount,” so that if you buy 1,000 units of gold or more, the price would be less on a per unit basis than if you bought less than 1,000 units.
Another way to think of linear transformations is to make a connection to calculus (which we’ll review in more detail shortly). A function or transformation takes some input value, x, and maps it to some output value, y. A particular output y may be a larger or smaller value than the input x, or more generally a neighborhood around an input x will be mapped to a larger or smaller neighborhood around the output y. Here a neighborhood refers to the set of points arbitrarily close to x or y. For a single-variable function like f(x) = 2x + 1, a neighborhood is actually an interval. For example, the neighborhood around an input point x = 2 would be all the points arbitrarily close to 2, such as 2.000001 and 1.99999999.
One way to think of the derivative of a function at a point is as the ratio of the size of the output interval around that point to the size of the input interval around the input point. Linear transformations will always have some constant ratio of output to input intervals for all points, whereas nonlinear transformations will have a varying ratio.
Linear transformations are often represented as matrices, which are rectangular grids of numbers. Matrices encode the coefficients for multivariable linear functions, such as
f^{x}(x,y) = Ax + By f^{y}(x,y) = Cx + Dy |
While this appears to be two functions, this is really a single function that maps a 2-dimensional point (x,y) to a new 2-dimensional point (x′,y′) using the coefficients A,B,C,D. To find x, you use the f^{x} function, and to find y′ you use the f^{y} function. We could have written this as a single line:
f(x,y) = (Ax + By, Cx + Dy) |
This makes it more clear that the output is a 2-tuple or 2-dimensional vector. In any case, it is useful to think of this function in two separate pieces since the computations for the x and y components are independent.
While the mathematical notion of a vector is very general and abstract, for machine learning a vector is just a 1-dimensional array of numbers. This linear transformation takes a 2-vector (one that has 2 elements) and turns it into another 2-vector, and to do this it requires four separate pieces of data, the four coefficients. There is a difference between a linear transformation like Ax + By and something like Ax + By + C which adds a constant; the latter is called an affine transformation. In practice, we use affine transformations in machine learning, but for this discussion we will stick with just linear transformations.
Matrices are a convenient way to store these coefficients. We can package the data into a 2 by 2 matrix:
The linear transformation is now represented completely by this matrix, assuming you understand how to use it, which we shall cover. We can apply this linear transformation by juxtaposing the matrix with a vector, e.g., Fx.
We compute the result of this transformation by multiplying each row in F with each column (only one here) of x. If you do this, you get the same result as the explicit function definition above. Matrices do not need to be square, they can be any rectangular shape.
We can graphically represent matrices as boxes with two strings coming out on each end with labeled indices:
We call this a string diagram. The n represents the dimensionality of the input vector and the m is the dimensionality of the output vector. You can imagine a vector flowing into the linear transformation from the left, and a new vector is produced on the right side. For the practical deep learning we use in this book, you only need to understand this much linear algebra, i.e., the principles of multiplying vectors by matrices. Any additional math will be introduced in the respective chapters.