**4**

MACHINE LEARNING

MACHINE LEARNING

Machine learning is found in almost every area of computer science. Over the past few years, I’ve attended computer science conferences in fields as diverse as distributed systems, databases, and stream processing, and no matter where I go, machine learning is already there. At some conferences, more than half of the presented research ideas have relied on machine learning methods.

As a computer scientist, you must know the fundamental machine learning ideas and algorithms to round out your overall skill set. This chapter provides an introduction to the most important machine learning algorithms and methods, and gives you 10 practical one-liners to apply these algorithms in your own projects.

**The Basics of Supervised Machine Learning**

The main aim of machine learning is to make accurate predictions using existing data. Let’s say you want to write an algorithm that predicts the value of a specific stock over the next two days. To achieve this goal, you’ll need to train a machine learning model. But what exactly is a *model*?

From the perspective of a machine learning user, the machine learning model looks like a black box (Figure 4-1): you put data in and get predictions out.

*Figure 4-1: A machine learning model, shown as a black box*

In this model, you call the input data *features* and denote them using the variable *x*, which can be a numerical value or a multidimensional vector of numerical values. Then the box does its magic and processes your input data. After a bit of time, you get prediction *y* back, which is the model’s predicted output, given the input features. For regression problems, the prediction consists of one or multiple numerical values—just like the input features.

Supervised machine learning is divided into two separate phases: the training phase and the inference phase.

*Training Phase*

*Training Phase*

During the *training phase*, you tell your model your desired output *y’* for a given input *x*. When the model outputs the prediction *y*, you compare it to *y’*, and if they are not the same, you update the model to generate an output that is closer to *y’*, as shown in Figure 4-2. Let’s look at an example from image recognition. Say you train a model to predict fruit names (outputs) when given images (inputs). For example, your specific training input is an image of a banana, but your model wrongly predicts *apple*. Because your desired output is different from the model prediction, you change the model so that next time the model will correctly predict *banana*.

*Figure 4-2: The training phase of a machine learning model*

As you keep telling the model your desired outputs for many different inputs and adjusting the model, you train the model by using your *training data*. Over time, the model will learn which output you’d like to get for certain inputs. That’s why data is so important in the 21st century: your model will be only as good as its training data. Without good training data, the model is guaranteed to fail. Roughly speaking, the training data supervises the machine learning process. That’s why we denote it *supervised learning*.

*Inference Phase*

*Inference Phase*

During the *inference phase*, you use the trained model to predict output values for new input features *x*. Note that the model has the power to predict outputs for inputs that have never been observed in the training data. For example, the fruit prediction model from the *training phase* can now identify the name of the fruits (learned in the training data) in images it has never seen before. In other words, suitable machine learning models possess the ability to *generalize*: they use their experience from the training data to predict outcomes for new inputs. Roughly speaking, models that generalize well produce accurate predictions for new input data. Generalized prediction for unseen input data is one of the strengths of machine learning and is a prime reason for its popularity across a wide range of applications.

**Linear Regression**

*Linear regression* is the one machine learning algorithm you’ll find most often in beginner-level machine learning tutorials. It’s commonly used in *regression problems*, for which the model predicts missing data values by using existing ones. A considerable advantage of linear regression, both for teachers and users, is its simplicity. But that doesn’t mean it can’t solve real problems! Linear regression has lots of practical use cases in diverse areas such as market research, astronomy, and biology. In this section, you’ll learn everything you need to know to get started with linear regression.

*The Basics*

*The Basics*

How can you use linear regression to predict stock prices on a given day? Before I answer this question, let’s start with some definitions.

Every machine learning model consists of model parameters. *Model parameters* are internal configuration variables that are estimated from the data. These model parameters determine how exactly the model calculates the prediction, given the input features. For linear regression, the model parameters are called *coefficients*. You may remember the formula for two-dimensional lines from school: *f(x)* = *ax* + *c*. The two variables *a* and *c* are the coefficients in the linear equation *ax* + *c*. You can describe how each input *x* is transformed into an output *f(x)* so that all outputs together describe a line in the two-dimensional space. By changing the coefficients, you can describe any line in the two-dimensional space.

Given the input features *x*_{1}, *x*_{2}, . . ., *x** _{k}*, the linear regression model combines the input features with the coefficients

*a*

_{1},

*a*

_{2}, . . .,

*a*

*to calculate the predicted output*

_{k}*y*by using this formula:

*y* = *f*(*x*) = *a*_{0} + *a*_{1} × *x*_{1} + *a*_{2} × *x*_{2} + ... + *a*_{k} × *x*_{k}

In our stock price example, you have a single input feature, *x*, the day. You input the day *x* with the hope of getting a stock price, the output *y*. This simplifies the linear regression model to the formula of a two-dimensional line:

*y* = *f*(*x*) = *a*_{0} + *a*_{1}*x*

Let’s have a look at three lines for which you change only the two model parameters *a*_{0} and *a*_{1} in Figure 4-3. The first axis describes the input *x*. The second axis describes the output *y*. The line represents the (linear) relationship between input and output.

*Figure 4-3: Three linear regression models (lines) described by different model parameters (coefficients). Every line represents a unique relationship between the input and the output variables.*

In our stock price example, let’s say our training data is the indices of three days, [0, 1, 2], matched with the stock prices [155, 156, 157]. To put it differently:

- Input x=0 should cause output y=155
- Input x=1 should cause output y=156
- Input x=2 should cause output y=157

Now, which line best fits our training data? I plotted the training data in Figure 4-4.

*Figure 4-4: Our training data, with its index in the array as the* x *coordinate, and its price as the* y *coordinate*

To find the line that best describes the data and, thus, to create a linear regression model, we need to determine the coefficients. This is where machine learning comes in. There are two principal ways of determining model parameters for linear regression. First, you can analytically calculate the line of best fit that goes between these points (the standard method for linear regression). Second, you can try different models, testing each against the labeled sample data, and ultimately deciding on the best one. In any case, you determine “best” through a process called *error minimization*, in which the model minimizes the squared difference (or selects the coefficients that lead to a minimal squared difference) of the predicted model values and the ideal output, selecting the model with the lowest error.

For our data, you end up with coefficients of *a*_{0} = 155.0 and *a*_{1} = 1.0. Then you put them into our formula for linear regression:

*y* = *f*(*x*) = *a*_{0} + *a*_{1}*x* = 155.0 + 1.0 × *x*

and plot both the line and the training data in the same space, as shown in Figure 4-5.

*Figure 4-5: A prediction line made using our linear regression model*

A perfect fit! The squared distance between the line (model prediction) and the training data is zero—so you have found the model that minimizes the error. Using this model, you can now predict the stock price for any value of *x*. For example, say you want to predict the stock price on day *x* = 4. To accomplish this, you simply use the model to calculate *f(x)* = 155.0 + 1.0 × 4 = 159.0. The predicted stock price on day 4 is $159. Of course, whether this prediction accurately reflects the real world is another story.

That’s the high-level overview of what happens. Let’s take a closer look at how to do this in code.

*The Code*

*The Code*

Listing 4-1 shows how to build a simple linear regression model in a single line of code (you may need to install the scikit-learn library first by running pip install sklearn in your shell).

from sklearn.linear_model import LinearRegression

import numpy as np

## Data (Apple stock prices)

apple = np.array([155, 156, 157])

n = len(apple)

## One-liner

model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)

## Result & puzzle

print(model.predict([[3],[4]]))

Can you already guess the output of this code snippet?

*How It Works*

*How It Works*

This one-liner uses two Python libraries: NumPy and scikit-learn. The former is the de facto standard library for numerical computations (like matrix operations). The latter is the most comprehensive library for machine learning and has implementations of hundreds of machine learning algorithms and techniques.

You may ask: “Why are you using libraries in a Python one-liner? Isn’t this cheating?” It’s a good question, and the answer is yes. Any Python program—with or without libraries—uses high-level functionality built on low-level operations. There’s not much point in reinventing the wheel when you can reuse existing code bases (that is, stand on the shoulders of giants). Aspiring coders often feel the urge to implement everything on their own, but this reduces their coding productivity. In this book, we’re going to use, not reject, the wide spectrum of powerful functionality implemented by some of the world’s best Python coders and pioneers. Each of these libraries took skilled coders years to develop, optimize, and tweak.

Let’s go through Listing 4-1 step by step. First, we create a simple data set of three values and store its length in a separate variable n to make the code more concise. Our data is three Apple stock prices for three consecutive days. The variable apple holds this data set as a one-dimensional NumPy array.

Second, we build the model by calling LinearRegression(). But what are the model parameters? To find them, we call the fit() function to train the model. The fit() function takes two arguments: the input features of the training data and the ideal outputs for these inputs. Our ideal outputs are the real stock prices of the Apple stock. But for the input features, fit() requires an array with the following format:

[<training_data_1>,

<training_data_2>,

--snip--

<training_data_n>]

where each training data value is a sequence of feature values:

<training_data> = [feature_1, feature_2, ..., feature_k]

In our case, the input consists of only a single feature *x* (the current day). Moreover, the prediction also consists of only a single value *y* (the current stock price). To bring the input array into the correct shape, you need to reshape it to this strange-looking matrix form:

[[0],

[1],

[2]]

A matrix with only one column is called a *column vector*. You use np.arange() to create the sequence of increasing *x* values; then you use reshape((n, 1)) to convert the one-dimensional NumPy array into a two-dimensional array with one column and n rows (see Chapter 3). Note that scikit-learn allows the output to be a one-dimensional array (otherwise, you would have to reshape the apple data array as well).

Once it has the training data and the ideal outputs, fit() then does error minimization: it finds the model parameters (that means *line*) so that the difference between the predicted model values and the desired outputs is minimal.

When fit() is satisfied with its model, it’ll return a model that you can use to predict two new stock values by using the predict() function. The predict() function has the same input requirements as fit(), so to satisfy them, you’ll pass a one-column matrix with our two new values that you want predictions for:

print(model.predict([[3],[4]]))

Because our error minimization was zero, you should get perfectly linear outputs of 158 and 159. This fits well along the line of fit plotted in Figure 4-5. But it’s often not possible to find such a perfectly fitting single straight-line linear model. For example, if our stock prices are [157, 156, 159], and you run the same function and plot it, you should get the line in Figure 4-6.

In this case, the fit() function finds the line that minimizes the squared error between the training data and the predictions as described previously.

Let’s wrap this up. Linear regression is a machine learning technique whereby your model learns coefficients as model parameters. The resulting linear model (for example, a line in the two-dimensional space) directly provides you with predictions on new input data. This problem of predicting numerical values when given numerical input values belongs to the class of regression problems. In the next section, you’ll learn about another important area in machine learning called classification.

*Figure 4-6: A linear regression model with an imperfect fit*

**Logistic Regression in One Line**

Logistic regression is commonly used for *classification problems*, in which you predict whether a sample belongs to a specific category (or class). This contrasts with regression problems, where you’re given a sample and predict a numerical value that falls into a continuous range. An example classification problem is to divide Twitter users into the male and female, given different input features such as their *posting frequency* or the *number of tweet replies*. The logistic regression model belongs to one of the most fundamental machine learning models. Many concepts introduced in this section will be the basis of more advanced machine learning techniques.

*The Basics*

*The Basics*

To introduce logistic regression, let’s briefly review how linear regression works: given the training data, you compute a line that fits this training data and predicts the outcome for input *x*. In general, linear regression is great for predicting a *continuous* output, whose value can take an infinite number of values. The stock price predicted earlier, for example, could conceivably have been any number of positive values.

But what if the output is not continuous, but *categorical*, belonging to a limited number of groups or categories? For example, let’s say you want to predict the likelihood of lung cancer, given the number of cigarettes a patient smokes. Each patient can either have lung cancer or not. In contrast to the stock price, here you have only these two possible outcomes. Predicting the likelihood of categorical outcomes is the primary motivation for logistic regression.

**The Sigmoid Function**

Whereas linear regression fits a line to the training data, logistic regression fits an S-shaped curve, called *the sigmoid function*. The S-shaped curve helps you make binary decisions (for example, yes/no). For most input values, the sigmoid function will return a value that is either very close to 0 (one category) or very close to 1 (the other category). It’s relatively unlikely that your given input value generates an ambiguous output. Note that it is possible to generate 0.5 probabilities for a given input value—but the shape of the curve is designed in a way to minimize those in practical settings (for most possible values on the horizontal axis, the probability value is either very close to 0 or very close to 1). Figure 4-7 shows a logistic regression curve for the lung cancer scenario.

*Figure 4-7: A logistic regression curve that predicts cancer based on cigarette use*

**NOTE**

*You can apply logistic regression for* multinomial classification *to classify the data into more than two classes. To accomplish this, you’ll use the generalization of the sigmoid function, called the* softmax function, *which returns a tuple of probabilities, one for each class. The sigmoid function transforms the input feature(s) into only a single probability value. However, for clarity and readability, I’ll focus on* binomial classification *and the sigmoid function in this section.*

The sigmoid function in Figure 4-7 approximates the probability that a patient has lung cancer, given the number of cigarettes they smoke. This probability helps you make a robust decision on the subject when the only information you have is the number of cigarettes the patient smokes: does the patient have lung cancer?

Have a look at the predictions in Figure 4-8, which shows two new patients (in light gray at the bottom of the graph). You know nothing about them but the number of cigarettes they smoke. You’ve trained our logistic regression model (the sigmoid function) that returns a probability value for any new input value *x*. If the probability given by the sigmoid function is higher than 50 percent, the model predicts *lung cancer positive*; otherwise, it predicts *lung cancer negative*.

*Figure 4-8: Using logistic regression to estimate probabilities of a result*

**Finding the Maximum Likelihood Model**

The main question for logistic regression is how to select the correct sigmoid function that best fits the training data. The answer is in each model’s *likelihood:* the probability that the model would generate the observed training data. You want to select the model with the maximum likelihood. Your sense is that this model best approximates the real-world process that generated the training data.

To calculate the likelihood of a given model for a given set of training data, you calculate the likelihood for each single training data point, and then multiply those with each other to get the likelihood of the whole set of training data. How to calculate the likelihood of a single training data point? Simply apply this model’s sigmoid function to the training data point; it’ll give you the data point’s probability under this model. To select the maximum likelihood model for all data points, you repeat this same likelihood computation for different sigmoid functions (shifting the sigmoid function a little bit), as in Figure 4-9.

In the previous paragraph, I described how to determine the maximum likelihood sigmoid function (model). This sigmoid function fits the data best—so you can use it to predict new data points.

Now that we’ve covered the theory, let’s look at how you’d implement logistic regression as a Python one-liner.

*Figure 4-9: Testing several sigmoid functions to determine maximum likelihood*

*The Code*

*The Code*

You’ve seen an example of using logistic regression for a health application (correlating cigarette consumption with cancer probability). This “virtual doc” application would be a great idea for a smartphone app, wouldn’t it? Let’s program your first virtual doc using logistic regression, as shown in Listing 4-2—in a single line of Python code!

from sklearn.linear_model import LogisticRegression

import numpy as np

## Data (#cigarettes, cancer)

X = np.array([[0, "No"],

[10, "No"],

[60, "Yes"],

[90, "Yes"]])

## One-liner

model = LogisticRegression().fit(X[:,0].reshape(n,1), X[:,1])

## Result & puzzle

print(model.predict([[2],[12],[13],[40],[90]]))

Take a guess: what’s the output of this code snippet?

*How It Works*

*How It Works*

The training data X consists of four patient records (the rows) with two columns. The first column holds the number of cigarettes the patients smoke (*input feature*), and the second column holds the *class labels*, which say whether they ultimately suffered from lung cancer.

You create the model by calling the LogisticRegression() constructor. You call the fit() function on this model; fit() takes two arguments, which are the input (cigarette consumption) and the output class labels (cancer). The fit() function expects a two-dimensional input array format with one row per training data sample and one column per feature of this training data sample. In this case, you have only a single feature value so you transform the one-dimensional input into a two-dimensional NumPy array by using the reshape() operation. The first argument to reshape() specifies the number of rows, and the second specifies the number of columns. You care about only the number of columns, which here is 1. You’ll pass -1 as the number of desired rows, which is a special signal to NumPy to determine the number of rows automatically.

The input training data will look as follows after reshaping (in essence, you simply remove the class labels and keep the two-dimensional array shape intact):

[[0],

[10],

[60],

[90]]

Next, you predict whether a patient has lung cancer, given the number of cigarettes they smoke: your input will be 2, 12, 13, 40, 90 cigarettes. That gives an output as follows:

# ['No' 'No' 'Yes' 'Yes' 'Yes']

The model predicts that the first two patients are lung cancer negative, while the latter three are lung cancer positive.

Let’s look in detail at the probabilities the sigmoid function came up with that lead to this prediction! Simply run the following code snippet after Listing 4-2:

for i in range(20):

print("x=" + str(i) + " --> " + str(model.predict_proba([[i]])))

The predict_proba() function takes as input the number of cigarettes and returns an array containing the probability of lung cancer negative (at index 0) and the probability of lung cancer positive (index 1). When you run this code, you should get the following output:

x=0 --> [[0.67240789 0.32759211]]

x=1 --> [[0.65961501 0.34038499]]

x=2 --> [[0.64658514 0.35341486]]

x=3 --> [[0.63333374 0.36666626]]

x=4 --> [[0.61987758 0.38012242]]

x=5 --> [[0.60623463 0.39376537]]

x=6 --> [[0.59242397 0.40757603]]

x=7 --> [[0.57846573 0.42153427]]

x=8 --> [[0.56438097 0.43561903]]

x=9 --> [[0.55019154 0.44980846]]

x=10 --> [[0.53591997 0.46408003]]

x=11 --> [[0.52158933 0.47841067]]

x=12 --> [[0.50722306 0.49277694]]

x=13 --> [[0.49284485 0.50715515]]

x=14 --> [[0.47847846 0.52152154]]

x=15 --> [[0.46414759 0.53585241]]

x=16 --> [[0.44987569 0.55012431]]

x=17 --> [[0.43568582 0.56431418]]

x=18 --> [[0.42160051 0.57839949]]

x=19 --> [[0.40764163 0.59235837]]

If the probability of lung cancer being negative is higher than the probability of lung cancer being positive, the predicted outcome will be *lung cancer negative*. This happens the last time for x=12. If the patient has smoked more than 12 cigarettes, the algorithm will classify them as *lung cancer positive*.

In summary, you’ve learned how to classify problems easily with logistic regression using the scikit-learn library. The idea of logistic regression is to fit an S-shaped curve (the sigmoid function) to the data. This function assigns a numerical value between 0 and 1 to every new data point and each possible class. The numerical value models the probability of this data point belonging to the given class. However, in practice, you often have training data but no class label assigned to the training data. For example, you have customer data (say, their age and their income) but you don’t know any class label for each data point. To still extract useful insights from this kind of data, you will learn about another category of machine learning next: unsupervised learning. Specifically, you’ll learn about how to find similar clusters of data points, an important subset of unsupervised learning.

**K-Means Clustering in One Line**

If there’s one clustering algorithm you need to know—whether you’re a computer scientist, data scientist, or machine learning expert—it’s the *K-Means algorithm*. In this section, you’ll learn the general idea and when and how to use it in a single line of Python code.

*The Basics*

*The Basics*

The previous sections covered supervised learning, in which the training data is *labeled*. In other words, you know the output value of every input value in the training data. But in practice, this isn’t always the case. Often, you’ll find yourself confronted with *unlabeled* data—especially in many data analytics applications—where it’s not clear what “the optimal output” means. In these situations, a prediction is impossible (because there is no output to start with), but you can still distill useful knowledge from these unlabeled data sets (for example, you can find clusters of similar unlabeled data). Models that use unlabeled data fall under the category of *unsupervised learning*.

As an example, suppose you’re working in a startup that serves different target markets with various income levels and ages. Your boss tells you to find a certain number of target personas that best fit your target markets. You can use clustering methods to identify the *average customer personas* that your company serves. Figure 4-10 shows an example.

*Figure 4-10: Observed customer data in the two-dimensional space*

Here, you can easily identify three types of personas with different types of incomes and ages. But how to find those algorithmically? This is the domain of clustering algorithms such as the widely popular K-Means algorithm. Given the data sets and an integer *k*, the K-Means algorithm finds *k* clusters of data such that the difference between the center of a cluster (called the *centroid*) and the data in the cluster is minimal. In other words, you can find the different personas by running the K-Means algorithm on your data sets, as shown in Figure 4-11.

*Figure 4-11: Customer data with customer personas (cluster centroids) in the two-dimensional space*

The cluster centers (black dots) match the clustered customer data. Every cluster center can be viewed as one customer persona. Thus, you have three idealized personas: a 20-year-old earning $2000, a 25-year-old earning $3000, and a 40-year-old earning $4000. And the great thing is that the K-Means algorithm finds those cluster centers even in high-dimensional spaces (where it would be hard for humans to find the personas visually).

The K-Means algorithm requires “the number of cluster centers *k*” as an input. In this case, you look at the data and “magically” define *k* = 3. More advanced algorithms can find the number of cluster centers automatically (for an example, look at the 2004 paper “Learning the k in K-Means” by Greg Hamerly and Charles Elkan).

So how does the K-Means algorithm work? In a nutshell, it performs the following procedure:

Initialize random cluster centers (centroids).

Repeat until convergence

Assign every data point to its closest cluster center.

Recompute each cluster center as the centroid of all data points assigned to it.

This results in multiple loop iterations: you first assign the data to the *k* cluster centers, and then you recompute each cluster center as the centroid of the data assigned to it.

Let’s implement it!

Consider the following problem: given two-dimensional salary data (*hours worked*, *salary earned*), find two clusters of employees in the given data set that work a similar number of hours and earn a similar salary.

*The Code*

*The Code*

How can you do all of this in a single line of code? Fortunately, the scikit-learn library in Python already has an efficient implementation of the K-Means algorithm. Listing 4-3 shows the one-liner code snippet that runs K-Means clustering for you.

## Dependencies

from sklearn.cluster import KMeans

import numpy as np

## Data (Work (h) / Salary ($))

X = np.array([[35, 7000], [45, 6900], [70, 7100],

[20, 2000], [25, 2200], [15, 1800]])

## One-liner

kmeans = KMeans(n_clusters=2).fit(X)

## Result & puzzle

cc = kmeans.cluster_centers_

print(cc)

What’s the output of this code snippet? Try to guess a solution even if you don’t understand every syntactical detail. This will open your knowledge gap and prepare your brain to absorb the algorithm much better.

*How It Works*

*How It Works*

In the first lines, you import the KMeans module from the sklearn.cluster package. This module takes care of the clustering itself. You also need to import the NumPy library because the KMeans module works on NumPy arrays.

Our data is two-dimensional. It correlates the number of working hours with the salary of some workers. Figure 4-12 shows the six data points in this employee data set.

*Figure 4-12: Employee salary data*

The goal is to find the two cluster centers that best fit this data:

## One-liner

kmeans = KMeans(n_clusters=2).fit(X)

In the one-liner, you create a new KMeans object that handles the algorithm for you. When you create the KMeans object, you define the number of cluster centers by using the n_clusters function argument. Then you simply call the instance method fit(X) to run the K-Means algorithm on the input data X. The KMeans object now holds all the results. All that’s left is to retrieve the results from its attributes:

cc = kmeans.cluster_centers_

print(cc)

Note that in the sklearn package, the convention is to use a trailing underscore for some attribute names (for example, cluster_centers_) to indicate that these attributes were created dynamically within the training phase (the fit() function). Before the training phase, these attributes do not exist yet. This is not general Python convention (trailing underscores are usually used only to avoid naming conflicts with Python keywords—variable list_ instead of list). However, if you get used to it, you appreciate the consistent use of attributes in the sklearn package. So, what are the cluster centers and what is the output of this code snippet? Take a look at Figure 4-13.

*Figure 4-13: Employee salary data with cluster centers in the two-dimensional space*

You can see that the two cluster centers are (20, 2000) and (50, 7000). This is also the result of the Python one-liner. These clusters correspond to two idealized employee personas: the first works for 20 hours a week and earns $2000 per month, while the second works for 50 hours a week and earns $7000 per month. Those two types of personas fit the data reasonably well. Thus, the result of the one-liner code snippet is as follows:

## Result & puzzle

cc = kmeans.cluster_centers_

print(cc)

'''

[[ 50. 7000.]

[ 20. 2000.]]

'''

To summarize, this section introduced you to an important subtopic of unsupervised learning: clustering. The K-Means algorithm is a simple, efficient, and popular way of extracting *k* clusters from multidimensional data. Behind the scenes, the algorithm iteratively recomputes cluster centers and reassigns each data value to its closest cluster center until it finds the optimal clusters. But clusters are not always ideal for finding similar data items. Many data sets do not show a clustered behavior, but you’ll still want to leverage the distance information for machine learning and prediction. Let’s stay in the multidimensional space and explore another way to use the distance of (Euclidean) data values: the K-Nearest Neighbors algorithm.

**K-Nearest Neighbors in One Line**

The popular *K-Nearest Neighbors** (KNN)* algorithm is used for regression and classification in many applications such as recommender systems, image classification, and financial data forecasting. It’s the basis of many advanced machine learning techniques (for example, in information retrieval). There is no doubt that understanding KNN is an important building block of your proficient computer science education.

*The Basics*

*The Basics*

The KNN algorithm is a robust, straightforward, and popular machine learning method. It’s simple to implement but still a competitive and fast machine learning technique. All other machine learning models we’ve discussed so far use the training data to compute a *representation* of the original data. You can use this representation to predict, classify, or cluster new data. For example, the linear and logistic regression algorithms define learning parameters, while the clustering algorithm calculates cluster centers based on the training data. However, the KNN algorithm is different. In contrast to the other approaches, it does not compute a new model (or representation) but uses the *whole data set* as a model.

Yes, you read that right. The machine learning model is nothing more than a set of observations. Every single instance of your training data is one part of your model. This has advantages and disadvantages. A disadvantage is that the model can quickly blow up as the training data grows—which may require sampling or filtering as a preprocessing step. A great advantage, however, is the simplicity of the training phase (just add the new data values to the model). Additionally, you can use the KNN algorithm for prediction or classification. You execute the following strategy, given your input vector *x*:

- Find the
*k*nearest neighbors of*x*(according to a predefined distance metric). - Aggregate the
*k*nearest neighbors into a single prediction or classification value. You can use any aggregator function such as average, mean, max, or min.

Let’s walk through an example. Your company sells homes for clients. It has acquired a large database of customers and house prices (see Figure 4-14). One day, your client asks how much they must expect to pay for a house of 52 square meters. You query your KNN model, and it immediately gives you the response $33,167. And indeed, your client finds a home for $33,489 the same week. How did the KNN system come to this surprisingly accurate prediction?

First, the KNN system simply calculates the *k = 3* nearest neighbors to the query *D = 52 square meters* using Euclidean distance. The three nearest neighbors are A, B, and C with prices $34,000, $33,500, and $32,000, respectively. Then, it aggregates the three nearest neighbors by calculating the simple average of their values. Because *k = 3* in this example, you denote the model as *3NN*. Of course, you can vary the similarity functions, the parameter *k*, and the aggregation method to come up with more sophisticated prediction models.

*Figure 4-14: Calculating the price of house D based on the three nearest neighbors A, B, and C*

Another advantage of KNN is that it can be easily adapted as new observations are made. This is not generally true for machine learning models. An obvious weakness in this regard is that as the computational complexity of finding the *k* nearest neighbors becomes harder and harder, the more points you add. To accommodate for that, you can continuously remove stale values from the model.

As I mentioned, you can also use KNN for classification problems. Instead of averaging over the *k* nearest neighbors, you can use a voting mechanism: each nearest neighbor votes for its class, and the class with the most votes wins.

*The Code*

*The Code*

Let’s dive into how to use KNN in Python—in a single line of code (see Listing 4-4).

## Dependencies

from sklearn.neighbors import KNeighborsRegressor

import numpy as np

## Data (House Size (square meters) / House Price ($))

X = np.array([[35, 30000], [45, 45000], [40, 50000],

[35, 35000], [25, 32500], [40, 40000]])

## One-liner

KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1])

## Result & puzzle

res = KNN.predict([[30]])

print(res)

Take a guess: what’s the output of this code snippet?

*How It Works*

*How It Works*

To help you see the result, let’s plot the housing data from this code in Figure 4-15.

*Figure 4-15: Housing data in the two-dimensional space*

Can you see the general trend? With the growing size of your house, you can expect a linear growth of its market price. Double the square meters, and the price will double too.

In the code (see Listing 4-4), the client requests your price prediction for a house of 30 square meters. What does KNN with *k = 3* (in short, 3NN) predict? Take a look at Figure 4-16.

Beautiful, isn’t it? The KNN algorithm finds the three closest houses with respect to house size and averages the predicted house price as the average of the *k=3* nearest neighbors. Thus, the result is $32,500.

If you are confused by the data conversions in the one-liner, let me quickly explain what is happening here:

KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1])

*Figure 4-16: Housing data in the two-dimensional space with predicted house price for a new data point (house size equals 30 square meters) using KNN*

First, you create a new machine learning model called KNeighborsRegressor. If you wanted to use KNN for classification, you’d use KNeighborsClassifier.

Second, you train the model by using the fit() function with two parameters. The first parameter defines the input (the house size), and the second parameter defines the output (the house price). The shape of both parameters must be an array-like data structure. For example, to use 30 as an input, you’d have to pass it as [30]. The reason is that, in general, the input can be multidimensional rather than one-dimensional. Therefore, you reshape the input:

print(X[:,0])

"[35 45 40 35 25 40]"

print(X[:,0].reshape(-1,1))

"""

[[35]

[45]

[40]

[35]

[25]

[40]]

"""

Notice that if you were to use this 1D NumPy array as an input to the fit() function, the function wouldn’t work because it expects an array of (array-like) observations, and not an array of integers.

In summary, this one-liner taught you how to create your first KNN regressor in a single line of code. If you have a lot of changing data and model updates, KNN is your best friend! Let’s move on to a wildly popular machine learning model these days: neural networks.

**Neural Network Analysis in One Line**

Neural networks have gained massive popularity in recent years. This is in part because the algorithms and learning techniques in the field have improved, but also because of the improved hardware and the rise of general-purpose GPU (GPGPU) technology. In this section, you’ll learn about the *multilayer perceptron** (MLP)* which is one of the most popular neural network representations. After reading this, you’ll be able to write your own neural network in a single line of Python code!

*The Basics*

*The Basics*

For this one-liner, I have prepared a special data set with fellow Python colleagues on my email list. My goal was to create a relatable real-world data set, so I asked my email subscribers to participate in a data-generation experiment for this chapter.

**The Data**

If you’re reading this book, you’re interested in learning Python. To create an interesting data set, I asked my email subscribers six anonymized questions about their Python expertise and income. The responses to these questions will serve as training data for the simple neural network example (as a Python one-liner).

The training data is based on the answers to the following six questions:

- How many hours have you looked at Python code in the last seven days?
- How many years ago did you start to learn about computer science?
- How many coding books are on your shelf?
- What percentage of your Python time do you spend working on real-world projects?
- How much do you earn per month (round to $1000) from selling your technical skills (in the widest sense)?
- What’s your approximate Finxter rating, rounded to 100 points?

The first five questions will be your input, and the sixth question will be the output for the neural network analysis. In this one-liner section, you’re examining neural network regression. In other words, you predict a numerical value (your Python skills) based on numerical input features. We’re not going to explore neural network classification in this book, which is another great strength of neural networks.

The sixth question approximates the skill level of a Python coder. Finxter (*https://finxter.com/*) is our puzzle-based learning application that assigns a rating value to any Python coder based on their performance in solving Python puzzles. In this way, it helps you quantify your skill level in Python.

Let’s start with visualizing how each question influences the output (the skill rating of a Python developer), as shown in Figure 4-17.

*Figure 4-17: Relationship between questionnaire answers and the Python skill rating at Finxter*

Note that these plots show only how each separate feature (question) impacts the final Finxter rating, but they tell you nothing about the impact of a combination of two or more features. Note also that some Pythonistas didn’t answer all six questions; in those cases, I used the dummy value -1.

**What Is an Artificial Neural Network?**

The idea of creating a theoretical model of the human brain (the biological neural network) has been studied extensively in recent decades. But the foundations of artificial neural networks were proposed as early as the 1940s and ’50s! Since then, the concept of artificial neural networks has been refined and continually improved.

The basic idea is to break the big task of learning and inference into multiple micro-tasks. These micro-tasks are not independent but interdependent. The brain consists of billions of neurons that are connected with trillions of synapses. In the simplified model, learning is merely adjusting the *strength* of synapses (also called *weights* or *parameters* in artificial neural networks). So how do you “create” a new synapse in the model? Simple—you increase its weight from zero to a nonzero value.

Figure 4-18 shows a basic neural network with three layers (input, hidden, output). Each layer consists of multiple neurons that are connected from the input layer via the hidden layer to the output layer.

*Figure 4-18: A simple neural network analysis for animal classification*

In this example, the neural network is trained to detect animals in images. In practice, you would use one input neuron per pixel of the image as an input layer. This can result in millions of input neurons that are connected with millions of hidden neurons. Often, each output neuron is responsible for one bit of the overall output. For example, to detect two different animals (for example, cats and dogs), you’ll use only a single neuron in the output layer that can model two different states (0=cat, 1=dog).

The idea is that each neuron can be activated, or “fired”, when a certain input impulse arrives at the neuron. Each neuron decides independently, based on the strength of the input impulse, whether to fire or not. This way, you simulate the human brain, in which neurons activate each other via impulses. The activation of the input neurons propagates through the network until the output neurons are reached. Some output neurons will be activated, and others won’t. The specific pattern of firing output neurons forms your final output (or prediction) of the artificial neural network. In your model, a firing output neuron could encode a 1, and a nonfiring output neuron could encode a 0. This way, you can train your neural network to predict anything that can be encoded as a series of 0s and 1s (which is everything a computer can represent).

Let’s have a detailed look at how neurons work mathematically, in Figure 4-19.

*Figure 4-19: Mathematical model of a single neuron: the output is a function of the three inputs.*

Each neuron is connected to other neurons, but not all connections are equal. Instead, each connection has an associated weight. Formally, a firing neuron propagates an impulse of 1 to the outgoing neighbors, while a nonfiring neuron propagates an impulse of 0. You can think of the weight as indicating how much of the impulse of the firing input neuron is forwarded to the neuron via the connection. Mathematically, you multiply the impulse by the weight of the connection to calculate the input for the next neuron. In our example, the neuron simply sums over all inputs to calculate its own output. This is the *activation function* that describes how exactly the inputs of a neuron generate an output. In our example, a neuron fires with higher likelihood if its relevant input neurons fire too. This is how the impulses propagate through the neural network.

What does the learning algorithm do? It uses the training data to select the weights *w* of the neural network. Given a training input value *x*, different weights *w* lead to different outputs. Hence, the learning algorithm gradually changes the weights *w*—in many iterations—until the output layer produces similar results as the training data. In other words, the training algorithm gradually reduces the error of correctly predicting the training data.

There are many network structures, training algorithms, and activation functions. This chapter shows you a hands-on approach of using the neural network now, within a single line of code. You can then learn the finer details as you need to improve upon this (for example, you could start by reading the “Neural Network” entry on Wikipedia, *https://en.wikipedia.org/wiki/Neural_network*).

*The Code*

*The Code*

The goal is to create a neural network that predicts the Python skill level (Finxter rating) by using the five input features (answers to the questions):

**WEEK** How many hours have you been exposed to Python code in the last seven days?

**YEARS** How many years ago did you start to learn about computer science?

**BOOKS** How many coding books are on your shelf?

**PROJECTS** What percentage of your Python time do you spend implementing real-world projects?

**EARN** How much do you earn per month (round to $1000) from selling your technical skills (in the widest sense)?

Again, let’s stand on the shoulders of giants and use the scikit-learn (sklearn) library for neural network regression, as in Listing 4-5.

## Dependencies

from sklearn.neural_network import MLPRegressor

import numpy as np

## Questionaire data (WEEK, YEARS, BOOKS, PROJECTS, EARN, RATING)

X = np.array(

[[20, 11, 20, 30, 4000, 3000],

[12, 4, 0, 0, 1000, 1500],

[2, 0, 1, 10, 0, 1400],

[35, 5, 10, 70, 6000, 3800],

[30, 1, 4, 65, 0, 3900],

[35, 1, 0, 0, 0, 100],

[15, 1, 2, 25, 0, 3700],

[40, 3, -1, 60, 1000, 2000],

[40, 1, 2, 95, 0, 1000],

[10, 0, 0, 0, 0, 1400],

[30, 1, 0, 50, 0, 1700],

[1, 0, 0, 45, 0, 1762],

[10, 32, 10, 5, 0, 2400],

[5, 35, 4, 0, 13000, 3900],

[8, 9, 40, 30, 1000, 2625],

[1, 0, 1, 0, 0, 1900],

[1, 30, 10, 0, 1000, 1900],

[7, 16, 5, 0, 0, 3000]])

## One-liner

neural_net = MLPRegressor(max_iter=10000).fit(X[:,:-1], X[:,-1])

## Result

res = neural_net.predict([[0, 0, 0, 0, 0]])

print(res)

It’s impossible for a human to correctly figure out the output—but would you like to try?

*How It Works*

*How It Works*

In the first few lines, you create the data set. The machine learning algorithms in the scikit-learn library use a similar input format. Each row is a single observation with multiple features. The more rows, the more training data exists; the more columns, the more features of each observation. In this case, you have five features for the input and one feature for the output value of each training data.

The one-liner creates a neural network by using the constructor of the MLPRegressor class. I passed max_iter=10000 as an argument because the training doesn’t converge when using the default number of iterations (max_iter=200).

After that, you call the fit() function, which determines the parameters of the neural network. After calling fit(), the neural network has been successfully initialized. The fit() function takes a multidimensional input array (one observation per row, one feature per column) and a one-dimensional output array (size = number of observations).

The only thing left is calling the predict function on some input values:

## Result

res = neural_net.predict([[0, 0, 0, 0, 0]])

print(res)

# [94.94925927]

Note that the actual output may vary slightly because of the nondeterministic nature of the function and the different convergence behavior.

In plain English: if . . .

- . . . you have trained 0 hours in the last week,
- . . . you started your computer science studies 0 years ago,
- . . . you have 0 coding books in your shelf,
- . . . you spend 0 percent of your time implementing real Python projects, and
- . . . you earn $0 selling your coding skills,

the neural network estimates that your skill level is *very* low (a Finxter rating of 94 means you have difficulty understanding the Python program print("hello, world")).

So let’s change this: what happens if you invest 20 hours a week learning and revisit the neural network after one week:

## Result

res = neural_net.predict([[20, 0, 0, 0, 0]])

print(res)

# [440.40167562]

Not bad—your skills improve quite significantly! But you’re still not happy with this rating number, are you? (An above-average Python coder has at least a 1500–1700 rating on Finxter.)

No problem. Buy 10 Python books (only nine left after this one). Let’s see what happens to your rating:

## Result

res = neural_net.predict([[20, 0, 10, 0, 0]])

print(res)

# [953.6317602]

Again, you make significant progress and double your rating number! But buying Python books alone will not help you much. You need to study them! Let’s do this for a year:

## Result

res = neural_net.predict([[20, 1, 10, 0, 0]])

print(res)

# [999.94308353]

Not much happens. This is where I don’t trust the neural network too much. In my opinion, you should have reached a much better performance of at least 1500. But this also shows that the neural network can be only as good as its training data. You have very limited data, and the neural network can’t really overcome this limitation: there’s just too little knowledge in a handful of data points.

But you don’t give up, right? Next, you spend 50 percent of your Python time selling your skills as a Python freelancer:

## Result

res = neural_net.predict([[20, 1, 10, 50, 1000]])

print(res)

# [1960.7595547]

Boom! Suddenly the neural network considers you to be an expert Python coder. A wise prediction of the neural network, indeed! Learn Python for at least a year and do practical projects, and you’ll become a great coder.

To sum up, you’ve learned about the basics of neural networks and how to use them in a single line of Python code. Interestingly, the questionnaire data indicates that starting out with practical projects—maybe even doing freelance projects from the beginning—matters a lot to your learning success. The neural network certainly knows that. If you want to learn my exact strategy of becoming a freelancer, join the free webinar about state-of-the-art Python freelancing at *https://blog.finxter.com/webinar-freelancer/*.

In the next section, you’ll dive deeper into another powerful model representation: decision trees. While neural networks can be quite expensive to train (they often need multiple machines and many hours, and sometimes even weeks, to train), decision trees are lightweight. Nevertheless, they are a fast, effective way to extract patterns from your training data.

**Decision-Tree Learning in One Line**

*Decision trees* are powerful and intuitive tools in your machine learning toolbelt. A big advantage of decision trees is that, unlike many other machine learning techniques, they’re human-readable. You can easily train a decision tree and show it to your supervisors, who do not need to know anything about machine learning in order to understand what your model does. This is especially great for data scientists who often must defend and present their results to management. In this section, I’ll show you how to use decision trees in a single line of Python code.

*The Basics*

*The Basics*

Unlike many machine learning algorithms, the ideas behind decision trees might be familiar from your own experience. They represent a structured way of making decisions. Each decision opens new branches. By answering a bunch of questions, you’ll finally land on the recommended outcome. Figure 4-20 shows an example.

*Figure 4-20: A simplified decision tree for recommending a study subject*

Decision trees are used for classification problems such as “which subject should I study, given my interests?” You start at the top. Now, you repeatedly answer questions and select the choices that describe your features best. Finally, you reach a *leaf node* of the tree, a node with no *children*. This is the recommended class based on your feature selection.

Decision-tree learning has many nuances. In the preceding example, the first question carries more weight than the last question. If you like math, the decision tree will never recommend art or linguistics. This is useful because some features may be much more important for the classification decision than others. For example, a classification system that predicts your current health may use your sex (feature) to practically rule out many diseases (classes).

Hence, the order of the decision nodes lends itself to performance optimizations: place the features at the top that have a high impact on the final classification. In decision-tree learning, you’ll then aggregate the questions with little impact on the final classification, as shown in Figure 4-21.

*Figure 4-21: Pruning improves efficiency of decision-tree learning.*

Suppose the full decision tree looks like the tree on the left. For any combination of features, there’s a separate classification outcome (the tree leaves). However, some features may not give you any additional information with respect to the classification problem (for example, the first Language decision node in the example). Decision-tree learning would effectively get rid of these nodes for efficiency reasons, a process called *pruning*.

*The Code*

*The Code*

You can create your own decision tree in a single line of Python code. Listing 4-6 shows you how.

## Dependencies

from sklearn import tree

import numpy as np

## Data: student scores in (math, language, creativity) --> study field

X = np.array([[9, 5, 6, "computer science"],

[1, 8, 1, "linguistics"],

[5, 7, 9, "art"]])

## One-liner

Tree = tree.DecisionTreeClassifier().fit(X[:,:-1], X[:,-1])

## Result & puzzle

student_0 = Tree.predict([[8, 6, 5]])

print(student_0)

student_1 = Tree.predict([[3, 7, 9]])

print(student_1)

Guess the output of this code snippet!

*How It Works*

*How It Works*

The data in this code describes three students with their estimated skill levels (a score from 1–10) in the three areas of math, language, and creativity. You also know the study subjects of these students. For example, the first student is highly skilled in math and studies computer science. The second student is skilled in language much more than in the other two skills and studies linguistics. The third student is skilled in creativity and studies art.

The one-liner creates a new decision-tree object and trains the model by using the fit() function on the labeled training data (the last column is the label). Internally, it creates three nodes, one for each feature: math, language, and creativity. When predicting the class of student_0 (math = 8, language = 6, creativity = 5), the decision tree returns computer science. It has learned that this feature pattern (high, medium, medium) is an indicator of the first class. On the other hand, when asked for (3, 7, 9), the decision tree predicts art because it has learned that the score (low, medium, high) hints to the third class.

Note that the algorithm is nondeterministic. In other words, when executing the same code twice, different results may arise. This is common for machine learning algorithms that work with random generators. In this case, the order of the features is randomly organized, so the final decision tree may have a different order of the features.

To summarize, decision trees are an intuitive way of creating human-readable machine learning models. Every branch represents a choice based on a single feature of a new sample. The leaves of the tree represent the final prediction (classification or regression). Next, we’ll leave concrete machine learning algorithms for a moment and explore a critical concept in machine learning: variance.

**Get Row with Minimal Variance in One Line**

You may have read about the Vs in Big Data: volume, velocity, variety, veracity, and value. *Variance* is yet another important V: it measures the expected (squared) deviation of the data from its mean. In practice, variance is an important measure with relevant application domains in financial services, weather forecasting, and image processing.

*The Basics*

*The Basics*

Variance measures how much data spreads around its average in the one-dimensional or multidimensional space. You’ll see a graphical example in a moment. In fact, variance is one of the most important properties in machine learning. It captures the patterns of the data in a generalized manner—and machine learning is all about pattern recognition.

Many machine learning algorithms rely on variance in one form or another. For instance, the *bias-variance trade-off* is a well-known problem in machine learning: sophisticated machine learning models risk overfitting the data (high variance) but represent the training data very accurately (low bias). On the other hand, simple models often generalize well (low variance) but do not represent the data accurately (high bias).

So what exactly is variance? It’s a simple statistical property that captures how much the data set spreads from its mean. Figure 4-22 shows an example plotting two data sets: one with low variance, and one with high variance.

*Figure 4-22: Variance comparison of two company stock prices*

This example shows the stock prices of two companies. The stock price of the tech startup fluctuates heavily around its average. The stock price of the food company is quite stable and fluctuates only in minor ways around the average. In other words, the tech startup has high variance, and the food company has low variance.

In mathematical terms, you can calculate the variance *var(X)* of a set of numerical values *X* by using the following formula:

The value is the average value of the data in *X*.

*The Code*

*The Code*

As they get older, many investors want to reduce the overall risk of their investment portfolio. According to the dominant investment philosophy, you should consider stocks with lower variance as less-risky investment vehicles. Roughly speaking, you can lose less money investing in a stable, predictable, and large company than in a small tech startup.

The goal of the one-liner in Listing 4-7 is to identify the stock in your portfolio with minimal variance. By investing more money into this stock, you can expect a lower overall variance of your portfolio.

## Dependencies

import numpy as np

## Data (rows: stocks / cols: stock prices)

X = np.array([[25,27,29,30],

[1,5,3,2],

[12,11,8,3],

[1,1,2,2],

[2,6,2,2]])

## One-liner

# Find the stock with smallest variance

min_row = min([(i,np.var(X[i,:])) for i in range(len(X))], key=lambda x: x[1])

## Result & puzzle

print("Row with minimum variance: " + str(min_row[0]))

print("Variance: " + str(min_row[1]))

What’s the output of this code snippet?

*How It Works*

*How It Works*

As usual, you first define the data you want to run the one-liner on (see the top of Listing 4-7). The NumPy array X contains five rows (one row per stock in your portfolio) with four values per row (stock prices).

The goal is to find the ID and variance of the stock with minimal variance. Hence, the outermost function of the one-liner is the min() function. You execute the min() function on a sequence of tuples (a,b), where the first tuple value a is the row index (stock index), and the second tuple value b is the variance of the row.

You may ask: what’s the minimal value of a sequence of tuples? Of course, you need to properly define this operation before using it. To this end, you use the key argument of the min() function. The key argument takes a function that returns a comparable object value, given a sequence value. Again, our sequence values are tuples, and you need to find the tuple with minimal variance (the second tuple value). Because variance is the second value, you’ll return x[1] as the basis for comparison. In other words, the tuple with the minimal second tuple value wins.

Let’s look at how to create the sequence of tuple values. You use list comprehension to create a tuple for any row index (stock). The first tuple element is simply the index of row *i*. The second tuple element is the variance of this row. You use the NumPy var() function in combination with slicing to calculate the row variance.

The result of the one-liner is, therefore, as follows:

"""

Row with minimum variance: 3

Variance: 0.25

"""

I’d like to add that there’s an alternative way of solving this problem. If this wasn’t a book about Python one-liners, I would prefer the following solution instead of the one-liner:

var = np.var(X, axis=1)

min_row = (np.where(var==min(var)), min(var))

In the first line, you calculate the variance of the NumPy array X along the columns (axis=1). In the second line, you create the tuple. The first tuple value is the index of the minimum in the variance array. The second tuple value is the minimum in the variance array. Note that multiple rows may have the same (minimal) variance.

This solution is more readable. So clearly, there is a trade-off between conciseness and readability. Just because you can cram everything into a single line of code doesn’t mean you should. All things being equal, it’s much better to write concise *and* readable code, instead of blowing up your code with unnecessary definitions, comments, or intermediate steps.

After learning the basics of variance in this section, you’re now ready to absorb how to calculate basic statistics.

**Basic Statistics in One Line**

As a data scientist and machine learning engineer, you need to know basic statistics. Some machine learning algorithms are entirely based on statistics (for example, Bayesian networks).

For example, extracting basic statistics from matrices (such as average, variance, and standard deviation) is a critical component for analyzing a wide range of data sets such as financial data, health data, or social media data. With the rise of machine learning and data science, knowing about how to use NumPy—which is at the heart of Python data science, statistics, and linear algebra—will become more and more valuable to the marketplace.

In this one-liner, you’ll learn how to calculate basic statistics with NumPy.

*The Basics*

*The Basics*

This section explains how to calculate the average, the standard deviation, and the variance along an axis. These three calculations are very similar; if you understand one, you’ll understand all of them.

Here’s what you want to achieve: given a NumPy array of stock data with rows indicating the different companies and columns indicating their daily stock prices, the goal is to find the average and standard deviation of each company’s stock price (see Figure 4-23).

*Figure 4-23: Average and variance along axis 1*

This example shows a two-dimensional NumPy array, but in practice, the array can have much higher dimensionality.

**Simple Average, Variance, Standard Deviation**

Before examining how to accomplish this in NumPy, let’s slowly build the background you need to know. Say you want to calculate the simple average, the variance, or the standard deviation over all values in a NumPy array. You’ve already seen examples of the average and the variance function in this chapter. The standard deviation is simply the square root of the variance. You can achieve this easily with the following functions:

import numpy as np

X = np.array([[1, 3, 5],

[1, 1, 1],

[0, 2, 4]])

print(np.average(X))

# 2.0

print(np.var(X))

# 2.4444444444444446

print(np.std(X))

# 1.5634719199411433

You may have noted that you apply those functions on the two-dimensional NumPy array X. But NumPy simply flattens the array and calculates the functions on the flattened array. For example, the simple average of the flattened NumPy array X is calculated as follows:

(1 + 3 + 5 + 1 + 1 + 1 + 0 + 2 + 4) / 9 = 18 / 9 = 2.0

**Calculating Average, Variance, Standard Deviation Along an Axis**

However, sometimes you want to calculate these functions along an axis. You can do this by specifying the keyword axis as an argument to the average, variance, and standard deviation functions (see Chapter 3 for a detailed introduction to the axis argument).

*The Code*

*The Code*

Listing 4-8 shows you exactly how to calculate the average, variance, and standard deviation along an axis. Our goal is to calculate the averages, variances, and standard deviations of all stocks in a two-dimensional matrix with rows representing stocks and columns representing daily prices.

## Dependencies

import numpy as np

## Stock Price Data: 5 companies

# (row=[price_day_1, price_day_2, ...])

x = np.array([[8, 9, 11, 12],

[1, 2, 2, 1],

[2, 8, 9, 9],

[9, 6, 6, 3],

[3, 3, 3, 3]])

## One-liner

avg, var, std = np.average(x, axis=1), np.var(x, axis=1), np.std(x, axis=1)

## Result & puzzle

print("Averages: " + str(avg))

print("Variances: " + str(var))

print("Standard Deviations: " + str(std))

Guess the output of the puzzle!

*How It Works*

*How It Works*

The one-liner uses the axis keyword to specify the axis along which to calculate the average, variance, and standard deviation. For example, if you perform these three functions along axis=1, each row is aggregated into a single value. Hence, the resulting NumPy array has a reduced dimensionality of one.

The result of the puzzle is the following:

"""

Averages: [10. 1.5 7. 6. 3. ]

Variances: [2.5 0.25 8.5 4.5 0. ]

Standard Deviations: [1.58113883 0.5 2.91547595 2.12132034 0. ]

"""

Before moving on to the next one-liner, I want to show you how to use the same idea for an even higher-dimensional NumPy array.

When averaging along an axis for high-dimensional NumPy arrays, you’ll always aggregate the axis defined in the axis argument. Here’s an example:

import numpy as np

x = np.array([[[1,2], [1,1]],

[[1,1], [2,1]],

[[1,0], [0,0]]])

print(np.average(x, axis=2))

print(np.var(x, axis=2))

print(np.std(x, axis=2))

"""

[[1.5 1. ]

[1. 1.5]

[0.5 0. ]]

[[0.25 0. ]

[0. 0.25]

[0.25 0. ]]

[[0.5 0. ]

[0. 0.5]

[0.5 0. ]]

"""

There are three examples of computing the average, variance, and standard deviation along axis 2 (see Chapter 3; the innermost axis). In other words, all values of axis 2 will be combined into a single value that results in axis 2 being dropped from the resulting array. Dive into the three examples and figure out how exactly axis 2 is collapsed into a single average, variance, or standard deviation value.

To summarize, a wide range of data sets (including financial data, health data, and social media data) requires you to be able to extract basic insights from your data sets. This section gives you a deeper understanding of how to use the powerful NumPy toolset to extract basic statistics quickly and efficiently from multidimensional arrays. This is needed as a basic preprocessing step for many machine learning algorithms.

**Classification with Support-Vector Machines in One Line**

*Support-vector machines* (*SVMs*) have gained massive popularity in recent years because they have robust classification performance, even in high-dimensional spaces. Surprisingly, SVMs work even if there are more dimensions (features) than data items. This is unusual for classification algorithms because of the *curse of dimensionality*: with increasing dimensionality, the data becomes extremely sparse, which makes it hard for algorithms to find patterns in the data set. Understanding the basic ideas of SVMs is a fundamental step to becoming a sophisticated machine learning engineer.

*The Basics*

*The Basics*

How do classification algorithms work? They use the training data to find a decision boundary that divides data in the one class from data in the other class (in “Logistic Regression in One Line” on page 89, the decision boundary would be whether the probability of the sigmoid function is below or above the 0.5 threshold).

**A High-Level Look at Classification**

Figure 4-24 shows an example of a general classifier.

*Figure 4-24: Diverse skill sets of computer scientists and artists*

Suppose you want to build a recommendation system for aspiring university students. The figure visualizes the training data consisting of users classified according to their skills in two areas: logic and creativity. Some people have high logic skills and relatively low creativity; others have high creativity and relatively low logic skills. The first group is labeled as *computer scientists*, and the second group is labeled as *artists*.

To classify new users, the machine learning model must find a decision boundary that separates the computer scientists from the artists. Roughly speaking, you’ll classify a user by where they fall with respect to the decision boundary. In the example, you’ll classify users who fall into the left area as computer scientists, and users who fall into the right area as artists.

In the two-dimensional space, the decision boundary is either a line or a (higher-order) curve. The former is called a *linear classifier*, and the latter is called a *nonlinear classifier*. In this section, we’ll explore only linear classifiers.

Figure 4-24 shows three decision boundaries that are all valid separators of the data. In our example, it’s impossible to quantify which of the given decision boundaries is better; they all lead to perfect accuracy when classifying the training data.

**But What Is the Best Decision Boundary?**

Support-vector machines provide a unique and beautiful answer to this question. Arguably, the best decision boundary provides a maximal margin of safety. In other words, SVMs maximize the distance between the closest data points and the decision boundary. The goal is to minimize the error of new points that are close to the decision boundary.

Figure 4-25 shows an example.

*Figure 4-25: Support-vector machines maximize the error of margin.*

The SVM classifier finds the respective support vectors so that the zone between the support vectors is as thick as possible. Here, the support vectors are the data points that lie on the two dotted lines parallel to the decision boundary. These lines are denoted as *margins*. The decision boundary is the line in the middle with maximal distance to the margins. Because the zone between the margins and the decision boundary is maximized, the *margin of error* is expected to be maximal when classifying new data points. This idea shows high classification accuracy for many practical problems.

*The Code*

*The Code*

Is it possible to create your own SVM in a single line of Python code? Take a look at Listing 4-9.

## Dependencies

from sklearn import svm

import numpy as np

## Data: student scores in (math, language, creativity) --> study field

X = np.array([[9, 5, 6, "computer science"],

[10, 1, 2, "computer science"],

[1, 8, 1, "literature"],

[4, 9, 3, "literature"],

[0, 1, 10, "art"],

[5, 7, 9, "art"]])

## One-liner

svm = svm.SVC().fit(X[:,:-1], X[:,-1])

## Result & puzzle

student_0 = svm.predict([[3, 3, 6]])

print(student_0)

student_1 = svm.predict([[8, 1, 1]])

print(student_1)

Guess the output of this code.

*How It Works*

*How It Works*

The code breaks down how you can use support-vector machines in Python in the most basic form. The NumPy array holds the labeled training data with one row per user and one column per feature (skill level in math, language, and creativity). The last column is the label (the class).

Because you have three-dimensional data, the support-vector machine separates the data by using two-dimensional planes (the linear separator) rather than one-dimensional lines. As you can see, it’s also possible to separate three classes rather than only two as shown in the preceding examples.

The one-liner itself is straightforward: you first create the model by using the constructor of the svm.SVC class (*SVC* stands for *support-vector classification*). Then, you call the fit() function to perform the training based on your labeled training data.

In the results part of the code snippet, you call the predict() function on new observations. Because student_0 has skills indicated as math=3, language=3, and creativity=6, the support-vector machine predicts that the label *art* fits this student’s skills. Similarly, student_1 has skills indicated as math=8, language=1, and creativity=1. Thus, the support-vector machine predicts that the label *computer science* fits this student’s skills.

Here’s the final output of the one-liner:

## Result & puzzle

student_0 = svm.predict([[3, 3, 6]])

print(student_0)

# ['art']

student_1 = svm.predict([[8, 1, 1]])

print(student_1)

## ['computer science']

In summary, SVMs perform well even in high-dimensional spaces when there are more features than training data vectors. The idea of maximizing the *margin of safety* is intuitive and leads to robust performance when classifying *boundary cases*—that is, vectors that fall within the margin of safety. In the final section of this chapter, we’ll zoom one step back and have a look at a meta-algorithm for classification: ensemble learning with random forests.

**Classification with Random Forests in One Line**

Let’s move on to an exciting machine learning technique: *ensemble learning*. Here’s my quick-and-dirty tip if your prediction accuracy is lacking but you need to meet the deadline at all costs: try this meta-learning approach that combines the predictions (or classifications) of multiple machine learning algorithms. In many cases, it will give you better last-minute results.

*The Basics*

*The Basics*

In the previous sections, you’ve studied multiple machine learning algorithms that you can use to get quick results. However, different algorithms have different strengths. For example, neural network classifiers can generate excellent results for complex problems. However, they are also prone to overfitting the data because of their powerful capacity to memorize fine-grained patterns of the data. Ensemble learning for classification problems partially overcomes the problem that you often don’t know in advance which machine learning technique works best.

How does this work? You create a meta-classifier consisting of multiple types or instances of basic machine learning algorithms. In other words, you train multiple models. To classify a single observation, you ask all models to classify the input independently. Next, you return the class that was returned most often, given your input, as a *meta-prediction*. This is the final output of your ensemble learning algorithm.

*Random forests* are a special type of ensemble learning algorithms. They focus on decision-tree learning. A forest consists of many trees. Similarly, a random forest consists of many decision trees. Each decision tree is built by injecting randomness in the tree-generation procedure during the training phase (for example, which tree node to select first). This leads to various decision trees—exactly what you want.

Figure 4-26 shows how the prediction works for a trained random forest using the following scenario. Alice has high math and language skills. The *ensemble* consists of three decision trees (building a random forest). To classify Alice, each decision tree is queried about Alice’s classification. Two of the decision trees classify Alice as a computer scientist. Because this is the class with the most votes, it’s returned as the final output for the classification.

*Figure 4-26: Random forest classifier aggregating the output of three decision trees*

*The Code*

*The Code*

Let’s stick to this example of classifying the study field based on a student’s skill level in three areas (math, language, creativity). You may think that implementing an ensemble learning method is complicated in Python. But it’s not, thanks to the comprehensive scikit-learn library (see Listing 4-10).

## Dependencies

import numpy as np

from sklearn.ensemble import RandomForestClassifier

## Data: student scores in (math, language, creativity) --> study field

X = np.array([[9, 5, 6, "computer science"],

[5, 1, 5, "computer science"],

[8, 8, 8, "computer science"],

[1, 10, 7, "literature"],

[1, 8, 1, "literature"],

[5, 7, 9, "art"],

[1, 1, 6, "art"]])

## One-liner

Forest = RandomForestClassifier(n_estimators=10).fit(X[:,:-1], X[:,-1])

## Result

students = Forest.predict([[8, 6, 5],

[3, 7, 9],

[2, 2, 1]])

print(students)

Take a guess: what’s the output of this code snippet?

*How It Works*

*How It Works*

After initializing the labeled training data in Listing 4-10, the code creates a random forest by using the constructor on the class RandomForestClassifier with one parameter n_estimators that defines the number of trees in the forest. Next, you populate the model that results from the previous initialization (an empty forest) by calling the function fit(). To this end, the input training data consists of all but the last column of array X, while the labels of the training data are defined in the last column. As in the previous examples, you use slicing to extract the respective columns from the data array X.

The classification part is slightly different in this code snippet. I wanted to show you how to classify multiple observations instead of only one. You can achieve this here by creating a multidimensional array with one row per observation.

Here’s the output of the code snippet:

## Result

students = Forest.predict([[8, 6, 5],

[3, 7, 9],

[2, 2, 1]])

print(students)

# ['computer science' 'art' 'art']

Note that the result is still nondeterministic (the result may be different for different executions of the code) because the random forest algorithm relies on the random number generator that returns different numbers at different points in time. You can make this call deterministic by using the integer argument random_state. For example, you can set random_state=1 when calling the random forest constructor: RandomForestClassifier(n_estimators=10, random_state=1). In this case, each time you create a new random forest classifier, the same output results because the same random numbers are created: they are all based on the seed integer 1.

In summary, this section introduced a meta-approach for classification: using the output of various decision trees to reduce the variance of the classification error. This is one version of ensemble learning, which combines multiple basic models into a single meta-model that’s able to leverage their individual strengths.

**NOTE**

*Two different decision trees can lead to a high variance of the error: one generates good results, while the other one doesn’t. By using random forests, you mitigate this effect.*

Variations of this idea are common in machine learning—and if you need to quickly improve your prediction accuracy, simply run multiple machine learning models and evaluate their output to find the best one (a quick-and-dirty secret of machine learning practitioners). In a way, ensemble learning techniques automatically perform the task that’s often done by experts in practical machine learning pipelines: selecting, comparing, and combining the output of different machine learning models. The big strength of ensemble learning is that this can be done individually for each data value at runtime.

**Summary**

This chapter covered 10 basic machine learning algorithms that are fundamental to your success in the field. You’ve learned about regression algorithms to predict values such as linear regression, KNNs, and neural networks. You’ve learned about classification algorithms such as logistic regression, decision-tree learning, SVMs, and random forests. Furthermore, you’ve learned how to calculate basic statistics of multidimensional data arrays, and to use the K-Means algorithm for unsupervised learning. These algorithms and methods are among the most important algorithms in the field of machine learning, and there are a lot more to study if you want to start working as a machine learning engineer. That learning will pay off—machine learning engineers usually earn six figures in the United States (a simple web search should confirm this)! For students who want to dive deeper into machine learning, I recommend the excellent (and free) Coursera course from Andrew Ng. You can find the course material online by asking your favorite search engine.

In the next chapter, you’ll study one of the most important (and most undervalued) skills of highly efficient programmers: regular expressions. While this chapter was a bit more on the conceptual side (you learned the general ideas, but the scikit-learn library did the heavy lifting), the next chapter will be highly technical. So, roll up your sleeves and read on!