Overview

This chapter introduces classification problems, classification using linear and logistic regression, K-nearest neighbors, and decision trees. You will also be briefly introduced to artificial neural networks as a type of classification technique.

By the end of this chapter, you will be able to implement logistic regression and explain how it can be used to classify data into specific groups or classes. You will also be able to use the k-nearest neighbors algorithm for classification and decision trees for data classification, including the ID3 algorithm. Additionally, you will be able to identify the entropy within data and explain how decision trees such as ID3 aim to reduce entropy.

# Introduction

In the previous chapters, we began our supervised machine learning journey using regression techniques, predicting the continuous variable output on a given set of input data. We will now turn to the other type of machine learning problem: classification. Recall that classification tasks aim to classify given input data into two or more specified number of classes.

So, while regression is a task of estimating a continuous value for given input data (for example, estimating the price of a house given its location and dimensions as input data), classification is about predicting a (discrete) label for given input data. For example, a well-known machine learning classification task is the spam detection of emails, where the task is to predict whether a given email is *spam* or *not_spam*. Here, *spam* and *not_spam* are the labels for this task and the input data is the email, or rather the textual data contained in the different fields of the email, such as subject, body, and receiver. The textual data would be preprocessed into numerical features in order to be usable for a classification model. Because there are only two labels in this task, it is known as a binary classification task. And if there are more than two labels in a classification task, it is called a multiclass classification task.

There are various kinds of classification models with different learning algorithms, each having their pros and cons. But essentially, all models are trained using a labeled dataset and, once trained, can predict labels for unlabeled data samples. In this chapter, we will extend the concepts learned in *Chapter 3*, *Linear Regression*, and *Chapter 4*, *Autoregression*, and will apply them to a dataset labeled with classes, rather than continuous values, as output. We will discuss some of the well-known classification models and apply them to some example labeled datasets.

# Ordinary Least Squares as a Classifier

We covered **ordinary least squares** (**OLS**) as linear regression in the context of predicting continuous variable output in the previous chapter, but it can also be used to predict the class that a set of data is a member of. OLS-based classifiers are not as powerful as other types of classifiers that we will cover in this chapter, but they are particularly useful in understanding the process of classification. To recap, an OLS-based classifier is a non-probabilistic, linear binary classifier. It is non-probabilistic because it does not generate any confidence over the prediction such as, for example, logistic regression. It is a linear classifier as it has a linear relationship with respect to its parameters/coefficient.

Now, let's say we had a fictional dataset containing two separate groups, Xs and Os, as shown in *Figure 5.1*. We could construct a linear classifier by first using OLS linear regression to fit the equation of a straight line to the dataset. For any value that lies above the line, the *X* class would be predicted, and for any value beneath the line, the *O* class would be predicted. Any dataset that can be separated by a straight line is known as linearly separable (as in our example), which forms an important subset of data types in machine learning problems. The straight line, in this case, would be called the decision boundary. More generally, the decision boundary is defined as the hyperplane separating the data. In this case, the decision boundary is linear. There could be cases where a decision boundary can be non-linear. Datasets such as the one in our example can be learned by **linear classifiers** such as an OLS-based classifier, or **support vector machines** (**SVMs**) with linear kernels.

However, this does not mean that a linear model can only have a linear decision boundary. A linear classifier/model is a model that is linear with respect to the parameters/weights (β) of the model, but not necessarily with respect to inputs (**x**). Depending on the input, a linear model may have a linear or non-linear decision boundary. As mentioned before, examples of linear models include OLS, SVM, and logistic regression, while examples of non-linear models include KNN, random forest, decision tree, and ANN. We will cover more of these models in the later parts of this chapter:

## Exercise 5.01: Ordinary Least Squares as a Classifier

This exercise contains a contrived example of using OLS as a classifier. In this exercise, we will use a completely fictional dataset, and test how the OLS model fares as a classifier. In order to implement OLS, we will use the **LinearRegression** API of **sklearn**. The dataset is composed of manually selected *x* and *y* values for a scatterplot which are approximately divided into two groups. The dataset has been specifically designed for this exercise, to demonstrate how linear regression can be used as a classifier, and this is available in the accompanying code files for this book, as well as on GitHub, at https://packt.live/3a7oAY8:

- Import the required packages:
import matplotlib.pyplot as plt

import matplotlib.lines as mlines

import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

- Load the
**linear_classifier.csv**dataset into a pandas DataFrame:df = pd.read_csv('../Datasets/linear_classifier.csv')

df.head()

The output will be as follows:

Looking through the dataset, each row contains a set of

*x, y*coordinates, as well as the label corresponding to which class the data belongs to, either a cross (**x**) or a circle (**o**). - Produce a scatterplot of the data with the marker for each point as the corresponding class label:
plt.figure(figsize=(10, 7))

for label, label_class in df.groupby('labels'):

plt.scatter(label_class.values[:,0], label_class.values[:,1], \

label=f'Class {label}', marker=label, c='k')

plt.legend()

plt.title("Linear Classifier");

We'll get the following scatterplot:

- In order to impartially evaluate the model, we should split the training dataset into a training and a test set. We make that train/test split in the ratio 60:40 in the following step:
df_train, df_test = train_test_split(df.copy(), test_size=0.4, \

random_state=12)

- Using the scikit-learn
**LinearRegression**API from the previous chapter, fit a linear model to the**x**,**y**coordinates of the training dataset and print out the linear equation:# Fit a linear regression model

model = LinearRegression()

model.fit(df_train.x.values.reshape((-1, 1)), \

df_train.y.values.reshape((-1, 1)))

# Print out the parameters

print(f'y = {model.coef_[0][0]}x + {model.intercept_[0]}')

The output will be as follows:

y = 1.2718120805369124x + 8.865771812080538

Note

Throughout the exercises and activities in this chapter, owing to randomization, there could be a minor variation in the outputs presented here and those that you might obtain.

- Plot the fitted trendline over the test dataset:
# Plot the trendline

trend = model.predict(np.linspace(0, 10).reshape((-1, 1)))

plt.figure(figsize=(10, 7))

for label, label_class in df_test.groupby('labels'):

plt.scatter(label_class.values[:,0], label_class.values[:,1], \

label=f'Class {label}', marker=label, c='k')

plt.plot(np.linspace(0, 10), trend, c='k', label='Trendline')

plt.legend()

plt.title("Linear Classifier");

The output will be as follows:

- With the fitted trendline, the classifier can then be applied. For each row in the test dataset, determine whether the
*x, y*point lies above or below the linear model (or trendline). If the point lies below the trendline, the model predicts the**o**class; if above the line, the**x**class is predicted. Include these values as a column of predicted labels:# Make predictions

y_pred = model.predict(df_test.x.values.reshape((-1, 1)))

pred_labels = []

for _y, _y_pred in zip(df_test.y, y_pred):

if _y < _y_pred:

pred_labels.append('o')

else:

pred_labels.append('x')

df_test['Pred Labels'] = pred_labels

df_test.head()

The output will be as follows:

- Plot the points with the corresponding ground truth labels. For those points where the labels were correctly predicted, plot the corresponding class. For those incorrect predictions, plot a diamond:
plt.figure(figsize=(10, 7))

for idx, label_class in df_test.iterrows():

if label_class.labels != label_class['Pred Labels']:

label = 'D'

s=70

else:

label = label_class.labels

s=50

plt.scatter(label_class.values[0], label_class.values[1], \

label=f'Class {label}', marker=label, c='k', s=s)

plt.plot(np.linspace(0, 10), trend, c='k', label='Trendline')

plt.title("Linear Classifier");

incorrect_class = mlines.Line2D([], [], color='k', marker='D', \

markersize=10, \

label='Incorrect Classification');

plt.legend(handles=[incorrect_class]);

The output will be as follows:

We can see that, in this plot, the linear classifier made two incorrect predictions in this completely fictional dataset, one at *x = 1*, and another at *x = 3*.

Note

To access the source code for this specific section, please refer to https://packt.live/3hT3Fwy.

You can also run this example online at https://packt.live/3fECHai. You must execute the entire Notebook in order to get the desired result.

But what if our dataset is not linearly separable and we cannot classify the data using a straight-line model, which is very frequently the case. Furthermore, the preceding approach doesn't give us a measure of confidence regarding the predictions. To cope with these challenges, we turn to other classification methods, many of which use different models, but the process logically flows from our simplified linear classifier model.

# Logistic Regression

The logistic, or logit, model is a linear model that has been effectively used for classification tasks in a number of different domains. Recalling the definition of the OLS model from the previous section, the logistic regression model takes as input a linear combination of the input features. In this section, we will use it to classify images of handwritten digits. In understanding the logistic model, we also take an important step in understanding the operation of a particularly powerful machine learning model – artificial neural networks. So, what exactly is the logistic model? Like the OLS model, which is composed of a linear or straight-line function, the logistic model is composed of the standard logistic function, which, in mathematical terms, looks something like this:

In practical terms, when trained, this function returns the probability of the input information belonging to a particular class or group. In the preceding equation, **x** is the input feature vector (an array of numbers, each representing a feature of the input data), **β**1 is the parameter vector of the model that has to be learned by training the model, **β**0 is the bias term or offset term (yet another parameter) that helps the model to deal with any constant value offsets in the relationship between input (**x**) and output (**y**), and **p(x)** is the output probability of the data sample **x** belonging to a certain class. For example, if we have two classes, A and B, then **p(x)** is the probability of class A and **1-p(x)** is the probability of class B.

So, how did we arrive at the logistic function? Well, the logistic regression model arises from the desire to model the log of odds in favor of a data point to belong to class A of the two classes (A and B) via linear functions in **x**. The model has the following form:

We are considering the case of binary classification here, with just two classes, A and B, although we could easily extend the discussion to multiclass classification as well using the one-versus-all classification trick. More on that will be discussed in a subsequent section. But for now, because we know there are only two classes, we know that:

Using the preceding two equations, we can get:

And now, if we consider class A as our target class, we can replace **p(class=A)** with **y** (target output):

The left-hand side of the preceding equation is popularly known as log-odds, as it is the logarithm of the odds ratio, which is the ratio of the probability of class A to the probability of class B. So, why is this important? For a linear model such as logistic regression, the fact that the log-odds of this model is linear with respect to the input **x** implies the linearity of the model.

By rearranging the preceding equation slightly, we get the logistic function:

Notice the exponents of **e**, that is, **β**0** + β**1**x**, and that this relationship is a linear function of the two training parameters or *weights*, β0 and β1, as well as the input feature vector, *x*. If we were to assume β0 = 0 and β1 = 1, and plot the logistic function over the range **(-6, 6)**, we would get the following result:

Note

The sigmoid curve centers around the point x = -β0, so, if β0 is nonzero, the curve would not center around the point x=0, as shown in the preceding figure.

Examining *Figure 5.13*, we notice some important aspects of classification. The first thing to note is that, if we look at the probability values on the **y** axis at the extremes of the function, the values are almost at zero when **x = -6** and at one when **x = 6**. While it looks like the values are in fact **0** and **1**, this is not exactly the case. The logistic function approaches zero and one at these extremes and will only equal zero and one when **x** is at a positive or negative infinity. In practical terms, what this means is that the logistic function will never return a probability of greater than or equal to one, or less than or equal to zero, which is perfect for a classification task. In any event, we could never have a probability of greater than one since, by definition, a probability of one is a certainty of an event occurring. Likewise, we cannot have a probability of less than zero since, by definition, a probability of zero is a certainty of the event not occurring. The fact that the logistic function approaches but never equals one or zero means that there is always some uncertainty in the outcome or the classification.

The final aspect to notice regarding the logistic function is that at **x = 0**, the probability is 0.5, which, if we were to get this result, would indicate that the model is equally uncertain about the outcome of the corresponding class; that is, it really has no idea.

Note

It is very important to correctly understand and interpret the probability information provided by classification models such as logistic regression. Consider this probability score as the chance of the input information belonging to a particular class given the variability in the information provided by the training data. One common mistake is to use this probability score as an objective measure of whether the model can be trusted regarding its prediction; unfortunately, this isn't necessarily the case. For example, *a model can provide a probability of 99.99% that some data belongs to a particular class and might still be absolutely wrong*.

What we do use the probability value for is selecting the predicted class by the classifier. Between the model outputting the probability and us deciding the predicted class lies the probability threshold value. We need to decide a threshold value, **τ**, between 0 and 1, such that the two classes (say, A and B) can then be defined as:

- Data samples with a model output probability between 0 and τ belong to class A.
- Data samples with a model output probability between τ and 1 belong to class B.

Now, say we had a model that was to predict whether some set of data belonged to class A or class B, and we decided the threshold to be 0.5 (which is actually a very common choice). If the logistic model returned a probability of 0.7, then we would return class B as the predicted class for the model. If the probability was only 0.2, the predicted class for the model would be class A.

## Exercise 5.02: Logistic Regression as a Classifier – Binary Classifier

For this exercise, we will be using a sample of the famous MNIST dataset (available at http://yann.lecun.com/exdb/mnist/ or on GitHub at https://packt.live/3a7oAY8), which is a sequence of images of handwritten code digits, 0 through 0, with corresponding labels. The MNIST dataset is comprised of 60,000 training samples and 10,000 test samples, where each sample is a grayscale image with a size of 28 x 28 pixels. In this exercise, we will use logistic regression to build a classifier. The first classifier we will build is a binary classifier, where we will determine whether the image is a handwritten 0 or a 1:

- For this exercise, we will need to import a few dependencies. Execute the following import statements:
import struct

import numpy as np

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.linear_model import LogisticRegression

- We will also need to download the MNIST datasets. You will only need to do this once, so after this step, feel free to comment out or remove these cells. Download the image data, as follows:
request = \

urllib.request.urlopen('http://yann.lecun.com/exdb'\

'/mnist/train-images-idx3-ubyte.gz')

with open('../Datasets/train-images-idx3-ubyte.gz', 'wb') as f:

f.write(request.read())

request = \

urllib.request.urlopen('http://yann.lecun.com/exdb'\

'/mnist/t10k-images-idx3-ubyte.gz')

with open('../Datasets/t10k-images-idx3-ubyte.gz', 'wb') as f:

f.write(request.read())

- Download the corresponding labels for the data:
request = \

urllib.request.urlopen('http://yann.lecun.com/exdb'\

'/mnist/train-labels-idx1-ubyte.gz')

with open('../Datasets/train-labels-idx1-ubyte.gz', 'wb') as f:

f.write(request.read())

request = \

urllib.request.urlopen('http://yann.lecun.com/exdb'\

'/mnist/t10k-labels-idx1-ubyte.gz')

with open('../Datasets/t10k-labels-idx1-ubyte.gz', 'wb') as f:

f.write(request.read())

- Once all the files have been successfully downloaded, unzip the files in the local directory using the following command (for Windows):
!ls *.gz #!dir *.gz for windows

The output will be as follows:

t10k-images-idx3-ubyte.gz train-images-idx3-ubyte.gz

t10k-labels-idx1-ubyte.gz train-images-idx1-ubyte.gz

Note

For Linux and macOS, check out the files in the local directory using the

**!ls *.gz**command. - Load the downloaded data. Don't worry too much about the exact details of reading the data, as these are specific to the MNIST dataset:
with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img = np.array(array("B", f.read())).reshape((size, rows, cols))

with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels = np.array(array("B", f.read()))

with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img_test = np.array(array("B", f.read()))\

.reshape((size, rows, cols))

with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels_test = np.array(array("B", f.read()))

- As always, having a thorough understanding of the data is key, so create an image plot of the first 10 images in the training sample. Notice the grayscale images and the fact that the corresponding labels are the digits 0 through 9:
for i in range(10):

plt.subplot(2, 5, i + 1)

plt.imshow(img[i], cmap='gray');

plt.title(f'{labels[i]}');

plt.axis('off')

The output will be as follows:

- As the initial classifier is aiming to classify either images of zeros or images of ones, we must first select these samples from the dataset:
samples_0_1 = np.where((labels == 0) | (labels == 1))[0]

images_0_1 = img[samples_0_1]

labels_0_1 = labels[samples_0_1]

samples_0_1_test = np.where((labels_test == 0) | (labels_test == 1))

images_0_1_test = img_test[samples_0_1_test]\

.reshape((-1, rows * cols))

labels_0_1_test = labels_test[samples_0_1_test]

- Visualize one sample from the 0 selection and another from the handwritten 1 digits to ensure that we have correctly allocated the data.
Here is the code for 0:

sample_0 = np.where((labels == 0))[0][0]

plt.imshow(img[sample_0], cmap='gray');

The output will be as follows:

Here is the code for 1:

sample_1 = np.where((labels == 1))[0][0]

plt.imshow(img[sample_1], cmap='gray');

The output will be as follows:

- We are almost at the stage where we can start building the model. However, as each sample is an image and has data in a matrix format, we must first rearrange each of the images. The model needs the images to be provided in vector form, that is, all the information for each image is stored in one row. Execute this as follows:
images_0_1 = images_0_1.reshape((-1, rows * cols))

images_0_1.shape

- Now, we can build and fit the logistic regression model with the selected images and labels:
model = LogisticRegression(solver='liblinear')

model.fit(X=images_0_1, y=labels_0_1)

The output will be as follows:

Note how the scikit-learn API calls for logistic regression are consistent with that of linear regression. There is an additional argument,

**solver**, which specifies the type of optimization process to be used. We have provided this argument here with the default value to suppress a future warning in this version of scikit-learn that requires**solver**to be specified. The specifics of the**solver**argument are beyond the scope of this chapter and have only been included to suppress the warning message. - Check the performance of this model against the corresponding training data:
model.score(X=images_0_1, y=labels_0_1)

The output will be as follows:

1.0

In this example, the model was able to predict the training labels with 100% accuracy.

- Display the first two predicted labels for the training data using the model:
model.predict(images_0_1) [:2]

The output will be as follows:

array([0, 1], dtype=uint8)

- How is the logistic regression model making the classification decisions? Look at some of the probabilities produced by the model for the training set:
model.predict_proba(images_0_1)[:2]

The output will be as follows:

array([[9.99999999e-01, 9.89532857e-10],

[4.56461358e-09, 9.99999995e-01]])

We can see that, for each prediction made, there are two probability values. For the prediction of each image, the first value is the probability that it is an image of digit

**0**, and the second value is the probability of digit**1**. These two values add up to 1. We can see that, in the first example, the prediction probability is 0.9999999 for digit**0**and, hence, the prediction is digit**0**. Similarly, the inverse is true for the second example.Note

The probabilities should ideally add up to 1 but, due to computational limits and truncation errors, it is almost 1.

- Compute the performance of the model against the test set to check its performance against data that it has not seen:
model.score(X=images_0_1_test, y=labels_0_1_test)

The output will be as follows:

0.9995271867612293

Note

Refer to

*Chapter 7*,*Model Evaluation*, for better methods of objectively measuring the model's performance.

We can see here that logistic regression is a powerful classifier that is able to distinguish between handwritten samples of 0 and 1.

Note

To access the source code for this specific section, please refer to https://packt.live/3dqqEvH.

You can also run this example online at https://packt.live/3hT6FJm. You must execute the entire Notebook in order to get the desired result.

Now that we have trained a logistic regression model on a binary classification problem, let's extend the model to multiple classes. Essentially, we will be using the same dataset and, instead of classifying into just two classes or digits, 0 and 1, we classify into all 10 classes, or digits 0–9. In essence, multiclass classification for logistic regression works as one-versus-all classification. That is, for classification into the 10 classes, we will be training 10 binary classifiers. Each classifier will have 1 digit as the first class, and all the other 9 digits as the second class. In this way, we get 10 binary classifiers that are then collectively used to make predictions. In other words, we get the prediction probabilities from each of the 10 binary classifiers and the final output digit/class is one whose classifier gave the highest probability.

## Exercise 5.03: Logistic Regression – Multiclass Classifier

In the previous exercise, we examined using logistic regression to classify between one of two groups. Logistic regression, however, can also be used to classify a set of input information to **k** different groups and it is this multiclass classifier we will be investigating in this exercise. The process for loading the MNIST training and test data is identical to the previous exercise:

- Import the required packages:
import struct

import numpy as np

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.linear_model import LogisticRegression

- Load the training/test images and the corresponding labels:
with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img = np.array(array("B", f.read()))\

.reshape((size, rows, cols))

with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels = np.array(array("B", f.read()))

with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img_test = np.array(array("B", f.read()))\

.reshape((size, rows, cols))

with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels_test = np.array(array("B", f.read()))

- Visualize a sample of the data:
for i in range(10):

plt.subplot(2, 5, i + 1)

plt.imshow(img[i], cmap='gray');

plt.title(f'{labels[i]}');

plt.axis('off')

The output will be as follows:

- Given that the training data is so large, we will select a subset of the overall data to reduce the training time as well as the system resources required for the training process:
np.random.seed(0) # Give consistent random numbers

selection = np.random.choice(len(img), 5000)

selected_images = img[selection]

selected_labels = labels[selection]

Note that, in this example, we are using data from all 10 classes, not just classes 0 and 1, so we are making this example a multiclass classification problem.

- Again, reshape the input data in vector form for later use:
selected_images = selected_images.reshape((-1, rows * cols))

selected_images.shape

The output will be as follows:

(5000, 784)

- The next cell is intentionally commented out. Leave this code commented out for the moment:
# selected_images = selected_images / 255.0

# img_test = img_test / 255.0

- Construct the logistic model. There are a few extra arguments, as follows: the
**lbfgs**value for**solver**is geared up for multiclass problems, with additional**max_iter**iterations required for converging on a solution. The**multi_class**argument is set to**multinomial**to calculate the loss over the entire probability distribution:model = LogisticRegression(solver='lbfgs', \

multi_class='multinomial', \

max_iter=500, tol=0.1)

model.fit(X=selected_images, y=selected_labels)

The output will be as follows:

Note

Refer to the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for more information on the arguments.

- Determine the accuracy score against the training set:
model.score(X=selected_images, y=selected_labels)

The output will be as follows:

1.0

- Determine the first two predictions for the training set and plot the images with the corresponding predictions:
model.predict(selected_images)[:2]

The output will be as follows:

array([4, 1], dtype-uint8)

- Show the images for the first two samples of the training set to see whether we are correct:
plt.subplot(1, 2, 1)

plt.imshow(selected_images[0].reshape((28, 28)), cmap='gray');

plt.axis('off');

plt.subplot(1, 2, 2)

plt.imshow(selected_images[1].reshape((28, 28)), cmap='gray');

plt.axis('off');

The output will be as follows:

- Again, print out the probability scores provided by the model for the first sample of the training set. Confirm that there are 10 different values for each of the 10 classes in the set:
model.predict_proba(selected_images)[0]

The output will be as follows:

Notice that, in the probability array of the first sample, the fifth (index four) sample is the highest probability, thereby indicating a prediction of

**4**. - Compute the accuracy of the model against the test set. This will provide a reasonable estimate of the model's
*in the wild*performance, as it has never seen the data in the test set. It is expected that the accuracy rate of the test set will be slightly lower than the training set, given that the model has not been exposed to this data:model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

The output will be as follows:

0.878

When checked against the test set, the model produced an accuracy level of 87.8%. When applying a test set, a performance drop is expected, as this is the very first time the model has seen these samples; while, during training, the training set was repeatedly shown to the model.

- Find the cell with the commented-out code, as shown in
*Step 4*. Uncomment the code in this cell:selected_images = selected_images / 255.0

img_test = img_test / 255.0

This cell simply scales all the image values to between 0 and 1. Grayscale images are comprised of pixels with values between and including 0–255, where 0 is black and 255 is white.

- Click
**Restart**&**Run-All**to rerun the entire notebook. - Find the training set error:
model.score(X=selected_images, y=selected_labels)

We'll get the following score:

0.986

- Find the test set error:
model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

We'll get the following score:

0.9002

Note

To access the source code for this specific section, please refer to https://packt.live/2B1CNKe.

You can also run this example online at https://packt.live/3fQU4Vd. You must execute the entire Notebook in order to get the desired result.

What effect did normalizing the images have on the overall performance of the system? The training error is worse! We went from 100% accuracy in the training set to 98.6%. Yes, there was a reduction in the performance of the training set, but an increase in the test set accuracy from 87.8% to 90.02%. The test set performance is of more interest, as the model has not seen this data before, and so it is a better representation of the performance that we could expect once the model is in the field. So, why do we get a better result?

Recall what we discussed about normalization and data scaling methods in *Chapter 2, Exploratory Data Analysis and Visualization*. And now let's review *Figure 5.13*, and notice the shape of the curve as it approaches -6 and +6. The curve saturates or flattens at almost 0 and almost 1, respectively. So, if we use an image (or **x** values) of between 0 and 255, the class probability defined by the logistic function is well within this flat region of the curve. Predictions within this region are unlikely to change much, as they will need to have very large changes in **x** values for any meaningful change in **y**. Scaling the images to be between 0 and 1 initially puts the predictions closer to **p(x) = 0.5**, and so, changes in **x** can have a bigger impact on the value for **y**. This allows for more sensitive predictions and results in getting a couple of predictions in the training set wrong, but more in the test set right. It is recommended, for your logistic regression models, that you scale the input values to be between either 0 and 1 or -1 and 1 prior to training and testing.

The following function is one way of scaling values of a NumPy array between 0 and 1:

def scale_input(x):

normalized = (x-min(x))/(max(x)-min(x))

return normalized

The preceding method of scaling is called min-max scaling, as it is based on scaling with respect to the minimum and maximum values of the array. Z-scaling and mean scaling are other well-known scaling methods.

Thus, we have successfully solved a multiclass classification problem using the logistic regression model. Let's now proceed toward an activity where, similar to *Exercise 5.02: Logistic Regression as a Classifier – Binary Classifier*, we will solve a binary classification problem. This time, however, we will use a simpler model – a linear regression classifier.

## Activity 5.01: Ordinary Least Squares Classifier – Binary Classifier

In this activity, we will build a two-class OLS (linear regression)-based classifier using the MNIST dataset to classify between two digits, 0 and 1.

The steps to be performed are as follows:

- Import the required dependencies:
import struct

import numpy as np

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.linear_model import LinearRegression

- Load the MNIST data into memory.
- Visualize a sample of the data.
- Construct a linear classifier model to classify the digits 0 and 1. The model we are going to create is to determine whether the samples are either the digits 0 or 1. To do this, we first need to select only those samples.
- Visualize the selected information with images of one sample of 0 and one sample of 1.
- In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape.
- Let's construct the model; use the
**LinearRegression**API and call the**fit**function. - Determine the accuracy against the training set.
- Determine the label predictions for each of the training samples, using a threshold of 0.5. Values greater than 0.5 classify as 1; values less than, or equal to, 0.5 classify as 0.
- Compute the classification accuracy of the predicted training values versus the ground truth.
- Compare the performance against the test set.
Note

The solution for this activity can be found via this link.

An interesting point to note here is that the test set performance here is worse than that in *Exercise 5.02: Logistic Regression as a Classifier – Binary Classifier*. The dataset is exactly the same in both cases, but the models are different. And, as expected, the linear regression classifier, being a simpler model, leads to poorer test set performance compared to a stronger, logistic regression model.

### Select K Best Feature Selection

Now that we have established how to train and test the linear regression and logistic regression models on the MNIST dataset, we will now solve another multiclass classification problem on a different dataset using the logistic regression model. As a prerequisite for the next exercise, let's quickly discuss a particular kind of feature selection method – select k best feature selection. In this method, we select features according to the k highest scores. The scores are derived based on a scoring function, which takes in the input feature (**X**) and target (**y**), and returns scores for each feature. An example of such a function could be a function that computes the ANOVA F-value between label (**y**) and feature (**X**). An implementation of this scoring function is available with scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif. The features are then sorted based on the decreasing order of scores, and we choose the top k features out of this ordered list. An implementation of the select k best feature selection method is available with scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html. Furthermore, the following is an example code to demonstrate how this method is used in scikit-learn:

>>> from sklearn.datasets import load_digits

>>> from sklearn.feature_selection import SelectKBest, chi2

>>> X, y = load_digits(return_X_y=True)

>>> X.shape

(1797, 64)

>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)

>>> X_new.shape

(1797, 20)

And now we move on to our next exercise, where we solve a multiclass classification problem.

## Exercise 5.04: Breast Cancer Diagnosis Classification Using Logistic Regression

In this exercise, we will be using the Breast Cancer Diagnosis dataset (available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) or on GitHub at https://packt.live/3a7oAY8). This dataset is a part of the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php). The dataset contains characteristics of the cell nuclei present in the digitized image of a **Fine Needle Aspirate** (**FNA**) of a breast mass, with the labels of malignant and benign for each cell nucleus. Characteristics are features (30 in total), such as the mean radius, radius error, worst radius, mean texture, texture error, and worst texture of the cell nuclei. In this exercise, we will use the features provided in the dataset to classify between malignant and benign cells.

The steps to be performed are as follows:

- Import the required packages. For this exercise, we will require the pandas package to load the data, the Matplotlib package for plotting, and scikit-learn for creating the logistic regression model. Import all the required packages and relevant modules for these tasks:
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.feature_selection import SelectKBest

from sklearn.model_selection import train_test_split

- Load the Breast Cancer Diagnosis dataset using pandas and examine the first five rows:
df = pd.read_csv('../Datasets/breast-cancer-data.csv')

df.head()

The output will be as follows:

Additionally, dissect the dataset into input (X) and output (y) variables:

X, y = df[[c for c in df.columns if c != 'diagnosis']], df.diagnosis

- The next step is feature engineering. We use scikit-learn's select k best features sub-module under its feature selection module. Basically, this examines the power of each feature against the target output based on a scoring function. You can read about the details here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html:
"""

restricting to 2 best features so that

we can visualize them on a plot

"""

skb_model = SelectKBest(k=2)

X_new = skb_model.fit_transform(X, y)

# get the k - best column names

mask = skb_model.get_support() #list of booleans

selected_features = [] # The list of your K best features

for bool, feature in zip(mask, df.columns):

if bool:

selected_features.append(feature)

print(selected_features)

The output will be as follows:

['worst perimeter', 'worst concave points']

- And now let's visualize how these two most important features correlate with the target (diagnosis) and how well they separate the two classes of diagnosis:
markers = {'benign': {'marker': 'o'}, \

'malignant': {'marker': 'x'},}

plt.figure(figsize=(10, 7))

for name, group in df.groupby('diagnosis'):

plt.scatter(group[selected_features[0]], \

group[selected_features[1]], label=name, \

marker=markers[name]['marker'],)

plt.title(f'Diagnosis Classification {selected_features[0]} vs \

{selected_features[1]}');

plt.xlabel(selected_features[0]);

plt.ylabel(selected_features[1]);

plt.legend();

The output will be as follows:

- Before we can construct the model, we must first convert the
**diagnosis**values into labels that can be used within the model. Replace the**benign**diagnosis string with the value**0**, and the**malignant**diagnosis string with the value**1**:diagnoses = ['benign', 'malignant',]

output = [diagnoses.index(diag) for diag in df.diagnosis]

- Also, in order to impartially evaluate the model, we should split the training dataset into a training and a validation set:
train_X, valid_X, \

train_y, valid_y = train_test_split(df[selected_features], output, \

test_size=0.2, random_state=123)

- Create the model using the
**selected_features**and the assigned**diagnosis**labels:model = LogisticRegression(solver='liblinear')

model.fit(df[selected_features], output)

The output will be as follows:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,

multi_class='warn', n_jobs=None, penalty='l2',

random_state=None, solver='liblinear', tol=0.0001,

verbose=0,

warm_start=False)

- Compute the accuracy of the model against the validation set:
model.score(valid_X, valid_y)

The output will be as follows:

0.9385964912280702

- Construct another model using a random choice of
**selected_features**and compare performance:selected_features = ['mean radius', # List features here \

'mean texture', 'compactness error']

train_X, valid_X, \

train_y, valid_y = train_test_split(df[selected_features], output, \

test_size=0.2, random_state=123)

model = LogisticRegression(solver='liblinear')

model.fit(train_X, train_y)

model.score(valid_X, valid_y)

The output will be as follows:

0.8859649122807017

This reduced accuracy shows that indeed, using the two most important features renders a more powerful model than using three randomly chosen features.

- Construct another model using all the available information and compare performance:
selected_features = [feat for feat in df.columns \

if feat != 'diagnosis' # List features here

]

train_X, valid_X, \

train_y, valid_y = train_test_split(df[selected_features], output, \

test_size=0.2, random_state=123)

model = LogisticRegression(solver='liblinear')

model.fit(train_X, train_y)

model.score(valid_X, valid_y)

The output will be as follows:

0.9824561403508771

Note

To access the source code for this specific section, please refer to https://packt.live/2YWxjIN.

You can also run this example online at https://packt.live/2Bx8NWt. You must execute the entire Notebook in order to get the desired result.

This improvement in performance by using all the features shows that even those features that are not among the most important ones do still play a role in improving model performance.

# Classification Using K-Nearest Neighbors

Now that we are comfortable with creating multiclass classifiers using logistic regression and are getting reasonable performance with these models, we will turn our attention to another type of classifier: the K-nearest neighbors (KNN) classifier. KNN is a non-probabilistic, non-linear classifier. It does not predict the probability of a class. Also, as it does not learn any parameters, there is no linear combination of parameters and, thus, it is a non-linear model:

*Figure 5.24* represents the workings of a KNN classifier. The two different symbols, **X** and **O**, represent data points belonging to two different classes. The solid circle at the center is the test point requiring classification, the inner dotted circle shows the classification process where **k=3**, while the outer dotted circle shows the classification process where **k=5**. What we mean here is that, if **k=3**, we only look at the three data points nearest to the test point, which gives us the impression of that dotted circle encompassing those three nearest data points.

KNN is one of the simplest "learning" algorithms available for data classification. The use of learning in quotation marks is explicit, as KNN doesn't really learn from the data and encode these learnings in parameters or weights like other methods, such as logistic regression. KNN uses instance-based or lazy learning in that it simply stores or memorizes all the training samples and the corresponding classes. It derives its name, k-nearest neighbors, from the fact that, when a test sample is provided to the algorithm for class prediction, it uses a majority vote of the k-nearest points to determine the corresponding class. If we look at *Figure 5.24* and if we assume **k=3**, the nearest three points lie within the inner dotted circle, and, in this case, the classification would be a hollow circle (**O**).

If, however, we were to take **k=5**, the nearest five points lie within the outer dotted circle and the classification would be a cross (**X**) (three crosses to two hollow circles). So, how do we select **k**? Academically, we should plot the KNN model performance (error) as a function of **k**. Look for an elbow in this plot, and the moment when an increase in **k** does not change the error significantly; this means that we have found an optimal value for **k**. More practically, the choice of **k** depends on the data, with larger values of **k** reducing the effect of noise on the classification, but thereby making boundaries between classes less distinct.

The preceding figure highlights a few characteristics of KNN classification that should be considered:

- As mentioned previously, the selection of
**k**is quite important. In this simple example, switching**k**from three to five flipped the class prediction due to the proximity of both classes. As the final classification is taken by a majority vote, it is often useful to use odd numbers of**k**to ensure that there is a winner in the voting process. If an even value of**k**is selected, and a tie in the vote occurs, then there are a number of different methods available for breaking the tie, including:Reducing

**k**by one until the tie is brokenSelecting the class on the basis of the smallest Euclidean distance to the nearest point

Applying a weighting function to bias the test point toward those neighbors that are closer

- KNN models have the ability to form extremely complex non-linear boundaries, which can be advantageous in classifying images or datasets with highly non-linear boundaries. Considering that, in
*Figure 5.24*, the test point changes from a hollow circle classification to a cross with an increase in**k**, we can see here that a complex boundary could be formed. - KNN models can be highly sensitive to local features in the data, given that the classification process is only really dependent on the nearby points.
- As KNN models memorize all the training information to make predictions, they can struggle with generalizing to new, unseen data.

There is another variant of KNN, which, rather than specifying the number of nearest neighbors, specifies the size of the radius around the test point at which to look. This method, known as the radius neighbors classification, will not be considered in this chapter, but, in understanding KNN, you will also develop an understanding of the radius neighbors classification and how to use the model through scikit-learn.

Note

Our explanation of KNN classification and the next exercise examines modeling data with two features or two dimensions, as it enables simpler visualization and a greater understanding of the KNN modeling process. And then we will classify a dataset with a greater number of dimensions in *Activity 5.02: KNN Multiclass Classifier*, wherein we'll classify MNIST using KNN. Remember, just because there are too many dimensions to plot, this doesn't mean it cannot be classified with *N* dimensions.

To allow visualization of the KNN process, we will turn our attention in the following exercise to the Breast Cancer Diagnosis dataset. This dataset is provided as part of the accompanying code files for this book.

## Exercise 5.05: KNN Classification

In this exercise, we will be using the KNN classification algorithm to build a model on the Breast Cancer Diagnosis dataset and evaluate its performance by calculating its accuracy:

- For this exercise, we need to import pandas, Matplotlib, and the
**KNeighborsClassifier**and**train_test_split**sub-modules of scikit-learn. We will use the shorthand notation**KNN**for quick access:import pandas as pd

import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier as KNN

from sklearn.model_selection import train_test_split

- Load the Breast Cancer Diagnosis dataset and examine the first five rows:
df = pd.read_csv('../Datasets/breast-cancer-data.csv')

df.head()

The output will be as follows:

- At this stage, we need to choose the most appropriate features from the dataset for use with the classifier. We could simply select all 30 features. However, as this exercise is designed to allow visualization of the KNN process, we will arbitrarily only select the mean radius and worst radius. Construct a scatterplot for mean radius versus worst radius for each of the classes in the dataset with the corresponding diagnosis type:
markers = {'benign': {'marker': 'o', 'facecolor': 'g', \

'edgecolor': 'g'}, \

'malignant': {'marker': 'x', 'facecolor': 'r', \

'edgecolor': 'r'},}

plt.figure(figsize=(10, 7))

for name, group in df.groupby('diagnosis'):

plt.scatter(group['mean radius'], group['worst radius'], \

label=name, marker=markers[name]['marker'], \

facecolors=markers[name]['facecolor'], \

edgecolor=markers[name]['edgecolor'])

plt.title('Breast Cancer Diagnosis Classification Mean Radius '\

'vs Worst Radius');

plt.xlabel('Mean Radius');

plt.ylabel('Worst Radius');

plt.legend();

The output will be as follows:

- Before actually going into training a model, let's split the training dataset further into a training and a validation set in the ratio 80:20 to be able to impartially evaluate the model performance later using the validation set:
train_X, valid_X, \

train_y, valid_y = train_test_split(df[['mean radius', \

'worst radius']], \

df.diagnosis, test_size=0.2, \

random_state=123)

- Construct a KNN classifier model with
**k = 3**and fit it to the training data:model = KNN(n_neighbors=3)

model.fit(X=train_X, y=train_y)

The output will be as follows:

- Check the performance of the model against the validation set:
model.score(X=valid_X, y=valid_y)

The output will show the performance score:

0.9385964912280702

As we can see, the accuracy is over 93% on the validation set. Next, by means of an exercise, we will try to understand what decision boundaries are formed by the KNN model during the training process. We will draw the boundaries in the exercise.

Note

To access the source code for this specific section, please refer to https://packt.live/3dovRUH.

You can also run this example online at https://packt.live/2V5hYEP. You must execute the entire Notebook in order to get the desired result.

## Exercise 5.06: Visualizing KNN Boundaries

To visualize the decision boundaries produced by the KNN classifier, we need to sweep over the prediction space, that is, the minimum and maximum values for the mean radius and worst radius, and determine the classifications made by the model at those points. Once we have this sweep, we can then plot the classification decisions made by the model:

- Import all the relevant packages. We will also need NumPy for this exercise:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

from sklearn.neighbors import KNeighborsClassifier as KNN

- Load the dataset into a pandas DataFrame:
df = pd.read_csv('../Datasets/breast-cancer-data.csv')

df.head()

The output will be as follows:

- While we could use the diagnosis strings to create the model in the previous exercise, in plotting the decision boundaries, it would be more useful to map the diagnosis to separate integer values. To do this, create a list of the labels for later reference and iterate through this list, replacing the existing label with the corresponding index in the list:
labelled_diagnoses = ['benign', 'malignant',]

for idx, label in enumerate(labelled_diagnoses):

df.diagnosis = df.diagnosis.replace(label, idx)

df.head()

The output will be as follows:

Notice the use of the

**enumerate**function in the**for**loop definition. When iterating through the**for**loop, the**enumerate**function provides the index of the value in the list as well as the value itself through each iteration. We assign the index of the value to the**idx**variable and the value to**label**. Using**enumerate**in this way provides an easy way to replace the species strings with a unique integer label. - Construct a KNN classification model, again using three nearest neighbors and fit to the mean radius and worst radius with the newly labeled diagnosis data:
model = KNN(n_neighbors=3)

model.fit(X=df[['mean radius', 'worst radius']], y=df.diagnosis)

The output will be as follows:

- To visualize our decision boundaries, we need to create a mesh or range of predictions across the information space, that is, all possible combinations of values of mean radius and worst radius. Starting with
**1**unit less than the minimum for both the mean radius and worst radius, and finishing at**1**unit more than the maximum for mean radius and worst radius, use the**arange**function of NumPy to create a range of values between these limits in increments of**0.1**(spacing):spacing = 0.1

mean_radius_range = np.arange(df['mean radius'].min() - 1, \

df['mean radius'].max() + 1, spacing)

worst_radius_range = np.arange(df['worst radius'].min() - 1, \

df['worst radius'].max() + 1, spacing)

- Use the NumPy
**meshgrid**function to combine the two ranges in a grid:# Create the mesh

xx, yy = np.meshgrid(mean_radius_range, worst_radius_range)

Check out

**xx**:xx

The output will be as follows:

- Check out
**yy**:yy

The output will be as follows:

- Concatenate the mesh into a single NumPy array using
**np.c_**:pred_x = np.c_[xx.ravel(), yy.ravel()] # Concatenate the results

pred_x

The output will be as follows:

While this function call looks a little mysterious, it simply concatenates the two separate arrays together (refer to https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html) and is shorthand for concatenate.

- Produce the class predictions for the mesh:
pred_y = model.predict(pred_x).reshape(xx.shape)

pred_y

The output will be as follows:

- To consistently visualize the boundaries, we will need two sets of consistent colors; a lighter set of colors for the decision boundaries, and a darker set of colors for the points of the training set themselves. Create two color maps using
**ListedColormaps**:# Create color maps

cmap_light = ListedColormap(['#6FF6A5', '#F6A56F',])

cmap_bold = ListedColormap(['#0EE664', '#E6640E',])

- To highlight the decision boundaries, first plot the training data according to the diagnosis types, using the
**cmap_bold**color scheme and different markers for each of the different diagnosis types:markers = {'benign': {'marker': 'o', 'facecolor': 'g', \

'edgecolor': 'g'}, \

'malignant': {'marker': 'x', 'facecolor': 'r', \

'edgecolor': 'r'},}

plt.figure(figsize=(10, 7))

for name, group in df.groupby('diagnosis'):

diagnoses = labelled_diagnoses[name]

plt.scatter(group['mean radius'], group['worst radius'], \

c=cmap_bold.colors[name], \

label=labelled_diagnoses[name], \

marker=markers[diagnoses]['marker'])

plt.title('Breast Cancer Diagnosis Classification Mean Radius '\

'vs Worst Radius');

plt.xlabel('Mean Radius');

plt.ylabel('Worst Radius');

plt.legend();

The output will be as follows:

- Using the prediction mesh made previously, plot the decision boundaries in addition to the training data:
plt.figure(figsize=(10, 7))

plt.pcolormesh(xx, yy, pred_y, cmap=cmap_light);

plt.scatter(df['mean radius'], df['worst radius'], c=df.diagnosis, cmap=cmap_bold, edgecolor='k', s=20);

plt.title('Breast Cancer Diagnosis Decision Boundaries Mean Radius '\

'vs Worst Radius');

plt.xlabel('Mean Radius');

plt.ylabel('Worst Radius');

plt.text(15, 12, 'Benign', ha='center',va='center', \

size=20,color='k');

plt.text(15, 30, 'Malignant', ha='center',va='center', \

size=20,color='k');

The output will be as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/3dpxPnY.

You can also run this example online at https://packt.live/3drmBPE. You must execute the entire Notebook in order to get the desired result.

We have thus both trained a KNN classifier and also understood how the knn decision boundaries are formed. Next, we will train a KNN multiclass classifier for a different dataset and evaluate its performance.

## Activity 5.02: KNN Multiclass Classifier

In this activity, we will use the KNN model to classify the MNIST dataset into 10 different digit-based classes.

The steps to be performed are as follows:

- Import the following packages:
import struct

import numpy as np

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.neighbors import KNeighborsClassifier as KNN

- Load the MNIST data into memory; first the training images, then the training labels, then the test images, and, finally, the test labels.
- Visualize a sample of the data.
- Construct a KNN classifier, with three nearest neighbors to classify the MNIST dataset. Again, to save processing power, randomly sample 5,000 images for use in training.
- In order to provide the image information to the model, we must first flatten the data out such that each image is 1 x 784 pixels in shape.
- Build the KNN model with
**k=3**and fit the data to the model. Note that, in this activity, we are providing 784 features or dimensions to the model, not just 2. - Determine the score against the training set.
- Display the first two predictions for the model against the training data.
- Compare the performance against the test set.
The output will be as follows:

0.9376

Note

The solution for this activity can be found via this link.

If we compare the preceding test set performance with that in *Exercise 5.03, Logistic Regression – Multiclass Classifier*, we see that for the exact same dataset, the knn model outperforms the logistic regression classifier regarding this task. This doesn't necessarily mean that knn always outperforms logistic regression, but it does so for this task, for this dataset.

# Classification Using Decision Trees

Another powerful classification method that we will be examining in this chapter is decision trees, which have found particular use in applications such as natural language processing, for example. There are a number of different machine learning algorithms that fall within the overall umbrella of decision trees, such as **Iterative Dichotomiser 3** (**ID3**) and **Classification and Regression Tree** (**CART**). In this chapter, we will investigate the use of the ID3 method in classifying categorical data, and we will use the scikit-learn CART implementation as another method of classifying the dataset. So, what exactly are decision trees?

As the name suggests, decision trees are a learning algorithm that apply a sequential series of decisions based on input information to make the final classification. Recalling your childhood biology class, you may have used a process similar to decision trees in the classification of different types of animals via dichotomous keys. Just like the dichotomous key example shown, decision trees aim to classify information following the result of a number of decision or question steps:

Depending upon the decision tree algorithm being used, the implementation of the decision steps may vary slightly, but we will be considering the implementation of the ID3 algorithm specifically. The ID3 algorithm aims to classify the data on the basis of each decision providing the largest information gain. To further understand this design, we also need to understand two additional concepts: entropy and information gain.

Note

The ID3 algorithm was first proposed by the Australian researcher Ross Quinlan in 1985 (https://doi.org/10.1007/BF00116251).

- Entropy: In simple terms, entropy shows the degree of uncertainty of the signal. For example, if a football (soccer) game is 5 minutes from finishing and if the score is 5-0, then we would say that the game has a low entropy, or, in other words, we are almost certain that the team with 5 goals will win. However, if the score is 1-1, then the game will be considered to have a high entropy (uncertainty). In the context of information theory, entropy is the average rate at which information is provided by a random source of data. Mathematically speaking, this entropy is defined as:

In this scenario, when the random source of data produces a probability value of around 0.5, the event carries more information, as the final outcome is relatively uncertain compared to when the data source produces an extreme (high or low) probability value.

- Information gain: This quantifies the amount of uncertainty reduced if we have prior information about a variable
**a**(the variable will be a feature in the case of machine learning models). In other words, how much information can variable a provide regarding an event. Given a dataset**S**, and an attribute to observe**a**, the information gain is defined mathematically as:

The information gain of dataset **S**, for attribute **a**, is equal to the entropy of **S** minus the entropy of **S** conditional on attribute **a**, or the entropy of dataset *S* minus the ratio of number of elements in set **t** to the total number of elements in source **S**, times the entropy of **t**, where **t** is one of the categories in attribute **a**.

If at first you find the mathematics here a little daunting, don't worry, for it is far simpler than it seems. To clarify the ID3 process, we will walk through the process using the same dataset as was provided by Quinlan in the original paper.

## Exercise 5.07: ID3 Classification

In this exercise, we will be performing ID3 classification on a dataset. In the original paper, Quinlan provided a small dataset of 10 weather observation samples labeled with either **P** to indicate that the weather was suitable for, say, a Saturday morning game of cricket, or baseball for our North American friends, or, if the weather was not suitable for a game, **N**. The example dataset described in the paper will be created in the exercise:

- Import the required packages:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

- In a Jupyter notebook, create a pandas DataFrame of the following training set:
df = pd.DataFrame()

df['Outlook'] = ['sunny', 'sunny', 'overcast', 'rain', 'rain', \

'rain', 'overcast', 'sunny', 'sunny', 'rain', \

'sunny', 'overcast', 'overcast', 'rain']

df['Temperature'] = ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', \

'cool', 'mild', 'cool', 'mild', 'mild', \

'mild', 'hot', 'mild',]

df['Humidity'] = ['high', 'high', 'high', 'high', 'normal', \

'normal', 'normal', 'high', 'normal', \

'normal', 'normal', 'high', 'normal', 'high']

df['Windy'] = ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', \

'Strong', 'Weak', 'Weak', 'Weak','Strong', 'Strong', \

'Weak', 'Strong']

df['Decision'] = ['N', 'N', 'P', 'P', 'P', 'N', 'P', 'N', 'P', \

'P','P', 'P', 'P', 'N']

df

The output will be as follows:

- In the original paper, the ID3 algorithm starts by taking a small sample of the training set at random and fitting the tree to this window. This can be a useful method for large datasets, but given that ours is quite small, we will simply start with the entire training set. The first step is to calculate the entropy for the
**Decision**column, where there are two possible values, or classes,**P**and**N**:# Probability of P

p_p = len(df.loc[df.Decision == 'P']) / len(df)

# Probability of N

p_n = len(df.loc[df.Decision == 'N']) / len(df)

entropy_decision = -p_n * np.log2(p_n) - p_p * np.log2(p_p)

print(f'H(S) = {entropy_decision:0.4f}')

The output will be as follows:

H(S) = 0.94403

- We will need to repeat this calculation, so wrap it in a function:
def f_entropy_decision(data):

p_p = len(data.loc[data.Decision == 'P']) / len(data)

p_n = len(data.loc[data.Decision == 'N']) / len(data)

return -p_n * np.log2(p_n) - p_p * np.log2(p_p)

- The next step is to calculate which attribute provides the highest information gain out of
**Outlook**,**Temperature**,**Humidity**, and**Windy**. Starting with the**Outlook**parameter, determine the probability of each decision given sunny, overcast, and rainy conditions. We need to evaluate the following equation: - Construct this equation in Python using the pandas
**groupby**method:IG_decision_Outlook = entropy_decision # H(S)

# Create a string to print out the overall equation

overall_eqn = 'Gain(Decision, Outlook) = Entropy(Decision)'

"""Iterate through the values for outlook and compute the probabilities and entropy values

"""

for name, Outlook in df.groupby('Outlook'):

num_p = len(Outlook.loc[Outlook.Decision == 'P'])

num_n = len(Outlook.loc[Outlook.Decision != 'P'])

num_Outlook = len(Outlook)

print(f'p(Decision=P|Outlook={name}) = {num_p}/{num_Outlook}')

print(f'p(Decision=N|Outlook={name}) = {num_n}/{num_Outlook}')

print(f'p(Outlook={name}) = {num_Outlook}/{len(df)}')

print(f'Entropy(Decision|Outlook={name}) = '\

f'-{num_p}/{num_Outlook}.log2({num_p}/{num_Outlook}) - '\

f'{num_n}/{num_Outlook}.log2({num_n}/{num_Outlook})')

entropy_decision_outlook = 0

# Cannot compute log of 0 so add checks

if num_p != 0:

entropy_decision_outlook -= (num_p / num_Outlook) \

* np.log2(num_p / num_Outlook)

# Cannot compute log of 0 so add checks

if num_n != 0:

entropy_decision_outlook -= (num_n / num_Outlook) \

* np.log2(num_n / num_Outlook)

IG_decision_Outlook -= (num_Outlook / len(df)) \

* entropy_decision_outlook

print()

overall_eqn += f' - p(Outlook={name}).'

overall_eqn += f'Entropy(Decision|Outlook={name})'

print(overall_eqn)

print(f'Gain(Decision, Outlook) = {IG_decision_Outlook:0.4f}')

The output will be as follows:

- The final gain equation for
**Outlook**can be rewritten as: - We need to repeat this process quite a few times, so wrap it in a function for ease of use later:
def IG(data, column, ent_decision=entropy_decision):

IG_decision = ent_decision

for name, temp in data.groupby(column):

p_p = len(temp.loc[temp.Decision == 'P']) / len(temp)

p_n = len(temp.loc[temp.Decision != 'P']) / len(temp)

entropy_decision = 0

if p_p != 0:

entropy_decision -= (p_p) * np.log2(p_p)

if p_n != 0:

entropy_decision -= (p_n) * np.log2(p_n)

IG_decision -= (len(temp) / len(df)) * entropy_decision

return IG_decision

- Repeat this process for each of the other columns to compute the corresponding information gain:
for col in df.columns[:-1]:

print(f'Gain(Decision, {col}) = {IG(df, col):0.4f}')

The output will be as follows:

Gain(Decision, Outlook) = 0.2467

Gain (Decision, Temperature) = 0.0292

Gain(Decision, Humidity) = 0.1518

Gain(Decision, Windy) = 0.0481

- This information provides the first decision of the tree. We want to split on the maximum information gain, so we split on
**Outlook**. Look at the data splitting on**Outlook**:for name, temp in df.groupby('Outlook'):

print('-' * 15)

print(name)

print('-' * 15)

print(temp)

print('-' * 15)

The output will be as follows:

Notice that all the overcast records have a decision of

**P**. This provides our first terminating leaf of the decision tree. If it is overcast, we are going to play, while if it is rainy or sunny, there is a chance we will not play. The decision tree so far can be represented as in the following figure:Note

This figure was created manually for reference and is not contained in, or obtained from, the accompanying source code.

- We now repeat this process, splitting by information gain until all the data is allocated and all branches of the tree terminate. First, remove the overcast samples, as they no longer provide any additional information:
df_next = df.loc[df.Outlook != 'overcast']

df_next

The output will be as follows:

- Now, we will turn our attention to the sunny samples and will rerun the gain calculations to determine the best way to split the sunny information:
df_sunny = df_next.loc[df_next.Outlook == 'sunny']

- Recompute the entropy for the sunny samples:
entropy_decision = f_entropy_decision(df_sunny)

entropy_decision

The output will be as follows:

0.9709505944546686

- Run the gain calculations for the sunny samples:
for col in df_sunny.columns[1:-1]:

print(f'Gain(Decision, {col}) = \

{IG(df_sunny, col, entropy_decision):0.4f}')

The output will be as follows:

Gain(Decision, Temperature) = 0.8281

Gain(Decision, Humidity) = 0.9710

Gain(Decision, Windy) = 0.6313

- Again, we select the largest gain, which is
**Humidity**. Group the data by**Humidity**:for name, temp in df_sunny.groupby('Humidity'):

print('-' * 15)

print(name)

print('-' * 15)

print(temp)

print('-' * 15)

The output will be as follows:

We can see here that we have two terminating leaves in that when the

**Humidity**is high, there is a decision not to play, and, vice versa, when the**Humidity**is normal, there is the decision to play. So, updating our representation of the decision tree, we have: - So, the last set of data that requires classification is the rainy outlook data. Extract only the
**rain**data and rerun the entropy calculation:df_rain = df_next.loc[df_next.Outlook == 'rain']

entropy_decision = f_entropy_decision(df_rain)

entropy_decision

The output will be as follows:

0.9709505944546686

- Repeat the gain calculation with the
**rain**subset:for col in df_rain.columns[1:-1]:

print(f'Gain(Decision, {col}) = \

{IG(df_rain, col, entropy_decision):0.4f}')

The output will be as follows:

Gain(Decision, Temperature) = 0.6313

Gain(Decision,Humidity) = 0.6313

Gain(Decision, Windy) = 0.9710

- Again, splitting on the attribute with the largest gain value requires splitting on the
**Windy**values. So, group the remaining information by**Windy**:for name, temp in df_rain.groupby('Windy'):

print('-' * 15)

print(name)

print('-' * 15)

print(temp)

print('-' * 15)

The output will be as follows:

- Finally, we have all the terminating leaves required to complete the tree, as splitting on
**Windy**provides two sets, all of which indicate either play (**P**) or no-play (**N**) values. Our complete decision tree is as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/37Rh7fX.

You can also run this example online at https://packt.live/3hTz4Px. You must execute the entire Notebook in order to get the desired result.

Decision trees, very much like KNN models, are discriminative models. Discriminative models are the models that aim to maximize the conditional probability of the class of data given the features. The opposite of discriminative models is generative models, which learn the joint probability of data classes and features and, hence, learn the distribution of data to generate artificial samples.

So, how do we make predictions with unseen information in the case of a decision tree? Simply follow the tree. Look at the decision being made at each node and apply the data from the unseen sample. The prediction will then end up being the label specified at the terminating leaf. Let's say we had a weather forecast for the upcoming Saturday and we wanted to predict whether we were going to play or not. The weather forecast is as follows:

The decision tree for this would be as follows (the dashed circles indicate selected leaves in the tree):

Now, hopefully, you have a reasonable understanding of the underlying concept of decision trees and the process of making sequential decisions. With the principles of decision trees in our toolkit, we will now look at applying a more complicated model using the functionality provided in scikit-learn.

### Classification and Regression Tree

The scikit-learn decision tree methods implement the CART method, which provides the ability to use decision trees in both classification and regression problems. CART differs from ID3 in that the decisions are made by comparing the values of features against a calculated value. More precisely, we can see that in the ID3 algorithm, a decision is made based on the value of the feature that is present in the dataset. This serves the purpose well when data is categorical; however, once data becomes continuous, this method does not work well. In such cases, CART is used, which calculates the threshold value for comparison with a feature value. And because, in such comparisons, there can only be two possible outcomes – (a) the feature value is greater than (or equal to) the threshold value or, (b) the feature value is less than (or equal to) the threshold value – hence, CART results in binary trees.

On the contrary, ID3 creates multiway trees because, as mentioned earlier, in ID3, the decision is made based on existing feature values and if the feature is categorical, then the tree is going to branch into potentially as many branches as the number of categories. Another difference between ID3 and CART is, as opposed to ID3, which uses information gain as the metric to find the best split, CART uses another measure called the **gini** **impurity** **measure**. Mathematically, you will recall that we defined entropy as:

And so, gini impurity is defined as:

Conceptually, this is a measure of the following: if we randomly pick a data point in our dataset and if we randomly classify (label) it according to the class distribution in the dataset, then what is the probability of classifying the data point incorrectly?

Having discussed the CART- and ID3-based decision tree methodologies, let's now solve a classification problem using the CART methodology.

## Exercise 5.08: Breast Cancer Diagnosis Classification Using a CART Decision Tree

In this exercise, we will classify the Breast Cancer Diagnosis data using scikit-learn's decision tree classifier, which can be used in both classification and regression problems:

- Import the required packages:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

- Load the Breast Cancer dataset:
df = pd.read_csv('../Datasets/breast-cancer-data.csv')

df.head()

The output will be as follows:

- Before actually going into training a model, let's further split the training dataset into a training and a validation set in the ratio 70:30 to be able to impartially evaluate the model performance later using the validation set:
train_X, valid_X, \

train_y, valid_y = train_test_split(df[set(df.columns)\

-{'diagnosis'}], df.diagnosis, \

test_size=0.3, random_state=123)

- Fit the model to the training data and check the corresponding accuracy:
model = DecisionTreeClassifier()

model = model.fit(train_X, train_y)

model.score(train_X, train_y)

The output will be as follows:

1.0

Our model achieves 100% accuracy on the training set.

- Check the performance against the test set:
model.score(valid_X, valid_y)

The output accuracy should be smaller than 1, ideally:

0.9415204678362573

- One of the great things about decision trees is that we can visually represent the model and see exactly what is going on. Install the required dependency:
!conda install python-graphviz

- Import the graphing package:
import graphviz

from sklearn.tree import export_graphviz

- Plot the model:
dot_data = export_graphviz(model, out_file=None)

graph = graphviz.Source(dot_data)

graph

The output will be as follows:

This figure illustrates the decisions of the CART decision tree in the scikit-learn model. The first line of the node is the decision that is made at that step. The first node, *X[1] <= 16.795*, indicates that the training data is split on column 1 on the basis of being less than or equal to 16.795. Those samples with values on column 1 less than 16.795 (of which there are 254) are then further dissected on column 25. Similarly, samples with values on column 1 greater than or equal to 16.795 (of which there are 144) are then further dissected on column 28. This decision/branching process continues until the terminating condition is reached. The terminating condition can be defined in several ways. Some of them are as follows:

- The tree has been exhausted and all terminating leaves have been constructed/found.
- Impurity (the measure of the different number of classes that the elements in a node belong to) at a particular node is below a given threshold.
- The number of elements at a particular node is lower than a threshold number of elements.
Note

To access the source code for this specific section, please refer to https://packt.live/31btfY5.

You can also run this example online at https://packt.live/37PJTO4. You must execute the entire Notebook in order to get the desired result.

Before we move on to the next topic, let's perform a binary classification task using the CART decision tree on the MNIST digits dataset. The task is to classify images of digits 0 and 1 into digits (or classes) 0 and 1.

## Activity 5.03: Binary Classification Using a CART Decision Tree

In this activity, we will build a CART Decision Tree-based classifier using the MNIST dataset to classify between two digits: 0 and 1.

The steps to be performed are as follows:

- Import the required dependencies:
import struct

import numpy as np

import pandas as pd

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

- Load the MNIST data into memory.
- Visualize a sample of the data.
- Construct a CART Decision Tree classifier model to classify the digits 0 and 1. The model we are going to create is to determine whether the samples are either the digits 0 or 1. To do this, we first need to select only those samples.
- Visualize the selected information with images of one sample of 0 and one sample of 1.
- In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape.
- Construct the model; use the
**DecisionTreeClassifier**API and call the**fit**function. - Determine the training set accuracy.
- Compare the performance against the test set.
The output will be as follows:

0.9962174940898345

Note

The solution for this activity can be found via this link.

An interesting point to note here is that the test set performance here is much better than that in *Activity 5.01: Ordinary Least Squares Classifier – Binary Classifier*. The dataset is exactly the same in both cases, but the models are different. This demonstrates the fact that the CART decision trees-based model performs better than the OLS-based model on this binary classification task.

Now that we have acquired an understanding of decision trees for classification, we will next discuss one of the most popular and powerful types of machine learning model that is widely used in the industry as well as in academia – artificial neural networks.

# Artificial Neural Networks

The final type of classification model that we will be studying is **Artificial** **Neural** **Networks** (**ANNs**). Firstly, this class of model is inspired by how the human brain functions. More specifically, we try to mathematically emulate the interconnected-neurons architecture, hence the name – neural networks. Essentially, an artificial neural network architecture looks something like that shown in *Figure 5.57*:

To the extreme left is the input data *X*, expanded into the **N0** different feature dimensions. This example has two hidden layers, **h1** and **h2**, having **N1** and **N2** number of neurons, respectively. Wait, what is a neuron? The nomenclature is derived from the human brain analogy, and a neuron in the context of an artificial neural network is essentially a node in the network/graph. And finally, in the figure, there is the output layer, Y, which consists of the *N* number of classes for the example of a multiclass classification task. Each arrow in this figure represents a network weight or parameter. As you can see, these models can therefore have a large number of arrows/parameters, which essentially makes them complex and powerful. And the way these weights come into play is, for example, **h11** is the weighted sum of all the input features, **x1**, **x2** … **xN0**, passed through an activation function.

Wait, what then is an activation function? In neural networks, inside each neuron or node is an implicit non-linear function. This helps make the model non-linear (hence complex), and if we remove these non-linearities, then the several hidden layers will collapse (by virtue of a series of matrix multiplications) resulting in an extremely simple linear model. This linear model would imply that the output class of data can be represented as the weighted sum of input features, which is absolutely not the case with ANNs. Popular non-linear activation functions used in neural networks are **sigmoid**, **tanh** (hyperbolic tangent), and **Rectified Linear Unit** (**ReLU**) In fact, if we use sigmoid as the activation function and omit all the hidden layers and restrict the number of classes to two, we get the following neural network:

Does this look familiar? This model is precisely the same as our logistic regression model! First, we take the weighted sum of all the input features **x1**, **x2** …. **xN0**, and then apply the sigmoid or logistic function in order to get the final output. This output is then compared with the ground truth label to compute the loss. And, similar to linear regression models as discussed in the previous chapter, neural networks use gradient descent to derive the optimal set of weights or parameters by minimizing the loss. Although, since a neural network model is much more complex than a linear regression model, the way the parameters are updated in the former is much more sophisticated than the latter, and a technique called backpropagation is used to do so. Mathematical details of backpropagation are beyond the scope of this chapter, but we encourage readers to read further on that.

## Exercise 5.09: Neural Networks – Multiclass Classifier

Neural networks can be used for multiclass classification and are by no means restricted just to binary classification. In this exercise, we will be investigating a 10-class classification problem, in other words, the MNIST digits classification task. The process for loading the MNIST training and test data is identical to the previous exercises:

- Import the required packages:
import struct

import numpy as np

import gzip

import urllib.request

import matplotlib.pyplot as plt

from array import array

from sklearn.neural_network import MLPClassifier

- Load the training/test images and the corresponding labels:
with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img = np.array(array("B", f.read())).reshape((size, rows, cols))

with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels = np.array(array("B", f.read()))

with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

img_test = np.array(array("B", f.read()))\

.reshape((size, rows, cols))

with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

magic, size = struct.unpack(">II", f.read(8))

labels_test = np.array(array("B", f.read()))

- Visualize a sample of the data:
for i in range(10):

plt.subplot(2, 5, i + 1)

plt.imshow(img[i], cmap='gray');

plt.title(f'{labels[i]}');

plt.axis('off')

The output will be as follows:

- Given that the training data is so large, we will select a subset of the overall data to reduce the training time as well as the system resources required for the training process:
np.random.seed(0) # Give consistent random numbers

selection = np.random.choice(len(img), 5000)

selected_images = img[selection]

selected_labels = labels[selection]

- Again, reshape the input data in vector form for later use:
selected_images = selected_images.reshape((-1, rows * cols))

selected_images.shape

The output will be as follows:

(5000, 784)

- Next, we normalize the image data. We scale all the image values between
**0**and**1**. Originally, grayscale images are comprised of pixels with values between and including**0**to**255**, where**0**is black and**255**is white. Normalization is important because it helps the gradient descent algorithm perform effectively. Unnormalized data is more prone to diminishing/exploding values of gradients during weight updates and will, therefore, lead to negligible weight updates:selected_images = selected_images / 255.0

img_test = img_test / 255.0

- Construct the neural network (or the multilayer perceptron) model. There are a few extra arguments, as follows: the
**sgd**value for**solver**tells the model to use stochastic gradient descent, with additional**max_iter**iterations required to converge on a solution. The**hidden_layer_sizes**argument essentially describes the model architecture, in other words, how many hidden layers there are and how many neurons there are in each hidden layer. For example, (20, 10, 5) would mean 3 hidden layers, with 20, 10, and 5 neurons in them, respectively. The**learning_rate_init**argument gives the initial learning rate for the gradient descent algorithm:model = MLPClassifier(solver='sgd', hidden_layer_sizes=(100,), \

max_iter=1000, random_state=1, \

learning_rate_init=.01)

model.fit(X=selected_images, y=selected_labels)

The output will be as follows:

Note

Refer to the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier for more information on the arguments.

- Determine the accuracy score against the training set:
model.score(X=selected_images, y=selected_labels)

The output will be as follows:

1.0

- Determine the first two predictions for the training set and plot the images with the corresponding predictions:
model.predict(selected_images)[:2]

The output will be as follows:

array([4, 1], dtype=uint8)

- Show the images for the first two samples of the training set to see whether we are correct:
plt.subplot(1, 2, 1)

plt.imshow(selected_images[0].reshape((28, 28)), cmap='gray');

plt.axis('off');

plt.subplot(1, 2, 2)

plt.imshow(selected_images[1].reshape((28, 28)), cmap='gray');

plt.axis('off');

The output will be as follows:

- Again, print out the probability scores provided by the model for the first sample of the training set. Confirm that there are 10 different values for each of the 10 classes in the set:
model.predict_proba(selected_images)[0]

The output will be as follows:

Notice that, in the probability array of the first sample, the fifth (digit

**4**) number is the highest probability, thus indicating a prediction of**4**. - Compute the accuracy of the model against the test set. This will provide a reasonable estimate of the model's
*in the wild*performance, as it has never seen the data in the test set. It is expected that the accuracy rate of the test set will be slightly lower than the training set, given that the model has not been exposed to this data:model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

The output will be as follows:

0.9384

If we compare these training and test set scores (1 and 0.9384) for the neural network model with those for the logistic regression model (0.986 and 0.9002) as obtained in *Exercise 5.03, Logistic Regression – Multiclass Classifier*, we can see that the neural network model expectedly outperforms the logistic regression model. This happens because there are many more parameters to be learned in a neural network compared to a logistic regression model, making neural networks more complex and hence powerful. Conversely, if we build a neural network binary classifier with no hidden layers and using sigmoidal activation functions, it essentially becomes the same as a logistic regression model.

Note

To access the source code for this specific section, please refer to https://packt.live/2NjfiyX.

You can also run this example online at https://packt.live/3dowv4z. You must execute the entire Notebook in order to get the desired result.

Before we conclude this chapter, let's work out a last classification task using neural networks, this time, on the Breast Cancer Diagnosis classification dataset.

## Activity 5.04: Breast Cancer Diagnosis Classification Using Artificial Neural Networks

In this activity, we will be using the Breast Cancer Diagnosis dataset (available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29 or on GitHub at https://packt.live/3a7oAY8). This dataset is a part of the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php). The dataset contains characteristics of the cell nuclei present in the digitized image of a fine needle aspirate (FNA) of a breast mass, with the labels malignant and benign for each cell nucleus. Characteristics are features (30 in total) such as the mean radius, radius error, worst radius, mean texture, texture error, and worst texture of the cell nuclei. In this activity, we will use the features provided in the dataset to classify between malignant and benign cells.

The steps to be performed are as follows:

- Import the required packages. For this activity, we will require the
**pandas**package for loading the data, the**matplotlib**package for plotting, and scikit-learn for creating the neural network model, as well as to split the dataset into training and test sets. Import all the required packages and relevant modules for these tasks:import pandas as pd

import matplotlib.pyplot as plt

from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split

from sklearn import preprocessing

- Load the Breast Cancer Diagnosis dataset using pandas and examine the first five rows.
- The next step is feature engineering. Different columns of this dataset have different scales of magnitude, hence, before constructing and training a neural network model, we normalize the dataset. For this, we use the
**MinMaxScaler**API from**sklearn**, which normalizes each column's values between 0 and 1, as discussed in the*Logistic Regression*section of this chapter (see*Exercise 5.03, Logistic Regression – Multiclass Classifier*): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. - Before we can construct the model, we must first convert the
**diagnosis**values into labels that can be used within the model. Replace the**benign**diagnosis string with the value**0**, and the**malignant**diagnosis string with the value**1**. - Also, in order to impartially evaluate the model, we should split the training dataset into a training and a validation set.
- Create the model using the normalized dataset and the assigned
**diagnosis**labels. - Compute the accuracy of the model against the validation set.
The output will be similar to the following:

0.9824561403508771

Note

The solution for this activity can be found via this link.

If we compare this validation set accuracy result with the result(s) from *Activity 5.02: KNN Multiclass Classifier*, we find artificial neural networks to be performing better than the logistic regression model on the exact same dataset. This is also expected, as the former is a more complex and powerful type of machine learning model than the latter.

# Summary

We covered a number of powerful and extremely useful classification models in this chapter, starting with the use of OLS as a classifier, and then we observed a significant increase in performance through the use of the logistic regression classifier. We then moved on to memorizing models, such as KNN, which, while simple to fit, was able to form complex non-linear boundaries in the classification process, even with images as input information into the model. Thereafter, we discussed decision trees and the ID3 algorithm. We saw how decision trees, like KNN models, memorize the training data using rules to make predictions with quite a high degree of accuracy. Finally, we concluded our introduction to classification problems with one of the most powerful classification models – artificial neural networks. We briefly covered the basics of a feedforward neural network and also showed through an exercise how it outperformed the logistic regression model on a classification task.

In the next chapter, we will be extending what we have learned in this chapter. It will cover ensemble techniques, including boosting, and the very effective random forest model.