Overview

This chapter examines different ways of performing ensemble modeling, along with its benefits and limitations. By the end of the chapter, you will be able to recognize the underfitting and overfitting of data on machine learning models. You will also be able to devise a bagging classifier using decision trees and implement adaptive boosting and gradient boosting models. Finally, you will be able to build a stacked ensemble using a number of classifiers.

# Introduction

In the previous chapters, we discussed the two types of supervised learning problems: regression and classification. We looked at a number of algorithms for each type and delved into how those algorithms worked.

But there are times when these algorithms, no matter how complex they are, just don't seem to perform well on the data that we have. There could be a variety of causes and reasons for this – perhaps the data is not good enough, perhaps there really is no trend where we are trying to find one, or perhaps the model itself is too complex.

Wait. What?! How can a model being too complex be a problem? If a model is too complex and there isn't enough data, the model could fit so well to the data that it learns even the noise and outliers, which is not what we want.

Often, where a single complex algorithm can give us a result that is way off from actual results, aggregating the results from a group of models can give us a result that's closer to the actual truth. This is because there is a high likelihood that the errors from all the individual models would cancel out when we take them all into account when making a prediction.

This approach of grouping multiple algorithms to give an aggregated prediction is what ensemble modeling is based on. The ultimate goal of an ensemble method is to combine several underperforming base estimators (that is, individual algorithms) in such a way that the overall performance of the system improves and the ensemble of algorithms results in a model that is more robust and can generalize well compared to an individual algorithm.

In the first half of this chapter, we will discuss how building an ensemble model can help us build a robust system that makes accurate predictions without increasing variance. We will start by talking about some reasons as to why a model may not perform well, and then move on to discussing the concepts of bias and variance, as well as overfitting and underfitting. We will introduce ensemble modeling as a solution for these performance issues and discuss different ensemble methods that could be used to overcome different types of problems when it comes to underperforming models.

We will discuss three types of ensemble methods; namely, bagging, boosting, and stacking. Each of these will be discussed right from the basic theory to discussions on which use cases each type deals with well and which use cases each type might not be a good fit for. We will also go through a number of exercises to implement the models using the scikit-learn library in Python.

Before diving deep into the topics, we shall first get familiar with a dataset that we will be using to demonstrate and understand the different concepts that are to be covered in this chapter. The next exercise enables us to do that. Before we delve into the exercise, it is necessary to become familiar with the concept of **one-hot encoding**.

# One-Hot Encoding

So, what is one-hot encoding? Well, in machine learning, we sometimes have categorical input features such as name, gender, and color. Such features contain label values rather than numeric values, such as John and Tom for name, male and female for gender, and red, blue, and green for color. Here, blue is one such label for the categorical feature – color. All machine learning models can work with numeric data, but many machine learning models cannot work with categorical data because of the way their underlying algorithms are designed. For example, decision trees can work with categorical data, but logistic regression cannot.

In order to still make use of categorical features with models such as logistic regression, we transform such features into a usable numeric format. *Figure 6.1* shows an example of what this transformation looks like:

*Figure 6.2* shows how one-hot encoding changes the dataset, once applied:

Basically, in this example, there are 3 categories of colors – red, blue, and green, and therefore 3 binary variables are needed – **color_red**, **color_blue**, and **color_green**. A **1** value is used to represent the binary variable for the color and **0** values for the other colors. These binary variables – **color_red**, **color_blue**, and **color_green** – are also known as **dummy** variables. Armed with this information, we can proceed to our exercise.

## Exercise 6.01: Importing Modules and Preparing the Dataset

In this exercise, we will import all the modules we will need for this chapter and get our dataset in shape for the exercises to come:

- Import all the modules required to manipulate the data and evaluate the model:
import pandas as pd

import numpy as np

%matplotlib inline

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.model_selection import KFold

The dataset that we will use in this exercise is the Titanic dataset, which was introduced in the previous chapters.

- Read the dataset and print the first five rows:
data = pd.read_csv('titanic.csv')

data.head()

Note

The code snippet presented above assumes that the dataset is stored in the same folder as that of the Jupyter Notebook for this exercise. However, if this dataset is saved in the

**Datasets**folder, you then need to use the following code:**data = pd.read_csv('../Datasets/titanic. csv')**The output is as follows:

- In order to make the dataset ready for use, we will add a
**preprocess**function, which will preprocess the dataset to get it into a format that is ingestible by the scikit-learn library.First, we create a

**fix_age**function to preprocess the**age**column and get an integer value. If the age is**null**, the function returns a value of**-1**to differentiate it from the available values, otherwise it returns the value. We then apply this function to the**age**column.Then, we convert the

**Gender**column into a binary variable with**1**for female and**0**for male values, and subsequently create dummy binary columns for the**Embarked**column using pandas'**get_dummies**function. Following this, we combine the DataFrame containing the dummy columns with the remaining numerical columns to create the final DataFrame, which is returned by the function:def preprocess(data):

def fix_age(age):

if np.isnan(age):

return -1

else:

return age

data.loc[:, 'Age'] = data.Age.apply(fix_age)

data.loc[:, 'Gender'] = data.Gender.apply(lambda s: \

int(s == 'female'))

embarked = pd.get_dummies(data.Embarked, \

prefix='Emb')[['Emb_C',\

'Emb_Q','Emb_S']]

cols = ['Pclass','Gender','Age','SibSp','Parch','Fare']

return pd.concat([data[cols], embarked], axis=1).values

- Split the dataset into training and validation sets.
We split the dataset into two parts – one on which we will train the models during the exercises (

**train**), and another on which we will make predictions to evaluate the performance of each of those models (**val**). We will use the function we wrote in the previous step to preprocess the training and validation datasets separately.Here, the

**Survived**binary variable is the target variable that determines whether or not the individual in each row survived the sinking of the Titanic, so we create**y_train**and**y_val**as the dependent variable columns from both the splits:train, val = train_test_split(data, test_size=0.2, random_state=11)

x_train = preprocess(train)

y_train = train['Survived'].values

x_val = preprocess(val)

y_val = val['Survived'].values

print(x_train.shape)

print(y_train.shape)

print(x_val.shape)

print(y_val.shape)

You should get the following output:

(712, 9)

(712,)

(179, 9)

(179,)

As we can see, the dataset is now split into 2 subsets, with the training set having **712** data points and the validation set having **179** data points.

Note

To access the source code for this specific section, please refer to https://packt.live/2Nm6KHM.

You can also run this example online at https://packt.live/2YWh9zg. You must execute the entire Notebook in order to get the desired result.

In this exercise, we began by loading the data and importing the necessary Python modules. We then preprocessed different columns of our dataset to make it usable for training machine learning models. Finally, we split the dataset into two subsets. And now, before doing anything further with the dataset, we will try to understand two important concepts of machine learning – overfitting and underfitting.

# Overfitting and Underfitting

Let's say we fit a supervised learning algorithm to our data and subsequently use the model to perform a prediction on a hold-out validation set. The performance of this model will be considered to be good based on how well it generalizes, that is, how well it makes predictions for data points in an independent validation dataset.

Sometimes, we find that the model is not able to make accurate predictions and gives poor performance on the validation data. This poor performance can be the result of a model that is too simple to model the data appropriately, or a model that is too complex to generalize to the validation dataset. In the former case, the model has a *high bias* and results in *underfitting*, while, in the latter case, the model has a *high variance* and results in *overfitting*.

**Bias**

The bias in the prediction of a machine learning model represents the difference between the predicted target value and the true target value of a data point. A model is said to have a high bias if the average predicted values are far off from the true values and is conversely said to have a low bias if the average predicted values are close to the true values.

A high bias indicates that the model cannot capture the complexity in the data and is unable to identify the relevant relationships between the inputs and outputs.

**Variance**

The variance in prediction of a machine learning model represents how scattered the predicted values are compared to the true values. A model is said to have high variance if the predictions are scattered and unstable and is conversely said to have low variance if the predictions are consistent and not very scattered.

A high variance indicates the model's inability to generalize and make accurate predictions on data points previously unseen by the model. As you can see in the following figure, the center of these circles represents the true target value of the data points. And the dots represent the predicted target value of the data points:

## Underfitting

Let's say that we fit a simple model on the training dataset, one with low model complexity, such as a simple linear model. We have fit a function that's able to represent the relationship between the **X** (input data) and **Y** (target output) data points in the training data to some extent, but we see that the training error is still high:

For example, look at the two regression plots shown in *Figure 6.5*. While the first plot shows a model that fits a straight line to the data, the second plot shows a model that attempts to fit a relatively more complex polynomial to the data, one that seems to represent the mapping between **X** and **Y** quite well.

If we look closely at the first model (on the left in the figure), the straight line is usually far away from the individual data points, as opposed to the second model where the data points are quite close to the curve. According to the definition of bias that we made in the previous section, we can say that the first model has a high bias. And, if we refer to the definition of the variance of a model, the first model is quite consistent in its predictions in that it predicts a fixed straight line-based output for a given input. Hence, the first model has a low variance and we can say that the first model demonstrates underfitting, since it shows the characteristics of a high bias and low variance; that is, while it is unable to capture the complexity in the mapping between the inputs and outputs, it is consistent in its predictions. This model will have a high prediction error on both the training data and validation data.

## Overfitting

Let's say that we trained a highly complex model that is able to make predictions on the training dataset almost perfectly. We have managed to fit a function to represent the relationship between the **X** and **Y** data points in the training data such that the predicted error on the training data is extremely low:

Looking at the two plots in *Figure 6.6*, we can see that the second plot shows a model that attempts to fit a highly complex function to the data points, compared to the plot on the left, which represents the ideal fit for the given data.

It is evident that, when we try to use the second-plot model to predict the **Y** values for **X** data points that did not appear in the training set, we will see that the predictions are way off from the corresponding true values. This is a case of overfitting, the phenomenon where the model fits the data too well so that it is unable to generalize to new data points, since the model learns even the random noise and outliers in the training data. This model shows the characteristics of high variance and low bias: while the average predicted values would be close to the true values, they would be quite scattered compared to the true values.

Another way in which overfitting can happen is when the number of data points is less than or equal to the degree of the polynomial that we are trying to fit to the model. We should, therefore, avoid a model where:

*degree of polynomial > number of data points *

With an extremely small dataset, trying to fit even a simple model can therefore also lead to overfitting.

## Overcoming the Problem of Underfitting and Overfitting

From the previous sections, we can see that, as we move from an overly simplistic to an overly complex model, we go from having an underfitting model with a high bias and low variance to an overfitting model with a low bias and high variance. The purpose of a supervised machine learning algorithm is to achieve a low bias with low variance and arrive at a place between underfitting and overfitting. This will also help the algorithm generalize well from the training data to validation data points, resulting in good prediction performance on data the model has never seen.

The best way to improve performance when the model underfits the data is to increase the model complexity so as to identify the relevant relationships in the data. This can be done by adding new features, or by creating an ensemble of high-bias models. However, in this case, adding more data to train on would not help, as the constraining factor is model complexity and more data will not help to reduce the model's bias.

Overfitting is, however, more difficult to tackle. Here are some common techniques used to overcome the problem posed by overfitting:

**Getting more data**: A highly complex model can easily overfit to a small dataset, but will not be able to as easily on a larger dataset.**Dimensionality reduction**: Reducing the number of features can help make the model less complex.**Regularization**: A new term is added to the cost function to adjust the coefficients (especially the high-degree coefficients in linear regression) toward a low value.**Ensemble modeling**: Aggregating the predictions of several overfitting models can effectively eliminate high variance in prediction and perform better than individual models that overfit to the training data.

We will talk in more detail about ensemble modeling techniques in the following sections of this chapter. Some of the common types of ensembles are:

**Bagging**: A shorter term for**Bootstrap Aggregation**, this technique is also used to decrease the model's variance and avoid overfitting. It involves taking a subset of features and data points at a time, training a model on each subset, and subsequently aggregating the results from all the models into a final prediction.**Boosting**: This technique is used to reduce bias rather than to reduce variance and involves incrementally training new models that focus on the misclassified data points in the previous model.**Stacking**: The aim of this technique is to increase the predictive power of the classifier, as it involves training multiple models and then using a combiner algorithm to make the final prediction by using the predictions from all these models' additional inputs.

Let's start with bagging, and then move on to boosting and stacking.

# Bagging

The term *bagging* is derived from a technique called bootstrap aggregation. In order to implement a successful predictive model, it's important to know in what situation we could benefit from using bootstrapping methods to build ensemble models. Such models are used extensively both in industry as well as academia.

One such application would be that these models can be used for the quality assessment of Wikipedia articles. Features such as **article_length**, **number_of_references**, **number_of_headings**, and **number_of_images** are used to build a classifier that classifies Wikipedia articles into low- or high-quality articles. Out of the several models that were tried for this task, the random forest model – a well-known bagging-based ensemble classifier that we will discuss in our next section – outperforms all other models such as SVM, logistic regression, and even neural networks, with the best precision and recall scores of **87.3%** and **87.2%**, respectively. This demonstrates the power of such models as well as their potential to be used in real-life applications.

In this section, we'll talk about a way to use bootstrap methods to create an ensemble model that minimizes variance and look at how we can build an ensemble of decision trees, that is, the random forest algorithm. But what is bootstrapping and how does it help us build robust ensemble models?

# Bootstrapping

The bootstrap method essentially refers to drawing multiple samples (each known as a resample) from the dataset consisting of randomly chosen data points, where there can be an overlap in the data points contained in each resample and each data point has an equal probability of being selected from the overall dataset:

From the previous diagram, we can see that each of the five bootstrapped samples taken from the primary dataset is different and has different characteristics. As such, training models on each of these resamples would result in different predictions.

The following are the advantages of bootstrapping:

- Each resample can contain different characteristics from that of the entire dataset, allowing us a different perspective of how the data behaves.
- Algorithms that make use of bootstrapping are powerfully built and handle unseen data better, especially on smaller datasets that have a tendency to cause overfitting.
- The bootstrap method can test the stability of a prediction by testing models using datasets with different variations and characteristics, resulting in a model that is more robust.

Now that we are aware of what bootstrapping is, what exactly does a bagging ensemble do? In simple words, bagging means aggregating the outputs of parallel models, each of which is built by bootstrapping data. It is essentially an ensemble model that generates multiple versions of a predictor on each resample and uses these to get an aggregated predictor. The aggregation step gives us a *meta prediction*, which involves taking an average over the models when predicting a continuous numerical value for regression problems, while taking a *vote* when predicting a class for classification problems. Voting can be of two types:

- Hard voting (class-based)
- Soft voting (probabilistic)

In hard voting, we consider the majority among the classes predicted by the base estimators, whereas in soft voting, we average the probabilities of belonging to a class and then predict the class.

The following diagram gives us a visual representation of how a bagging estimator is built from the bootstrap sampling shown in *Figure 6.7*:

Since each model is essentially independent of the others, all the base models can be trained in parallel, considerably speeding up the training process as a resampled dataset is smaller in size than the original dataset, and therefore allowing us to take advantage of the computational power we have on our hands today.

Bagging essentially helps to reduce the variance of the entire ensemble. It does so by introducing randomization into its formulation procedure and is usually used with a base predictor that has a tendency to overfit the training data. The primary point of consideration here would be the stability (or lack thereof) of the training dataset: bagging proves effective in cases where a slight perturbation in data leads to a significant change in model results, that is, the model with high variance. This is how bagging helps in countering variance.

**scikit-learn** uses **BaggingClassifier** and **BaggingRegressor** to implement generic bagging ensembles for classification and regression tasks, respectively. The primary inputs to these are the base estimators to use on each resample, along with the number of estimators to use (that is, the number of resamples).

## Exercise 6.02: Using the Bagging Classifier

In this exercise, we will use scikit-learn's bagging classifier as our ensemble, with **DecisionTreeClassifier** as the base estimator. We know that decision trees are prone to overfitting, and so will have a high variance and low bias, both being important characteristics for the base estimators to be used in bagging ensembles.

The dataset that we will use in this exercise is the Titanic dataset. Please complete *Exercise 6.01, Importing Modules and Preparing the Dataset,* before you embark on this exercise:

- Import the base and ensemble classifiers:
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import BaggingClassifier

- Specify the hyperparameters and initialize the model.
Here, we will first specify the hyperparameters of the base estimator, for which we are using the decision tree classifier with the entropy or information gain as the splitting criterion. We will not specify any limits on the depth of the tree or size/number of leaves on each tree to grow fully. Following this, we will define the hyperparameters for the bagging classifier and pass the base estimator object to the classifier as a hyperparameter.

We will take 50 base estimators for our example, which will run in parallel and utilize all the processes available in the machine (which is done by specifying

**n_jobs=-1**). Additionally, we will specify**max_samples**as 0.5, indicating that the number of data points in the bootstrap should be half that in the total dataset. We will also set a random state (to any arbitrary value, which will stay constant throughout) to maintain the reproducibility of the results:dt_params = {'criterion': 'entropy', 'random_state': 11}

dt = DecisionTreeClassifier(**dt_params)

bc_params = {'base_estimator': dt, 'n_estimators': 50, \

'max_samples': 0.5, 'random_state': 11, 'n_jobs': -1}

bc = BaggingClassifier(**bc_params)

- Fit the bagging classifier model to the training data and calculate the prediction accuracy.
Let's now fit the bagging classifier and find the meta predictions for both the training and validation sets. Following this, let's find the prediction accuracy on the training and validation datasets:

bc.fit(x_train, y_train)

bc_preds_train = bc.predict(x_train)

bc_preds_val = bc.predict(x_val)

print('Bagging Classifier:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=bc_preds_train), \

accuracy_score(y_true=y_val, y_pred=bc_preds_val)))

The output is as follows:

Bagging Classifier:

> Accuracy on training data = 0.9270

> Accuracy on validation data = 0.8659

- Fit the decision tree model to the training data to compare prediction accuracy.
Let's also fit the decision tree (from the object we initialized in

*Step 2*) so that we will be able to compare the prediction accuracies of the ensemble with that of the base predictor:dt.fit(x_train, y_train)

dt_preds_train = dt.predict(x_train)

dt_preds_val = dt.predict(x_val)

print('Decision Tree:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=dt_preds_train), \

accuracy_score(y_true=y_val, y_pred=dt_preds_val)))

The output is as follows:

Decision Tree:

> Accuracy on training data = 0.9831

> Accuracy on validation data = 0.7709

Here, we can see that, although the decision tree has a much higher training accuracy than the bagging classifier, its accuracy on the validation dataset is lower, a clear signal that the decision tree is overfitting to the training data. The bagging ensemble, on the other hand, reduces the overall variance and results in a much more accurate prediction.

Note

To access the source code for this specific section, please refer to https://packt.live/37O6735.

You can also run this example online at https://packt.live/2Nh3ayB. You must execute the entire Notebook in order to get the desired result.

Next, we will look at perhaps the most widely known bagging-based machine learning model there is, the random forest model. Random forest is a bagging ensemble model that uses a decision tree as the base estimator.

## Random Forest

An issue that is commonly faced with decision trees is that the split on each node is performed using a **greedy** algorithm that minimizes the entropy of the leaf nodes. Keeping this in mind, the base estimator decision trees in a bagging classifier can still be similar in terms of the features they split on, and so can have predictions that are quite similar. However, bagging is only useful in reducing the variance in predictions if the predictions from the base models are not correlated.

The random forest algorithm attempts to overcome this problem by not only bootstrapping the data points in the overall training dataset, but also bootstrapping the features available for each tree to split on. This ensures that when the greedy algorithm is searching for the *best* feature to split on, the overall *best* feature may not always be available in the bootstrapped features for the base estimator, and so would not be chosen – resulting in base trees that have different structures. This simple tweak lets the best estimators be trained in such a way that the predictions from each tree in the forest have a lower probability of being correlated to the predictions from other trees.

Each base estimator in the random forest has a random sample of data points as well as a random sample of features. And since the ensemble is made up of decision trees, the algorithm is called a random forest.

## Exercise 6.03: Building the Ensemble Model Using Random Forest

The two primary parameters that random forest takes are the fraction of features and the fraction of data points to bootstrap on, to train each base decision tree.

In this exercise, we will use scikit-learn's random forest classifier to build the ensemble model.

The dataset that we will use in this exercise is the Titanic dataset. This exercise is a continuation of *Exercise 6.02, Using a Bagging Classifier*:

- Import the ensemble classifier:
from sklearn.ensemble import RandomForestClassifier

- Specify the hyperparameters and initialize the model.
Here, we will use entropy as the splitting criterion for the decision trees in a forest comprising 100 trees. As before, we will not specify any limits regarding the depth of the trees or the size/number of leaves. Unlike the bagging classifier, which took

**max_samples**as an input during initialization, the random forest algorithm takes in only**max_features**, indicating the number (or fraction) of features in the bootstrap sample. We will specify the value for this as 0.5, so that only three out of six features are considered for each tree:rf_params = {'n_estimators': 100, 'criterion': 'entropy', \

'max_features': 0.5, 'min_samples_leaf': 10, \

'random_state': 11, 'n_jobs': -1}

rf = RandomForestClassifier(**rf_params)

- Fit the random forest classifier model to the training data and calculate the prediction accuracy.
Let's now fit the random forest model and find the meta predictions for both the training and validation sets. Following this, let's find the prediction accuracy on the training and validation datasets:

rf.fit(x_train, y_train)

rf_preds_train = rf.predict(x_train)

rf_preds_val = rf.predict(x_val)

print('Random Forest:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=rf_preds_train), \

accuracy_score(y_true=y_val, y_pred=rf_preds_val)))

The output is as follows:

Random Forest:

> Accuracy on training data = 0.8385

> Accuracy on validation data = 0.8771

If we compare the prediction accuracies of random forest on our dataset to that of the bagging classifier, we can see that the accuracy on the validation set is pretty much the same, although the latter has higher accuracy with regard to the training dataset.

Note

To access the source code for this specific section, please refer to https://packt.live/3dlvGtd.

You can also run this example online at https://packt.live/2NkSPS5. You must execute the entire Notebook in order to get the desired result.

# Boosting

The second ensemble technique we'll be looking at is boosting, which involves incrementally training new models that focus on the misclassified data points in the previous model and utilizes weighted averages to turn weak models (underfitting models having a high bias) into stronger models. Unlike bagging, where each base estimator could be trained independently of the others, the training of each base estimator in a boosted algorithm depends on the previous one.

Although boosting also uses the concept of bootstrapping, it's done differently from bagging, since each sample of data is weighted, implying that some bootstrapped samples can be used for training more often than other samples. When training each model, the algorithm keeps track of which features are most useful and which data samples have the most prediction error; these are given higher weightage and are considered to require more iterations to properly train the model.

When predicting the output, the boosting ensemble takes a weighted average of the predictions from each base estimator, giving a higher weight to the ones that had lower errors during the training stage. This means that, for the data points that are misclassified by the model in an iteration, the weights for those data points are increased so that the next model is more likely to classify it correctly.

As was the case with bagging, the results from all the boosting base estimators are aggregated to produce a meta prediction. However, unlike bagging, the accuracy of a boosted ensemble increases significantly with the number of base estimators in the boosted ensemble:

In the diagram, we can see that, after each iteration, the misclassified points have increased weights (represented by larger icons) so that the next base estimator that is trained is able to focus on those points. The final predictor has aggregated the decision boundaries from each of its base estimators.

Boosting is used extensively in real-world applications. For example, the commercial web search engines Yahoo and Yandex use variants of boosting in their machine-learned ranking engines. Ranking is the task of finding the most relevant documents given a search query. Particularly, in the case of Yandex, they use a gradient boosting-based approach to build an ensemble tree model that outperforms other models, including Yandex's previously used models, by achieving the lowest discounted cumulative gain of **4.14123**. This shows how useful boosting-based modeling can prove in real-life scenarios.

Note

Read more on Yandex at the following link: http://webmaster.ya.ru/replies.xml?item_no=5707&ncrnd=5118.

## Adaptive Boosting

Let's now talk about a boosting technique called **adaptive boosting**, which is best used to boost the performance of decision stumps for binary classification problems. Decision stumps are essentially decision trees with a maximum depth of one (only one split is made on a single feature), and, as such, are weak learners. The primary principle that adaptive boosting works on is the same: to improve the areas where the base estimator fails to turn an ensemble of weak learners into a strong learner.

To start with, the first base estimator takes a bootstrap of data points from the main training set and fits a decision stump to classify the sampled points, after which the trained decision tree stump is fit to the complete training data. For the samples that are misclassified, the weights are increased so that there is a higher probability of these data points being selected in the bootstrap for the next base estimator. A decision stump is again trained on the new bootstrap to classify the data points in the sample. Subsequently, the mini ensemble comprising the two base estimators is used to classify the data points in the entire training set. The misclassified data points from the second round are given a higher weight to improve their probability of selection and so on until the ensemble reaches the limit regarding the number of base estimators it should contain.

One drawback of adaptive boosting is that the algorithm is easily influenced by noisy data points and outliers since it tries to fit every point perfectly. As such, it is prone to overfitting if the number of estimators is very high.

## Exercise 6.04: Implementing Adaptive Boosting

In this exercise, we'll use scikit-learn's implementation of adaptive boosting for classification, **AdaBoostClassifier**:

We will again be using the Titanic dataset. This exercise is a continuation of *Exercise 6.03, Building the Ensemble Model Using Random Forest*:

- Import the classifier:
from sklearn.ensemble import AdaBoostClassifier

- Specify the hyperparameters and initialize the model.
Here, we will first specify the hyperparameters of the base estimator, for which we are using the decision tree classifier with a maximum depth of one, that is, a decision stump. Following this, we will define the hyperparameters for the AdaBoost classifier and pass the base estimator object to the classifier as a hyperparameter:

dt_params = {'max_depth': 1, 'random_state': 11}

dt = DecisionTreeClassifier(**dt_params)

ab_params = {'n_estimators': 100, 'base_estimator': dt, \

'random_state': 11}

ab = AdaBoostClassifier(**ab_params)

- Fit the model to the training data.
Let's now fit the

**AdaBoost**model and find the meta predictions for both the training and validation sets. Following this, let's find the prediction accuracy on the training and validation datasets:ab.fit(x_train, y_train)

ab_preds_train = ab.predict(x_train)

ab_preds_val = ab.predict(x_val)

print('Adaptive Boosting:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=ab_preds_train), \

accuracy_score(y_true=y_val, y_pred=ab_preds_val)

))

The output is as follows:

Adaptive Boosting:

> Accuracy on training data = 0.8272

> Accuracy on validation data = 0.8547

- Calculate the prediction accuracy of the model on the training and validation data for a varying number of base estimators.
Earlier, we claimed that the accuracy tends to increase with an increasing number of base estimators, but also that the model has a tendency to overfit if too many base estimators are used. Let's calculate the prediction accuracies so that we can find the point where the model begins to overfit the training data:

ab_params = {'base_estimator': dt, 'random_state': 11}

n_estimator_values = list(range(10, 210, 10))

train_accuracies, val_accuracies = [], []

for n_estimators in n_estimator_values:

ab = AdaBoostClassifier(n_estimators=n_estimators, **ab_params)

ab.fit(x_train, y_train)

ab_preds_train = ab.predict(x_train)

ab_preds_val = ab.predict(x_val)

train_accuracies.append(accuracy_score(y_true=y_train, \

y_pred=ab_preds_train))

val_accuracies.append(accuracy_score(y_true=y_val, \

y_pred=ab_preds_val))

- Plot a line graph to visualize the trend of the prediction accuracies on both the training and validation datasets:
plt.figure(figsize=(10,7))

plt.plot(n_estimator_values, train_accuracies, label='Train')

plt.plot(n_estimator_values, val_accuracies, label='Validation')

plt.ylabel('Accuracy score')

plt.xlabel('n_estimators')

plt.legend()

plt.show()

The output is as follows:

As was mentioned earlier, we can see that the training accuracy almost consistently increases as the number of decision tree stumps increases from 10 to 200. However, the validation accuracy fluctuates between 0.84 and 0.86 and begins to drop as the number of decision stumps goes higher. This happens because the AdaBoost algorithm is trying to fit the noisy data points and outliers as well.

Note

To access the source code for this specific section, please refer to https://packt.live/2V4zB7K.

You can also run this example online at https://packt.live/3dhSBpu. You must execute the entire Notebook in order to get the desired result.

## Gradient Boosting

Gradient boosting is an extension of the boosting method that visualizes boosting as an optimization problem. A loss function is defined that is representative of the error residuals (the difference between the predicted and true values), and the gradient descent algorithm is used to optimize the loss function.

In the first step, a base estimator (which would be a weak learner) is added and trained on the entire training dataset. The loss associated with the prediction is calculated and, in order to reduce the error residuals, the loss function is updated to add more base estimators for the data points where the existing estimators are performing poorly. Subsequently, the algorithm iteratively adds new base estimators and computes the loss to allow the optimization algorithm to update the model and minimize the residuals themselves.

In the case of adaptive boosting, decision stumps were used as the weak learners for the base estimators. However, for gradient boosting methods, larger trees can be used, but the weak learners should still be constrained by providing a limit to the maximum number of layers, nodes, splits, or leaf nodes. This ensures that the base estimators are still weak learners, but they can be constructed in a greedy manner.

From the previous chapters, we know that the gradient descent algorithm can be used to minimize a set of parameters, such as the coefficients in a regression equation. When building an ensemble, however, we have decision trees instead of parameters that need to be optimized. After calculating the loss at each step, the gradient descent algorithm then has to modify the parameters of the new tree that's to be added to the ensemble in such a way that reduces the loss. This approach is more commonly known as **functional gradient descent**.

## Exercise 6.05: Implementing GradientBoostingClassifier to Build an Ensemble Model

The two primary parameters that the gradient boosting classifier takes are the fraction of features and the fraction of data points to bootstrap on, to train each base decision tree.

In this exercise, we will use scikit-learn's gradient boosting classifier to build the boosting ensemble model.

This exercise is a continuation of *Exercise 6.04, Implementing Adaptive Boosting*:

- Import the ensemble classifier:
from sklearn.ensemble import GradientBoostingClassifier

- Specify the hyperparameters and initialize the model.
Here, we will use 100 decision trees as the base estimator, with each tree having a maximum depth of three and a minimum of five samples in each of its leaves. Although we are not using decision stumps, as in the previous example, the tree is still small and would be considered a weak learner:

gbc_params = {'n_estimators': 100, 'max_depth': 3, \

'min_samples_leaf': 5, 'random_state': 11}

gbc = GradientBoostingClassifier(**gbc_params)

- Fit the gradient boosting model to the training data and calculate the prediction accuracy.
Let's now fit the ensemble model and find the meta predictions for both the training and validation set. Following this, we will find the prediction accuracy on the training and validation datasets:

gbc.fit(x_train, y_train)

gbc_preds_train = gbc.predict(x_train)

gbc_preds_val = gbc.predict(x_val)

print('Gradient Boosting Classifier:'\

'\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=gbc_preds_train), \

accuracy_score(y_true=y_val, y_pred=gbc_preds_val)))

The output is as follows:

Gradient Boosting Classifier:

> Accuracy on training data = 0.8961

> Accuracy on validation data = 0.8771

Note

To access the source code for this specific section, please refer to https://packt.live/37QANjZ.

You can also run this example online at https://packt.live/2YljJ2D. You must execute the entire Notebook in order to get the desired result.

We can see that the gradient boosting ensemble has greater accuracy on both the training and validation datasets compared to those for the adaptive boosting ensemble.

# Stacking

Stacking, or stacked generalization, is also called **meta ensembling**. It is a model ensembling technique that consists of combining data from multiple models' predictions and using them as features to generate a new model. The stacked model will most likely outperform each of the individual models due to the smoothing effect it adds, as well as due to its ability to "choose" the base model that performs best in certain scenarios. Keeping this in mind, stacking is usually most effective when each of the base models is significantly different from each other.

Stacking is widely used in real-world applications. One popular example comes from the well-known Netflix competition whose two top performers built solutions that were based on stacking models. Netflix is a well-known streaming platform and the competition was about building the best recommendation engine. The winning algorithm was based on feature-weighted-linear-stacking, which basically had meta features derived from individual models/algorithms such as **Singular Value Decomposition** (**SVD**), **Restricted Boltzmann Machines** (**RBMs**), and **K-Nearest Neighbors** (**KNN**). One such meta feature was the standard deviation of the prediction of a 60-factor ordinal SVD. These meta features were found necessary to be able to achieve the winning model, which proves the power of stacking in real-world applications.

Stacking uses the predictions of the base models as additional features when training the final model – these are known as **meta features**. The stacked model essentially acts as a classifier that determines where each model is performing well and where it is performing poorly.

However, you cannot simply train the base models on the full training data, generate predictions on the full validation dataset, and then output these for second-level training. This runs the risk of your base model predictions already having "seen" the test set and therefore overfitting when feeding these predictions.

It is important to note that the value of the meta features for each row cannot be predicted using a model that contained that row in the training data, as we then run the risk of overfitting since the base predictions would have already "seen" the target variable for that row. The common practice is to divide the training data into **k** subsets so that, when finding the meta features for each of those subsets, we only train the model on the remaining data. Doing this also avoids the problem of overfitting the data the model has already "seen":

The preceding diagram shows how this is done: we divide the training data into **k** folds and find the predictions from the base models on each fold by training the model on the remaining **k-1** folds. So, once we have the meta predictions for each of the folds, we can use those meta predictions along with the original features to train the stacked model.

## Exercise 6.06: Building a Stacked Model

In this exercise, we will use a support vector machine (scikit-learn's **LinearSVC**) and k-nearest neighbors (scikit-learn's **KNeighborsClassifier**) as the base predictors, and the stacked model will be a logistic regression classifier.

This exercise is a continuation of *Exercise 6.05, Implementing GradientBoostingClassifier to Build an Ensemble Model*:

- Import the base models and the model used for stacking:
# Base models

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import LinearSVC

# Stacking model

from sklearn.linear_model import LogisticRegression

- Create a new training set with additional columns for predictions from base predictors.
We need to create two new columns for predicted values from each model to be used as features for the ensemble model in both the test and training set. Since NumPy arrays are immutable, we will create a new array that will have the same number of rows as the training dataset, and two columns more than those in the training dataset. Once the dataset is created, let's print it to see what it looks like:

x_train_with_metapreds = np.zeros((x_train.shape[0], \

x_train.shape[1]+2))

x_train_with_metapreds[:, :-2] = x_train

x_train_with_metapreds[:, -2:] = -1

print(x_train_with_metapreds)

The output is as follows:

As we can see, there are two extra columns filled with

*-1*values at the end of each row. - Train base models using the
**k-fold**strategy.Let's take

*k=5*. For each of the five folds, train on the other four folds and predict on the fifth fold. These predictions should then be added to the placeholder columns for base predictions in the new NumPy array.First, we initialize the

**KFold**object with the value of**k**and a random state to maintain reproducibility. The**kf.split()**function takes the dataset to split as an input and returns an iterator, with each element in the iterator corresponding to the list of indices in the training and validation folds respectively. These index values in each loop over the iterator can be used to subdivide the training data for training and prediction for each row.Once the data is adequately divided, we train the two base predictors on four-fifths of the data and predict the values on the remaining one-fifth of the rows. These predictions are then inserted into the two placeholder columns we initialized with

**-1**in*Step 2*:kf = KFold(n_splits=5, random_state=11)

for train_indices, val_indices in kf.split(x_train):

kfold_x_train, kfold_x_val = x_train[train_indices], \

x_train[val_indices]

kfold_y_train, kfold_y_val = y_train[train_indices], \

y_train[val_indices]

svm = LinearSVC(random_state=11, max_iter=1000)

svm.fit(kfold_x_train, kfold_y_train)

svm_pred = svm.predict(kfold_x_val)

knn = KNeighborsClassifier(n_neighbors=4)

knn.fit(kfold_x_train, kfold_y_train)

knn_pred = knn.predict(kfold_x_val)

x_train_with_metapreds[val_indices, -2] = svm_pred

x_train_with_metapreds[val_indices, -1] = knn_pred

- Create a new validation set with additional columns for predictions from base predictors.
As we did in

*Step 2*, we will add two placeholder columns for the base model predictions in the validation dataset as well:x_val_with_metapreds = np.zeros((x_val.shape[0], \

x_val.shape[1]+2))

x_val_with_metapreds[:, :-2] = x_val

x_val_with_metapreds[:, -2:] = -1

print(x_val_with_metapreds)

The output is as follows:

- Fit base models on the complete training set to get meta features for the validation set.
Next, we will train the two base predictors on the complete training dataset to get the meta prediction values for the validation dataset. This is similar to what we did for each fold in

*Step 3*:svm = LinearSVC(random_state=11, max_iter=1000)

svm.fit(x_train, y_train)

knn = KNeighborsClassifier(n_neighbors=4)

knn.fit(x_train, y_train)

svm_pred = svm.predict(x_val)

knn_pred = knn.predict(x_val)

x_val_with_metapreds[:, -2] = svm_pred

x_val_with_metapreds[:, -1] = knn_pred

- Train the stacked model and use the final predictions to calculate accuracy.
The final step is to train the logistic regression model on all the columns of the training dataset plus the meta predictions from the base estimators. We use the model to find the prediction accuracies for both the training and validation datasets:

lr = LogisticRegression(random_state=11)

lr.fit(x_train_with_metapreds, y_train)

lr_preds_train = lr.predict(x_train_with_metapreds)

lr_preds_val = lr.predict(x_val_with_metapreds)

print('Stacked Classifier:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=lr_preds_train), \

accuracy_score(y_true=y_val, y_pred=lr_preds_val)))

The output is as follows:

Stacked Classifier:

> Accuracy on training data = 0.7837

> Accuracy on validation data = 0.8827

Note

Owing to randomization, you might get an output that varies slightly in comparison to the output presented in the preceding step.

- Compare the accuracy with that of base models.
To get a sense of the performance boost from stacking, we calculate the accuracies of the base predictors on the training and validation datasets and compare this with that of the stacked model:

print('SVM:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=svm.predict(x_train)), \

accuracy_score(y_true=y_val, y_pred=svm_pred)))

print('kNN:\n> Accuracy on training data = {:.4f}'\

'\n> Accuracy on validation data = {:.4f}'\

.format(accuracy_score(y_true=y_train, \

y_pred=knn.predict(x_train)), \

accuracy_score(y_true=y_val, y_pred=knn_pred)))

The output is as follows:

SVM

> Accuracy on training data = 0.7205

> Accuracy on validation data = 0.7430

kNN:

> Accuracy on training data = 0.7921

> Accuracy on validation data = 0.6816

Note

Owing to randomization, you might get an output that varies slightly in comparison to the output presented in the preceding step.

As we can see, not only does the stacked model give us a validation accuracy that is significantly higher than either of the base predictors, but it also has the highest accuracy, nearly 89%, of all the ensemble models discussed in this chapter.

Note

To access the source code for this specific section, please refer to https://packt.live/37QANjZ.

You can also run this example online at https://packt.live/2YljJ2D. You must execute the entire Notebook in order to get the desired result.

## Activity 6.01: Stacking with Standalone and Ensemble Algorithms

In this activity, we'll use the Boston House Prices: Advanced Regression Techniques Database (available at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ or on GitHub at https://packt.live/2Vk002e).

This dataset is aimed toward solving a regression problem (that is, the target variable takes on a range of continuous values). In this activity, we will use decision trees, k-nearest neighbors, random forest, and gradient boosting algorithms to train individual regressors on the data. Then, we will build a stacked linear regression model that uses all these algorithms and compare the performance of each. We will use the **mean absolute error** (**MAE**) as the evaluation metric for this activity.

Note

The MAE function, **mean_absolute_error()**, can be used in a similar way to the **accuracy_score()** measure used previously.

The steps to be performed are as follows:

- Import the relevant libraries.
- Read the data.
- Preprocess the dataset to remove null values and one-hot encoded categorical variables to prepare the data for modeling.
- Divide the dataset into train and validation DataFrames.
- Initialize dictionaries in which to store the train and validation MAE values.
- Train a
**DecisionTreeRegressor**model (**dt**) with the following hyperparameters and save the scores:dt_params = {

'criterion': 'mae',

'min_samples_leaf': 15,

'random_state': 11

}

- Train a
**KNeighborsRegressor**model (**knn**) with the following hyperparameters and save the scores:knn_params = {

'n_neighbors': 5

}

- Train a
**RandomForestRegressor**model (**rf**) with the following hyperparameters and save the scores:rf_params = {

'n_estimators': 20,

'criterion': 'mae',

'max_features': 'sqrt',

'min_samples_leaf': 10,

'random_state': 11,

'n_jobs': -1

}

- Train a
**GradientBoostingRegressor**model (**gbr**) with the following hyperparameters and save the scores:gbr_params = {

'n_estimators': 20,

'criterion': 'mae',

'max_features': 'sqrt',

'min_samples_leaf': 10,

'random_state': 11

}

- Prepare the training and validation datasets, with the four meta estimators having the same hyperparameters that were used in the previous steps.
- Train a
**LinearRegression**model (**lr**) as the stacked model. - Visualize the train and validation errors for each individual model and the stacked model.
The output will be as follows:

Note

The solution for this activity can be found via this link.

Thus, we have successfully demonstrated how stacking as an ensembling technique proves to be superior to any individual machine learning model in terms of the validation set accuracy across different datasets.

# Summary

In this chapter, we started off with a discussion on overfitting and underfitting and how they can affect the performance of a model on unseen data. The chapter looked at ensemble modeling as a solution for these models and went on to discuss different ensemble methods that could be used, and how they could decrease the overall bias or variance encountered when making predictions. We first discussed bagging algorithms and introduced the concept of bootstrapping.

Then, we looked at random forest as a classic example of a bagged ensemble and solved exercises that involved building a bagging classifier and random forest classifier on the previously seen Titanic dataset. We then moved on to discussing boosting algorithms, how they successfully reduce bias in the system, and gained an understanding of how to implement adaptive boosting and gradient boosting. The last ensemble method we discussed was stacking, which, as we saw from the exercise, gave us the best accuracy score of all the ensemble methods we implemented. Although building an ensemble model is a great way to decrease bias and variance, and such models generally outperform any single model by itself, they themselves come with their own problems and use cases. While bagging is great when trying to avoid overfitting, boosting can reduce both bias and variance, though it may still have a tendency to overfit. Stacking, on the other hand, is a good choice for when one model performs well on a portion of the data while another model performs better on another portion of the data.

In the next chapter, we will explore more ways to overcome the problems of overfitting and underfitting in detail by looking at validation techniques, that is, ways to judge our model's performance, and how to use different metrics as indicators to build the best possible model for our use case.