5. Model Validation and Optimization – The Applied Data Science Workshop – Second Edition

5. Model Validation and Optimization

Overview

In this chapter, you will learn how to use k-fold cross validation to test model performance, as well as how to use validation curves to optimize model parameters. You will also learn how to implement dimensionality reduction techniques such as Principal Component Analysis (PCA). By the end of this chapter, you will have completed an end-to-end machine learning project and produced a final model that can be used to make business decisions.

Introduction

As we've seen in the previous chapters, it's easy to train models with scikit-learn using just a few lines of Python code. This is possible by abstracting away the computational complexity of the algorithm, including details such as constructing cost functions and optimizing model parameters. In other words, we deal with a black box where the internal operations are hidden from us.

While the simplicity offered by this approach is quite nice on the surface, it does nothing to prevent the misuse of algorithms—for example, by selecting the wrong model for a dataset, overfitting on the training set, or failing to test properly on unseen data.

In this chapter, we'll show you how to avoid some of these pitfalls while training classification models and equip you with the tools to produce trustworthy results. We'll introduce k-fold cross validation and validation curves, and then look at ways to use them in Jupyter.

We'll also introduce the topic of dimensionality reduction and see how it can be used, along with k-fold cross validation, to perform model selection. We'll apply these techniques to our models for the Human Resource Analytics dataset in order to build and present an optimized final solution.

The topics in this chapter are highly practical with regard to real-world machine learning problems. The information and code presented here will enable you to build predictive models that perform well on unseen data, which is a crucial property of production models. To start things off, we'll learn about k-fold cross validation.

Assessing Models with k-Fold Cross Validation

Thus far, we have trained models on a subset of the data and then assessed performance on the unseen portion, called the test set. This is good practice because the model's performance on data that's used for training is not a good indicator of its effectiveness as a predictor. It's very easy to increase accuracy on a training dataset by overfitting a model, which results in a poorer performance on unseen data.

That being said, simply training models on data that's been split in this way is not good enough. There is a natural variance in data that causes accuracies to be different (if even slightly), depending on the training and test splits. Furthermore, using only one training/test split to compare models can introduce bias toward certain models and lead to overfitting.

k-Fold cross validation offers a solution to this problem and allows the variance to be accounted for by way of an error estimate on each accuracy calculation.

The method of k-fold cross validation is illustrated in the following diagram, where we can see how the k-folds can be selected from the dataset:

Figure 5.1: Illustration of k-fold cross validation

Note

Image source: CC BY-SA 4.0: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.svg.

Keeping the preceding illustration in mind, the k-fold cross validation algorithm works as follows:

  1. Split data into k folds of near-equal size.
  2. Test and train k models on different fold combinations, where each model includes k – 1 folds of training data and uses the left-out fold as the validation set. In this method, each fold ends up being used as the test data exactly once.
  3. Calculate the model accuracy by taking the mean of the k accuracy values. The standard deviation is also calculated to provide error estimates on the value.

It's standard to set k = 10, but smaller values for k should be considered if you're using a big dataset.

This validation method can be used to compare model performance with different hyperparameters in a more reliable way than using a single train-test split on the data, as we were doing in Chapter 4, Training Classification Models. For example, we could use k-fold cross validation to optimize the value of C for an SVM or the value of k (number of nearest neighbors) for a KNN classifier.

Although k-fold cross validation involves splitting the data many times into testing and validation sets, only a subset of the full dataset should be included in this algorithm. This can be accomplished by setting aside a random sample of records from the full dataset and keeping the majority or records for training and validation with k-fold cross validation. The records that have been set aside will be used for testing purposes later in model development.

This is our first time using the term hyperparameter. It references a parameter that is defined when initializing a model, for example, the parameters we mentioned previously for the SVM and KNN classifier, or the maximum depth of a decision tree. In contrast, the term parameter in machine learning refers to a model variable that is determined during training, such as the coefficients on the decision boundary hyperplane for a trained SVM, or the weights learned by a linear regression model.

Once the best model has been identified, it's often beneficial to retrain on the entirety of the dataset before using it in production (that is, before using it to make predictions). This way, we expose the model to all the data that's available so that it has an opportunity to learn from the full range of patterns in the dataset.

When implementing this with scikit-learn, it's common to use a slightly improved variation of the normal k-fold algorithm instead. This is called stratified k-Fold. The improvement is that stratified k-fold cross validation maintains roughly even class label populations in the folds. As you can imagine, this reduces the overall variance in the models and decreases the likelihood of highly unbalanced models causing bias.

Tuning Hyperparameters with Validation Curves

K-fold cross validation naturally lends itself to the use of validation curves for tuning model parameters. As shown in the following graph, validation curves chart out the model accuracy as a function of a hyperparameter, such as the number of decision trees used in a Random Forest or (as mentioned previously) the maximum depth. By understanding how to interpret these charts, we can make well-informed hyperparameter selections.

Note

Like most of the scikit-learn documentation, the information provided for validation curves is very informative and worth a read. This includes the recipes that we'll use in this chapter for creating and plotting validation curves. You can read about them at:

https://scikit-learn.org/stable/modules/learning_curve.html.

Consider this validation curve, where the accuracy score is plotted as a function of the gamma SVM hyperparameter:

Figure 5.2: Validation curve for an SVM model

Note

Image source: https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html.

Starting on the left-hand side of the plot, we can see that both the training data (orange, top line) and the testing data (blue, bottom line) produce the same score. This is good, since it means the model is generalizing well to unseen data. However, the score is also quite low compared to other gamma values; therefore, we say the model is underfitting the data.

Increasing the value of gamma, we can see a point where the error bars of these two lines no longer overlap. From this point on, the classifier is failing to generalize well for unseen data, since it's overfitting the training set. As gamma continues to increase, we can see the score of the testing data drop off dramatically, while the training data score continues to increase.

The optimal value for the gamma parameter can be found by looking for a high test data score where the error bars on each line still overlap.

Keep in mind that a validation curve for some hyperparameters is only valid while the other hyperparameters remain constant. For example, if training the SVM in this plot, we could decide to pick gamma as 10-4. However, we may want to optimize the C parameter as well. With a different value for C, the preceding plot would be different and our selection for gamma may no longer be optimal.

To handle problems such as this, you can look into grid search algorithms. These are available through scikit-learn and use many of the same ideas we've discussed here. Grid search works like a higher dimensional validation curve. For example, a grid search over gamma = [10-5, 10-4, 10-3] and C = [0.1, 1, 10] would train and test nine sets of models, one for each combination of gamma and C. As you can imagine, this is prone to becoming computationally intense as the number of hyperparameters and their ranges increase.

Now that we've learned the basics of how k-fold cross validation and validation curves work, it's time to proceed with the hands-on section of this chapter. We'll open up a Jupyter notebook and continue building models for the employee retention problem.

Exercise 5.01: Using k-Fold Cross Validation and Validation Curves in Python with scikit-learn

In this exercise, you will implement k-fold cross validation and validation curves in Python and learn how to use these methods to assess models and tune hyperparameters in a Jupyter notebook. Perform the following steps to complete this exercise:

  1. Create a new notebook using one of the following commands:

    JupyterLab (run jupyter lab)

    Jupyter Notebook (run jupyter notebook)

    Then, open the chosen platform in your web browser by copy and pasting the URL, as prompted in the Terminal.

  2. Load the following libraries and set your plot setting for the notebook:

    import pandas as pd

    import numpy as np

    import datetime

    import time

    import os

    import matplotlib.pyplot as plt

    %matplotlib inline

    import seaborn as sns

    %config InlineBackend.figure_format='retina'

    sns.set() # Revert to matplotlib defaults

    plt.rcParams['figure.figsize'] = (8, 8)

    plt.rcParams['axes.labelpad'] = 10

    sns.set_style("darkgrid")

    %load_ext watermark

    %watermark -d -v -m -p \

    numpy,pandas,matplotlib,seaborn,sklearn

  3. Start by loading the preprocessed training data (the same dataset we worked with in the previous chapter). Load the table by running the cell with the following code:

    df = pd.read_csv('../data/hr-analytics/hr_data_processed.csv')

    Note

    As a reminder, you can find this file at https://packt.live/2YE90iC.

    In this exercise, you will be working with the same two features as in the previous chapter: satisfaction_level and last_evaluation.

    As mentioned previously in relation to k-fold cross validation, you still need to split the full dataset into a training and validation set and a test set. You will use the training and validation set during this exercise, and use the test set later during model selection.

  4. Set up the training data by running the following command:

    from sklearn.model_selection import train_test_split

    features = ['satisfaction_level', 'last_evaluation']

    X, X_test, \

    y, y_test = train_test_split(df[features].values, \

                                 df['left'].values, \

                                 test_size=0.15, \

                                 random_state=1)

  5. Use a decision tree with max_depth=5 to instantiate a model for k-fold cross validation

    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier(max_depth=5)

    At this point, you have not performed any interesting computation. You have simply prepared a model object, clf, and defined its hyperparameters (for example, max_depth).

  6. To run the stratified k-fold cross validation algorithm, use the model_selection.cross_val_score function from scikit-learn and print the resulting score by running the following code:

    from sklearn.model_selection import cross_val_score

    np.random.seed(1)

    scores = cross_val_score(estimator=clf, X=X, \

                             y=y, cv=10,)

    print('accuracy = {:.3f} +/- {:.3f}'.format(scores.mean(), \

                                                scores.std(),))

    Here, you train 10 variations of our clf model using stratified k-fold validation. Note that scikit-learn's cross_val_score does this type of validation (stratified) by default.

    You use np.random.seed to set the seed for the random number generator, thereby ensuring reproducibility with respect to any computation that follows that depends on random numbers. In this case, you set the seed to ensure reproducibility of the randomly selected samples for each fold in stratified k-fold cross validation.

    Notice that you printed the average accuracy and standard deviation of each fold. You can also look at the individual accuracies for each fold by looking at the scores variable.

  7. Insert a new cell and run print(scores). You should see the following output:

    [0.92241379 0.91529412 0.92784314 0.92941176 0.9254902 0.92705882

    0.91294118 0.91607843 0.92229199 0.9277865 ]

    Using cross_val_score is a convenient way to accomplish k-fold cross validation, but it doesn't tell you about the accuracies within each class. Since your problem is sensitive to each class' accuracy (as identified in the exercises in the previous chapters), you will need to manually implement k-fold cross validation so that this information is available to us. In particular, you are interested in the accuracy of class 1, which represents the employees who have left.

  8. Define a custom class for calculating k-fold cross validation class accuracies by running the following code:

    from sklearn.model_selection import StratifiedKFold

    from sklearn.metrics import confusion_matrix

    def cross_val_class_score(clf, X, y, cv=10):

        kfold = (StratifiedKFold(n_splits=cv).split(X, y))

        class_accuracy = []

        for k, (train, test) in enumerate(kfold):

            clf.fit(X[train], y[train])

            y_test = y[test]

            y_pred = clf.predict(X[test])

            cmat = confusion_matrix(y_test, y_pred)

            class_acc = cmat.diagonal()/cmat.sum(axis=1)

            class_accuracy.append(class_acc)

            print('fold: {:d} accuracy: {:s}'.format(k+1, \

                                                     str(class_acc),))

        return np.array(class_accuracy)

    You implement k-fold cross validation manually using the model_selection.StratifiedKFold class in scikit-learn. This class takes the number of folds as an initialization argument and provides the split method to build randomly sampled masks for the data. In this instance, a mask is simply an array containing indexes of items in another array, where the items can then be returned by running code such as data[mask].

  9. Having defined this function, you can now calculate the class accuracies with code that's very similar to model_selection.cross_val_score from before. Do this by running the following code:

    np.random.seed(1)

    scores = cross_val_class_score(clf, X, y)

    print('accuracy = {} +/- {}'.format(scores.mean(axis=0), \

                                        scores.std(axis=0),))

    This will print the following output when the folds are iterated over and 10 models are trained:

    fold: 1 accuracy: [0.98559671 0.72039474]

    fold: 2 accuracy: [0.98559671 0.68976898]

    fold: 3 accuracy: [0.98971193 0.72937294]

    fold: 4 accuracy: [0.98765432 0.74257426]

    fold: 5 accuracy: [0.99074074 0.71617162]

    fold: 6 accuracy: [0.98971193 0.72607261]

    fold: 7 accuracy: [0.98251029 0.68976898]

    fold: 8 accuracy: [0.98559671 0.69306931]

    fold: 9 accuracy: [0.98455201 0.72277228]

    fold: 10 accuracy: [0.98352214 0.74917492]

    accuracy = [0.98651935 0.71791406] +/- [0.00266409 0.0200439 ]

    These outputs show the class accuracies, where the first value corresponds to class 0 and the second corresponds to class 1.

    Having seen k-fold cross validation in action, we'll move on to the topic of validation curves. These can be generated easily with scikit-learn.

  10. Calculate validation curves with model_selection.validation_curve. This function uses stratified k-fold cross validation to train models for various values of a specified hyperparameter. Perform the calculations required to plot a validation curve by running the following code:

    from sklearn.model_selection import validation_curve

    clf = DecisionTreeClassifier()

    max_depth_range = np.arange(3, 20, 1)

    np.random.seed(1)

    train_scores, \

    test_scores = validation_curve(estimator=clf, \

                                   X=X, y=y, \

                                   param_name='max_depth', \

                                   param_range=max_depth_range, \

                                   cv=5,);

    By running this, you've trained a set of decision trees over the range of max_depth values. These values are defined in the max_depth_range = np.arange(3, 20, 1) line, which corresponds to the [3, 4, … 18, 19] array—that is, from max_depth=3 up to max_depth=20, with a step size of 1.

    The validation_curve function will return arrays with the cross validation (training and test) scores for a set of models, where each has a different max_depth variable.

  11. To visualize the results, leverage a function provided in the scikit-learn documentation:

    Note

    The triple-quotes ( """ ) shown in the code snippet below are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.

chapter_5_workbook.ipynb

def plot_validation_curve(train_scores, \

                          test_scores, \

                          param_range, \

                          xlabel='', \

                          log=False, \

):

    """This code is from scikit-learn docs (BSD License).

    http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

    """

    train_mean = np.mean(train_scores, axis=1)

    train_std = np.std(train_scores, axis=1)

    test_mean = np.mean(test_scores, axis=1)

    test_std = np.std(test_scores, axis=1)

This will result in the following graph:

Figure 5.3: Validation curve for a decision tree

Setting the max depth parameter for decision trees controls the balance between underfitting and overfitting. This is reflected in the validation curve, where we can see low accuracies for small maximum depth values (underfitting), since we are not allowing the decision tree to create enough branches to capture the patterns in the data.

For large max depth values to the right of the chart, we can see the opposite happen, as the decision trees here overfit the training data. This is evidenced from the fact that our validation accuracy (red squares) decreases as the maximum depth increases.

Notice how the training accuracy (blue circles) continues increasing as the maximum depth increases. This happens because the decision trees are able to capture increasingly detailed patterns in the training data. By looking at the validation accuracies, we can see that these patterns do not generalize well for unseen data.

Based on this chart, a good value for max_depth appears to be 6. At this point, we can see that the validation accuracy has hit a maximum and that the training and validation accuracies are agreement (within error).

Note

To access the source code for this specific section, please refer to https://packt.live/30GTi9a.

You can also run this example online at https://packt.live/2BcG5tP.

To summarize, we have learned and implemented two important techniques for building reliable predictive models.

The first such technique was k-fold cross validation, where we train and validate a set of models over different subsets of the data in order to generate a variety of accuracy measurements for a single model choice. From this set, we then calculated the average accuracy and the standard deviation. This standard deviation is an important error metric to gauge the variability of our selected model.

The second technique we explored in this section was validation curves. By comparing training and validation accuracies (as generated by k-fold cross validation) over the range of our selected hyperparameter, validation curves allow us to visualize when our model is underfitting or overfitting and help us to identify optimal hyperparameter values.

In the next section, we'll introduce the concept of dimensionality reduction and why it's useful for training models. Then, we'll apply it to the Human Resource Analytics dataset and revisit the topics from this section in order to train highly accurate models for predicting employee turnover.

Dimensionality Reduction with PCA

Dimensionality reduction can be as simple as removing unimportant features from the training data. However, it's usually not obvious that removing a set of features will boost model performance. Even features that are highly noisy may offer some valuable information that models can learn from. For these reasons, we should know about better methods for reducing data dimensionality, such as the following:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)

These techniques allow for data compression, where the most important information from a large group of features can be encoded in just a few features.

In this section, we'll focus on PCA. This technique transforms the data by projecting it into a new subspace of orthogonal principal components, where the components with the highest eigenvalues (as described here) encode the most information for training the model. Then, we can simply select a set of principal components in place of the original high-dimensional dataset. The number of principal components to select will depend on the details of the specific dataset, but it should be a reasonable percentage of the original set of features.

For example, PCA could be used to encode the information from every pixel in an image. In this case, the original feature space would have dimensions equal to the number of pixels in the image. This high-dimensional space could then be reduced with PCA, where the majority of useful information for training predictive models might be reduced to just a few dimensions. Not only does this save time when training and using models, but it allows them to perform better by removing noise from the dataset.

Similar to the algorithms we've discussed and implemented in this book, it's not necessary to have a detailed understanding of PCA in order to leverage its benefits. However, before implementing PCA with scikit-learn, we'll dig into the technical details a bit further in order to gain some appreciation for the underlying algorithm.

The key insight of PCA is to identify patterns between features based on correlations so that the PCA algorithm calculates the covariance matrix and then decomposes this into eigenvectors and eigenvalues. The vectors are then used to transform the data into a new subspace, from which a fixed number of principal components can be selected. Through this process, we effectively look at a high-dimensional dataset and find a set of vectors that follow directions of large variance, and thereby can encode much of the total information in fewer dimensions.

In the following exercise, we'll look at an example of how PCA can be used to reduce the dimensionality of our Human Resource Analytics dataset.

Exercise 5.02: Dimensionality Reduction with PCA

After training a variety of models for predicting employee turnover with the Human Resource Analytics dataset, we are still yet to use the majority of the features at our disposal. In this exercise, we will take the first steps in putting these features to use.

First, you will learn about a modeling technique that calculates which features are most influential for making predictions. Then, using these so-called "feature importance", you will create a strategy for selecting good features for dimensionality reduction. Finally, you will learn how to implement PCA with scikit-learn. Perform the following steps to complete this exercise:

  1. Starting at the point in the notebook where the previous exercise ended, load the preprocessed dataset and print the columns by running the following code. This is the same table that you used in the previous exercise:

    df = pd.read_csv('../data/hr-analytics/hr_data_processed.csv')

    df.columns

    Here's the output of the preceding command:

    Figure 5.4: The columns of hr_data_processed.csv

    In order to determine which features are good candidates for reducing with PCA, you want to calculate how important each of them is for making predictions. Once you know this information, you can select those that are least important for PCA and leave the most important features intact.

  2. Determine feature importance using a decision tree classifier. Select all available features and train a decision tree on the full dataset (not doing a train-test split), by running the following code:

    features = ['satisfaction_level', 'last_evaluation', \

                'number_project','average_montly_hours', \

                'time_spend_company', 'work_accident', \

                'promotion_last_5years', 'department_IT', \

                'department_RandD','department_accounting', \

                'department_hr', 'department_management', \

                'department_marketing', 'department_product_mng', \

                'department_sales','department_support', \

                'department_technical', 'salary_high', \

                'salary_low', 'salary_medium']

    X = df[features].values

    y = df.left.values

    from sklearn.tree import DecisionTreeClassifier

    clf = DecisionTreeClassifier(max_depth=10)

    clf.fit(X, y)

    By now, you should recognize exactly what the preceding code is doing. Based on previous testing, you found max_depth=6 to be a good choice when training on just two features: satisfaction_level and last_evaluation. When more features are included in the model, decision trees tend to require more depth to avoid underfitting (assuming all the other hyperparameters remain constant). Therefore, select max_depth=10 as an educated guess. Most likely, this is not the optimal choice, but for our purposes here, this does not matter.

  3. Having trained a quick and dirty model, leverage it to see how important each feature is for making predictions by using the feature_importances_ attribute of clf. Visualize these in a bar chart by running the following code:

    (

        pd.Series(clf.feature_importances_, \

                  name='Feature importance', \

                  index=df[features].columns,)

        .sort_values()

        .plot.barh()

    )

    plt.xlabel('Feature importance')

    The bar plot for this is as follows:

    Figure 5.5: Feature importance calculated by a decision tree model

    As shown in the preceding bar plot, there are a handful of features that are of significant importance when it comes to making predictions, and the rest appear to have near-zero importance.

    Keep in mind, however, that this chart does not represent the true feature importance, but simply that of the quick and dirty decision tree model, clf. In other words, the features with near-zero importance in the preceding chart may be more important for other models. In any case, the information here is sufficient for us to make a selection on which features to reduce with PCA.

  4. Set aside the five most important features from the preceding chart so that you can use them for modeling later, and then select the remainder for use in the PCA algorithm. Do this with the following code:

    importances = list(pd.Series(clf.feature_importances_, \

                                 index=df[features].columns,)

                       .sort_values(ascending=False).index)

    low_importance_features = importances[5:]

    high_importance_features = importances[:5]

  5. Print the least of low importance features list as follows:

    np.array(low_importance_features)

    The output is as follows:

    array(['salary_low', 'department_technical', 'work_accident',

           'department_support', 'department_IT', 'department_RandD',

           'salary_high', 'salary_medium', 'department_management',

           'department_accounting', 'department_hr', 'department_sales',

           'department_product_mng', 'promotion_last_5years',

           'department_marketing'], dtype='<U22')

  6. Print the least of high importance features list as follows:

    np.array(high_importance_features)

    The output is as follows:

    array(['satisfaction_level', 'last_evaluation', 'time_spend_company',

           'number_project', 'average_montly_hours'], dtype='<U20')

  7. Having identified the features to use for dimensionality reduction, run the PCA algorithm with the following code:

    from sklearn.decomposition import PCA

    pca_features = ['salary_low', 'department_technical', \

                    'department_support','work_accident', \

                    'salary_medium', 'department_IT', \

                    'department_RandD', 'salary_high', \

                    'department_management','department_accounting', \

                    'department_hr', 'department_sales', \

                    'department_product_mng', 'promotion_last_5years', \

                    'department_marketing']

    X_reduce = df[pca_features]

    pca = PCA(n_components=3)

    pca.fit(X_reduce)

    X_pca = pca.transform(X_reduce)

    First, we define the list of features to use in PCA, which can conveniently by done by copying and pasting the output of np.array(low_importance_features) from the preceding cell.

    Next, we instantiate the PCA class from scikit-learn with n_components=3, indicating that we want to keep the first three components returned by the PCA algorithm. Finally, we fit our instantiated PCA class and then transform the same dataset.

  8. Check the shape of the component data by running X_pca.shape. This will print the following output:

    (14999, 3)

    This result implies that we have three arrays of length 14,999 – corresponding to the three principal components for each record in the dataset.

  9. Insert these principal component features into df by running the following code:

    df['first_principle_component'] = X_pca.T[0]

    df['second_principle_component'] = X_pca.T[1]

    df['third_principle_component'] = X_pca.T[2]

  10. Save the updated dataset by running the following code:

    df.to_csv('../data/hr-analytics/hr_data_processed_pca.csv', \

              index=False,)

  11. Finally, save the "fit" PCA class. This will be needed later to process future data before feeding it into our classifier's prediction method:

    import joblib

    joblib.dump(pca, 'hr-analytics-pca.pkl')

    Note

    To access the source code for this specific section, please refer to https://packt.live/30GTi9a.

    You can also run this example online at https://packt.live/2BcG5tP.

This concludes our exercise on PCA. You've learned how to generate feature importance and use that information to identify good candidates for dimensionality reduction. Using this technique, we found a set of features from the Human Resource Analytics dataset to apply the PCA algorithm on and reduced them to create three new features, representing their principal components.

Model Training for Production

So far in this book, we have trained many models and spent considerable effort learning about model assessment and optimization techniques. However, we have primarily focused on training models for instructional purposes, rather than producing production-ready models with optimal performance.

We have discussed the importance of training data various times in this book. Generally, we want to have as many training records and informative features as possible. One downside of having a massive set of records is the additional work required to clean that data in order to prepare it for use in machine learning algorithms. The same can be said for the number of features.

An additional problem that presents itself as the number of features grows is the difficulty in fitting models well. The variation of feature types, such as numerical, categorical, and Boolean, can restrict the type of models that are available to us and raise technical considerations around feature scaling during model training. In this chapter, we were able to avoid feature scaling altogether by using decision trees, which do not require features to be of a comparable scale.

More troubling than the preceding concerns, with respect to a growing number of features, is something known as the curse of dimensionality. This refers to the difficulty that models encounter when trying to fit a large number of features. As the number of dimensions in the training data increases, it becomes increasingly difficult for models to find patterns due to the inherently large distances that appear between records in a high-dimensional space. The dimensionality reduction techniques we learned about earlier can be effective for counteracting this effect.

Despite the difficulties outlined here, it still holds true that more training data is usually beneficial to model performance. So far in this book, we've worked mostly on training the majority of our models on just two features. In this section, we'll apply what we learned previously to model assessment and optimization in order to train a production-ready model that uses information from all of the features that are available in our dataset.

Exercise 5.03: Training a Production-Ready Model for Employee Turnover

We have already spent considerable effort planning a machine learning strategy, cleaning the raw data, and building predictive models for the employee retention problem. Recall that our business objective was to help the client prevent employees from leaving. The strategy we decided upon was to build a classification model that would be able to predict employee turnover by estimating the probability of an employee leaving. This way, the company can assess the likelihood of current employee turnover and take action to prevent it.

Given our strategy, we can summarize the type of predictive modeling we are doing as follows:

  • Supervised learning on labeled training data
  • Classification with two class labels (binary)

In particular, we are training models to determine whether an employee has left the company, given a set of numerical and categorical features.

After preparing the data for machine learning in Chapter 3, Preparing Data for Predictive Modeling, we went on to implement SVM, KNN, and Random Forest algorithms using just two features. These models were able to make predictions with over 90% overall accuracy. When looking at the specific class accuracies, however, we found that employees who had left (class label 1) could only be predicted with 70-80% accuracy.

In this exercise, you will see how much these class 1 accuracies can be improved by utilizing the full feature space. You will look at a unified example using validation curves for hyperparameter tuning, k-fold cross validation and test set verification for model assessment, as well as the final steps in preparing a production-ready model. Perform the following steps to complete this exercise:

  1. Starting where you left off in the notebook, load the preprocessed dataset and print the columns by running the following code. This is the same table that you completed the previous exercise with:

    df = pd.read_csv('../data/hr-analytics/hr_data_processed.csv')

    df.columns

    This command displays the following output:

    Figure 5.6: The columns of hr_data_processed.csv

    As a quick refresher, we'll go through a brief summary of the variable descriptions. You are encouraged to look back at the analysis from Chapter 3, Preparing Data for Predictive Modeling, in order to review the feature distributions we generated.

    The first two features, satisfaction_level and last_evaluation, are numerical and span continuously from 0 to 1; these are what we used to train the models in the previous two exercises. Next, we have some numerical features, such as number_project and time_spend_company, followed by Boolean fields such as work_accident and promotion_last_5years. We also have the one-hot encoded categorical features, such as department_IT and salary_medium. Lastly, we have the PCA variables representing the first three principal components of the select feature set from the previous exercise.

    Given the mixed data types of our feature set, decision trees or Random Forests are very attractive models since they work well with feature sets composed of both numerical and categorical data. In this exercise, we are going to train a decision tree model.

    Note

    If you're interested in training an SVM or KNN classifier on mixed-type input features, you may find the data scaling prescription from this StackExchange answer useful: https://stats.stackexchange.com/questions/82923/mixing-continuous-and-binary-data-with-linear-svm/83086#83086.

    A simple approach would be to preprocess the data as follows:

    Standardize continuous variables, one-hot encode categorical features, and then shift binary values to -1 and 1 instead of 0 and 1.

    This would yield the data of mixed-feature types, which could then be used to train a variety of classification models.

  2. Select the features to use for our model as the top five features from the PCA section, in terms of feature importance, and the first three principal components of the remaining features. Do this selection and split the data into a training and validation set (X, y) and a test set (X_test, y_test) by running the following code:

    from sklearn.model_selection import train_test_split

    features = ['satisfaction_level', 'last_evaluation', \

                'time_spend_company','number_project', \

                'average_montly_hours', 'first_principle_component', \

                'second_principle_component', \

                'third_principle_component',]

    X, X_test, \

    y, y_test = train_test_split(df[features].values, \

                                 df['left'].values, \

                                 test_size=0.15, \

                                 random_state=1)

    Notice that you set test_size=0.15 for the train-test split, since you want to set aside 15% of the full dataset for testing the model that you will select after hyperparameter tuning.

    The hyperparameter you are going to optimize is the decision tree's max_depth. You will do this in the same way you found the validation curves, where you found max_depth=6 to be optimal for a decision tree model with only two features.

  3. Calculate the validation curve for a decision tree with a maximum depth ranging from 2 up to 52 by running the following code:

    %%time

    from sklearn.tree import DecisionTreeClassifier

    np.random.seed(1)

    clf = DecisionTreeClassifier()

    max_depth_range = np.arange(2, 52, 2)

    print('Training {} models ...'.format(len(max_depth_range)))

    train_scores, \

    test_scores = validation_curve(estimator=clf, X=X, y=y, \

                                   param_name='max_depth', \

                                   param_range=max_depth_range, \

                                   cv=10,);

    Since you are using the %%time magic function, this cell will print a message similar to the following:

    Training 25 models ...

    CPU times: user 7.93 s, sys: 29.5 ms, total: 7.96 s

    Wall time: 7.98 s

    The details of this will depend on your hardware and the other processes happening on your system at runtime.

    By executing this code, you run 25 sets of k-fold cross validation—one for each value of the max_depth hyperparameter in our defined range. By setting cv=10, you produce 10 estimates of the accuracy for each model (during k-fold cross validation), from which the mean and standard deviation are calculated in order to plot in the validation curve. In total, you train 250 models over various maximum depths and subsets of the data.

  4. Having run the calculations required for the validation curve, plot it with the plot_validation_curve function that was defined earlier in the notebook. If needed, scroll up and rerun that cell to define the function. Then, run the following code:

    plot_validation_curve(train_scores, test_scores, \

                          max_depth_range, xlabel='max_depth',)

    plt.ylim(0.95, 1.0)

    This will result in the following curve:

    Figure 5.7: Validation curve for a decision tree with PCA features

    Looking at this validation curve, you can see the accuracy of the training set (blue circles) quickly approach 100%, hitting this mark around max_depth=20. The validation set (red squares) reaches a maximum accuracy around max_depth=8, before dropping slightly as max_depth increases beyond this point. This happens because the models in this range are overfitting on the training data, learning patterns that don't generalize well to unseen data in the validation sets.

    Based on this result, we can select max_depth=8 as the optimal value to use for our production model.

  5. Check the k-fold cross validation accuracy for each class in our model with the cross_val_class_score function that was defined earlier in the notebook. If needed, scroll up and rerun that cell to define the function. Then, run the following code:

    clf = DecisionTreeClassifier(max_depth=8)

    np.random.seed(1)

    scores = cross_val_class_score(clf, X, y)

    print('accuracy = {} +/- {}'.format(scores.mean(axis=0), \

                                        scores.std(axis=0),))

    This will print the following output:

    fold: 1 accuracy: [0.99382716 0.91118421]

    fold: 2 accuracy: [0.99588477 0.91089109]

    fold: 3 accuracy: [0.99897119 0.91749175]

    fold: 4 accuracy: [0.99588477 0.95379538]

    fold: 5 accuracy: [0.99279835 0.91419142]

    fold: 6 accuracy: [0.99588477 0.92079208]

    fold: 7 accuracy: [0.99485597 0.92409241]

    fold: 8 accuracy: [0.99382716 0.9339934 ]

    fold: 9 accuracy: [0.9907312 0.91419142]

    fold: 10 accuracy: [0.99176107 0.94059406]

    accuracy = [0.99444264 0.92412172] +/- [0.00226594 0.01357943]

    As can be seen, this model is performing much better than previous models for class 1, with an average accuracy of 92.4% +/- 1.4%. This can be attributed to the additional features we are using here, compared to earlier models that relied on only two features.

  6. Visualize the new class accuracies with a boxplot by running the following code:

    fig = plt.figure(figsize=(5, 7))

    sns.boxplot(data=pd.DataFrame(scores, columns=[0, 1]), \

                                  palette=sns.color_palette('Set1'),)

    plt.xlabel('Left (0="no", 1="yes")')

    plt.ylabel('Accuracy')

    Here's the visualization created by the preceding code:

    Figure 5.8: Boxplot of class accuracies for the decision tree model

    At this point, having finished the hyperparameter optimization, you should now check how well the model performs on the test set. This result will give you more confidence that it will perform well when making predictions in production.

  7. Train a model on the full set of training and validation data (X, y). Then, determine the accuracy of each class for the test set (X_test, y_test) by running the following code:

    from sklearn.metrics import confusion_matrix

    clf = DecisionTreeClassifier(max_depth=8)

    clf.fit(X, y)

    y_pred = clf.predict(X_test)

    cmat = confusion_matrix(y_test, y_pred)

    cmat.diagonal() / cmat.sum(axis=1) * 100

    This will print the following output:

    array([99.23976608, 93.88888889])

    These test accuracies should fall within or very close to the range of the k-fold cross validation accuracies we calculated previously. For class 0, you can see 99.2%, which falls within the k-fold range of 99.2% – 99.6%, and for class 1, you can see 93.9%, which falls just above the k-fold range of 91.0% – 93.8%. These are good results, which give you confidence that your model will perform well in production.

  8. You have nearly finished creating your production model. Having selected the best hyperparameters and tested the accuracy, now train a new model on the full dataset with the following code:

    features = ['satisfaction_level', 'last_evaluation', \

                'time_spend_company', 'number_project', \

                'average_montly_hours', 'first_principle_component', \

                'second_principle_component', \

                'third_principle_component',]

    X = df[features].values

    y = df['left'].values

    clf = DecisionTreeClassifier(max_depth=8)

    clf.fit(X, y)

  9. To use this model in production without needing to retrain it each time, save it to disk. Using the joblib module, dump the model to a binary file by running the following code:

    import joblib

    joblib.dump(clf, 'hr-analytics-pca-tree.pkl')

  10. Check that your trained model was saved into the working directory. If your Jupyter Notebook environment has bash support, this can be done by running the following code:

    !ls .

    This will print the contents of the working directory:

    chapter_5_workbook.ipynb hr-analytics-pca-tree.pkl

    hr-analytics-pca.pkl

  11. In order to use this model to make predictions, load it from this binary file by running the following code:

    clf = joblib.load('hr-analytics-pca-tree.pkl')

    clf

    The output of this command is as follows:

    Figure 5.9: The decision tree model's representation

    Now run through an example showing how this model can be used to make predictions regarding employee turnover. You will pick a record from the training data and feed it into the model for prediction.

  12. Select a record from the training data and filter it on the original feature columns, and pretend this is the employee profile for Bob. Do this by running the following code:

    pca_features = ['salary_low', 'department_technical', \

                    'work_accident','department_support', \

                    'department_IT', 'department_RandD', \

                    'salary_high', 'salary_medium', \

                    'department_management','department_accounting', \

                    'department_hr', 'department_sales', \

                    'department_product_mng', 'promotion_last_5years', \

                    'department_marketing']

    non_pca_features = ['satisfaction_level', 'last_evaluation', \

                        'time_spend_company','number_project', \

                        'average_montly_hours']

    bob = df.iloc[8483][pca_features + non_pca_features]

    bob

    This will print the following output, showing Bobs metrics across each employee metric:

    salary_low 1.00

    department_technical 0.00

    work_accident 0.00

    department_support 0.00

    department_IT 0.00

    department_RandD 0.00

    salary_high 0.00

    salary_medium 0.00

    department_management 0.00

    department_accounting 0.00

    department_hr 0.00

    department_sales 1.00

    department_product_mng 0.00

    promotion_last_5years 0.00

    department_marketing 0.00

    satisfaction_level 0.77

    last_evaluation 0.68

    time_spend_company 2.00

    number_project 3.00

    average_montly_hours 225.00

    Name: 8483, dtype: float64

    In general, a prediction sample would need to be prepared in exactly the same way that the training data was, which includes the same method of data cleaning such as filling missing values and one-hot encoding categorical variables.

    In this case (for Bob), this preprocessing has already been done. However, at this point, assume that PCA transformations have not been done yet. This is a necessary step in order to produce the proper input that your model requires.

  13. Load the PCA transformation class that was saved to disk earlier in this exercise and use it to transform the relevant features for Bob by running the following code:

    pca = joblib.load('hr-analytics-pca.pkl')

    pca_feature_values = pca.transform([bob[pca_features]])[0]

    pca_feature_values

    This will print the following output, showing the principal components that we need in order to make a prediction for Bob:

    array([-0.67733089, 0.75837169, -0.10493685])

  14. Create a prediction vector for Bob that can be input into the prediction method of your classification model by running the following code:

    X_bob = np.concatenate((bob[non_pca_features].values, \

                            pca_feature_values))

    X_bob

    This will print the following output:

    array([ 7.70000000e-01, 6.80000000e-01, 2.00000000e+00,

            3.00000000e+00, 2.25000000e+02, -6.77330887e-01,

            7.58371688e-01, -1.04936853e-01])

  15. You are finally ready to see whether the model is predicting that Bob will leave the company. Calculate this outcome by running the following code:

    clf.predict([X_bob])

    This will print the following output:

    array([0])

    This indicates that our model is predicting that Bob will not leave the company, since he was assigned to class 0.

  16. You can see what probability the model has assigned to this prediction by using its predict_proba method. Check this result by running the following code:

    clf.predict_proba([X_bob])

    This will print the following output:

    array([[0.98, 0.02]])

    This indicates that our model has assigned 98% probability to Bob remaining at the company.

    Note

    To access the source code for this specific section, please refer to https://packt.live/30GTi9a.

    You can also run this example online at https://packt.live/2BcG5tP.

You have now reached the end of our final exercise with the Human Resource Analytics dataset and successfully trained a model that can predict employee turnover. In this exercise, you used validation curves for hyperparameter optimization and k-fold cross validation model assessment to confirm the confidence in the model.

By training a model on the most important features, in addition to those produced from dimensionality reduction, we were able to build a model that performs much better than previous ones from Chapter 3, Preparing Data for Predictive Modeling.

Finally, you learned how to persist models on disk and reload them for use in making predictions.

In the following activity, you will attempt to improve on the model we trained here. This will give you an opportunity to apply the topics from this chapter and use the skills you have learned from this book.

Activity 5.01: Hyperparameter Tuning and Model Selection

In this final activity related to machine learning, we'll take everything we have learned so far and put it together in order to build another predictive model for the employee retention problem. We seek to improve the accuracy of the model from the preceding exercise by training a Random Forest model.

In order to accomplish this, you will need to use the methods you've seen being implemented throughout this chapter, such as k-fold cross validation and validation curves. You will also need to confirm the validity of your model on testing data and determine whether it's an improvement on previous work. Finally, you will apply the model to a practical business situation. Perform the following steps to complete this activity:

Note

The detailed steps for this activity, along with the solutions and additional commentary, can be found via this link.

  1. Start up one of the following platforms for running Jupyter Notebooks:

    JupyterLab (run jupyter lab)

    Jupyter Notebook (run jupyter notebook)

    Then, open the platform you chose in your web browser by copying and pasting the URL, as prompted in the Terminal.

  2. Load the required libraries and set up your plotting environment for the notebook.
  3. Start by loading the training dataset you generated earlier in the notebook (hr_data_processed_pca.csv), assigning it to the df variable.
  4. Select the same features from the table that were used in the final exercise of this chapter (when we trained a decision tree with max_depth=8). Then, split these into a training and validation set and a test set (X, X_test, y, y_test). The test set should include 15% of the records.
  5. Calculate a validation curve for Random Forest classification models with n_estimators=50 over a range of max_depth values from 2 up to 52, in increments of 2. In the k-fold cross validation step of the validation curve calculation, assign the value of k to 5 by setting cv=5.
  6. Draw the validation curve using the plot_validation_curve visualization that was defined earlier in the notebook. Interpret the chart and note anything that's different from the validation curve in the previous exercise. What would you pick as the optimal value for max_depth?
  7. Perform k-fold cross validation using the cross_val_class_score function you defined earlier in the notebook, setting the Random Forest hyperparameters as n_estimators=50 and max_depth=25. Are these results better than the decision tree we trained in the previous exercise?
  8. Evaluate the performance of this model on the test set by training it on the full test and validation set (X, y), and then calculating its accuracy on each class in the test set (X_test, y_test). Are the scores in an appropriate range to validate the model?
  9. Train this model on the full set of records in df.
  10. Save the model to disk, and then check that it is saved properly by reloading it.
  11. Check the model performance for an imaginary employee, Alice, by selecting the appropriate features from row 573 of df. Make sure you select all of the features needed to make a prediction, including the first, second, and third principal components.
  12. Predict whether Alice is going to leave the company. Then, determine the probability assigned to that prediction by the model.
  13. Adjust the feature values for Alice in order to determine the changes that would be required to alter the model's prediction. Try setting average_montly_hours=100 and time_spend_company=2. Then, rerun the model's prediction probabilities. Was this adjustment enough to sway the model's prediction on whether or not Alice is going to leave?

Summary

In this chapter, we have seen how to use Jupyter Notebooks to perform parameter optimization and model selection.

We built upon the work we did in the previous chapter, where we trained predictive classification models for our binary problem and saw how decision boundaries are drawn for SVM, KNN, and Random Forest models. We improved on these simple models by using validation curves to optimize parameters and explored how dimensionality reduction can improve model performance as well.

Finally, at the end of the last exercise, we explored how the final model can be used in practice to make data-driven decisions. This demonstration connects our results back to the original business problem that inspired our modeling problem initially.

In the next chapter, we will depart from machine learning and focus on data acquisition instead. Specifically, we will discuss methods for extracting web data and learn about HTTP requests, web scraping with Python, and more data processing with pandas. These topics can be highly relevant to data scientists, given the huge importance of having good quality data to study and model.