Appendix – The Supervised Learning Workshop

Appendix

1. Fundamentals

Activity 1.01: Implementing Pandas Functions

  1. Open a new Jupyter notebook.
  2. Use pandas to load the Titanic dataset:

    import pandas as pd

    df = pd.read_csv(r'../Datasets/titanic.csv')

  3. Use the head function on the dataset as follows:

    # Have a look at the first 5 sample of the data

    df.head()

    The output will be as follows:

    Figure 1.26: First five rows

  4. Use the describe function as follows:

    df.describe(include='all')

    The output will be as follows:

    Figure 1.27: Output of describe()

  5. We do not need the Unnamed: 0 column. We can remove the column without using the del command, as follows:

    del df['Unnamed: 0']

    df = df[df.columns[1:]] # Use the columns

    df.head()

    The output will be as follows:

    Figure 1.28: First five rows after deleting the Unnamed: 0 column

  6. Compute the mean, standard deviation, minimum, and maximum values for the columns of the DataFrame without using describe:

    df.mean()

    The output will be as follows:

    Figure 1.29: Output for mean()

    Now, calculate the standard deviation:

    df.std()

    The output will be as follows:

    Figure 1.30: Output for std()

  7. Calculate the minimum value of the columns:

    df.min()

    The output will be as follows:

    Figure 1.31: Output for min()

    Next, calculate the maximum value of the column of the dataframe.

    df.max()

    The output will be as follows:

    Figure 1.32: Output for max()

  8. Use the quantile method for the 33, 66, and 99% quantiles, as shown in the following code snippet:

    df.quantile(0.33)

    The output will be as follows:

    Figure 1.33: Output for the 33% quantile

    Similarly, use the quantile method for 66%:

    df.quantile(0.66)

    The output will be as follows:

    Figure 1.34: Output for the 66% quantile

    Use the same method for 99%:

    df.quantile(0.99)

    The output will be as follows:

    Figure 1.35: Output for the 99% quantile

  9. Find out how many passengers were from each class using the groupby method:

    class_groups = df.groupby('Pclass')

    for name, index in class_groups:

        print(f'Class: {name}: {len(index)}')

    The output will be as follows:

    Class: 1: 323

    Class: 2: 277

    Class: 3: 709

  10. Find out how many passengers were from each class by using selecting/indexing methods to count the members of each class:

    for clsGrp in df.Pclass.unique():

        num_class = len(df[df.Pclass == clsGrp])

        print(f'Class {clsGrp}: {num_class}')

    The result will be as follows:

    Class 3: 709

    Class 1: 323

    Class 2: 277

    The answers to Step 6 and Step 7 do match.

  11. Determine who the eldest passenger in third class was:

    third_class = df.loc[(df.Pclass == 3)]

    third_class.loc[(third_class.Age == third_class.Age.max())]

    The output will be as follows:

    Figure 1.36: Eldest passenger in third class

  12. For a number of machine learning problems, it is very common to scale the numerical values between 0 and 1. Use the agg method with Lambda functions to scale the Fare and Age columns between 0 and 1:

    fare_max = df.Fare.max()

    age_max = df.Age.max()

    df.agg({'Fare': lambda x: x / fare_max, \

            'Age': lambda x: x / age_max,}).head()

    The output will be as follows:

    Figure 1.37: Scaling numerical values between 0 and 1

  13. Identify the one individual entry in the dataset without a listed Fare value:

    df_nan_fare = df.loc[(df.Fare.isna())]

    df_nan_fare

    The output will be as follows:

    Figure 1.38: Individual without a listed fare value

  14. Replace the NaN values of this row in the main DataFrame with the mean Fare value for those corresponding to the same class and Embarked location using the groupby method:

    embarked_class_groups = df.groupby(['Embarked', 'Pclass'])

    indices = embarked_class_groups\

              .groups[(df_nan_fare.Embarked.values[0], \

                       df_nan_fare.Pclass.values[0])]

    mean_fare = df.iloc[indices].Fare.mean()

    df.loc[(df.index == 1043), 'Fare'] = mean_fare

    df.iloc[1043]

    The output will be as follows:

Figure 1.39: Output for the individual without listed fare details

Note

To access the source code for this specific section, please refer to https://packt.live/2AWHbu0.

You can also run this example online at https://packt.live/2NmAnse. You must execute the entire Notebook in order to get the desired result.

2. Exploratory Data Analysis and Visualization

Activity 2.01: Summary Statistics and Missing Values

The steps to complete this activity are as follows:

  1. Import the required libraries:

    import json

    import pandas as pd

    import numpy as np

    import missingno as msno

    from sklearn.impute import SimpleImputer

    import matplotlib.pyplot as plt

    import seaborn as sns

  2. Read the data. Use pandas' .read_csv method to read the CSV file into a pandas DataFrame:

    data = pd.read_csv('../Datasets/house_prices.csv')

  3. Use pandas' .info() and .describe() methods to view the summary statistics of the dataset:

    data.info()

    data.describe().T

    The output of info() will be as follows:

    Figure 2.50: The output of the info() method (abbreviated)

    The output of describe() will be as follows:

    Figure 2.51: The output of the describe() method (abbreviated)

  4. Find the total count and total percentage of missing values in each column of the DataFrame and display them for columns having at least one null value, in descending order of missing percentages.

    As we did in Exercise 2.02: Visualizing Missing Values, we will use the .isnull() function on the DataFrame to get a mask, find the count of null values in each column by using the .sum() function over the DataFrame mask and the fraction of null values by using .mean() over the DataFrame mask, and multiply by 100 to convert it to a percentage. Then, we'll use pd.concat() to combine the total and percentage of null values into a single DataFrame and sort the rows according to the percentage of missing values:

    mask = data.isnull()

    total = mask.sum()

    percent = 100*mask.mean()

    #

    missing_data = pd.concat([total, percent], axis=1,join='outer', \

                             keys=['count_missing', 'perc_missing'])

    missing_data.sort_values(by='perc_missing', ascending=False, \

                             inplace=True)

    #

    missing_data[missing_data.count_missing > 0]

    The output will be as follows:

    Figure 2.52: Total count and percentage of missing values in each column

  5. Plot the nullity matrix and nullity correlation heatmap. First, we find the list of column names for those having at least one null value. Then, we use the missingno library to plot the nullity matrix (as we did in Exercise 2.02: Visualizing Missing Values) for a sample of 500 points, and the nullity correlation heatmap for the data in those columns:

    nullable_columns = data.columns[mask.any()].tolist()

    msno.matrix(data[nullable_columns].sample(500))

    plt.show()

    msno.heatmap(data[nullable_columns], vmin = -0.1, \

                 figsize=(18,18))

    plt.show()

    The nullity matrix will look like this:

    Figure 2.53: Nullity matrix

    The nullity correlation heatmap will look like this:

    Figure 2.54: Nullity correlation heatmap

  6. Delete the columns having more than 80% of values missing. Use the .loc operator on the DataFrame we created in Step 2 to select only those columns that had fewer than 80% of their values missing:

    data = data.loc[:,missing_data[missing_data.perc_missing < 80].index]

  7. Replace null values in the FireplaceQu column with NA values. Use the .fillna() method to replace null values with the NA string:

    data['FireplaceQu'] = data['FireplaceQu'].fillna('NA')

    data['FireplaceQu']

    The output should appear as follows:

Figure 2.55: Replacing null values

Note

To access the source code for this specific section, please refer to https://packt.live/316c4a0.

You can also run this example online at https://packt.live/2Z21v5c. You must execute the entire Notebook in order to get the desired result.

Activity 2.02: Representing the Distribution of Values Visually

  1. Plot a histogram using Matplotlib for the target variable, SalePrice. First, we initialize the figure using the plt.figure command and set the figure size. Then, we use matplotlib's .hist() function as our primary plotting function, to which we pass the SalePrice series object for plotting the histogram. Lastly, we specify the axes' labels and show the plot:

    plt.figure(figsize=(8,6))

    plt.hist(data.SalePrice, bins=range(0,800000,50000))

    plt.ylabel('Number of Houses')

    plt.xlabel('Sale Price')

    plt.show()

    The output will be as follows:

    Figure 2.56: Histogram for the target variable

  2. Find the number of unique values within each column having the object type. Create a new DataFrame called object_variables by using the .select_dtypes function on the original DataFrame to select those columns with the numpy.object data type. Then, find the number of unique values for each column in this DataFrame by using the .nunique() function, and sort the resultant series:

    object_variables = data.select_dtypes(include=[np.object])

    object_variables.nunique().sort_values()

    The output will be as follows:

    Figure 2.57: Number of unique values within each column having the object type (truncated)

  3. Create a DataFrame representing the number of occurrences for each categorical value in the HouseStyle column. Use the .value_counts() function to calculate the frequencies of each value in decreasing order in the form of a pandas series, and then reset the index to give us a DataFrame and sort the values according to the index:

    counts = data.HouseStyle.value_counts(dropna=False)

    counts.reset_index().sort_values(by='index')

    The output will be as follows:

    Figure 2.58: Number of occurrences of each categorical value in the HouseStyle column

  4. Plot a pie chart representing these counts. As in Step 1, we initialize the plot using plt.figure() and use the plt.title() and plt.show() methods to set the figure title and display it, respectively. The primary plotting function used is plt.pie(), to which we pass the series we created in the previous step:

    fig, ax = plt.subplots(figsize=(10,10))

    slices = ax.pie(counts, labels = counts.index, \

                    colors = ['white'], \

                    wedgeprops = {'edgecolor': 'black'})

    patches = slices[0]

    hatches =  ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '\.', '*']

    colors = ['white', 'white', 'lightgrey', 'white', \

              'lightgrey', 'white', 'lightgrey', 'white']

    for patch in range(len(patches)):

        patches[patch].set_hatch(hatches[patch])

        patches[patch].set_facecolor(colors[patch])

    plt.title('Pie chart showing counts for\nvarious house styles')

    plt.show()

    The output will be as follows:

    Figure 2.59: Pie chart representing the counts

  5. Find the number of unique values within each column having the number type. As was executed in Step 2, now select columns having the numpy.number data type and find the number of unique values in each column using .nunique(). Sort the resultant series in descending order:

    numeric_variables = data.select_dtypes(include=[np.number])

    numeric_variables.nunique().sort_values(ascending=False)

    The output will be as follows:

    Figure 2.60: Number of unique values within each numeric column (truncated)

  6. Plot a histogram using seaborn for the LotArea variable. Use seaborn's .distplot() function as the primary plotting function, to which the LotArea series in the DataFrame needs to be passed (without any null values, use .dropna() on the series to remove them). To improve the plot view, also set the bins parameter and specify the X-axis limits using plt.xlim():

    plt.figure(figsize=(10,7))

    sns.distplot(data.LotArea.dropna(), bins=range(0,100000,1000))

    plt.xlim(0,100000)

    plt.show()

    The output will be as follows:

    Figure 2.61: Histogram for the LotArea variable

  7. Calculate the skew and kurtosis values for the values in each column:

    data.skew().sort_values()

    data.kurt()

    The output for skew values will be:

Figure 2.62: Skew values for each column (truncated)

The output for kurtosis values will be:

Figure 2.63: Kurtosis values for each column (truncated)

Note

To access the source code for this specific section, please refer to https://packt.live/3fR91qj.

You can also run this example online at https://packt.live/37PYOI4. You must execute the entire Notebook in order to get the desired result.

Activity 2.03: Relationships within the Data

  1. Plot the correlation heatmap for the dataset. As we did in Exercise 2.13: Plotting a Correlation Heatmap, plot the heatmap using seaborn's .heatmap() function and pass the feature correlation matrix (as determined by using pandas' .corr() function on the DataFrame). Additionally, set the color map to RdBu using the cmap parameter, and the minimum and maximum values on the color scale to -1 and 1 using the vmin and vmax parameters, respectively:

    plt.figure(figsize = (12,10))

    sns.heatmap(data.corr(), square=True, cmap="RdBu", \

                vmin=-1, vmax=1)

    plt.show()

    The output will be as follows:

    Figure 2.64: Correlation heatmap for the dataset

  2. Plot a more compact heatmap having annotations for correlation values using the following subset of features:

    feature_subset = ['GarageArea','GarageCars','GarageCond', \

                      'GarageFinish', 'GarageQual','GarageType', \

                      'GarageYrBlt','GrLivArea','LotArea', \

                      'MasVnrArea','SalePrice']

    Now do the same as in the previous step, this time selecting only the above columns in the dataset and adding a parameter, annot, with a True value to the primary plotting function, with everything else remaining the same:

    plt.figure(figsize = (12,10))

    sns.heatmap(data[feature_subset].corr(), square=True, \

                annot=True, cmap="RdBu", vmin=-1, vmax=1)

    plt.show()

    The output will be as follows:

    Figure 2.65: Correlation heatmap for a feature subset with annotations for correlation values

  3. Display the pairplot for the same subset of features, with the KDE plot on the diagonals and the scatter plot elsewhere. Use seaborn's .pairplot() function to plot the pairplot for the non-null values in the selected columns of the DataFrame. To render the diagonal KDE plots, pass kde to the diag_kind parameter and, to set all other plots as scatter plots, pass scatter to the kind parameter:

    sns.pairplot(data[feature_subset].dropna(), \

                 kind ='scatter', diag_kind='kde')

    plt.show()

    The output will be as follows:

    Figure 2.66: Pairplot for the same subset of features

  4. Create a boxplot to show the variation in SalePrice for each category of GarageCars. The primary plotting function used here will be seaborn's .boxplot() function, to which we pass the DataFrame along with the parameters x and y, the former being the categorical variable and the latter the continuous variable over which we want to see the variation within each category, that is, GarageCars and SalePrice, respectively:

    plt.figure(figsize=(10, 10))

    sns.boxplot(x='GarageCars', y="SalePrice", data=data)

    plt.show()

    The output will be as follows:

    Figure 2.67: Boxplot showing the variation in SalePrice for each category of GarageCars

  5. Plot a line graph using seaborn to show the variation in SalePrice for older to more recently built flats. Here, we will plot a line graph using seaborn's .lineplot() function. Since we want to see the variation in SalePrice, we take this as the y variable and, since the variation is across a period of time, we take YearBuilt as the x variable. Keeping this in mind, we pass the respective series as values to the y and x parameters for the primary plotting function. We also pass a ci=None parameter to hide the standard deviation indicator around the line in the plot:

    plt.figure(figsize=(10,7))

    sns.lineplot(x=data.YearBuilt, y=data.SalePrice, ci=None)

    plt.show()

    The output will be as follows:

Figure 2.68: Line graph showing the variation in SalePrice for older to more recently built flats

Figure 2.68 illustrates how to use a line chart to emphasize both overall trends and the ups and downs on shorter time cycles. You may want to compare this chart to a scatter chart of the same data and consider what sort of information each conveys.

Note

To access the source code for this specific section, please refer to https://packt.live/2Z4bqHM.

You can also run this example online at https://packt.live/2Nl5ggI. You must execute the entire Notebook in order to get the desired result.

3. Linear Regression

Activity 3.01: Plotting Data with a Moving Average

  1. Load the two required packages:

    import pandas as pd

    import matplotlib.pyplot as plt

  2. Load the dataset into a pandas DataFrame from the CSV file:

    df = pd.read_csv('../Datasets/austin_weather.csv')

    df.head()

    The output will show the initial five rows of the austin_weather.csv file:

    Figure 3.61: The first five rows of the Austin weather data (note that additional columns to the right are not shown)

  3. Since we only need the Date and TempAvgF columns, we'll remove all the other columns from the dataset:

    df = df.loc[:, ['Date', 'TempAvgF']]

    df.head()

    The output will be as follows:

    Figure 3.62: Date and TempAvgF columns of the Austin weather data

  4. Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column (note that temperatures are recorded daily). Repeat the process to create the Month and Day columns, and then extract the first year's worth of data:

    df.loc[:, 'Year'] = df.loc[:, 'Date'].str.slice(0, 4).astype('int')

    df.loc[:, 'Month'] = df.loc[:, 'Date'].str.slice(5, 7).astype('int')

    df.loc[:, 'Day'] = df.loc[:, 'Date'].str.slice(8, 10).astype('int')

    df = df.loc[df.index < 365]

    print(df.head())

    print(df.tail())

    The output will be as follows:

    Figure 3.63: New DataFrame with one year's worth of data

  5. Compute a 20-day moving average using the rolling() method:

    window = 20

    rolling = df.TempAvgF.rolling(window).mean()

    print(rolling.head())

    print(rolling.tail())

    The output will be as follows:

    Figure 3.64: DataFrame with moving average data

  6. Plot the raw data and the moving averages, with the x axis as the day number in the year:

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1]);

    # Raw data

    ax.scatter(df.index, df.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(rolling.index, rolling, c = 'r', \

            linestyle = '--', label = f'{window} day moving average')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df.index.min(), df.index.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12)

    plt.show()

    The output will be as follows:

Figure 3.65: Temperature data with the 20-day moving average overlaid

Note

To access the source code for this specific section, please refer to https://packt.live/2Nl5m85.

You can also run this example online at https://packt.live/3epJvs6. You must execute the entire Notebook in order to get the desired result.

Activity 3.02: Linear Regression Using the Least Squares Method

  1. Import the required packages and classes:

    import pandas as pd

    import matplotlib.pyplot as plt

    from sklearn.linear_model import LinearRegression

  2. Load the data from the CSV (austin_weather.csv) and inspect the data (using the head() and tail() methods):

    # load data and inspect

    df = pd.read_csv('../Datasets/austin_weather.csv')

    print(df.head())

    print(df.tail())

    The output for df.head() will be as follows:

    Figure 3.66: Output for df.head()

    The output for df.tail() will be as follows:

    Figure 3.67: Output for df.tail()

  3. Drop everything except the Date and TempAvgF columns:

    df = df.loc[:, ['Date', 'TempAvgF']]

    df.head()

    The output will be as follows:

    Figure 3.68: Two columns used for Activity 3.02

  4. Create new Year, Month, and Day columns and populate them by parsing the Date column:

    # add some useful columns

    df.loc[:, 'Year'] = df.loc[:, 'Date']\

                        .str.slice(0, 4).astype('int')

    df.loc[:, 'Month'] = df.loc[:, 'Date']\

                         .str.slice(5, 7).astype('int')

    df.loc[:, 'Day'] = df.loc[:, 'Date']\

                       .str.slice(8, 10).astype('int')

    print(df.head())

    print(df.tail())

    The output will be as follows:

    Figure 3.69: Augmented data

  5. Create a new column for a moving average and populate it with a 20-day moving average of the TempAvgF column:

    """

    set a 20 day window then use that to smooth temperature in a new column

    """

    window = 20

    df['20_d_mov_avg'] = df.TempAvgF.rolling(window).mean()

    print(df.head())

    print(df.tail())

    The output will be as follows:

    Figure 3.70: Addition of the 20-day moving average

  6. Slice one complete year of data to use in a model. Ensure the year doesn't have missing data due to the moving average. Also create a column for Day_of_Year (it should start at 1):

    """

    now let's slice exactly one year on the

    calendar start and end dates

    we see from the previous output that

    2014 is the first year with complete data,

    however it will still have NaN values for

    the moving average, so we'll use 2015

    """

    df_one_year = df.loc[df.Year == 2015, :].reset_index()

    df_one_year['Day_of_Year'] = df_one_year.index + 1

    print(df_one_year.head())

    print(df_one_year.tail())

    The output will be as follows:

    Figure 3.71: One year's worth of data

  7. Create a scatterplot of the raw data (the original TempAvgF column) and overlay it with a line for the 20-day moving average:

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1]);

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12)

    plt.show()

    The output will be as follows:

    Figure 3.72: Raw data with the 20-day moving average overlaid

  8. Create a linear regression model using the default parameters, that is, calculate a y intercept for the model and do not normalize the data. The day numbers for the year (1 to 365) constitute the input data and the average temperatures constitute the output data. Print the parameters of the model and the r2 value:

    # fit a linear model

    linear_model = LinearRegression(fit_intercept = True)

    linear_model.fit(df_one_year['Day_of_Year']\

                     .values.reshape((-1, 1)), \

                     df_one_year.TempAvgF)

    print('model slope: ', linear_model.coef_)

    print('model intercept: ', linear_model.intercept_)

    print('model r squared: ', \

          linear_model.score(df_one_year['Day_of_Year']\

                             .values.reshape((-1, 1)), \

                             df_one_year.TempAvgF))

    The results should be as follows:

    model slope: [0.04304568]

    model intercept: 62.23496914044859

    model r squared: 0.09549593659736466

    Note that the r2 value is very low, which is not surprising given that the data has a significant variation in the slope over time, and we are fitting a single linear model with a constant slope.

  9. Generate predictions from the model using the same x data:

    # make predictions using the training data

    y_pred = linear_model.predict(df_one_year['Day_of_Year']\

                                  .values.reshape((-1, 1)))

    x_pred = df_one_year.Day_of_Year

  10. Create a new scatterplot, as before, adding an overlay of the predictions of the model:

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1]);

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    # linear model

    ax.plot(x_pred, y_pred, c = "blue", linestyle = '-.', \

            label = 'linear model')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12)

    plt.show()

    The output will be as follows:

Figure 3.73: Raw data, 20-day moving average, and linear fit

Note

To access the source code for this specific section, please refer to https://packt.live/2CwEKyT.

You can also run this example online at https://packt.live/3hKJSzD. You must execute the entire Notebook in order to get the desired result.

Activity 3.03: Dummy Variables

  1. Import the required packages and classes:

    import pandas as pd

    import matplotlib.pyplot as plt

    from sklearn.linear_model import LinearRegression

  2. Load and inspect the data:

    # load data and inspect

    df = pd.read_csv('../Datasets/austin_weather.csv')

    print(df.head())

    print(df.tail())

    The output for df.head() should appear as follows:

    Figure 3.74: Output for the df.head() function

    The output for df.tail() should appear as follows:

    Figure 3.75: Output for the df.tail() function

  3. Carry out the preprocessing as before. Drop all but the Date and TempAvgF columns. Add columns for Year, Month, and Day. Create a new column with a 20-day moving average. Slice out the first complete year (2015):

    df = df.loc[:, ['Date', 'TempAvgF']]

    # add some useful columns

    df.loc[:, 'Year'] = df.loc[:, 'Date'].str.slice(0, 4).astype('int')

    df.loc[:, 'Month'] = df.loc[:, 'Date'].str.slice(5, 7).astype('int')

    df.loc[:, 'Day'] = df.loc[:, 'Date'].str.slice(8, 10).astype('int')

    """

    set a 20 day window then use that to smooth

    temperature in a new column

    """

    window = 20

    df['20_d_mov_avg'] = df.TempAvgF.rolling(window).mean()

    """

    now let's slice exactly one year on the

    calendar start and end dates

    we see from the previous output that

    2014 is the first year with complete data,

    however it will still have NaN values for

    the moving average, so we'll use 2015

    """

    df_one_year = df.loc[df.Year == 2015, :].reset_index()

    df_one_year['Day_of_Year'] = df_one_year.index + 1

    print(df_one_year.head())

    print(df_one_year.tail())

    The data should appear as follows:

    Figure 3.76: Preprocessed data

  4. Visualize the results:

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1]);

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12)

    plt.show()

    The plot should appear as follows:

    Figure 3.77: Austin temperatures and moving average

  5. We can see that the temperature rises from January to around September, and then falls again. This is a clear seasonal cycle. As a first improvement, we can include the month in the model. As described in the introduction to dummy variables, if we just encoded the months as integers 1 to 12, the model might interpret that December (12) was more important than January (1). So, we encode the month as dummy variables to avoid this:

    # use the month as a dummy variable

    dummy_vars = pd.get_dummies(df_one_year['Month'], drop_first = True)

    dummy_vars.columns = ['Feb', 'Mar', 'Apr', 'May', 'Jun', \

                          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

    df_one_year = pd.concat([df_one_year, dummy_vars], \

                            axis = 1).drop('Month', axis = 1)

    df_one_year

    The data should appear as follows:

    Figure 3.78: Data augmented with dummy variables for the month

  6. Now, fit a linear model using Day_of_Year and the dummy variables, and print the model coefficients and the r2 value:

    # fit model using the month dummy vars

    linear_model = LinearRegression(fit_intercept = True)

    linear_model.fit(pd.concat([df_one_year.Day_of_Year, \

                                df_one_year.loc[:, 'Feb':'Dec']], \

                                                axis = 1),

                                df_one_year['TempAvgF'])

    print('model coefficients: ', linear_model.coef_)

    print('model intercept: ', linear_model.intercept_)

    print('model r squared: ', \

          linear_model.score(pd.concat([df_one_year.Day_of_Year, \

                                        df_one_year.loc[:, 'Feb':'Dec']], \

                                                        axis = 1),

                                        df_one_year['TempAvgF']))

    The results should be as follows:

    model coefficients: [ 0.03719346 1.57445204 9.35397321 19.16903518 22.02065629 26.80023439

    30.17121033 30.82466482 25.6117698 15.71715435 1.542969 -4.06777548]

    model intercept: 48.34038858048261

    model r squared: 0.7834805472165678

    Note the signs on the coefficients—the first value associated with Day_of_Year, and then the values for January through December follow. The coefficients for January, February, March, November, and December are negative, while those for June through September are positive. This makes sense for the seasons in Texas.

  7. Now, make predictions using the single-year data, and visualize the results:

    # make predictions using the data

    y_pred = \

    linear_model.predict(pd.concat([df_one_year.Day_of_Year, \

                                    df_one_year.loc[:, 'Feb':'Dec']], \

                                                    axis = 1))

    x_pred = df_one_year.Day_of_Year

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1]);

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    # regression predictions

    ax.plot(x_pred, y_pred, c = "blue", linestyle = '-.', \

            label = 'linear model w/dummy vars')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12, loc = 'upper left')

    plt.show()

    The output should appear as follows:

Figure 3.79: Linear regression results with month dummy variables

Note

To access the source code for this specific section, please refer to https://packt.live/3enegOg.

You can also run this example online at https://packt.live/2V4VgMM. You must execute the entire Notebook in order to get the desired result.

Activity 3.04: Feature Engineering with Linear Regression

  1. Load the required packages and classes:

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    from sklearn.linear_model import LinearRegression

  2. Load the data and carry out preprocessing through to the point where Day_of_Year is added:

    # load data

    df = pd.read_csv('../Datasets/austin_weather.csv')

    df = df.loc[:, ['Date', 'TempAvgF']]

    # add some useful columns

    df.loc[:, 'Year'] = df.loc[:, 'Date'].str.slice(0, 4).astype('int')

    df.loc[:, 'Month'] = df.loc[:, 'Date'].str.slice(5, 7).astype('int')

    df.loc[:, 'Day'] = df.loc[:, 'Date'].str.slice(8, 10).astype('int')

    """

    set a 20 day window then use that to smooth

    temperature in a new column

    """

    window = 20

    df['20_d_mov_avg'] = df.TempAvgF.rolling(window).mean()

    """

    now let's slice exactly one year on the

    calendar start and end dates

    we see from the previous output that

    2014 is the first year with complete data,

    however it will still have NaN values for

    the moving average, so we'll use 2015

    """

    df_one_year = df.loc[df.Year == 2015, :].reset_index()

    df_one_year['Day_of_Year'] = df_one_year.index + 1

  3. Now, for the feature engineering, we construct the sine and cosine of Day_of_Year with a period of 365 days:

    # add two columns for sine and cosine of the Day_of_Year

    df_one_year['sine_Day'] = np.sin(2 * np.pi \

                              * df_one_year['Day_of_Year'] / 365)

    df_one_year['cosine_Day'] = np.cos(2 * np.pi \

                              * df_one_year['Day_of_Year'] / 365)

    df_one_year

    The data should appear as follows:

    Figure 3.80: Austin weather data with the new features, sine_Day and cosine_Day

  4. We can now fit the model using the LinearRegression class from scikit-learn, and print the coefficients and the r2 score:

    # fit model using the Day_of_Year and sin/cos

    linear_model = LinearRegression(fit_intercept = True)

    linear_model.fit(df_one_year[['Day_of_Year', 'sine_Day', \

                                  'cosine_Day']], \

                     df_one_year['TempAvgF'])

    print('model coefficients: ', linear_model.coef_)

    print('model intercept: ', linear_model.intercept_)

    print('model r squared: ', \

    linear_model.score(df_one_year[['Day_of_Year', 'sine_Day', \

                                    'cosine_Day']], \

                       df_one_year['TempAvgF']))

    The output should be as follows:

    model coefficients: [ 1.46396364e-02 -5.57332499e+00 -1.67824174e+01]

    model intercept: 67.43327530313064

    model r squared: 0.779745650129063

    Note that the r2 value is about the same as we achieved with the dummy variables. However, let's look at the predictions and see whether this model might be more or less suitable than before.

  5. Generate predictions using the augmented data:

    # make predictions using the data

    y_pred = \

    linear_model.predict(df_one_year[['Day_of_Year', 'sine_Day', \

                                      'cosine_Day']])

    x_pred = df_one_year.Day_of_Year

  6. Now, visualize the results:

    fig = plt.figure(figsize=(10, 7))

    ax = fig.add_axes([1, 1, 1, 1])

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    # regression predictions

    ax.plot(x_pred, y_pred, c = "blue", linestyle = '-.', \

            label = 'linear model w/sin-cos fit')

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 14)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12, loc = 'upper left')

    The output will be as follows:

Figure 3.81: Austin temperature data with moving average overlay and periodic feature fit overlay

Note

To access the source code for this specific section, please refer to https://packt.live/3dvkmet.

You can also run this example online at https://packt.live/3epnOIJ. You must execute the entire Notebook in order to get the desired result.

Activity 3.05: Gradient Descent

  1. Import the modules and classes:

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    from sklearn.metrics import r2_score

    from sklearn.linear_model import SGDRegressor

  2. Load the data (austin_weather.csv) and preprocess it up to the point of creating the Day_of_Year column and slicing one full year (2015):

    # load data and inspect

    df = pd.read_csv('../Datasets/austin_weather.csv')

    df = df.loc[:, ['Date', 'TempAvgF']]

    # add time-based columns

    df.loc[:, 'Year'] = df.loc[:, 'Date'].str.slice(0, 4).astype('int')

    df.loc[:, 'Month'] = df.loc[:, 'Date'].str.slice(5, 7).astype('int')

    df.loc[:, 'Day'] = df.loc[:, 'Date'].str.slice(8, 10).astype('int')

    """

    set a 20 day window then use that to smooth

    temperature in a new column

    """

    window = 20

    df['20_d_mov_avg'] = df.TempAvgF.rolling(window).mean()

    """

    now let's slice exactly one year on the

    calendar start and end dates

    we see from the previous output that

    2014 is the first year with complete data,

    however it will still have NaN values for

    the moving average, so we'll use 2015

    """

    df_one_year = df.loc[df.Year == 2015, :].reset_index()

    df_one_year['Day_of_Year'] = df_one_year.index + 1

    print(df_one_year.head())

    print(df_one_year.tail())

    The output will be as follows:

    Figure 3.82: Preprocessed data before scaling

  3. Scale the data for training:

    # scale the data

    X_min = df_one_year.Day_of_Year.min()

    X_range = df_one_year.Day_of_Year.max() \

              - df_one_year.Day_of_Year.min()

    Y_min = df_one_year.TempAvgF.min()

    Y_range = df_one_year.TempAvgF.max() \

              - df_one_year.TempAvgF.min()

    scale_X = (df_one_year.Day_of_Year - X_min) / X_range

    train_X = scale_X.ravel()

    train_Y = ((df_one_year.TempAvgF - Y_min) / Y_range).ravel()

  4. Set random.seed, instantiate the model object with SGDRegressor, and fit the model to the training data:

    # create the model object

    np.random.seed(42)

    model = SGDRegressor(loss = 'squared_loss', max_iter = 100, \

                         learning_rate = 'constant', eta0 = 0.0005, \

                         tol = 0.00009, penalty = 'none')

    # fit the model

    model.fit(train_X.reshape((-1, 1)), train_Y)

    The output should be as follows:

    Figure 3.83: Model object using SGDRegressor

  5. Extract the model coefficients and rescale:

    Beta0 = (Y_min + Y_range * model.intercept_[0] \

             - Y_range * model.coef_[0] * X_min / X_range)

    Beta1 = Y_range * model.coef_[0] / X_range

    print(Beta0)

    print(Beta1)

    The output should be similar to the following:

    61.45512325422412

    0.04533603293003107

  6. Generate predictions using the scaled data, and then get the r2 value:

    # generate predictions

    pred_X = df_one_year['Day_of_Year']

    pred_Y = model.predict(train_X.reshape((-1, 1)))

    # calculate the r squared value

    r2 = r2_score(train_Y, pred_Y)

    print('r squared = ', r2)

    The result should be similar to the following:

    r squared = 0.09462157379706759

  7. Scale the predictions back to real values and then visualize the results:

    # scale predictions back to real values

    pred_Y = (pred_Y * Y_range) + Y_min

    fig = plt.figure(figsize = (10, 7))

    ax = fig.add_axes([1, 1, 1, 1])

    # Raw data

    ax.scatter(df_one_year.Day_of_Year, df_one_year.TempAvgF, \

               label = 'Raw Data', c = 'k')

    # Moving averages

    ax.plot(df_one_year.Day_of_Year, df_one_year['20_d_mov_avg'], \

            c = 'r', linestyle = '--', \

            label = f'{window} day moving average')

    # Regression predictions

    ax.plot(pred_X, pred_Y, c = "blue", linestyle = '-.', \

            linewidth = 4, label = 'linear fit (from SGD)')

    # put the model on the plot

    ax.text(1, 85, 'Temp = ' + str(round(Beta0, 2)) + ' + ' \

            + str(round(Beta1, 4)) + ' * Day', fontsize = 16)#

    ax.set_title('Air Temperature Measurements', fontsize = 16)

    ax.set_xlabel('Day', fontsize = 16)

    ax.set_ylabel('Temperature ($^\circ$F)', fontsize = 14)

    ax.set_xticks(range(df_one_year.Day_of_Year.min(), \

                        df_one_year.Day_of_Year.max(), 30))

    ax.tick_params(labelsize = 12)

    ax.legend(fontsize = 12)

    plt.show()

    The output will be as follows:

Figure 3.84: Results of linear regression using SGDRegressor

Note

To access the source code for this specific section, please refer to https://packt.live/2AY1bMZ.

You can also run this example online at https://packt.live/2NgCI86. You must execute the entire Notebook in order to get the desired result.

4. Autoregression

Activity 4.01: Autoregression Model Based on Periodic Data

  1. Import the necessary packages, classes, and libraries.

    Note

    This activity will work on an earlier version of pandas, ensure that you downgrade the version of pandas using the command:

    pip install pandas==0.24.2

    The code is as follows:

    import pandas as pd

    import numpy as np

    from statsmodels.tsa.ar_model import AR

    from statsmodels.graphics.tsaplots import plot_acf

    import matplotlib.pyplot as plt

  2. Load the data and convert the Date column to datetime:

    df = pd.read_csv('../Datasets/austin_weather.csv')

    df.Date = pd.to_datetime(df.Date)

    print(df.head())

    print(df.tail())

    The output for df.head() should look as follows:

    Figure 4.22: Output for df.head()

    The output for df.tail() should look as follows:

    Figure 4.23: Output for df.tail()

  3. Plot the complete set of average temperature values (df.TempAvgF) with Date on the x axis:

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.scatter(df.Date, df.TempAvgF)

    plt.show()

    The output will be as follows:

    Figure 4.24: Plot of Austin temperature data over several years

    Note the periodic behavior of the data. It's sensible given that temperature varies over an annual weather cycle.

  4. Construct an autocorrelation plot (using statsmodels) to see whether the average temperature can be used with an autoregression model. Where is the lag acceptable and where is it not for an autoregression model? Check the following code:

    max_lag = 730

    fig, ax = plt.subplots(figsize = (10, 7))

    acf_plot = plot_acf(x = df.TempAvgF, ax = ax, lags = max_lag, \

                        use_vlines = False, alpha = 0.9, \

                        title = 'Autocorrelation of Austin Temperature '\

                                'vs. lag')

    ax.grid(True)

    ax.text(280, -0.01, '90% confidence interval', fontsize = 9)

    ax.set_xlabel('Lag', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    The plot should look as follows:

    Figure 4.25: Autocorrelation versus lag (days)

    The lag is acceptable only when the autocorrelation line lies outside the 90% confidence bounds, as represented by the shaded area. Note that, in this case, instead of a steadily decreasing ACF value, we see peaks and valleys. This should match your intuition because the original data shows a periodic pattern. Also, note that there are very strong positive and negative correlations. It is possible to leverage the strong negative correlation at around 180 days (half a year), but that is a more advanced time series topic beyond our scope here. The main takeaway from Figure 4.25 is that there is a very steep drop in the ACF after short lag times. Now, use the same methods as before to look at the lag plots versus the ACF.

  5. Get the actual ACF values:

    corr0 = np.correlate(df.TempAvgF[0: ] - df.TempAvgF.mean(), \

            df.TempAvgF[0: ] - df.TempAvgF.mean(), mode = 'valid')

    corrs = [np.correlate(df.TempAvgF[:(df.TempAvgF.shape[0] - i)] \

             - df.TempAvgF.mean(), df.TempAvgF[i: ] \

             - df.TempAvgF.mean(), mode = 'valid')

             for i in range(max_lag)] / corr0

  6. We need the same utility grid plotting function we developed in Exercise 4.01, Creating an Autoregression Model:

    """

    utility function to plot out a range of

    plots depicting self-correlation

    """

    def plot_lag_grid(series, corrs, axis_min, axis_max, \

                      num_plots, total_lag, n_rows, n_cols):

        lag_step = int(total_lag / num_plots)

        fig = plt.figure(figsize = (18, 16))

        for i, var_name in enumerate(range(num_plots)):

            corr = corrs[lag_step * i]

            ax = fig.add_subplot(n_rows, n_cols, i + 1)

            ax.scatter(series, series.shift(lag_step * i))

            ax.set_xlim(axis_min, axis_max)

            ax.set_ylim(axis_min, axis_max)

            ax.set_title('lag = ' + str(lag_step * i))

            ax.text(axis_min + 0.05 * (axis_max - axis_min), \

                    axis_max - 0.05 * (axis_max - axis_min), \

                    'correlation = ' + str(round(corr[0], 3)))

        fig.tight_layout()

        plt.show()

  7. Now, given that we have an indication that we are interested in short lags, but also that there are strong correlations around a half year and a full year, let's look at two timescales:

    plot_lag_grid(df.TempAvgF, corrs, df.TempAvgF.min(), \

                  df.TempAvgF.max(), 9, 45, 3, 3)

    plot_lag_grid(df.TempAvgF, corrs, df.TempAvgF.min(), \

                  df.TempAvgF.max(), 9, 405, 3, 3)

    The output for short lags will be as follows:

    Figure 4.26: Lag plots with short lags

    The output for longer lags will be as follows:

    Figure 4.27: Lag plots with longer lags

    We can see from Figure 4.26 that the correlation degrades consistently from lag 5 to 40. Over a longer timescale, Figure 4.27 shows that the correlation degrades rapidly and then improves as we near a lag of one year. This matches the intuition from the plot of the raw data (side note—this should reinforce the importance of EDA).

  8. We would expect from our initial analysis that the autoregression model would focus on fairly short lags. Let's use the statsmodels AR function to build a model and see the results:

    """

    statsmodels AR function builds an autoregression model

    using all the defaults, it will determine the max lag

    and provide all the model coefficients

    """

    model = AR(df.TempAvgF)

    model_fit = model.fit()

    # model fit now contains all the model information

    max_lag = model_fit.k_ar

    """

    note that by using defaults, the maximum lag is

    computed as round(12*(nobs/100.)**(1/4.))

    see https://www.statsmodels.org/devel/generated/statsmodels.tsa.ar_model.AR.fit.html#statsmodels.tsa.ar_model.AR.fit

    """

    print('Max Lag: ' + str(max_lag))

    print('Coefficients: \n' + str(model_fit.params))

    # how far into the future we want to predict

    max_forecast = 365

    # generate predictions from the model

    pred_temp = pd.DataFrame({'pred_temp': \

                              model_fit.predict(start = max_lag, \

                                                end = df.shape[0] \

                                                + max_forecast - 1)})

    # attach the dates for visualization

    pred_temp['Date'] = df.loc[pred_temp.index, 'Date'].reindex()

    pred_temp.loc[(max(df.index) + 1):, 'Date'] = \

        pd.to_datetime([max(df.Date) \

                        + pd.Timedelta(days = i)

                        for i in range(1, max_forecast + 1)])

    The result is a model with lags of up to 23 days:

    Figure 4.28: AR model of Austin temperature data

  9. Plot the predictions, forecast, and raw data on the same plot:

    """

    visualize the predictions overlaid on the real data

    as well as the extrapolation to the future

    """

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.plot(df.Date, df.TempAvgF, c = "blue", \

            linewidth = 4, label = 'Actual Average Temperature')

    ax.plot(pred_temp.loc[0 : len(df.TempAvgF), 'Date'], \

            pred_temp.loc[0 : len(df.TempAvgF), 'pred_temp'], \

            c = "yellow", linewidth = 0.5, \

            label = 'Predicted Temperature')

    ax.plot(pred_temp.loc[len(df.TempAvgF):, 'Date'], \

            pred_temp.loc[len(df.TempAvgF):, 'pred_temp'], \

            c = "red", linewidth = 2, \

            label = 'Forecast Temperature')

    ax.set_xlabel('Date', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_title('Austin Texas Average Daily Temperature')

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.legend()

    plt.show()

    The output will be as follows:

    Figure 4.29: Austin temperature predictions and forecast

  10. Let's zoom in on the end of the data, on the last 30 days of the data and on the first 30 forecast values:

    # zoom in on a window near the end of the raw data

    window = 30

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.plot(df.Date[(len(df.TempAvgF) - window) : len(df.TempAvgF)], \

            df.TempAvgF[(len(df.TempAvgF) - window) : \

                         len(df.TempAvgF)], \

            c = "blue", linewidth = 4, \

            label = 'Actual Average Temperature')

    ax.plot(pred_temp.Date.iloc[(-max_forecast \

                                 - window):(-max_forecast)], \

            pred_temp.pred_temp.iloc[(-max_forecast \

                                      - window):(-max_forecast)], \

            c = "red", linewidth = 2, label = 'Predicted Temperature')

    ax.plot(pred_temp.loc[len(df.TempAvgF):\

                         (len(df.TempAvgF) + window), 'Date'], \

            pred_temp.loc[len(df.TempAvgF):\

                         (len(df.TempAvgF) + window), 'pred_temp'], \

            c = "green", linewidth = 2, label = 'Forecast Temperature')

    ax.set_xlabel('Date', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_title('Austin Texas Average Daily Temperature')

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_xticks(pd.date_range(df.Date[len(df.TempAvgF) - window], \

                                df.Date[len(df.TempAvgF) - 1] \

                                + pd.Timedelta(days = window), 5))

    ax.legend()

    plt.show()

    We will get the following output:

Figure 4.30: Detail of predictions near the end of the data

Note

To access the source code for this specific section, please refer to https://packt.live/3hOXUQL.

You can also run this example online at https://packt.live/313Vmbl. You must execute the entire Notebook in order to get the desired result.

Now that the activity is successfully completed, upgrade the version of pandas to continue to smoothly run the exercises and activities present in the rest of the book. To upgrade pandas, run:

pip install pandas==1.0.3

5. Classification Techniques

Activity 5.01: Ordinary Least Squares Classifier – Binary Classifier

Solution:

  1. Import the required dependencies:

    import struct

    import numpy as np

    import gzip

    import urllib.request

    import matplotlib.pyplot as plt

    from array import array

    from sklearn.linear_model import LinearRegression

  2. Load the MNIST data into memory:

    with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img = np.array(array("B", f.read())).reshape((size, rows, cols))

    with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels = np.array(array("B", f.read()))

    with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img_test = np.array(array("B", f.read()))\

                   .reshape((size, rows, cols))

    with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels_test = np.array(array("B", f.read()))

  3. Visualize a sample of the data:

    for i in range(10):

        plt.subplot(2, 5, i + 1)

        plt.imshow(img[i], cmap='gray');

        plt.title(f'{labels[i]}');

        plt.axis('off')

    The output will be as follows:

    Figure 5.63: Sample data

  4. Construct a linear classifier model to classify the digits 0 and 1. The model we are going to create is to determine whether the samples are either the digits 0 or 1. To do this, we first need to select only those samples:

    samples_0_1 = np.where((labels == 0) | (labels == 1))[0]

    images_0_1 = img[samples_0_1]

    labels_0_1 = labels[samples_0_1]

    samples_0_1_test = np.where((labels_test == 0) | (labels_test == 1))

    images_0_1_test = img_test[samples_0_1_test]\

                      .reshape((-1, rows * cols))

    labels_0_1_test = labels_test[samples_0_1_test]

  5. Visualize the selected information. Here's the code for 0:

    sample_0 = np.where((labels == 0))[0][0]

    plt.imshow(img[sample_0], cmap='gray');

    The output will be as follows:

    Figure 5.64: First sample data

    Here's the code for 1:

    sample_1 = np.where((labels == 1))[0][0]

    plt.imshow(img[sample_1], cmap='gray');

    The output will be as follows:

    Figure 5.65: Second sample data

  6. In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:

    images_0_1 = images_0_1.reshape((-1, rows * cols))

    images_0_1.shape

    The output will be as follows:

    (12665, 784)

  7. Let's construct the model; use the LinearRegression API and call the fit function:

    model = LinearRegression()

    model.fit(X=images_0_1, y=labels_0_1)

    The output will be as follows:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,

                     normalize=False)

  8. Determine the training set accuracy:

    model.score(X=images_0_1, y=labels_0_1)

    The output will be as follows:

    0.9705320567708795

  9. Determine the label predictions for each of the training samples, using a threshold of 0.5. Values greater than 0.5 classify as 1, while values less than, or equal to, 0.5 classify as 0:

    y_pred = model.predict(images_0_1) > 0.5

    y_pred = y_pred.astype(int)

    y_pred

    The output will be as follows:

    array([0, 1, 1, ..., 1, 0, 1])

  10. Compute the classification accuracy of the predicted training values versus the ground truth:

    np.sum(y_pred == labels_0_1) / len(labels_0_1)

    The output will be as follows:

    0.9947887879984209

  11. 10. Compare the performance against the test set:

    y_pred = model.predict(images_0_1_test) > 0.5

    y_pred = y_pred.astype(int)

    np.sum(y_pred == labels_0_1_test) / len(labels_0_1_test)

    The output will be as follows:

    0.9938534278959811

    Note

    To access the source code for this specific section, please refer to https://packt.live/3emRZAk.

    You can also run this example online at https://packt.live/37T4bGh. You must execute the entire Notebook in order to get the desired result.

Activity 5.02: KNN Multiclass Classifier

  1. Import the following packages:

    import struct

    import numpy as np

    import gzip

    import urllib.request

    import matplotlib.pyplot as plt

    from array import array

    from sklearn.neighbors import KNeighborsClassifier as KNN

  2. Load the MNIST data into memory.

    Training images:

    with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img = np.array(array("B", f.read())).reshape((size, rows, cols))

    Training labels:

    with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels = np.array(array("B", f.read()))

    Test images:

    with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img_test = np.array(array("B", f.read()))\

                   .reshape((size, rows, cols))

    Test labels:

    with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels_test = np.array(array("B", f.read()))

  3. Visualize a sample of the data:

    for i in range(10):

        plt.subplot(2, 5, i + 1)

        plt.imshow(img[i], cmap='gray');

        plt.title(f'{labels[i]}');

        plt.axis('off')

    The output will be as follows:

    Figure 5.66: Sample images

  4. Construct a KNN classifier with k=3 to classify the MNIST dataset. Again, to save processing power, randomly sample 5,000 images for use in training:

    np.random.seed(0)

    selection = np.random.choice(len(img), 5000)

    selected_images = img[selection]

    selected_labels = labels[selection]

  5. In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:

    selected_images = selected_images.reshape((-1, rows * cols))

    selected_images.shape

    The output will be as follows:

    (5000, 784)

  6. Build the three-neighbor KNN model and fit the data to the model. Note that, in this activity, we are providing 784 features or dimensions to the model, not just 2:

    model = KNN(n_neighbors=3)

    model.fit(X=selected_images, y=selected_labels)

    The output will be as follows:

    KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

                         metric_params=None, n_jobs=None, n_neighbors=3, p=2,

                         weights='uniform')

  7. Determine the score against the training set:

    model.score(X=selected_images, y=selected_labels)

    The output will be as follows:

    0.9692

  8. Display the first two predictions for the model against the training data:

    model.predict(selected_images)[:2]

    plt.subplot(1, 2, 1)

    plt.imshow(selected_images[0].reshape((28, 28)), cmap='gray');

    plt.axis('off');

    plt.subplot(1, 2, 2)

    plt.imshow(selected_images[1].reshape((28, 28)), cmap='gray');

    plt.axis('off');

    The output will be as follows:

    Figure 5.67: First two values of the test set

  9. Compare the performance against the test set:

    model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

    The output will be as follows:

    0.9376

    Note

    To access the source code for this specific section, please refer to https://packt.live/313xdlc.

    You can also run this example online at https://packt.live/2Nl6DMo. You must execute the entire Notebook in order to get the desired result.

Activity 5.03: Binary Classification Using a CART Decision Tree

Solution:

  1. Import the required dependencies:

    import struct

    import numpy as np

    import pandas as pd

    import gzip

    import urllib.request

    import matplotlib.pyplot as plt

    from array import array

    from sklearn.model_selection import train_test_split

    from sklearn.tree import DecisionTreeClassifier

  2. Load the MNIST data into memory:

    with gzip.open('../Datasets/train-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img = np.array(array("B", f.read())).reshape((size, rows, cols))

    with gzip.open('../Datasets/train-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels = np.array(array("B", f.read()))

    with gzip.open('../Datasets/t10k-images-idx3-ubyte.gz', 'rb') as f:

        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

        img_test = np.array(array("B", f.read()))\

                   .reshape((size, rows, cols))

    with gzip.open('../Datasets/t10k-labels-idx1-ubyte.gz', 'rb') as f:

        magic, size = struct.unpack(">II", f.read(8))

        labels_test = np.array(array("B", f.read()))

  3. Visualize a sample of the data:

    for i in range(10):

        plt.subplot(2, 5, i + 1)

        plt.imshow(img[i], cmap='gray');

        plt.title(f'{labels[i]}');

        plt.axis('off')

    The output will be as follows:

    Figure 5.68: Sample data

  4. Construct a linear classifier model to classify the digits 0 and 1. The model we are going to create is to determine whether the samples are either the digits 0 or 1. To do this, we first need to select only those samples:

    samples_0_1 = np.where((labels == 0) | (labels == 1))[0]

    images_0_1 = img[samples_0_1]

    labels_0_1 = labels[samples_0_1]

    samples_0_1_test = np.where((labels_test == 0) | (labels_test == 1))

    images_0_1_test = img_test[samples_0_1_test]\

                      .reshape((-1, rows * cols))

    labels_0_1_test = labels_test[samples_0_1_test]

  5. Visualize the selected information. Here's the code for 0:

    sample_0 = np.where((labels == 0))[0][0]

    plt.imshow(img[sample_0], cmap='gray');

    The output will be as follows:

    Figure 5.69: First sample data

    Here's the code for 1:

    sample_1 = np.where((labels == 1))[0][0]

    plt.imshow(img[sample_1], cmap='gray');

    The output will be as follows:

    Figure 5.70: Second sample data

  6. In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:

    images_0_1 = images_0_1.reshape((-1, rows * cols))

    images_0_1.shape

    The output will be as follows:

    (12665, 784)

  7. Let's construct the model; use the DecisionTreeClassifier API and call the fit function:

    model = DecisionTreeClassifier(random_state=123)

    model = model.fit(X=images_0_1, y=labels_0_1)

    model

    The output will be as follows:

    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

                           max_features=None, max_leaf_nodes=None,

                           min_impurity_decrease=0.0, min_impurity_split=None,

                           min_samples_leaf=1, min_samples_split=2,

                           min_weight_fraction_leaf=0.0, presort=False,

                           random_state=None, splitter='best')

  8. Determine the training set accuracy:

    model.score(X=images_0_1, y=labels_0_1)

    The output will be as follows:

    1.0

  9. Determine the label predictions for each of the training samples, using a threshold of 0.5. Values greater than 0.5 classify as 1, values less than or equal to 0.5, classify as 0:

    y_pred = model.predict(images_0_1) > 0.5

    y_pred = y_pred.astype(int)

    y_pred

  10. Compute the classification accuracy of the predicted training values versus the ground truth:

    np.sum(y_pred == labels_0_1) / len(labels_0_1)

  11. Compare the performance against the test set:

    y_pred = model.predict(images_0_1_test) > 0.5

    y_pred = y_pred.astype(int)

    np.sum(y_pred == labels_0_1_test) / len(labels_0_1_test)

    The output will be as follows:

    0.9962174940898345

    Note

    To access the source code for this specific section, please refer to https://packt.live/3hNUJbT.

    You can also run this example online at https://packt.live/2Cq5W25. You must execute the entire Notebook in order to get the desired result.

Activity 5.04: Breast Cancer Diagnosis Classification Using Artificial Neural Networks

  1. Import the required packages. For this activity, we will require the pandas package for loading the data, the matplotlib package for plotting, and scikit-learn for creating the neural network model, as well as to split the dataset into training and test sets. Import all the required packages and relevant modules for these tasks:

    import pandas as pd

    import matplotlib.pyplot as plt

    from sklearn.neural_network import MLPClassifier

    from sklearn.model_selection import train_test_split

    from sklearn import preprocessing

  2. Load the Breast Cancer Diagnosis dataset using pandas and examine the first five rows:

    df = pd.read_csv('../Datasets/breast-cancer-data.csv')

    df.head()

    The output will be as follows:

    Figure 5.71: First five rows of the breast cancer dataset

    Additionally, dissect the dataset into input (X) and output (y) variables:

    X, y = df[[c for c in df.columns if c != 'diagnosis']], df.diagnosis

  3. The next step is feature engineering. Different columns of this dataset have different scales of magnitude; hence, before constructing and training a neural network model, we normalize the dataset. For this, we use the MinMaxScaler API from sklearn, which normalizes each column's values between 0 and 1, as discussed in the Logistic Regression section of this chapter (see Exercise 5.03, Logistic Regression – Multiclass Classifier): https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html:

    X_array = X.values #returns a numpy array

    min_max_scaler = preprocessing.MinMaxScaler()

    X_array_scaled = min_max_scaler.fit_transform(X_array)

    X = pd.DataFrame(X_array_scaled, columns=X.columns)

    Examine the first five rows of the normalized dataset:

    X = pd.DataFrame(X_array_scaled, columns=X.columns)

    X.head()

    The output will be as follows:

    Figure 5.72: First five rows of the normalized dataset

  4. Before we can construct the model, we must first convert the diagnosis values into labels that can be used within the model. Replace the benign diagnosis string with the value 0, and the malignant diagnosis string with the value 1:

    diagnoses = ['benign', 'malignant',]

    output = [diagnoses.index(diag) for diag in y]

  5. Also, in order to impartially evaluate the model, we should split the training dataset into a training and a validation set:

    train_X, valid_X, \

    train_y, valid_y = train_test_split(X, output, \

                                        test_size=0.2, random_state=123)

  6. Create the model using the normalized dataset and the assigned diagnosis labels:

    model = MLPClassifier(solver='sgd', hidden_layer_sizes=(100,), \

                          max_iter=1000, random_state=1, \

                          learning_rate_init=.01)

    model.fit(X=train_X, y=train_y)

    The output will be as follows:

    MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto',

                  beta_1=0.9, beta_2=0.999, early_stopping=False,

                  epsilon=1e-08, hidden_layer_sizes=(100,),

                  learning_rate='constant',

                  learning_rate_init=0.01, max_iter=1000, momentum=0.9,

                  n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,

                  random_state=1, shuffle=True, solver='sgd', tol=0.0001,

                  validation_fraction=0.1, verbose=False, warm_start=False)

  7. Compute the accuracy of the model against the validation set:

    model.score(valid_X, valid_y)

    The output will be as follows:

    0.9824561403508771

    Note

    To access the source code for this specific section, please refer to https://packt.live/3dpNt2G.

    You can also run this example online at https://packt.live/37OpdWM. You must execute the entire Notebook in order to get the desired result.

6. Ensemble Modeling

Activity 6.01: Stacking with Standalone and Ensemble Algorithms

Solution

  1. Import the relevant libraries:

    import pandas as pd

    import numpy as np

    import seaborn as sns

    %matplotlib inline

    import matplotlib.pyplot as plt

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import mean_absolute_error

    from sklearn.model_selection import KFold

    from sklearn.linear_model import LinearRegression

    from sklearn.tree import DecisionTreeRegressor

    from sklearn.neighbors import KNeighborsRegressor

    from sklearn.ensemble import GradientBoostingRegressor, \

    RandomForestRegressor

  2. Read the data:

    data = pd.read_csv('boston_house_prices.csv')

    data.head()

    Note

    The preceding code snippet assumes that the dataset is presented in the same folder as that of the exercise notebook. However, if your dataset is present in the Datasets folder, you need to use the following code: data = pd.read_csv('../Datasets/boston_house_prices.csv')

    You will get the following output:

    Figure 6.15: Top rows of the Boston housing dataset

  3. Preprocess the dataset to remove null values to prepare the data for modeling:

    # check how many columns have less than 10 % null data

    perc_missing = data.isnull().mean()*100

    cols = perc_missing[perc_missing < 10].index.tolist()

    cols

    You will get the following output:

    Figure 6.16: Number of columns

    And then fill in the missing values, if any:

    data_final = data.fillna(-1)

  4. Divide the dataset into train and validation DataFrames:

    train, val = train, val = train_test_split(data_final, \

                                               test_size=0.2, \

                                               random_state=11)

    x_train = train.drop(columns=['PRICE'])

    y_train = train['PRICE'].values

    x_val = val.drop(columns=['PRICE'])

    y_val = val['PRICE'].values

  5. Initialize dictionaries in which to store the train and validation MAE values:

    train_mae_values, val_mae_values = {}, {}

  6. Train a decision tree (dt) model with the following hyperparameters and save the scores:

    dt_params = {'criterion': 'mae', 'min_samples_leaf': 15, \

                 'random_state': 11}

    dt = DecisionTreeRegressor(**dt_params)

    dt.fit(x_train, y_train)

    dt_preds_train = dt.predict(x_train)

    dt_preds_val = dt.predict(x_val)

    train_mae_values['dt'] = mean_absolute_error(y_true=y_train, \

                                                 y_pred=dt_preds_train)

    val_mae_values['dt'] = mean_absolute_error(y_true=y_val, \

                                               y_pred=dt_preds_val)

  7. Train a k-nearest neighbours (knn) model with the following hyperparameters and save the scores:

    knn_params = {'n_neighbors': 5}

    knn = KNeighborsRegressor(**knn_params)

    knn.fit(x_train, y_train)

    knn_preds_train = knn.predict(x_train)

    knn_preds_val = knn.predict(x_val)

    train_mae_values['knn'] = mean_absolute_error(y_true=y_train, \

                                                  y_pred=knn_preds_train)

    val_mae_values['knn'] = mean_absolute_error(y_true=y_val, \

                                                y_pred=knn_preds_val)

  8. Train a random forest (rf) model with the following hyperparameters and save the scores:

    rf_params = {'n_estimators': 20, 'criterion': 'mae', \

                 'max_features': 'sqrt', 'min_samples_leaf': 10, \

                 'random_state': 11, 'n_jobs': -1}

    rf = RandomForestRegressor(**rf_params)

    rf.fit(x_train, y_train)

    rf_preds_train = rf.predict(x_train)

    rf_preds_val = rf.predict(x_val)

    train_mae_values['rf'] = mean_absolute_error(y_true=y_train, \

                                                 y_pred=rf_preds_train)

    val_mae_values['rf'] = mean_absolute_error(y_true=y_val, \

                                               y_pred=rf_preds_val)

  9. Train a gradient boosting regression (gbr) model with the following hyperparameters and save the scores:

    gbr_params = {'n_estimators': 20, 'criterion': 'mae', \

                  'max_features': 'sqrt', 'min_samples_leaf': 10, \

                  'random_state': 11}

    gbr = GradientBoostingRegressor(**gbr_params)

    gbr.fit(x_train, y_train)

    gbr_preds_train = gbr.predict(x_train)

    gbr_preds_val = gbr.predict(x_val)

    train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, \

                                                  y_pred=gbr_preds_train)

    val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, \

                                                y_pred=gbr_preds_val)

  10. Prepare the training and validation datasets, with the four meta estimators having the same hyperparameters that were used in the previous steps. First, we build the training set:

    num_base_predictors = len(train_mae_values) # 4

    x_train_with_metapreds = np.zeros((x_train.shape[0], \

                             x_train.shape[1]+num_base_predictors))

    x_train_with_metapreds[:, :-num_base_predictors] = x_train

    x_train_with_metapreds[:, -num_base_predictors:] = -1

    kf = KFold(n_splits=5, random_state=11)

    for train_indices, val_indices in kf.split(x_train):

        kfold_x_train, kfold_x_val = x_train.iloc[train_indices], \

                                     x_train.iloc[val_indices]

        kfold_y_train, kfold_y_val = y_train[train_indices], \

                                     y_train[val_indices]

        predictions = []

        dt = DecisionTreeRegressor(**dt_params)

        dt.fit(kfold_x_train, kfold_y_train)

        predictions.append(dt.predict(kfold_x_val))

        knn = KNeighborsRegressor(**knn_params)

        knn.fit(kfold_x_train, kfold_y_train)

        predictions.append(knn.predict(kfold_x_val))

        gbr = GradientBoostingRegressor(**gbr_params)

        rf.fit(kfold_x_train, kfold_y_train)

        predictions.append(rf.predict(kfold_x_val))

        gbr = GradientBoostingRegressor(**gbr_params)

        gbr.fit(kfold_x_train, kfold_y_train)

        predictions.append(gbr.predict(kfold_x_val))

        for i, preds in enumerate(predictions):

            x_train_with_metapreds[val_indices, -(i+1)] = preds

  11. Prepare the validation set:

    x_val_with_metapreds = np.zeros((x_val.shape[0], \

                                     x_val.shape[1]+num_base_predictors))

    x_val_with_metapreds[:, :-num_base_predictors] = x_val

    x_val_with_metapreds[:, -num_base_predictors:] = -1

    predictions = []

    dt = DecisionTreeRegressor(**dt_params)

    dt.fit(x_train, y_train)

    predictions.append(dt.predict(x_val))

    knn = KNeighborsRegressor(**knn_params)

    knn.fit(x_train, y_train)

    predictions.append(knn.predict(x_val))

    gbr = GradientBoostingRegressor(**gbr_params)

    rf.fit(x_train, y_train)

    predictions.append(rf.predict(x_val))

    gbr = GradientBoostingRegressor(**gbr_params)

    gbr.fit(x_train, y_train)

    predictions.append(gbr.predict(x_val))

    for i, preds in enumerate(predictions):

        x_val_with_metapreds[:, -(i+1)] = preds

  12. Train a linear regression (lr) model as the stacked model:

    lr = LinearRegression(normalize=True)

    lr.fit(x_train_with_metapreds, y_train)

    lr_preds_train = lr.predict(x_train_with_metapreds)

    lr_preds_val = lr.predict(x_val_with_metapreds)

    train_mae_values['lr'] = mean_absolute_error(y_true=y_train, \

                                                 y_pred=lr_preds_train)

    val_mae_values['lr'] = mean_absolute_error(y_true=y_val, \

                                               y_pred=lr_preds_val)

  13. Visualize the train and validation errors for each individual model and the stacked model:

    mae_scores = pd.concat([pd.Series(train_mae_values, name='train'), \

                            pd.Series(val_mae_values, name='val')], \

                            axis=1)

    mae_scores

    First, you get the following output:

Figure 6.17: Values of training and validation errors

Now, plot the MAE scores on a bar plot using the following code:

mae_scores.plot(kind='bar', figsize=(10,7))

plt.ylabel('MAE')

plt.xlabel('Model')

plt.show()

The final output will be as follows:

Figure 6.18: Visualization of training and validation errors

Note

To access the source code for this specific section, please refer to https://packt.live/3fNqtMG.

You can also run this example online at https://packt.live/2Yn2VIl. You must execute the entire Notebook in order to get the desired result.

7. Model Evaluation

Activity 7.01: Final Test Project

  1. Import the relevant libraries:

    import pandas as pd

    import numpy as np

    import json

    %matplotlib inline

    import matplotlib.pyplot as plt

    from sklearn.preprocessing import OneHotEncoder

    from sklearn.model_selection import RandomizedSearchCV, train_test_split

    from sklearn.ensemble import GradientBoostingClassifier

    from sklearn.metrics import (accuracy_score, precision_score, \

    recall_score, confusion_matrix, precision_recall_curve)

  2. Read the breast-cancer-data.csv dataset:

    data = pd.read_csv('../Datasets/breast-cancer-data.csv')

    data.info()

  3. Let's separate the input data (X) and the target (y):

    X = data.drop(columns=['diagnosis'])

    y = data['diagnosis'].map({'malignant': 1, 'benign': 0}.get).values

  4. Split the dataset into training and test sets:

    X_train, X_test, \

    y_train, y_test = train_test_split(X, y, \

                                       test_size=0.2, random_state=11)

    print(X_train.shape)

    print(y_train.shape)

    print(X_test.shape)

    print(y_test.shape)

    You should get the following output:

    (455, 30)

    (455,)

    (114, 30)

    (114,)

  5. Choose a base model and define the range of hyperparameter values corresponding to the model to be searched for hyperparameter tuning. Let's use a gradient-boosted classifier as our model. We then define ranges of values for all hyperparameters we want to tune in the form of a dictionary:

    meta_gbc = GradientBoostingClassifier()

    param_dist = {'n_estimators': list(range(10, 210, 10)), \

                  'criterion': ['mae', 'mse'],\

                  'max_features': ['sqrt', 'log2', 0.25, 0.3, \

                                   0.5, 0.8, None], \

                  'max_depth': list(range(1, 10)), \

                  'min_samples_leaf': list(range(1, 10))}

  6. Define the parameters with which to initialize the RandomizedSearchCV object and use K-fold cross-validation to identify the best model hyperparameters. Define the parameters required for random search, including cv as 5, indicating that the hyperparameters should be chosen by evaluating the performance using 5-fold cross-validation. Then, initialize the RandomizedSearchCV object and use the .fit() method to initiate optimization:

    rand_search_params = {'param_distributions': param_dist, \

                          'scoring': 'accuracy', 'n_iter': 100, \

                          'cv': 5, 'return_train_score': True, \

                          'n_jobs': -1, 'random_state': 11 }

    random_search = RandomizedSearchCV(meta_gbc, **rand_search_params)

    random_search.fit(X_train, y_train)

    You should get the following output:

    Figure 7.36: The RandomizedSearchCSV object

    Once the tuning is complete, find the position (iteration number) at which the highest mean test score was obtained. Find the corresponding hyperparameters and save them to a dictionary:

    idx = np.argmax(random_search.cv_results_['mean_test_score'])

    final_params = random_search.cv_results_['params'][idx]

    final_params

    You should get the following output:

    Figure 7.37: Hyperparameters

  7. Split the training dataset further into training and validation sets and train a new model using the final hyperparameters on the training dataset. Use scikit-learn's train_test_split() method to split X and y into train and validation components, with the validation set comprising 15% of the dataset:

    train_X, val_X, \

    train_y, val_y = train_test_split(X_train, y_train, \

                                      test_size=0.15, random_state=11)

    train_X.shape, train_y.shape, val_X.shape, val_y.shape

    You should get the following output:

    ((386, 30), (386,), (69, 30), (69,))

  8. Train the gradient-boosted classification model using the final hyperparameters and make predictions in relation to the training and validation sets. Also, calculate the probability regarding the validation set:

    gbc = GradientBoostingClassifier(**final_params)

    gbc.fit(train_X, train_y)

    preds_train = gbc.predict(train_X)

    preds_val = gbc.predict(val_X)

    pred_probs_val = np.array([each[1] \

                     for each in gbc.predict_proba(val_X)])

  9. Calculate accuracy, precision, and recall for predictions in relation to the validation set, and print the confusion matrix:

    print('train accuracy_score = {}'\

    .format(accuracy_score(y_true=train_y, y_pred=preds_train)))

    print('validation accuracy_score = {}'\

    .format(accuracy_score(y_true=val_y, y_pred=preds_val)))

    print('confusion_matrix: \n{}'\

    .format(confusion_matrix(y_true=val_y, y_pred=preds_val)))

    print('precision_score = {}'\

    .format(precision_score(y_true=val_y, y_pred=preds_val)))

    print('recall_score = {}'\

    .format(recall_score(y_true=val_y, y_pred=preds_val)))

    You should get the following output:

    Figure 7.38: Evaluation scores and the confusion matrix

  10. Experiment with varying thresholds to find the optimal point having a high recall.

    Plot the precision-recall curve:

    plt.figure(figsize=(10,7))

    precision, recall, \

    thresholds = precision_recall_curve(val_y, \

                                        pred_probs_val)

    plt.plot(recall, precision)

    plt.xlabel('Recall')

    plt.ylabel('Precision')

    plt.show()

    The output will be as follows:

    Figure 7.39: Precision recall curve

    """

    Plot the variation in precision and recall with increasing threshold values.

    """

    PR_variation_df = pd.DataFrame({'precision': precision, \

                                    'recall': recall}, \

                                    index=list(thresholds)+[1])

    PR_variation_df.plot(figsize=(10,7))

    plt.xlabel('Threshold')

    plt.ylabel('P/R values')

    plt.show()

    You should get the following output:

    Figure 7.40: Variation in precision and recall with increasing threshold values

  11. Finalize a threshold that will be used for predictions in relation to the test dataset. Let's finalize a value, say, 0.05. This value is entirely dependent on what you feel would be optimal based on your exploration in the previous step:

    final_threshold = 0.05

  12. Predict the final values in relation to the test dataset and save them to a file. Use the final threshold value determined in Step 10 to find the classes for each value in the training set. Then, write the final predictions to the final_predictions.csv file:

    pred_probs_test = np.array([each[1] \

                      for each in gbc.predict_proba(X_test)])

    preds_test = (pred_probs_test > final_threshold).astype(int)

    preds_test

    The output will be as follows:

Figure 7.41: Prediction for final values for the test dataset

Alternatively, you can also get the output in CSV format:

with open('final_predictions.csv', 'w') as f:

    f.writelines([str(val)+'\n' for val in preds_test])

The output will be a CSV file as follows:

Figure 7.42: Output for the final values

Note

To access the source code for this specific section, please refer to https://packt.live/2Ynw6Lt.

You can also run this example online at https://packt.live/3erAajt. You must execute the entire Notebook in order to get the desired result.