The Supervised Learning Workshop – The Supervised Learning Workshop

The

Supervised Learning

Workshop

Second Edition

The Supervised Learning Workshop

Second Edition

Copyright © 2020 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, and Ishita Mathur

Reviewers: Tiffany Ford, Sukanya Mandal, Ashish Pratik Patil, and Ratan Singh

Managing Editor: Snehal Tambe

Acquisitions Editors: Manuraj Nair, Sneha Shinde, and Anindya Sil

Production Editor: Salma Patel

Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: April 2019

Second edition: February 2020

Production reference: 2270720

ISBN: 978-1-80020-904-6

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface

1. Fundamentals

Introduction

When to Use Supervised Learning

Python Packages and Modules

Loading Data in Pandas

Exercise 1.01: Loading and Summarizing the Titanic Dataset

Exercise 1.02: Indexing and Selecting Data

Exercise 1.03: Advanced Indexing and Selection

Pandas Methods

Exercise 1.04: Using the Aggregate Method

Quantiles

Lambda Functions

Exercise 1.05: Creating Lambda Functions

Data Quality Considerations

Managing Missing Data

Class Imbalance

Low Sample Size

Activity 1.01: Implementing Pandas Functions

Summary

2. Exploratory Data Analysis and Visualization

Introduction

Exploratory Data Analysis (EDA)

Summary Statistics and Central Values

Exercise 2.01: Summarizing the Statistics of Our Dataset

Missing Values

Finding Missing Values

Exercise 2.02: Visualizing Missing Values

Imputation Strategies for Missing Values

Exercise 2.03: Performing Imputation Using Pandas

Exercise 2.04: Performing Imputation Using Scikit-Learn

Exercise 2.05: Performing Imputation Using Inferred Values

Activity 2.01: Summary Statistics and Missing Values

Distribution of Values

Target Variable

Exercise 2.06: Plotting a Bar Chart

Categorical Data

Exercise 2.07: Identifying Data Types for Categorical Variables

Exercise 2.08: Calculating Category Value Counts

Exercise 2.09: Plotting a Pie Chart

Continuous Data

Skewness

Kurtosis

Exercise 2.10: Plotting a Histogram

Exercise 2.11: Computing Skew and Kurtosis

Activity 2.02: Representing the Distribution of Values Visually

Relationships within the Data

Relationship between Two Continuous Variables

Pearson's Coefficient of Correlation

Exercise 2.12: Plotting a Scatter Plot

Exercise 2.13: Plotting a Correlation Heatmap

Using Pairplots

Exercise 2.14: Implementing a Pairplot

Relationship between a Continuous and a Categorical Variable

Exercise 2.15: Plotting a Bar Chart

Exercise 2.16: Visualizing a Box Plot

Relationship Between Two Categorical Variables

Exercise 2.17: Plotting a Stacked Bar Chart

Activity 2.03: Relationships within the Data

Summary

3. Linear Regression

Introduction

Regression and Classification Problems

The Machine Learning Workflow

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Exercise 3.01: Plotting Data with a Moving Average

Activity 3.01: Plotting Data with a Moving Average

Linear Regression

Least Squares Method

The Scikit-Learn Model API

Exercise 3.02: Fitting a Linear Model Using the Least Squares Method

Activity 3.02: Linear Regression Using the Least Squares Method

Linear Regression with Categorical Variables

Exercise 3.03: Introducing Dummy Variables

Activity 3.03: Dummy Variables

Polynomial Models with Linear Regression

Exercise 3.04: Polynomial Models with Linear Regression

Activity 3.04: Feature Engineering with Linear Regression

Generic Model Training

Gradient Descent

Exercise 3.05: Linear Regression with Gradient Descent

Exercise 3.06: Optimizing Gradient Descent

Activity 3.05: Gradient Descent

Multiple Linear Regression

Exercise 3.07: Multiple Linear Regression

Summary

4. Autoregression

Introduction

Autoregression Models

Exercise 4.01: Creating an Autoregression Model

Activity 4.01: Autoregression Model Based on Periodic Data

Summary

5. Classification Techniques

Introduction

Ordinary Least Squares as a Classifier

Exercise 5.01: Ordinary Least Squares as a Classifier

Logistic Regression

Exercise 5.02: Logistic Regression as a Classifier – Binary Classifier

Exercise 5.03: Logistic Regression – Multiclass Classifier

Activity 5.01: Ordinary Least Squares Classifier – Binary Classifier

Select K Best Feature Selection

Exercise 5.04: Breast Cancer Diagnosis Classification Using Logistic Regression

Classification Using K-Nearest Neighbors

Exercise 5.05: KNN Classification

Exercise 5.06: Visualizing KNN Boundaries

Activity 5.02: KNN Multiclass Classifier

Classification Using Decision Trees

Exercise 5.07: ID3 Classification

Classification and Regression Tree

Exercise 5.08: Breast Cancer Diagnosis Classification Using a CART Decision Tree

Activity 5.03: Binary Classification Using a CART Decision Tree

Artificial Neural Networks

Exercise 5.09: Neural Networks – Multiclass Classifier

Activity 5.04: Breast Cancer Diagnosis Classification Using Artificial Neural Networks

Summary

6. Ensemble Modeling

Introduction

One-Hot Encoding

Exercise 6.01: Importing Modules and Preparing the Dataset

Overfitting and Underfitting

Underfitting

Overfitting

Overcoming the Problem of Underfitting and Overfitting

Bagging

Bootstrapping

Exercise 6.02: Using the Bagging Classifier

Random Forest

Exercise 6.03: Building the Ensemble Model Using Random Forest

Boosting

Adaptive Boosting

Exercise 6.04: Implementing Adaptive Boosting

Gradient Boosting

Exercise 6.05: Implementing GradientBoostingClassifier to Build an Ensemble Model

Stacking

Exercise 6.06: Building a Stacked Model

Activity 6.01: Stacking with Standalone and Ensemble Algorithms

Summary

7. Model Evaluation

Introduction

Importing the Modules and Preparing Our Dataset

Evaluation Metrics

Regression Metrics

Exercise 7.01: Calculating Regression Metrics

Classification Metrics

Numerical Metrics

Curve Plots

Exercise 7.02: Calculating Classification Metrics

Splitting a Dataset

Hold-Out Data

K-Fold Cross-Validation

Sampling

Exercise 7.03: Performing K-Fold Cross-Validation with Stratified Sampling

Performance Improvement Tactics

Variation in Train and Test Errors

Learning Curve

Validation Curve

Hyperparameter Tuning

Exercise 7.04: Hyperparameter Tuning with Random Search

Feature Importance

Exercise 7.05: Feature Importance Using Random Forest

Activity 7.01: Final Test Project

Summary

Appendix