2. Exploratory Data Analysis and Visualization – The Supervised Learning Workshop

2. Exploratory Data Analysis and Visualization

Overview

This chapter takes us through how to perform exploration and analysis on a new dataset. By the end of this chapter, you will be able to explain the importance of data exploration and communicate the summary statistics of a dataset. You will visualize patterns in missing values in data and be able to replace null values appropriately. You will be equipped to identify continuous features, categorical features and visualize distributions of values across individual variables. You will also be able to describe and analyze relationships between different types of variables using correlation and visualizations.

Introduction

Say we have a problem statement that involves predicting whether a particular earthquake caused a tsunami. How do we decide what model to use? What do we know about the data we have? Nothing! But if we don't know and understand our data, chances are we'll end up building a model that's not very interpretable or reliable. When it comes to data science, it's important to have a thorough understanding of the data we're dealing with, in order to generate features that are highly informative and, consequently, to build accurate and powerful models. To acquire this understanding, we perform an exploratory analysis of the data to see what the data can tell us about the relationships between the features and the target variable (the value that you are trying to predict using the other variables). Getting to know our data will even help us interpret the model we build and identify ways we can improve its accuracy. The approach we take to achieve this is to allow the data to reveal its structure or model, which helps us gain some new, often unsuspected, insight into the data.

We will first begin with a brief introduction to exploratory data analysis and then progress to explaining summary statistics and central values. This chapter also teaches you how to find and visualize missing values and then describes the various imputation strategies for addressing the problem of missing values. The remainder of the chapter then focuses on visualizations. Specifically, the chapter teaches you how to create various plots such as scatter plot, histograms, pie charts, heatmaps, pairplots and more. Let us begin with exploratory data analysis.

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is defined as a method to analyze datasets and sum up their main characteristics to derive useful conclusions, often with visual methods.

The purpose of EDA is to:

  • Discover patterns within a dataset
  • Spot anomalies
  • Form hypotheses regarding the behavior of data
  • Validate assumptions

Everything from basic summary statistics to complex visualizations helps us gain an intuitive understanding of the data itself, which is highly important when it comes to forming new hypotheses about the data and uncovering what parameters affect the target variable. Often, discovering how the target variable varies across a single feature gives us an indication of how important a feature might be, and a variation across a combination of several features helps us to come up with ideas for new informative features to engineer.

Most explorations and visualizations are intended to understand the relationship between the features and the target variable. This is because we want to find out what relationships exist (or don't exist) between the data we have and the values we want to predict.

EDA can tell us about:

  • Features that are unclean, have missing values, or have outliers
  • Features that are informative and are a good indicator of the target
  • The kind of relationships features have with the target
  • Further features that the data might need that we don't already have
  • Edge cases you might need to account for separately
  • Filters you might need to apply to the dataset
  • The presence of incorrect or fake data points

Now that we've looked at why EDA is important and what it can tell us, let's talk about what exactly EDA involves. EDA can involve anything from looking at basic summary statistics to visualizing complex trends over multiple variables. However, even simple statistics and plots can be powerful tools, as they may reveal important facts about the data that could change our modeling perspective. When we see plots representing data, we are able to easily detect trends and patterns, compared to just raw data and numbers. These visualizations further allow us to ask questions such as "How?" and "Why?", and form hypotheses about the dataset that can be validated by further visualizations. This is a continuous process that leads to a deeper understanding of the data.

The dataset that we will use for our exploratory analysis and visualizations has been taken from the Significant Earthquake Database from NOAA, available as a public dataset on Google BigQuery (table ID: 'bigquery-public-data.noaa_significant_earthquakes.earthquakes'). We will be using a subset of the columns available, the metadata for which is available at https://console.cloud.google.com/bigquery?project=packt-data&folder&organizationId&p=bigquery-public-data&d=noaa_significant_earthquakes&t=earthquakes&page=table, and will load it into a pandas DataFrame to perform the exploration. We'll primarily be using Matplotlib for most of our visualizations, along with the Seaborn and Missingno libraries for some. It is to be noted, however, that Seaborn merely provides a wrapper over Matplotlib's functionalities, so anything that is plotted using Seaborn can also be plotted using Matplotlib. We'll try to keep things interesting by using visualizations from both libraries.

The exploration and analysis will be conducted keeping in mind a sample problem statement: Given the data we have, we want to predict whether an earthquake caused a tsunami. This will be a classification problem (more on this in Chapter 5, Classification Techniques) where the target variable is the flag_tsunami column.

Before we begin, let's first import the required libraries, which we will be using for most of our data manipulations and visualizations.

In a Jupyter notebook, import the following libraries:

import json

import pandas as pd

import numpy as np

import missingno as msno

from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt

import seaborn as sns

We can also read in the metadata containing the data types for each column, which are stored in the form of a JSON file. Do this using the following command. This command opens the file in a readable format and uses the json library to read the file into a dictionary:

with open('..\dtypes.json', 'r') as jsonfile:

    dtyp = json.load(jsonfile)

Note

The output of the preceding command can be found here: https://packt.live/3a4Zjhm

Summary Statistics and Central Values

In order to find out what our data really looks like, we use a technique known as data profiling. This is defined as the process of examining the data available from an existing information source (for example, a database or a file) and collecting statistics or informative summaries about that data. The goal is to make sure that you understand your data well and are able to identify any challenges that the data may pose early on in the project, which is done by summarizing the dataset and assessing its structure, content, and quality.

Data profiling includes collecting descriptive statistics and data types. Common data profile commands include those you have seen previously, including data.describe(), data.head(), and data.tail(). You can also use data.info(), which tells you how many non-null values there are in each column, along with the data type of the values (non-numeric types are represented as object types).

Exercise 2.01: Summarizing the Statistics of Our Dataset

In this exercise, we will use the summary statistics functions we read about previously to get a basic idea of our dataset:

Note

The dataset can be found on our GitHub repository here: https://packt.live/2TjU9aj

  1. Read the earthquakes data into a data pandas DataFrame and use the dtyp dictionary we read using the json library in the previous section, to specify the data types of each column in the CSV. Begin by loading the requisite libraries and the JSON file we have prepared with the data types. You can inspect the data types before reading the data:

    import json

    import pandas as pd

    import numpy as np

    import missingno as msno

    from sklearn.impute import SimpleImputer

    import matplotlib.pyplot as plt

    import seaborn as sns

    with open('../dtypes.json', 'r') as jsonfile:

        dtyp = json.load(jsonfile)

    dtyp

    The output will be as follows:

    Figure 2.1: Inspecting data types

  2. Use the data.info() function to get an overview of the dataset:

    data = pd.read_csv('../Datasets/earthquake_data.csv', dtype = dtyp)

    data.info()

    The output will be as follows:

    Figure 2.2: Overview of the dataset

  3. Print the first five and the last five rows of the dataset. The first five rows are printed as follows:

    data.head()

    data.tail()

    The output will be as follows:

    Figure 2.3: The first and last five rows

    We can see in these outputs that there are 28 columns, but not all of them are displayed. Only the first 10 and last 10 columns are displayed, with the ellipses representing the fact that there are columns in between that are not displayed.

  4. Use data.describe() to find the summary statistics of the dataset. Run data.describe().T:

    data.describe().T

    Here, .T indicates that we're taking a transpose of the DataFrame to which it is applied, that is, turning the columns into rows and vice versa. Applying it to the describe() function allows us to see the output more easily with each row in the transposed DataFrame now corresponding to the statistics for a single feature.

    We should get an output like this:

Figure 2.4: Summary statistics

Note

To access the source code for this specific section, please refer to https://packt.live/2Yl5qer.

You can also run this example online at https://packt.live/2V3I76D. You must execute the entire Notebook in order to get the desired result.

Notice here that the describe() function only shows the statistics for columns with numerical values. This is because we cannot calculate the statistics for the columns having non-numerical values (although we can visualize their values, as we will see later).

Missing Values

When there is no value (that is, a null value) recorded for a particular feature in a data point, we say that the data is missing. Having missing values in a real dataset is inevitable; no dataset is ever perfect. However, it is important to understand why the data is missing, and whether there is a factor that has affected the loss of data. Appreciating and recognizing this allows us to handle the remaining data in an appropriate manner. For example, if the data is missing randomly, then it's highly likely that the remaining data is still representative of the population. However, if the missing data is not random in nature and we assume that it is, it could bias our analysis and subsequent modeling.

Let's look at the common reasons (or mechanisms) for missing data:

  • Missing Completely at Random (MCAR): Values in a dataset are said to be MCAR if there is no correlation whatsoever between the value missing and any other recorded variable or external parameter. This means that the remaining data is still representative of the population, though this is rarely the case and taking missing data to be completely random is usually an unrealistic assumption.

    For example, in a study that involves determining the reason for obesity among K12 children, MCAR is when the parents forgot to take their children to the clinic for the study.

  • Missing at Random (MAR): If the case where the data is missing is related to the data that was recorded rather than the data that was not, then the data is said to be MAR. Since it's unfeasible to statistically verify whether data is MAR, we'd have to depend on whether it's a reasonable possibility.

    Using the K12 study, missing data in this case is due to parents moving to a different city, hence the children had to leave the study; missingness has nothing to do with the study itself.

  • Missing Not at Random (MNAR): Data that is neither MAR nor MCAR is said to be MNAR. This is the case of a non-ignorable non-response, that is, the value of the variable that's missing is related to the reason it is missing.

    Continuing with the example of the case study, data would be MNAR if the parents were offended by the nature of the study and did not want their children to be bullied, so they withdrew their children from the study.

Finding Missing Values

So, now that we know why it's important to familiarize ourselves with the reasons behind why our data is missing, let's talk about how we can find these missing values in a dataset. For a pandas DataFrame, this is most commonly executed using the .isnull() method on a DataFrame to create a mask of the null values (that is, a DataFrame of Boolean values) indicating where the null values exist—a True value at any position indicates a null value, while a False value indicates the existence of a valid value at that position.

Note

The .isnull() method can be used interchangeably with the .isna() method for pandas DataFrames. Both these methods do exactly the same thing—the reason there are two methods to do the same thing is pandas DataFrames were originally based on R DataFrames, and hence have reproduced much of the syntax and ideas of the latter.

It may not be immediately obvious whether the missing data is random or not. Discovering the nature of missing values across features in a dataset is possible through two common visualization techniques:

  • Nullity matrix: This is a data-dense display that lets us quickly visualize the patterns in data completion. It gives us a quick glance at how the null values within a feature (and across features) are distributed, how many there are, and how often they appear with other features.
  • Nullity-correlation heatmap: This heatmap visually describes the nullity relationship (or a data completeness relationship) between each pair of features; that is, it measures how strongly the presence or absence of one variable affects the presence of another.

    Akin to regular correlation, nullity correlation values range from -1 to 1, the former indicating that one variable appears when the other definitely does not, and the latter indicating the simultaneous presence of both variables. A value of 0 implies that one variable having a null value has no effect on the other being null.

Exercise 2.02: Visualizing Missing Values

Let's analyze the nature of the missing values by first looking at the count and percentage of missing values for each feature, and then plotting a nullity matrix and correlation heatmap using the missingno library in Python. We will be using the same dataset from the previous exercises.

Please note that this exercise is a continuation of Exercise 2.01: Summarizing the Statistics of Our Dataset.

The following steps will help you complete this exercise to visualize the missing values in the dataset:

  1. Calculate the count and percentage of missing values in each column and arrange these in decreasing order. We will use the .isnull() function on the DataFrame to get a mask. The count of null values in each column can then be found using the .sum() function over the DataFrame mask. Similarly, the fraction of null values can be found using .mean() over the DataFrame mask and multiplied by 100 to convert it to a percentage.

    Then, we combine the total and percentage of null values into a single DataFrame using the pd.concat() function, and subsequently sort the rows by percentage of missing values and print the DataFrame:

    mask = data.isnull()

    total = mask.sum()

    percent = 100*mask.mean()

    missing_data = pd.concat([total, percent], axis=1,join='outer', \

                             keys=['count_missing', 'perc_missing'])

    missing_data.sort_values(by='perc_missing', ascending=False, \

                             inplace=True)

    missing_data

    The output will be as follows:

    Figure 2.5: The count and percentage of missing values in each column

    Here, we can see that the state, total_damage_millions_dollars, and damage_millions_dollars columns have over 90% missing values, which means that data for fewer than 10% of the data points in the dataset are available for these columns. On the other hand, year, flag_tsunami, country, and region_code have no missing values.

  2. Plot the nullity matrix. First, we find the list of columns that have any null values in them using the .any() function on the DataFrame mask from the previous step. Then, we use the missingno library to plot the nullity matrix for a random sample of 500 data points from our dataset, for only those columns that have missing values:

    nullable_columns = data.columns[mask.any()].tolist()

    msno.matrix(data[nullable_columns].sample(500))

    plt.show()

    The output will be as follows:

    Figure 2.6: The nullity matrix

    Here, black lines represent non-nullity while the white lines indicate the presence of a null value in that column. At a glance, location_name appears to be completely populated (we know from the previous step that there is, in fact, only one missing value in this column), while latitude and longitude seem mostly complete, but spottier.

    The spark line on the right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset. Note that this is only for the sample of 500 points.

  3. Plot the nullity correlation heatmap. We will plot the nullity correlation heatmap using the missingno library for our dataset, for only those columns that have missing values:

    msno.heatmap(data[nullable_columns], figsize=(18,18))

    plt.show()

    The output will be as follows:

Figure 2.7: The nullity correlation heatmap

Here, we can also see some boxes labeled <1: this just means that the correlation values in those cases are all close to 1.0, but still not quite perfectly so. We can see a value of <1 between injuries and total_injuries, which means that the missing values in each category are correlated. We would need to dig deeper to understand whether the missing values are correlated because they are based upon the same or similar information, or for some other reason.

Note

To access the source code for this specific section, please refer to https://packt.live/2YSXq3k.

You can also run this example online at https://packt.live/2Yn3Us7. You must execute the entire Notebook in order to get the desired result.

Imputation Strategies for Missing Values

There are multiple ways of dealing with missing values in a column. The simplest way is to simply delete rows having missing values; however, this can result in the loss of valuable information from other columns. Another option is to impute the data, that is, replace the missing values with a valid value inferred from the known part of the data. The common ways in which this can be done are listed here:

  • Create a new value that is distinct from the other values to replace the missing values in the column so as to differentiate those rows altogether. Then, use a non-linear machine learning algorithm (such as ensemble models or support vectors) that can separate the values out.
  • Use an appropriate central value from the column (mean, median, or mode) to replace the missing values.
  • Use a model (such as a K-nearest neighbors or a Gaussian mixture model) to learn the best value with which to replace the missing values.

Python has a few functions that are useful for replacing null values in a column with a static value. One way to do this is to use the inherent pandas .fillna(0) function: there is no ambiguity in imputation here—the static value with which to substitute the null data point in the column is the argument being passed to the function (the value in the brackets).

However, if the number of null values in a column is significant and it's not immediately obvious what the appropriate central value is that can be used to replace each null value, then we can either delete the rows having null values or delete the column altogether from the modeling perspective, as it may not add any significant value. This can be done by using the .dropna() function on the DataFrame. The parameters that can be passed to the function are as follows:

  • axis: This defines whether to drop rows or columns, which is determined by assigning the parameter a value of 0 or 1, respectively.
  • how: A value of all or any can be assigned to this parameter to indicate whether the row/column should contain all null values to drop the column, or whether to drop the column if there is at least one null value.
  • thresh: This defines the minimum number of null values the row/column should have in order to be dropped.

Additionally, if an appropriate replacement for a null value for a categorical feature cannot be determined, a possible alternative to deleting the column is to create a new category in the feature that can represent the null values.

Note

If it is immediately obvious how a null value for a column can be replaced from an intuitive understanding or domain knowledge, then we can replace the value on the spot. Keep in mind that any such data changes should be made in your code and never directly on the raw data. One reason for this is that it allows the strategy to be updated easily in the future. Another reason is that it makes it visible to others who may later be reviewing the work where changes were made. Directly changing raw data can lead to data versioning problems and make it impossible for others to reproduce your work. In many cases, inferences become more obvious at later stages in the exploration process. In these cases, we can substitute null values as and when we find an appropriate way to do so.

Exercise 2.03: Performing Imputation Using Pandas

Let's look at missing values and replace them with zeros in time-based (continuous) features having at least one null value (month, day, hour, minute, and second). We do this because, for cases where we do not have recorded values, it would be safe to assume that the events take place at the beginning of the time duration. This exercise is a continuation of Exercise 2.02: Visualizing Missing Values:

  1. Create a list containing the names of the columns whose values we want to impute:

    time_features = ['month', 'day', 'hour', 'minute', 'second']

  2. Impute the null values using .fillna(). We will replace the missing values in these columns with 0 using the inherent pandas .fillna() function and pass 0 as an argument to the function:

    data[time_features] = data[time_features].fillna(0)

  3. Use the .info() function to view null value counts for the imputed columns:

    data[time_features].info()

    The output will be as follows:

Figure 2.8: Null value counts

As we can now see, all values for our features in the DataFrame are now non-null.

Note

To access the source code for this specific section, please refer to https://packt.live/2V9nMx3.

You can also run this example online at https://packt.live/2BqoZZM. You must execute the entire Notebook in order to get the desired result.

Exercise 2.04: Performing Imputation Using Scikit-Learn

In this exercise, you will replace the null values in the description-related categorical features using scikit-learn's SimpleImputer class. In Exercise 2.02: Visualizing Missing Values, we saw that almost all of these features comprised more than 50% of null values in the data. Replacing these null values with a central value might bias any model we try to build using the features, deeming them irrelevant. Let's instead replace the null values with a separate category, having the value NA. This exercise is a continuation of Exercise 2.02: Visualizing Missing Values:

  1. Create a list containing the names of the columns whose values we want to impute:

    description_features = ['injuries_description', \

                            'damage_description', \

                            'total_injuries_description', \

                            'total_damage_description']

  2. Create an object of the SimpleImputer class. Here, we first create an imp object of the SimpleImputer class and initialize it with parameters that represent how we want to impute the data. The parameters we will pass to initialize the object are as follows:

    missing_values: This is the placeholder for the missing values, that is, all occurrences of the values in the missing_values parameter will be imputed.

    strategy: This is the imputation strategy, which can be one of mean, median, most_frequent (that is, the mode), or constant. While the first three can only be used with numeric data and will replace missing values using the specified central value along each column, the last one will replace missing values with a constant as per the fill_value parameter.

    fill_value: This specifies the value with which to replace all occurrences of missing_values. If left to the default, the imputed value will be 0 when imputing numerical data and the missing_value string for strings or object data types:

    imp = SimpleImputer(missing_values=np.nan, \

                        strategy='constant', \

                        fill_value='NA')

  3. Perform the imputation. We will use imp.fit_transform() to actually perform the imputation. It takes the DataFrame with null values as input and returns the imputed DataFrame:

    data[description_features] = \

    imp.fit_transform(data[description_features])

  4. Use the .info() function to view null value counts for the imputed columns:

    data[description_features].info()

    The output will be as follows:

Figure 2.9: The null value counts

Note

To access the source code for this specific section, please refer to https://packt.live/3ervLgk.

You can also run this example online at https://packt.live/3doEX3G. You must execute the entire Notebook in order to get the desired result.

In the last two exercises, we looked at two ways to use pandas and scikit-learn methods to impute missing values. These methods are very basic methods we can use if we have little or no information about the underlying data. Next, we'll look at more advanced techniques we can use to fill in missing data.

Exercise 2.05: Performing Imputation Using Inferred Values

Let's replace the null values in the continuous damage_millions_dollars feature with information from the categorical damage_description feature. Although we may not know the exact dollar amount that was incurred, the categorical feature gives us information on the range of the amount that was incurred due to damage from the earthquake. This exercise is a continuation of Exercise 2.04: Performing Imputation Using scikit-learn:

  1. Find how many rows have null damage_millions_dollars values, and how many of those have non-null damage_description values:

    print(data[pd.isnull(data.damage_millions_dollars)].shape[0])

    print(data[pd.isnull(data.damage_millions_dollars) \

               & (data.damage_description != 'NA')].shape[0])

    The output will be as follows:

    5594

    3849

    As we can see, 3,849 of 5,594 null values can be easily substituted with the help of another variable. For example, we know that all variables having column names ending with _description are a descriptor field containing estimates for data that may not be available in the original numerical column. For deaths, injuries, and total_injuries, the corresponding categorical values represent the following:

    0 = None

    1 = Few (~1 to 50 deaths)

    2 = Some (~51 to 100 deaths)

    3 = Many (~101 to 1,000 deaths)

    4 = Very Many (~1,001 or more deaths)

    As regards damage_millions_dollars, the corresponding categorical values represent the following:

    0 = None

    1 = Limited (roughly corresponding to less than 1 million dollars)

    2 = Moderate (~1 to 5 million dollars)

    3 = Severe (~>5 to 24 million dollars)

    4 = Extreme (~25 million dollars or more)

  2. Find the mean damage_millions_dollars value for each category. Since each of the categories in damage_description represents a range of values, we find the mean damage_millions_dollars value for each category from the non-null values already available. These provide a reasonable estimate for the most likely value for that category:

    category_means = data[['damage_description', \

                           'damage_millions_dollars']]\

                           .groupby('damage_description').mean()

    category_means

    The output will be as follows:

    Figure 2.10: The mean damage_millions_dollars value for each category

    Note that the first three values make intuitive sense given the preceding definitions: 0.42 is between 0 and 1, 3.1 is between 1 and 5, and 13.8 is between 5 and 24. The last category is defined as 25 million or more; it transpires that the mean of these extreme cases is very high (3,575!).

  3. Store the mean values as a dictionary. In this step, we will convert the DataFrame containing the mean values to a dictionary (a Python dict object), so that accessing them is convenient.

    Additionally, since the value for the newly created NA category (the imputed value in the previous exercise) was NaN, and the value for the 0 category was absent (no rows had damage_description equal to 0 in the dataset), we explicitly added these values to the dictionary as well:

    replacement_values = category_means\

                         .damage_millions_dollars.to_dict()

    replacement_values['NA'] = -1

    replacement_values['0'] = 0

    replacement_values

    The output will be as follows:

    Figure 2.11: The dictionary of mean values

  4. Create a series of replacement values. For each value in the damage_description column, we map the categorical value onto the mean value using the map function. The .map() function is used to map the keys in the column to the corresponding values for each element from the replacement_values dictionary:

    imputed_values = data.damage_description.map(replacement_values)

  5. Replace null values in the column. We do this by using np.where as a ternary operator: the first argument is the mask, the second is the series from which to take the value if the mask is positive, and the third is the series from which to take the value if the mask is negative.

    This ensures that the array returned by np.where only replaces the null values in damage_millions_dollars with values from the imputed_values series:

    data['damage_millions_dollars'] = \

    np.where(data.damage_millions_dollars.isnull(), \

    data.damage_description.map(replacement_values), \

    data.damage_millions_dollars)

  6. Use the .info() function to view null value counts for the imputed columns:

    data[['damage_millions_dollars']].info()

    The output will be as follows:

Figure 2.12: The null value counts

We can see that, after replacement, there are no null values in the damage_millions_dollars column.

Note

To access the source code for this specific section, please refer to https://packt.live/3fMRqQo.

You can also run this example online at https://packt.live/2YkBgYC. You must execute the entire Notebook in order to get the desired result.

In this section, we have looked at replacing missing values in more than one way. In one case, we replaced values with zeros; in another case, we looked at more information about the dataset to reason that we could replace missing values with a combination of information from a descriptive field and the means of values we did have. These sorts of decisions and steps are extremely common when working with real data. We also noted that, occasionally, when we have sufficient data and the instances with missing values are few, we can just drop them. In the following activity, we'll use a different dataset for you to practice and reinforce these methods.

Activity 2.01: Summary Statistics and Missing Values

In this activity, we'll revise some of the summary statistics and missing value exploration we have looked at thus far in this chapter. We will be using a new dataset, House Prices: Advanced Regression Techniques, available on Kaggle.

Note

The original dataset is available at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data or on our GitHub repository at https://packt.live/2TjU9aj.

While the Earthquakes dataset used in the exercises is aimed at solving a classification problem (when the target variable has only discrete values), the dataset we will use in the activities will be aimed at solving a regression problem (when the target variable takes on a range of continuous values). We will use pandas functions to generate summary statistics and visualize missing values using a nullity matrix and nullity correlation heatmap.

The steps to be performed are as follows:

  1. Read the data (house_prices.csv).
  2. Use pandas' .info() and .describe() methods to view the summary statistics of the dataset.

    The output of the info() method will be as follows:

    Figure 2.13: The output of the info() method (abbreviated)

    The output of the describe() method will be as follows:

    Figure 2.14: The output of the describe() method (abbreviated)

    Note

    The outputs of the info() and describe() methods have been truncated for presentation purposes. You can find the outputs in their entirety here: https://packt.live/2TjZSgi

  3. Find the total count and total percentage of missing values in each column of the DataFrame and display them for columns having at least one null value, in descending order of missing percentages.
  4. Plot the nullity matrix and nullity correlation heatmap.

    The nullity matrix will be as follows:

    Figure 2.15: Nullity matrix

    The nullity correlation heatmap will be as follows:

    Figure 2.16: Nullity correlation heatmap

  5. Delete the columns having more than 80% of their values missing.
  6. Replace null values in the FireplaceQu column with NA values.

    Note

    The solution for this activity can be found via this link.

You should now be comfortable using the approaches we've learned to investigate missing values in any type of tabular data.

Distribution of Values

In this section, we'll look at how individual variables behave—what kind of values they take, what the distribution across those values is, and how those distributions can be represented visually.

Target Variable

The target variable can either have values that are continuous (in the case of a regression problem) or discrete (as in the case of a classification problem). The problem statement we're looking at in this chapter involves predicting whether an earthquake caused a tsunami, that is, the flag_tsunami variable, which takes on two discrete values only—making it a classification problem.

One way of visualizing how many earthquakes resulted in tsunamis and how many didn't involves the use of a bar chart, where each bar represents a single discrete value of the variable, and the height of the bars is equal to the count of the data points having the corresponding discrete value. This gives us a good comparison of the absolute counts of each category.

Exercise 2.06: Plotting a Bar Chart

Let's look at how many of the earthquakes in our dataset resulted in a tsunami. We will do this by using the value_counts() method over the column and using the .plot(kind='bar') function directly on the returned pandas series. This exercise is a continuation of Exercise 2.05: Performing Imputation Using Inferred Values:

  1. Use plt.figure() to initiate the plotting:

    plt.figure(figsize=(8,6))

  2. Next, type in our primary plotting command:

    data.flag_tsunami.value_counts().plot(kind='bar', \

                                          color = ('grey', \

                                                   'black'))

  3. Set the display parameters and display the plot:

    plt.ylabel('Number of data points')

    plt.xlabel('flag_tsunami')

    plt.show()

    The output will be as follows:

Figure 2.17: Bar chart showing how many earthquakes resulted in a tsunami

From this bar plot, we can see that most of the earthquakes did not result in tsunamis and that fewer than one-third of the earthquakes actually did. This shows us that the dataset is slightly imbalanced.

Note

To access the source code for this specific section, please refer to https://packt.live/2Yn4UfR.

You can also run this example online at https://packt.live/37QvoJI. You must execute the entire Notebook in order to get the desired result.

Let's look more closely at what these Matplotlib commands do:

  • plt.figure(figsize=(8,6)): This command defines how big our plot should be, by providing width and height values. This is always the first command before any plotting command is written.
  • plt.xlabel() and plt.ylabel(): These commands take a string as input and allow us to specify what the labels for the X and Y axes on the plot should be.
  • plt.show(): This is the final command that is written when plotting a visualization that displays the plot inline within the Jupyter notebook.

Categorical Data

Categorical variables are ones that take discrete values representing different categories or levels of observation that can either be string objects or integer values. For example, our target variable, flag_tsunami, is a categorical variable with two categories, Tsu and No.

Categorical variables can be of two types:

  • Nominal variables: Variables in which the categories are labeled without any order of precedence are called nominal variables. An example of a nominal variable from our dataset would be location_name. The values that this variable takes cannot be said to be ordered, that is, one location is not greater than the other. Similarly, more examples of such a variable would be color, types of footwear, ethnicity type, and so on.
  • Ordinal variables: Variables that have some order associated with them are called ordinal variables. An example from our dataset would be damage_description since each value represents an increasing value of damage incurred. Another example could be days of the week, which would have values from Monday to Sunday, which have some order associated with them and we know that Thursday comes after Wednesday but before Friday.

    Although ordinal variables can be represented by object data types, they are often represented as numerical data types as well, often making it difficult to differentiate between them and continuous variables.

One of the major challenges faced when dealing with categorical variables in a dataset is high cardinality, that is, a large number of categories or distinct values with each value appearing a relatively small number of times. For example, location_name has a large number of unique values, with each value occurring a small number of times in the dataset.

Additionally, non-numerical categorical variables will always require some form of preprocessing to be converted into a numerical format so that they can be ingested for training by a machine learning model. It can be a challenge to encode categorical variables numerically without losing out on contextual information that, despite being easy for humans to interpret (due to domain knowledge or otherwise just plain common sense), would be hard for a computer to automatically understand. For example, a geographical feature such as country or location name by itself would give no indication of the geographical proximity of different values, but that might just be an important feature—what if earthquakes that occur at locations in South East Asia trigger more tsunamis than those that occur in Europe? There would be no way of capturing that information by merely encoding the feature numerically.

Exercise 2.07: Identifying Data Types for Categorical Variables

Let's establish which variables in our Earthquake dataset are categorical and which are continuous. As we now know, categorical variables can also have numerical values, so having a numeric data type doesn't guarantee that a variable is continuous. This exercise is a continuation of Exercise 2.05: Performing Imputation Using Inferred Values:

  1. Find all the columns that are numerical and object types. We use the .select_dtypes() method on the DataFrame to create a subset DataFrame having numeric (np.number) and categorical (np.object) columns, and then print the column names for each. For numeric columns, use this command:

    numeric_variables = data.select_dtypes(include=[np.number])

    numeric_variables.columns

    The output will be as follows:

    Figure 2.18: All columns that are numerical

    For categorical columns, use this command:

    object_variables = data.select_dtypes(include=[np.object])

    object_variables.columns

    The output will be as follows:

    Figure 2.19: All columns that are object types

    Here, it is evident that the columns that are object types are categorical variables. To differentiate between the categorical and continuous variables from the numeric columns, let's see how many unique values there are for each of these features.

  2. Find the number of unique values for numeric features. We use the select_dtypes method on the DataFrame to find the number of unique values in each column and sort the resulting series in ascending order. For numeric columns, use this command:

    numeric_variables.nunique().sort_values()

    The output will be as follows:

Figure 2.20: Number of unique values for numeric features

For categorical columns, use this command:

object_variables.nunique().sort_values()

The output will be as follows:

Figure 2.21: Number of unique values for categorical columns

Note

To access the source code for this specific section, please refer to https://packt.live/2YlSmFt.

You can also run this example online at https://packt.live/31hnuIr. You must execute the entire Notebook in order to get the desired result.

For the numeric variables, we can see that the top nine have significantly fewer unique values than the remaining rows, and it's likely that these are categorical variables. However, we must keep in mind that it is possible that some of them might just be continuous variables with a low range of rounded-up values. Also, month and day would not be considered categorical variables here.

Exercise 2.08: Calculating Category Value Counts

For columns with categorical values, it would be useful to see what the unique values (categories) of the feature are, along with what the frequencies of these categories are, that is, how often does each distinct value occur in the dataset. Let's find the number of occurrences of each 0 to 4 label and NaN values for the injuries_description categorical variable. This exercise is a continuation of Exercise 2.07: Identifying Data Types for Categorical Variables:

  1. Use the value_counts() function on the injuries_description column to find the frequency of each category. Using value_counts gives us the frequencies of each value in decreasing order in the form of a pandas series:

    counts = data.injuries_description.value_counts(dropna=False)

    counts

    The output should be as follows:

    Figure 2.22: Frequency of each category

  2. Sort the values in increasing order of the ordinal variable. If we want the frequencies in the order of the values themselves, we can reset the index to give us a DataFrame and sort values by the index (that is, the ordinal variable):

    counts.reset_index().sort_values(by='index')

    The output will be as follows:

Figure 2.23: Sorted values

Note

To access the source code for this specific section, please refer to https://packt.live/2Yn5URj.

You can also run this example online at https://packt.live/314dYIr. You must execute the entire Notebook in order to get the desired result.

Exercise 2.09: Plotting a Pie Chart

Since our target variable in our sample data is categorical, the example in Exercise 2.06: Plotting a Bar Chart, showed us one way of visualizing how the categorical values are distributed (using a bar chart). Another plot that can make it easy to see how each category functions as a fraction of the overall dataset is a pie chart. Let's plot a pie chart to visualize the distribution of the discrete values of the damage_description variable. This exercise is a continuation of Exercise 2.08, Calculating Category Value Counts:

  1. Format the data into the form that needs to be plotted. Here, we run value_counts() over the column and sort the series by index:

    counts = data.damage_description.value_counts()

    counts = counts.sort_index()

  2. Plot the pie chart. The plt.pie() category plots the pie chart using the count data. We will use the same three steps for plotting as described in Exercise 2.06: Plotting a Bar Chart:

    fig, ax = plt.subplots(figsize=(10,10))

    slices = ax.pie(counts, \

                    labels=counts.index, \

                    colors = ['white'], \

                    wedgeprops = {'edgecolor': 'black'})

    patches = slices[0]

    hatches = ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '\.', '*']

    for patch in range(len(patches)):

        patches[patch].set_hatch(hatches[patch])

    plt.title('Pie chart showing counts for\ndamage_description '\

              'categories')

    plt.show()

    The output will be as follows:

Figure 2.24: Pie chart showing counts for damage_description categories

Note

To access the source code for this specific section, please refer to https://packt.live/37Ovj9s.

You can also run this example online at https://packt.live/37OvotM. You must execute the entire Notebook in order to get the desired result.

Figure 2.24 tells us the relative number of items in each of the five damage description categories. Note that it would be good practice to do the extra work to change the uninformative labels to the categories—recall from the EDA discussion that:

0 = NONE

1 = LIMITED (roughly corresponding to less than $1 million)

2 = MODERATE (~$1 to $5 million)

3 = SEVERE (~>$5 to $24 million)

4 = EXTREME (~$25 million or more)

In addition, while the pie chart gives us a quick visual impression of which are the largest and smallest categories, we get no idea of the actual quantities, so adding those labels would increase the value of the chart. You can use the code in the repository for this book to update the chart.

Continuous Data

Continuous variables can take any number of values and are usually integer (for example, number of deaths) or float data types (for example, the height of a mountain). It's useful to get an idea of the basic statistics of the values in the feature: the minimum, maximum, and percentile values we see from the output of the describe() function gives us a fair estimate of this.

However, for continuous variables, it is also very useful to see how the values are distributed in the range they operate in. Since we cannot simply find the counts of individual values, instead, we order the values in ascending order, group them into evenly-sized intervals, and find the counts for each interval. This gives us the underlying frequency distribution and plotting this gives us a histogram, which allows us to examine the shape, central values, and amount of variability in the data.

Histograms give us an easy view of the data that we're looking at. They tell us about the behavior of the values at a glance in terms of the underlying distribution (for example, a normal or exponential distribution), the presence of outliers, skewness, and more.

Note

It is easy to get confused between a bar chart and a histogram. The major difference is that a histogram is used to plot continuous data that has been binned to visualize the frequency distribution, while bar charts can be used for a variety of other use cases, including to represent categorical variables as we have done. Additionally, with histograms, the number of bins is something we can vary, so the range of values in a bin is determined by the number of bins, as is the height of the bars in the histogram. In a bar chart, the width of the bars does not generally convey meaning, and the height is usually a property of the category, like a count.

One of the most common frequency distributions is a Gaussian (or normal) distribution. This is a symmetric distribution that has a bell-shaped curve, which indicates that the values near the middle of the range have the highest occurrences in the dataset with a symmetrically decreasing frequency of occurrences as we move away from the middle. You almost certainly have seen examples of Gaussian distributions, because many natural and man-made processes generate values that vary nearly like the Gaussian distribution. Thus, it is extremely common to see data compared to the Gaussian distribution.

It is a probability distribution and the area under the curve equals one, as shown in Figure 2.25:

Figure 2.25: Gaussian (normal) distribution

A symmetric distribution like normal distribution can be characterized entirely by two parameters—the mean (µ) and the standard deviation (σ). In Figure 2.25, the mean is at 7.5, for example. However, there are significant amounts of real data that do not follow a normal distribution and may be asymmetric. The asymmetry of data is often referred to as a skew.

Skewness

A distribution is said to be skewed if it is not symmetric in nature, and skewness measures the asymmetry of a variable about its mean. The value can be positive or negative (or undefined). In the former case, the tail is on the right-hand side of the distribution, while the latter indicates that the tail is on the left-hand side.

However, it must be noted that a thick and short tail would have the same effect on the value of skewness as a long, thin tail.

Kurtosis

Kurtosis is a measure of the tailedness of the distribution of a variable and is used to measure the presence of outliers in one tail versus the other. A high value of kurtosis indicates a fatter tail and the presence of outliers. In a similar way to the concept of skewness, kurtosis also describes the shape of the distribution.

Exercise 2.10: Plotting a Histogram

Let's plot the histogram for the eq_primary feature using the Seaborn library. This exercise is a continuation of Exercise 2.09, Plotting a Pie Chart:

  1. Use plt.figure() to initiate the plotting:

    plt.figure(figsize=(10,7))

  2. sns.distplot() is the primary command that we will use to plot the histogram. The first parameter is the one-dimensional data over which to plot the histogram, while the bins parameter defines the number and size of the bins. Use this as follows:

    sns.distplot(data.eq_primary.dropna(), \

                 bins=np.linspace(0,10,21))

  3. Display the plot using plt.show():

    plt.show()

    The output will be as follows:

Figure 2.26: Histogram for the example primary feature

The plot gives us a normed (or normalized) histogram, which means that the area under the bars of the histogram equals unity. Additionally, the line over the histogram is the kernel density estimate, which gives us an idea of what the probability distribution for the variable would look like.

Note

To access the source code for this specific section, please refer to https://packt.live/2BwZrdj.

You can also run this example online at https://packt.live/3fMSxj2. You must execute the entire Notebook in order to get the desired result.

From the plot, we can see that the values of eq_primary lie mostly between 5 and 8, which means that most earthquakes had a magnitude with a moderate to high value, with barely any earthquakes having a low or very high magnitude.

Exercise 2.11: Computing Skew and Kurtosis

Let's calculate the skew and kurtosis values for all of the features in the dataset using the core pandas functions available to us. This exercise is a continuation of Exercise 2.10, Plotting a Histogram:

  1. Use the .skew() DataFrame method to calculate the skew for all features and then sort the values in ascending order:

    data.skew().sort_values()

    The output will be as follows:

    Figure 2.27: Skew values for all the features in the dataset

  2. Use the .kurt() DataFrame method to calculate the kurtosis for all features:

    data.kurt()

    The output will be as follows:

Figure 2.28: Kurtosis values for all the features in the dataset

Here, we can see that the kurtosis values for some variables deviate significantly from 0. This means that these columns have a long tail. But the values that are at the tail end of these variables (which indicate the number of people dead, injured, and the monetary value of damage), in our case, may be outliers that we may need to pay special attention to. Larger values might, in fact, indicate an additional force that added to the devastation caused by an earthquake, that is, a tsunami.

Note

To access the source code for this specific section, please refer to https://packt.live/2Yklmh0.

You can also run this example online at https://packt.live/37PcMdj. You must execute the entire Notebook in order to get the desired result.

Activity 2.02: Representing the Distribution of Values Visually

In this activity, we will implement what we learned in the previous section by creating different plots such as histograms and pie charts. Furthermore, we will calculate the skew and kurtosis for the features of the dataset. Here, will use the same dataset we used in Activity 2.01: Summary Statistics and Missing Values, that is, House Prices: Advanced Regression Techniques. We'll use different types of plots to visually represent the distribution of values for this dataset. This activity is a continuation of Activity 2.01: Summary Statistics and Missing Values:

The steps to be performed are as follows:

  1. Plot a histogram using Matplotlib for the target variable, SalePrice.

    The output will be as follows:

    Figure 2.29: Histogram for the target variable

  2. Find the number of unique values within each column having an object type.
  3. Create a DataFrame representing the number of occurrences for each categorical value in the HouseStyle column.
  4. Plot a pie chart representing these counts.

    The output will be as follows:

    Figure 2.30: Pie chart representing the counts

  5. Find the number of unique values within each column having a number type.
  6. Plot a histogram using seaborn for the LotArea variable.

    The output will be as follows:

    Figure 2.31: Histogram for the LotArea variable

  7. Calculate the skew and kurtosis values for the values in each column.

    The output for skew values will be:

Figure 2.32: Skew values for each column

The output for kurtosis values will be:

Figure 2.33: Kurtosis values for each column

Note

The solution for this activity can be found via this link.

We have seen how to look into the nature of data in more detail, in particular, by beginning to understand the distribution of the data using histograms or density plots, relative counts of data using pie charts, as well as inspecting the skew and kurtosis of the variables as a first step to finding potentially problematic data, outliers, and so on.

By now, you should have a comfort level handling various statistical measures of data such as summary statistics, counts, and the distribution of values. Using tools such as histograms and density plots, you can explore the shape of datasets, and augment that understanding by calculating statistics such as skew and kurtosis. You should be developing some intuition for some flags that warrant further investigation, such as large skew or kurtosis values.

Relationships within the Data

There are two reasons why it is important to find relationships between variables in the data:

  • Establishing which features are potentially important can be deemed essential, since finding ones that have a strong relationship with the target variable will aid in the feature selection process.
  • Finding relationships between different features themselves can be useful since variables in the dataset are usually never completely independent of every other variable and this can affect our modeling in a number of ways.

Now, there are a number of ways in which we can visualize these relationships, and this really depends on the types of variable we are trying to find the relationship between, and how many we are considering as part of the equation or comparison.

Relationship between Two Continuous Variables

Establishing a relationship between two continuous variables is basically seeing how one varies as the value of the other is increased. The most common way to visualize this would be to use a scatter plot, in which we take each variable along a single axis (the X and Y axes in a two-dimensional plane when we have two variables) and plot each data point using a marker in the X-Y plane. This visualization gives us a good idea of whether any kind of relationship exists between the two variables at all.

If we want to quantize the relationship between the two variables, however, the most common method is to find the correlation between them. If the target variable is continuous and it has a high degree of correlation with another variable, this is an indication that the feature would be an important part of the model.

Pearson's Coefficient of Correlation

Pearson's Coefficient of Correlation is a correlation coefficient that is commonly used to show the linear relationship between a pair of variables. The formula returns a value between -1 and +1, where:

  • +1 indicates a strong positive relationship
  • -1 indicates a strong negative relationship
  • 0 indicates no relationship at all

It's also useful to find correlations between pairs of features themselves. In some models, highly correlated features can cause issues, including coefficients that vary strongly with small changes in data or modal parameters. In the extreme case, perfectly correlated features (such as X2 = 2.5 * X1) cause some models, including linear regression, to return undefined coefficients (values of Inf).

Note

When fitting a linear model, having features that are highly correlated to one another can result in an unpredictable and widely varying model. This is because the coefficients of each feature in a linear model can be interpreted as the unit change in the target variable, keeping all other features constant. When a set of features is not independent (that is, are correlated), however, we cannot determine the effect of the independent changes on the target variable due to each feature, resulting in widely varying coefficients.

To find the pairwise correlation for every numeric feature in a DataFrame with every other feature, we can use the .corr() function on the DataFrame.

Exercise 2.12: Plotting a Scatter Plot

Let's plot a scatter plot between the primary earthquake magnitude on the X axis and the corresponding number of injuries on the Y axis. This exercise is a continuation of Exercise 2.11, Computing Skew and Kurtosis:

  1. Filter out the null values. Since we know that there are null values in both columns, let's first filter the data to include only the non-null rows:

    data_to_plot = data[~pd.isnull(data.injuries) \

                   & ~pd.isnull(data.eq_primary)]

  2. Create and display the scatter plot. We will use Matplotlib's plt.scatter(x=..., y=...) command as the primary command for plotting the data. The x and y parameters state which feature is to be considered along which axis. They take a single-dimensional data structure such as a list, a tuple, or a pandas series. We can also send the scatter function more parameters that define, say, the icon to use to plot an individual data point. For example, to use a red cross as the icon, we would need to send the parameters marker='x', c='r':

    plt.figure(figsize=(12,9))

    plt.scatter(x=data_to_plot.eq_primary, y=data_to_plot.injuries)

    plt.xlabel('Primary earthquake magnitude')

    plt.ylabel('No. of injuries')

    plt.show()

    The output will be as follows:

Figure 2.34: Scatter plot

From the plot, we can infer that although there doesn't appear to be a trend between the number of people who were injured and the earthquake magnitude, there is an increasing number of earthquakes with large injury counts as the magnitude increases. However, for the majority of earthquakes, there does not seem to be a relationship.

Note

To access the source code for this specific section, please refer to https://packt.live/314eupR.

You can also run this example online at https://packt.live/2YWtbsm. You must execute the entire Notebook in order to get the desired result.

Exercise 2.13: Plotting a Correlation Heatmap

Let's plot a correlation heatmap between all the numeric variables in our dataset using seaborn's sns.heatmap() function on the inter-feature correlation values in the dataset. This exercise is a continuation of Exercise 2.12, Plotting a Scatter Plot.

The optional parameters passed to the sns.heatmap() function are square and cmap, which indicate that the plot should be such that each pixel is square and specify which color scheme to use, respectively:

  1. Plot a basic heatmap with all the features:

    plt.figure(figsize = (12,10))

    sns.heatmap(data.corr(), square=True, cmap="YlGnBu")

    plt.show()

    The output will be as follows:

    Figure 2.35: Correlation heatmap

    We can see from the color bar on the right of the plot that the minimum value, around -0.2, is the lightest shade, which is a misrepresentation of the correlation values, which vary from -1 to 1.

  2. Plot a subset of features in a more customized heatmap. We will specify the upper and lower limits using the vmin and vmax parameters and plot the heatmap again with annotations specifying the pairwise correlation values on a subset of features. We will also change the color scheme to one that can be better interpreted—while the neutral white will represent no correlation, increasingly darker shades of blue and red will represent higher positive and negative correlation values, respectively:

    feature_subset = ['focal_depth', 'eq_primary', 'eq_mag_mw', \

                      'eq_mag_ms', 'eq_mag_mb', 'intensity', \

                      'latitude', 'longitude', 'injuries', \

                      'damage_millions_dollars','total_injuries', \

                      'total_damage_millions_dollars']

    plt.figure(figsize = (12,10))

    sns.heatmap(data[feature_subset].corr(), square=True, \

                annot=True, cmap="RdBu", vmin=-1, vmax=1)

    plt.show()

    The output will be as follows:

Figure 2.36: Customized correlation heatmap

Note

To access the source code for this specific section, please refer to https://packt.live/2Z1lPUB.

You can also run this example online at https://packt.live/2YntBc8. You must execute the entire Notebook in order to get the desired result.

Now, while we can calculate the value of correlation, this only gives us an indication of a linear relationship. To better judge whether there's a possible dependency, we could plot a scatter plot between pairs of features, which is mostly useful when the relationship between the two variables is not known, and visualizing how the data points are scattered or distributed could give us an idea of whether (and how) the two may be related.

Using Pairplots

A pairplot is useful for visualizing multiple relationships between pairs of features at once and can be plotted using Seaborn's .pairplot() function. In the following exercise, we will create a pairplot and visualize relations between the features in a dataset.

Exercise 2.14: Implementing a Pairplot

In this exercise, we will look at a pairplot between the features having the highest pairwise correlation in the dataset. This exercise is a continuation of Exercise 2.13, Plotting a Correlation Heatmap:

  1. Define a list having the subset of features on which to create the pairplot:

    feature_subset = ['focal_depth', 'eq_primary', 'eq_mag_mw', \

                      'eq_mag_ms', 'eq_mag_mb', 'intensity',]

  2. Create the pairplot using seaborn. The arguments sent to the plotting function are kind='scatter', which indicates that we want each individual plot between the pair of variables in the grid to be represented as a scatter plot, and diag_kind='kde', which indicates that we want the plots along the diagonal (where both the features in the pair are the same) to be a kernel density estimate.

    It should also be noted here that the plots symmetrically across the diagonal from one another will essentially be the same, just with the axes reversed:

    sns.pairplot(data[feature_subset].dropna(), kind ='scatter', \

                 diag_kind='kde')

    plt.show()

    The output will be as follows:

Figure 2.37: Pairplot between the features having the highest pairwise correlation

We have successfully visualized a pairplot to look at the features that have high correlation between them within a dataset.

Note

To access the source code for this specific section, please refer to https://packt.live/2Ni11T0.

You can also run this example online at https://packt.live/3eol7aj. You must execute the entire Notebook in order to get the desired result.

Relationship between a Continuous and a Categorical Variable

A common way to view the relationship between two variables when one is categorical and the other is continuous is to use a bar plot or a box plot:

  • A bar plot helps compare the value of a variable for a discrete set of parameters and is one of the most common types of plots. Each bar represents a categorical value and the height of the bar usually represents an aggregated value of the continuous variable over that category (such as average, sum, or count of the values of the continuous variable in that category).
  • A box plot is a rectangle drawn to represent the distribution of the continuous variable for each discrete value of the categorical variable. It not only allows us to visualize outliers efficiently but also allows us to compare the distribution of the continuous variable across categories of the categorical variable. The lower and upper edges of the rectangle represent the first and third quartiles, respectively, the line down through the middle represents the median value, and the points (or fliers) above and below the rectangle represent outlier values.

Exercise 2.15: Plotting a Bar Chart

Let's visualize the total number of tsunamis created by earthquakes of each intensity level using a bar chart. This exercise is a continuation of Exercise 2.14, Implementing a Pairplot:

  1. Preprocess the flag_tsunami variable. Before we can use the flag_tsunami variable, we need to preprocess it to convert the No values to zeros and the Tsu values to ones. This will give us the binary target variable. To do this, we set the values in the column using the .loc operator, with : indicating that values need to be set for all rows, and the second parameter specifying the name of the column for which values are to be set:

    data.loc[:,'flag_tsunami'] = data.flag_tsunami\

                                .apply(lambda t: int(str(t) == 'Tsu'))

  2. Remove all rows having null intensity values from the data we want to plot:

    subset = data[~pd.isnull(data.intensity)][['intensity',\

                                               'flag_tsunami']]

  3. Find the total number of tsunamis for each intensity level and display the DataFrame. To get the data in a format by means of which a bar plot can be visualized, we will need to group the rows by each intensity level, and then sum over the flag_tsunami values to get the total number of tsunamis for each intensity level:

    data_to_plot = subset.groupby('intensity').sum()

    data_to_plot

    The output will be as follows:

    Figure 2.38: Total number of tsunamis for each intensity level

  4. Plot the bar chart, using Matplotlib's plt.bar(x=..., height=...) method, which takes two arguments, one specifying the x values at which bars need to be drawn, and the second specifying the height of each bar. Both of these are one-dimensional data structures that must have the same length:

    plt.figure(figsize=(12,9))

    plt.bar(x=data_to_plot.index, height=data_to_plot.flag_tsunami)

    plt.xlabel('Earthquake intensity')

    plt.ylabel('No. of tsunamis')

    plt.show()

    The output will be as follows:

Figure 2.39: Bar chart

From this plot, we can see that as the earthquake intensity increases, the number of tsunamis caused also increases, but beyond an intensity of 9, the number of tsunamis seems to suddenly drop.

Think about why this could be happening. Perhaps it's just that there are fewer earthquakes with an intensity that high, and hence fewer tsunamis. Or it could be an entirely independent factor; maybe high-intensity earthquakes have historically occurred on land and couldn't trigger a tsunami. Explore the data to find out.

Note

To access the source code for this specific section, please refer to https://packt.live/3enFjsZ.

You can also run this example online at https://packt.live/2V5apxV. You must execute the entire Notebook in order to get the desired result.

Exercise 2.16: Visualizing a Box Plot

In this exercise, we'll plot a box plot that represents the variation in eq_primary over those countries with at least 100 earthquakes. This exercise is a continuation of Exercise 2.15, Plotting a Bar Chart:

  1. Find countries with over 100 earthquakes. We will find the value counts for all the countries in the dataset. Then, we'll create a series comprising only those countries having a count greater than 100:

    country_counts = data.country.value_counts()

    top_countries = country_counts[country_counts > 100]

    top_countries

    The output will be as follows:

    Figure 2.40: Countries with over 100 earthquakes

  2. Subset the DataFrame to filter in only those rows having countries in the preceding set. To filter the rows, we use the .isin() method on the pandas series to select those rows containing a value in the array-like object passed as a parameter:

    subset = data[data.country.isin(top_countries.index)]

  3. Create and display the box plot. The primary command for plotting the data is sns.boxplot(x=..., y=..., data=..., order=). The x and y parameters are the names of the columns in the DataFrame to be plotted on each axis—the former is assumed to be the categorical variable and the latter the continuous. The data parameter takes the DataFrame from which to take the data and order takes a list of category names that indicates the order in which to display the categories on the X axis:

    plt.figure(figsize=(15, 15))

    sns.boxplot(x='country', y="eq_primary", data=subset, \

                order=top_countries.index)

    plt.show()

    The output will be as follows:

Figure 2.41: Box plot

Note

To access the source code for this specific section, please refer to https://packt.live/2zQHPZw.

You can also run this example online at https://packt.live/3hPAzhN. You must execute the entire Notebook in order to get the desired result.

Relationship Between Two Categorical Variables

When we are looking at only a pair of categorical variables to find a relationship between them, the most intuitive way to do this is to divide the data on the basis of the first category, and then subdivide it further on the basis of the second categorical variable and look at the resultant counts to find the distribution of data points. While this might seem confusing, a popular way to visualize this is to use stacked bar charts. As in a regular bar chart, each bar would represent a categorical value. But each bar would again be subdivided into color-coded categories that would provide an indication of what fraction of the data points in the primary category fall into each subcategory (that is, the second category). The variable with a larger number of categories is usually considered the primary category.

Exercise 2.17: Plotting a Stacked Bar Chart

In this exercise, we'll plot a stacked bar chart that represents the number of tsunamis that occurred for each intensity level. This exercise is a continuation of Exercise 2.16, Visualizing a Box Plot :

  1. Find the number of data points that fall into each grouped value of intensity and flag_tsunami:

    grouped_data = data.groupby(['intensity', \

                                 'flag_tsunami']).size()

    grouped_data

    The output will be as follows:

    Figure 2.42: Data points falling into each grouped value of intensity and flag_tsunami

  2. Use the .unstack() method on the resultant DataFrame to get the level-1 index (flag_tsunami) as a column:

    data_to_plot = grouped_data.unstack()

    data_to_plot

    The output will be as follows:

    Figure 2.43: The level-1 index

  3. Create the stacked bar chart. We first use the sns.set() function to indicate that we want to use seaborn as our visualization library. Then, we can easily use the native .plot() function in pandas to plot a stacked bar chart by passing the kind='bar' and stacked=True arguments:

    sns.set()

    data_to_plot.plot(kind='bar', stacked=True, figsize=(12,8))

    plt.show()

    The output will be as follows:

Figure 2.44: A stacked bar chart

Note

To access the source code for this specific section, please refer to https://packt.live/37SnqA8.

You can also run this example online at https://packt.live/3dllvVx. You must execute the entire Notebook in order to get the desired result.

The plot now lets us visualize and interpret the fraction of earthquakes that caused tsunamis at each intensity level. In Exercise 2.15: Plotting a Bar Chart, we saw the number of tsunamis drop for earthquakes having an intensity of greater than 9. From this plot, we can now confirm that this was primarily because the number of earthquakes themselves dropped beyond level 10; the fraction of tsunamis even increased for level 11.

Activity 2.03: Relationships within the Data

In this activity, we will revise what we learned in the previous section about relationships between data. We will use the same dataset we used in Activity 2.01: Summary Statistics and Missing Values, that is, House Prices: Advanced Regression Techniques. We'll use different plots to highlight relationships between values in this dataset. This activity is a continuation of Activity 2.01: Summary Statistics and Missing Values:

The steps to be performed are as follows:

  1. Plot the correlation heatmap for the dataset.

    The output should be similar to the following:

    Figure 2.45: Correlation Heatmap for the Housing dataset

  2. Plot a more compact heatmap having annotations for correlation values using the following subset of features:

    feature_subset = ['GarageArea','GarageCars','GarageCond', \

                      'GarageFinish','GarageQual','GarageType', \

                      'GarageYrBlt','GrLivArea','LotArea', \

                      'MasVnrArea','SalePrice']

    The output should be similar to the following:

    Figure 2.46: Correlation heatmap for selected variables of the Housing dataset

  3. Display the pairplot for the same subset of features, with the KDE plot on the diagonals and the scatter plot elsewhere.

    The output will be as follows:

    Figure 2.47: Pairplot for the same subset of features

  4. Create a boxplot to show the variation in SalePrice for each category of GarageCars:

    The output will be as follows:

    Figure 2.48: Boxplot showing variation in SalePrice for each category of GarageCars

  5. Plot a line graph using seaborn to show the variation in SalePrice for older and more recently built homes:

    The output will be as follows:

Figure 2.49: Line graph showing the variation in SalePrice for older to more recently built homes

Note

The solution for this activity can be found via this link.

You have learned how to use more advanced methods from the seaborn package to visualize large numbers of variables at once, using charts such as the correlation heatmap, pairplot, and boxplots. With boxplots, you learned how to visualize the range of one variable segmented across another, categorical variable. The boxplot further directly visualizes the quantiles and outliers, making it a powerful tool in your EDA toolkit. You have also created some preliminary line and scatter plots that are helpful in visualizing continuous data that trends over time or some other variable.

Summary

In this chapter, we started by talking about why data exploration is an important part of the modeling process and how it can help in not only preprocessing the dataset for the modeling process but also help us engineer informative features and improve model accuracy. This chapter focused on not only gaining a basic overview of the dataset and its features but also gaining insights by creating visualizations that combine several features. We looked at how to find the summary statistics of a dataset using core functionality from pandas. We looked at how to find missing values and talked about why they're important while learning how to use the Missingno library to analyze them and the pandas and scikit-learn libraries to impute the missing values. Then, we looked at how to study the univariate distributions of variables in the dataset and visualize them for both categorical and continuous variables using bar charts, pie charts, and histograms. Lastly, we learned how to explore relationships between variables, and about how they can be represented using scatter plots, heatmaps, box plots, and stacked bar charts, to name but a few.

In the following chapters, we will start exploring supervised machine learning algorithms. Now that we have an idea of how to explore a dataset that we have, we can proceed to the modeling phase. The next chapter will introduce regression, a class of algorithms that are primarily used to build models for continuous target variables.