Chapter 4 – Applied Unsupervised Learning with R

Chapter 4

Dimension Reduction

Learning Objectives

By the end of this chapter, you will be able to:

  • Apply different dimension reduction techniques
  • Execute market basket analysis using the Apriori algorithm
  • Perform principal component analysis on a dataset

In this chapter, we will have a look at different dimension reduction techniques.

Introduction

This chapter presents techniques for unsupervised learning that accomplish something called dimension reduction. First, we will discuss what a dimension is, why we want to avoid having too many dimensions, and the basic idea of dimension reduction. The chapter then covers two dimension reduction techniques in detail: market basket analysis and Principal Component Analysis (PCA). Market basket analysis is a technique for generating associative rules in datasets. The chapter will contain a walk-through of detailed R code that accomplishes this. PCA, a very common dimension reduction technique, comes from theoretical linear algebra. The chapter will also show a detailed walk-through of how to accomplish PCA with R.

The Idea of Dimension Reduction

The dimensions of a dataset are nothing more than the collection of distinct numbers that are required to describe observations in it. For example, consider the position of Pac-Man in the game named after him. Pac-Man is a game that was popular in the 20th century in America. It is an extremely simple game: Pac-Man is a little circular creature on a screen who likes to eat little dots and fruits. He lives in a maze that he has to navigate with only two sets of directions to move in: up/down and left/right. There are some monsters who try to chase Pac-Man and kill him. You can see in the following illustration what a Pac-Man game looks like, and what the world that he inhabits and has to move in looks like:

Figure 4.1: Illustration of a Pac-Man-style game

As you can see, Pac-Man's position can be fully described by two numbers: how far he is from the left side of the screen and how far he is from the top of the screen. If we know those two numeric measurements, then there is only one unique place on the screen where he could be. So, if we wanted to collect data on where Pac-Man was over time, we would be able to collect a two-dimensional dataset that consisted of those two numbers measured repeatedly. We would feel completely confident that each observation, consisting of two numbers, fully described everything that could be known about where Pac-Man was located at the time of the observation.

It is not only location data or geometric data that can be described as two-dimensional. Any dataset that contains two different measurements can be described as two-dimensional. For example, if we measured individuals' heights and weights, we could create a two-dimensional dataset that consisted of their height and weight measurements. If we recorded height, weight, and shoe size, then we would have a three-dimensional dataset. There is no limit to the number of dimensions that can be contained in a dataset.

Dimension reduction is the process of finding a lower-dimensional dataset that approximates a higher-dimensional dataset. Consider an example related to Pac-Man. Imagine that we have a three-dimensional dataset that describes Pac-Man's location. Suppose that the dimensions of this dataset are (1) how far Pac-Man is from the left side of the screen, (2) how far Pac-Man is from the top of the screen, and (3) how far Pac-Man is from the blue monster that is chasing him. This is a three-dimensional dataset; however, we can have complete knowledge of Pac-Man's location with only the information contained in the first two dimensions. The simplest way we could perform effective dimension reduction here would be to discard the third dimension, since it would not help us locate Pac-Man any better than we would be able to with only the first two dimensions. So, the two-dimensional dataset consisting of the dataset's first two dimensions would be a good approximation of the three-dimensional dataset that we started with.

In most real-life scenarios, dimension reduction is not as easy as discarding dimensions. Typically, we will attempt to use data from all dimensions to create a completely new dataset whose dimensions have different meanings from the dimensions in the original dataset. The exercises in the rest of the chapter will illustrate this process.

In the following exercise, we will look at a dataset that contains multiple dimensions. We will create plots that illustrate dimension reduction and how it can help us.

Exercise 21: Examining a Dataset that Contains the Chemical Attributes of Different Wines

Prerequisites:

To download the data, go to https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson04/Exercise21/wine.csv.

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at http://archive.ics.uci.edu/ml/datasets/Wine. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson04/Exercise21/wine.csv.

Download this data and store it on your computer in a file called wine.csv.

Note

For all the exercises and activities where we are importing external csv or images, Go to R Studio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.

This data contains information about the chemical measurements of 13 different attributes of 178 different samples of wine. Altogether, this is a 13-dimensional dataset. If we consider a subset of the data consisting of only 2 of the 13 attributes, we will have a 2-dimensional dataset comparable to our hypothetical Pac-Man data. With 2-dimensional data, we can always plot it on a 2-dimensional scatterplot.

In this dataset, the first column records the class of the wine, or, in other words, what type of wine it is. Every other column records a measurement related to the chemical makeup of the wine. One wonderful thing about machine learning is that even without knowing anything about the chemistry of wine, we can use pure data analysis tools to find patterns and draw conclusions that may be unnoticed even by chemistry experts.

Here are the steps for completion:

  1. Open the R console and make sure you have saved the data file (wine.csv) in a location that R can access. You can use the setwd() command to make sure your file is accessible. For example, if your wine.csv file is located in the C:/Users/me/datasets folder, then you can run the setwd('C:/Users/me/datasets') command in the R console. Then, you will be able to open the wine data file in R as follows:

    wine<-read.csv('wine.csv')

  2. Consider the following scatterplot of the two-dimensional data created by the flavanoids and total phenols attributes:

    plot(wine$flavanoid,wine$phenol)

    The output is as follows:

    Figure 4.2: Scatterplot of two-dimensional data of flavanoids and phenol
  3. After plotting the data, we observe that there appears to be a strong correlation between the flavanoid and phenol measurements. We can draw a line on the plot that represents this correlation. For now, don't worry about where we found the coefficients labeled a and b in the following commands:

    plot(wine$flavanoid,wine$phenol)

    abline(a=1.1954,b=.54171,col='red',lwd=5)

Figure 4.3: Scatterplot with a line representing correlation between flavanoids and phenol

As you can see, the red line follows the geometric shape of our data quite closely. The majority of the points in the data are quite close to the red line. If we wanted a concise way to describe the points, we could simply say what point on the red line they are closest to. This would not be a perfect description of the data, since some points would map to the same point on the red line even though they have different flavanoid and phenol levels. However, describing this data using only the red line is a reasonable approximation to the actual data.

If we describe each observation using the point on the red line that it is closest to, then what we have accomplished is dimension reduction. We started with a dataset that requires two measurements to describe each observation and found a way to describe each observation using only one point. This is the basic idea of every dimension reduction strategy, and this chapter contains several practical strategies to accomplish it.

Importance of Dimension Reduction

Why is dimension reduction something that we are interested in doing? Here are a couple of reasons:

  • One reason may be for the sake of compressing data. If a dataset is particularly large, and if the R instance running on your laptop takes too long to do simple calculations on it, it may be useful to reduce the dimensions of the data so that it can more easily fit into your computer's memory.
  • A more interesting reason for dimension reduction is that it provides insights into the underlying structure of our data and the ways that different attributes relate to each other. In the preceding exercise, even if we don't have advanced training in chemistry, we can use what we have learned from our simple dimension reduction exercise to understand wine chemistry better.

    Note

    If we read a little more about phenols and flavanoids (for example, at this website: https://www.researchgate.net/post/What_is_the_relation_between_total_Phenol_total_Flavonoids), we can learn that phenols and flavanoids both possess antioxidant activity. So, it could be that the red line on the chart represents the level of antioxidant activity of a particular wine, and that flavanoid and phenol measurements are both just capturing a noisy measurement of this one thing. So, dimension reduction has enabled us to generate a hypothesis about the chemical makeup of wine, even without advanced domain knowledge.

Market Basket Analysis

Market basket analysis is a method that allows us to take high-dimensional data and reduce it to something that is simple and manageable without losing too much information along the way. In market basket analysis, our goal is to generate rules that govern the data.

Market basket analysis is also called affinity analysis. It is named after the example of a grocery store trying to do analysis on its customers' transactions – analysis of the products each customer puts in his or her basket. A large grocery store may have something like 5,000 items for sale at any given time. They may have thousands of customers per day. For each customer, the grocery store can keep a record of those customers' transactions. One way to do this would be to use binary encodings, as shown in the following example:

Customer 1's transactions on Day 1:

Peanut Butter: No

Jelly: Yes

Bread: No

Milk: No

Customer 2's transactions on Day 1:

Peanut Butter: Yes

Jelly: Yes

Bread: No

Milk: No

...

These transactions could be stored in a table that had 5,000 columns – one for each item for sale in the store – and one row for every recorded transaction. And instead of storing "Yes" and "No" values for every item, they could store 1s and 0s, where 1 denotes "Yes" and 0 denotes "No", in a table that looks something like the following:

Figure 4.4: Table demonstrating transactions of the customers

The preceding table shows only four columns and five rows, but in practice, the table would be much, much larger.

The simplest use case for market basket analysis is to answer a simple question: what items are usually bought together? A grocery store owner may be interested in this purely out of curiosity. But, in fact, there are some compelling business reasons why they would want to know about their customers' most common baskets.

So far, this problem seems quite simple. Our binary data is as simple as it could be, consisting only of 0s and 1s. Our problem is merely to find what items tend to be purchased together. The complexity lies not in these simple ideas, but rather in their practical implementation.

Consider the brute-force approach to finding items that tend to be bought together. If we consider every possible basket of items, which is every possible combination of 0s and 1s in the preceding data, we find that there are 2^5000 possible baskets. This is much more than the estimated number of particles in the known universe, and it would not be computationally feasible to check each possible basket in a reasonable amount of time, or to store findings about each possible basket.

If we cannot check each possible basket, how can we find baskets that are bought together with any confidence that we are doing a comprehensive check? The answer is to apply an algorithmic solution. The Apriori algorithm is the most popular method to do thorough market basket analysis given time and space constraints. It was invented by Agrawal and Srikant, who published a paper about it in 1994. It proceeds sequentially through increasing market basket sizes.

The Apriori algorithm consists of several steps. In the first few steps, we will pass through our data set to find the most common baskets. In our first pass, we will find the most common baskets that have exactly one item in them. In our second pass, we will find the most common baskets that have exactly two items in them. We will continue these passes until we have found the most common baskets of every size that interests us. In the example of a grocery store, maybe the most common two-item basket is "peanut butter, jelly" and the most common three-item basket is "peanut butter, jelly, bread."

After finding the most common baskets, we will generate associative rules for these baskets. These rules will express relationships between items in the most common baskets. For example, an associative rule for a grocery store might be something such as, "if peanut butter and jelly are both in a basket, then it is likely that bread is also in the basket." These types of rules enable us to find associations between different individual items that could be useful for us. For example, after knowing that peanut butter and jelly are often accompanied by bread, the grocery store owner might be interested in rearranging the displays of these items so that they are closer together in the store and easier for shoppers to put into their baskets with minimal effort.

Note

The rule "If peanut butter and jelly are present [in a basket], then bread is likely to be present [in that basket]" is a simple associative rule. Associative rules are sometimes drawn with an arrow pointing from X to Y, indicating the idea that X "implies" Y, although associative rules are not necessarily causal.

Many Americans have grown up eating peanut butter and jelly sandwiches, so it may seem obvious to them that peanut butter, jelly, and bread are likely to be bought together. Market basket analysis may generate some seemingly obvious associative rules like these. However, in practice, market basket analysis is likely to generate associative rules that are surprising and unexpected. This is another example where without being an expert in grocery shopping or retail, we can use machine learning to find patterns and insights that are surprising even to experts.

In the next exercise, we will be applying market basket analysis to census survey data. The data in the dataset looks as follows:

Figure 4.5: Screenshot of the dataset

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at http://archive.ics.uci.edu/ml/datasets/Adult. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson04/Exercise22-Exercise25/.

This data is different from the hypothetical grocery basket data we described previously as its columns are not 0-1 binary encodings, but rather can take in multiple values. Since the Apriori algorithm is designed for 0-1 data, we will do re-coding of the data. Here, re-coding means that we will create new variables that are simpler and easier to work with than the original variables, but nevertheless convey the same information. The re-coding that we will perform here will transform the data so that it consists of 0-1 encodings. Another term for what we will do here is creating dummy variables. A dummy variable is a variable that only takes on the values 0 and 1. For each of the columns in the dataset, we can refer to the data at http://archive.ics.uci.edu/ml/datasets/Adult to find information about the column, and then use that information for our re-coding. We can do analogous transformations to all of the variables.

For categorical variables such as employment status, we make new 0-1 variables for each possible response. For ordinal variables such as age, we make two new variables, indicating whether the value is high or low.

We will draw conclusions about what survey answers tend to be answered in the same way. Market basket analysis can be used for a wide variety of datasets outside of just grocery data. No matter what dataset is used, market basket analysis will generate associative rules and tell us which attributes of the data tend to take the same values.

Exercise 22: Data Preparation for the Apriori Algorithm

Note

Exercise 22-25 should be executed together.

In this exercise, we will use data that is freely available at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson04/Exercise22-Exercise25/census.csv. This is survey data. To use this data, you should first download it to your computer – save it to a file called census.csv. You will not need to load any special packages in order to run this data or complete any prerequisites:

  1. Use the setwd() function in R to read data. After you have set the working directory, you can read it into R as follows:

    filepath='census.csv'

    mkt<-read.csv(filepath,stringsAsFactors=FALSE,header=FALSE,sep=',')

  2. Examine the data:

    head(mkt)

    Figure 4.6: Screenshot of the data

    One thing you will notice is that R has automatically assigned column names to the data, since the raw data file did not contain column names. By default, R assigns numbered column names beginning with V, since each column can be thought of as a vector.

  3. Create dummy variables.

    We can see from the data's website that the first variable, which R has called V1, is a measurement of age in years. For this variable, we recode it as a 0-1 binary variable based on whether its value is above the median age value or below the median age value. We can calculate the median age value with "median(mkt$V1)":

    mkt$old<-1*(mkt$V1>median(mkt$V1))

    mkt$young<-1*(mkt$V1<=median(mkt$V1))

  4. Similarly, we can see on the website that the second column, which R has labeled V2, refers to employment status. For employment, we can create several new variables, one for each class of employment:

    mkt$government_employee<-1*(mkt$V2 %in% c(" State-gov"," Local-gov"," Federal-gov"))

    mkt$self_employed<-1*(mkt$V2 %in% c(" Self-emp-not-inc"," Self-emp-inc"))

    mkt$never_worked<-1*(mkt$V2 %in% c(" Never-worked"))

    mkt$private_employment<-1*(mkt$V2 %in% c(" Private"))

    mkt$other_employment<-1*(mkt$V2 %in% c(" ?"," Without-pay" ))

  5. Here we encode 0-1 variables for the education level of a respondent:

    mkt$high_school_incomplete<-1*(mkt$V4 %in% c(" 1st-4th"," Preschool"," 5th-6th"," 7th-8th"," 9th"," 10th"," 11th"," 12th"))

    mkt$high_school_complete<-1*(mkt$V4 %in% c(" HS-grad"," Some-college"," Assoc-acdm"," Assoc-voc"))

    mkt$bachelors<-1*(mkt$V4 %in% c(" Bachelors"))

    mkt$post_bachelors<-1*(mkt$V4 %in% c(" Masters"," Prof-school"," Doctorate" ))

    We use the V4 column to encode education levels as the column labeled V3 is not useful for our purposes. We will not use the V5 column, which contains the same data expressed in a different way.

  6. Here we encode 0-1 variables for a person's marital status:

    mkt$married<-1*(mkt$V6 %in% c(" Married-civ-spouse"," Married-AF-spouse"," Married-spouse-absent"))

    mkt$never_married<-1*(mkt$V6 %in% c(" Never-married"))

    mkt$divorced_separated<-1*(mkt$V6 %in% c(" Divorced"," Separated"))

    mkt$widowed<-1*(mkt$V6 %in% c( " Widowed"))

  7. Here we encode 0-1 variables for a respondent's occupation:

    mkt$clerical<-1*(mkt$V7 %in% c(" Adm-clerical"))

    mkt$managerial<-1*(mkt$V7 %in% c(" Exec-managerial"))

    mkt$moving<-1*(mkt$V7 %in% c(" Transport-moving"))

    mkt$farming_fishing<-1*(mkt$V7 %in% c(" Farming-fishing"))

    mkt$craft_repair<-1*(mkt$V7 %in% c(" Craft-repair" ))

    mkt$sales<-1*(mkt$V7 %in% c(" Sales"))

    mkt$tech_support<-1*(mkt$V7 %in% c(" Tech-support"))

    mkt$service<-1*(mkt$V7 %in% c(" Protective-serv"," Priv-house-serv", " Other-service"))

    mkt$armed_forces<-1*(mkt$V7 %in% c(" Armed-Forces"))

    mkt$other_occupation<-1*(mkt$V7 %in% c(" Handlers-cleaners"," ?"," Machine-op-inspct"," Prof-specialty"))

    We will not use the V8 column since it is recorded for census purposes and is not useful for our analysis.

  8. Here we encode 0-1 variables for a respondent's self-reported sex:

    mkt$male<-1*(mkt$V9 %in% c(" Male"))

    mkt$female<-1*(mkt$V9 %in% c(" Female"))

    The V10 and V11 columns are not very informative, so we will not use them in our analysis.

  9. Here we encode 0-1 variables for the self-reported number of work hours of each respondent:

    mkt$high_hours<-1*(mkt$V12 > median(mkt$V12))

    mkt$low_hours<-1*(mkt$V12 <= median(mkt$V12))

  10. Here we encode 0-1 variables for whether a respondent reports that their native country is the United States:

    mkt$usa<-1*(mkt$V13==" United-States")

    mkt$not_usa<-1*(mkt$V13!=" United-States")

  11. Here we encode 0-1 variables for whether a respondent reports an income above or below $50,000:

    mkt$low_income<-1*(mkt$V14==" <=50K")

    mkt$high_income<-1*(mkt$V14==" >50K")

  12. Now, we have added 33 new variables that are 0-1 encodings. Since we will be performing market basket analysis only on the 0-1 encodings, we can remove the initial 14 variables that we started with to create a dummy-only dataset as follows:

    mktdummies<-mkt[,15:ncol(mkt)]

    mktdummies

  13. We can see the mean values of each of our variables by running the following code:

    print(colMeans(mktdummies,na.rm=TRUE))

    The mean of a dummy variable is equal to the percentage of the time that it is equal to 1. So, when we see that the mean of the married variable is 0.473, we know that about 47.3% of survey respondents were married.

    After completing this exercise, your data will have 33 columns, each of which is a dummy variable instance that takes only the values 0 and 1. If you print the top 6 rows by running print(head(mktdummies)) in the console, then you can see that the resulting dataset looks as follows:

Figure 4.7: Section of resulting dataset of dummy variables

Now that we have completed the exercise, we have a dummy-only dataset that consists of 0-1 variables that give true/false information about each of the original variables in the dataset.

Finally, we are ready to actually perform the Apriori algorithm. In the following exercise, we will begin to "take passes" through our data. In each pass, we will find the most common baskets that have a particular size.

Before we begin to pass through our data, we will need to specify something called support. Support is a name for one of the parameters of the Apriori algorithm. Here, support refers to the percentage of baskets that contain a particular combination of items. If we find that 40% of survey takers in the marketing data are both high income and female, then we will say that high-income, female "baskets" have 40% support in our data.

We need to make a decision about the minimum support we are interested in. If we set the minimum support threshold too high, we will not find any baskets that meet the threshold. If we set the minimum support threshold too low, we will find so many baskets that it will be difficult to look at all of them to find an interesting one. Also, since we want to find rules that will be practically useful, we want to find baskets that are relatively common, because more common baskets are more likely to have practical use for us.

Exercise 23: Passing through the Data to Find the Most Common Baskets

Now our data is prepared for the main steps of market basket analysis. Before going further, we have to make decisions about the parameters we will use in our algorithm:

  1. The first parameter we will work with is support, as explained previously. In this case, we can start by setting the minimum support threshold at 10%.

    support_thresh<-0.1

  2. First, we will find all one-item baskets that match our support threshold as follows:

    firstpass<-unname(which(colMeans(mktdummies,na.rm=TRUE)>support_thresh))

    This shows the collection of all survey items that were answered in the same way by at least 10% of respondents.

  3. To take the second pass through the data, we will define all possible candidates for two-item baskets that might have more than 10% support as follows:

    secondcand<-t(combn(firstpass,2))

    secondpass<-NULL

    Note

    If less than 10% of baskets contain a particular item, then it is impossible that more than 10% of baskets contain that item plus a different item. So, the candidates for two-item baskets that have more than 10% support will be combinations of items that survived the first pass through the data.

    We have defined secondcand, which is the set of candidates for our second pass, and secondpass, which we will use to store the results of the second pass. The secondpass variable starts with a NULL value because we have not yet begun the second pass.

    If we look at secondcand, we can see that it consists of pairs of numbers. Each number refers to a column in the mktdummies data. For example, the fourth row of secondcand refers to a potential basket consisting of people who responded that they are older than the median age and also privately employed. In the second pass through the data, we will check each two-item candidate in secondcand, and if it has greater than 10% support, it will survive the second pass through the data.

  4. In order to check the support of the fourth row of our candidates in secondcand, we can do the following calculation:

    k<-4

    support<-mean(mktdummies[,secondcand[k,1]]*mktdummies[,secondcand[k,2]],na.rm=TRUE)

    print(support)

    The output is as follows:

    0.05515801

  5. We need to do this same calculation for every candidate basket, which we can do by putting this calculation in a loop. This loop will save the final two-item baskets that reach the support threshold in the secondpass variable:

    k<-1

    while(k<=nrow(secondcand)){

    support<-mean(mktdummies[,secondcand[k,1]]*mktdummies[,secondcand[k,2]],na.rm=TRUE)

    if(support>support_thresh){

    secondpass<-rbind(secondpass,secondcand[k,])

    }

    k<-k+1

    }

  6. The important outcome variable of this exercise is the variable called secondpass. This variable contains all two-item baskets that reach the support threshold (10%) that we have specified. Look at the top six rows of this variable by running the following in the console:

    print(head(secondpass))

    The output is as follows:

    [,1] [,2]

    [1,] 1 6

    [2,] 1 9

    [3,] 1 12

    [4,] 1 14

    [5,] 1 25

    [6,] 1 26

    Here, each row contains two numbers, and each refers to a column number in the original dataset. For example, the first row indicates that the first column and the sixth column of the mktdummies dataset together constitute a two-item basket that has greater than 10% support. Since the first column of our dataset is called old and the sixth column in our dataset is called private_employment, then we conclude that survey respondents who are both old and employed privately constitute more than 10% of all survey respondents.

After this, we have finalized the second pass through the data. By completing the second pass, we now have a list of all of the most common baskets that have size two.

The point of the Apriori algorithm is that we can use the two-item baskets and one-item baskets to narrow down the three-item candidate baskets that we look at, which makes our search much faster.

To get a full sense of how the Apriori algorithm works, we should pass through the data at least one more time, which is covered in the following exercise.

Exercise 24: More Passes through the Data

In the following exercise, we will take more passes through the data. Recall that each time we pass through the data, we are looking for baskets that meet our support threshold. In each pass, we seek baskets that have more items than we sought in previous passes. So, in the first pass, we sought one-item baskets that met our support threshold. In the second pass, we sought two-item baskets that met our support threshold. In the following exercise, we will illustrate how to take more passes through the data, including a third pass, in which we will seek baskets with three items that meet our support threshold, and a fourth pass, in which we will seek baskets with four items that meet our support threshold.

Being able to take many passes through the data will be important to us if we are interested in complex rules that govern many items:

  1. In the third pass through the data, we will look for three-item baskets that have at least 10% support. The third pass through the data will start with a product variable equal to 1. This product variable will give us the product of different columns of our data, and the mean of the product variable will give us the support of different baskets, as follows:

    product<-1

    n<-1

  2. This product variable will be multiplied by the observations related to a two-item basket that survived the second pass:

    thirdpass<-NULL

    k<-1

    while(k<=nrow(secondpass)){

    j<-1

    while(j<=length(firstpass)){

    n<-1

    product<-1

    while(n<=ncol(secondpass)){

    product<-product*mktdummies[,secondpass[k,n]]

    n<-n+1

    }

  3. Finally, each product variable will be multiplied by the observations of a one-item basket that survived the first pass:

    if(!(firstpass[j] %in% secondpass[k,])){

    product<-product*mktdummies[,firstpass[j]]

  4. We take the mean of our product to find the support of the basket we have specified:

    support<-mean(product,na.rm=TRUE)

  5. If the support of the resulting three-item basket is higher than our specified support threshold, then we save it to our final thirdpass variable:

    if(support>support_thresh){

    thirdpass<-rbind(thirdpass,c(secondpass[k,],firstpass[j]))

    }

    }

    j<-j+1

    }

    k<-k+1

    }

    Note

    Steps 2-5 should be executed together.

    Now we have a list of all baskets of size three that are common in the data.

  6. After going through several passes through the data, we can start to see the general form of the steps taken in the Apriori algorithm. In general, to find the baskets that survive pass n, we need to take the baskets that survived pass n-1, add an item to them that survived pass 1, and see whether the resulting combination has support greater than our chosen threshold:

    fourthpass<-NULL

    k<-1

    while(k<=nrow(thirdpass)){

    j<-1

    while(j<=length(firstpass)){

    n<-1

    product<-1

    while(n<=ncol(thirdpass)){

    product<-product*mktdummies[,thirdpass[k,n]]

    n<-n+1

    }

    if(!(firstpass[j] %in% thirdpass[k,])){

    product<-product*mktdummies[,firstpass[j]]

    support<-mean(product,na.rm=TRUE)

    if(support>support_thresh){

    fourthpass<-rbind(fourthpass,c(thirdpass[k,],firstpass[j]))

    }

    }

    j<-j+1

    }

    k<-k+1

    }

    We can continue in this way indefinitely and create baskets of any size that meet our support threshold. For our purpose here, we will stop after four passes through the data, and we will examine the results of our third pass.

  7. The final important outcomes of this exercise are the thirdpass and fourthpass variables. These variables contain information about the three-item and four-item baskets that have met our support threshold. You can interpret each row of these variables in the same way you interpreted each row of secondpass. Each row represents one basket that meets our support threshold, and each number in each row refers to a column number in our dataset.

    You can verify what the top six rows of thirdpass look like by executing the following:

    print(head(thirdpass))

    The output is as follows:

    [,1] [,2] [,3]

    [1,] 1 6 9

    [2,] 1 6 12

    [3,] 1 6 26

    [4,] 1 6 29

    [5,] 1 6 30

    [6,] 1 6 32

    We can interpret row 2 as indicating that the basket containing item 1, item 6, and item 12 meets our support threshold.

  8. You can verify the top six rows of fourthpass as follows:

    print(head(fourthpass))

    The output is as follows:

    [,1] [,2] [,3] [,4]

    [1,] 1 6 9 26

    [2,] 1 6 9 29

    [3,] 1 6 9 30

    [4,] 1 6 9 32

    [5,] 1 6 12 26

    [6,] 1 6 12 29

    We can interpret row 5 as telling us that the basket containing items 1, 6, 12, and 26 meets our support threshold.

In the previous exercises, we have found the baskets that interest us. In this exercise, we will obtain the final product of market basket analysis. The final product we are interested in will be coherent associative rules. In other words, we are not only interested in the fact that a basket containing "old", "private_employment", and "low_hours" is common. We are also interested in generating a rule that relates these three items. One such rule might be "people who are older than the median survey respondent and who are privately employed are highly likely to work fewer hours than the median respondent". Market basket analysis thus goes further than other distribution analyses and clustering methods that only find groups in data. Market basket analysis not only finds groups but also groups them in coherent, meaningful rules.

In order to generate these rules, we will need to specify more parameters, similar to the support threshold we specified earlier.

One of these parameters is called confidence. Confidence is merely a conditional likelihood. Given that a person is both female and low-income, what is the likelihood that she is also divorced? What we have determined so far is support, which may tell us that the three-item basket consisting of female, low income, and divorced makes up more than 10% of all survey takers. Confidence tells us more – it tells us whether "divorced" is only a common basket item, or whether it is especially common conditional on the presence of "female" and "low income."

The final parameter we will have to specify is called lift. Lift is the confidence divided by the overall prevalence of the item predicted by the rule. In this case, suppose that if a person is female and low income, she has a 90% likelihood of also being divorced. Then 90% is the confidence of this rule, which seems quite high. However, this confidence will not seem impressive if 89% of all people are divorced anyway. If so, then knowing the presence of "female" and "low income" in the basket only improves our predictive capabilities very slightly, by about 1%. The value of lift in this case will be 90%/89%, or about 1.011. That is just a hypothetical – we will have to check the actual data to see what the actual value of lift is.

Together, confidence and lift provide measurements that help us decide whether an associative rule is useful or not. In a complex situation such as the many-question survey we are looking at here, we specify minimum thresholds for confidence and lift that filter out associative rules that are not sufficiently useful, so that we finish the Apriori algorithm with a small number of very useful rules.

Exercise 25: Generating Associative Rules as the Final Step of the Apriori Algorithm

In this exercise, we will complete the final steps of the Apriori algorithm. Any of the baskets that have survived our passes through the data so far can be considered candidate rules. In the final steps of market basket analysis, we will reduce the candidate rules further based on our final criteria – confidence and lift:

  1. Examine baskets that have survived multiple passes through the data as follows:

    head(thirdpass)

    The output is as follows:

    [,1] [,2] [,3]

    [1,] 1 6 9

    [2,] 1 6 12

    [3,] 1 6 26

    [4,] 1 6 29

    [5,] 1 6 30

    [6,] 1 6 32

    You can see the number of three-item baskets that survived the third pass as follows:

    nrow(thirdpass)

    The output is as follows:

    [1] 549

    We see that there are 549 three-item baskets, that is, 549 candidate rules that have at least 10% support in our data. These baskets are not the final products of market basket analysis – associative rules are the final products we are looking for.

  2. The formula for confidence for our three-item baskets is as follows: the support of the basket consisting of all three items, divided by the support of a basket consisting of only the first two items. We can calculate confidence as follows for the fifth row of our thirdpass three-item baskets:

    k<-5

    confidence<-mean(mktdummies[,thirdpass[k,1]]*mktdummies[,thirdpass[k,2]]*mktdummies[,thirdpass[k,3]],na.rm=TRUE)/mean(mktdummies[,thirdpass[k,1]]*mktdummies[,thirdpass[k,2]],na.rm=TRUE)

    Note

    This is just the support of the full three-item basket, divided by the support of the two-item basket not containing the third item.

  3. Lift is the confidence divided by the overall prevalence of the item predicted by the rule. Lift can be calculated easily as follows for the fifth row of our third-pass candidates:

    k<-5

    lift<-confidence/mean(mktdummies[,thirdpass[k,3]],na.rm=TRUE)

  4. To narrow down candidate rules to a final set of acceptable associative rules, we will specify minimum thresholds for confidence and lift, just like we did for support. Here, we have specified a lift threshold of 1.8 and a confidence threshold of 0.8:

    Note

    You can choose any values you prefer for those thresholds, but remember that lift thresholds should be higher than 1, and confidence thresholds should be between 0 and 1, and close to 1.

    lift_thresh<-1.8

    conf_thresh<-.8

  5. We can calculate lift and confidence for each of our candidate rules by constructing a loop as follows:

    thirdpass_conf<-NULL

    k<-1

    while(k<=nrow(thirdpass)){

    support<-mean(mktdummies[,thirdpass[k,1]]*mktdummies[,thirdpass[k,2]]*mktdummies[,thirdpass[k,3]],na.rm=TRUE)

    confidence<-mean(mktdummies[,thirdpass[k,1]]*mktdummies[,

    thirdpass[k,2]]*mktdummies[,thirdpass[k,3]],na.rm=TRUE)/

    mean(mktdummies[,thirdpass[k,1]]*mktdummies[,thirdpass[k,2]],na.rm=TRUE)

    lift<-confidence/mean(mktdummies[,thirdpass[k,3]],na.rm=TRUE)

    thirdpass_conf<-rbind(thirdpass_conf,unname(c(thirdpass[k,],support,confidence,lift)))

    k<-k+1

    }

    This has generated a new variable called thirdpass_conf, which is a DataFrame that contains columns for the support, confidence, and lift for each candidate rule. Here, conf is used to be short for confidence, something we have added to the thirdpass data.

  6. Finally, we can eliminate all candidate rules that do not meet the specified confidence and lift thresholds, as follows:

    thirdpass_high<-thirdpass_conf[which(thirdpass_conf[,5]>conf_thresh & thirdpass_conf[,6]>lift_thresh),]

  7. Now we have thirdpass_high, which is a set of associative three-item rules that have high confidence and high lift in our data. We can browse through some of them by printing the DataFrame to the console as follows:

    head(thirdpass_high)

Figure 4.8: Output of thirdpass_high

Altogether, the steps we have followed in market basket analysis can be summarized as follows:

Figure 4.9: Flowchart of steps followed in market basket analysis

Note

Remember, these refer to the dummy variables we created in Exercise 22, Data Preparation for the Apriori Algorithm, where we created a dummy variable called old that was 1 for individuals in the higher age ranges and 0 otherwise. We also created a dummy variable for high income where 1 was used to indicate annual income greater than $50,000, and 0 was used otherwise.

The interpretation of the rule on the first row of thirdpass_high is that people who are older than median, and also have high income, are likely (with high confidence and high lift) to also be married. This makes intuitive sense: marriage and high income can both take many years to achieve, so it makes sense that there are not many young, married, high income individuals. We find that this has confidence of about 87% and lift of about 1.84.

In this case, the firm that conducted the survey could use this data to create advertising campaigns – either creating a campaign that targeted older married people for homeownership because that is a proven high-income demographic, or targeting younger married people for homeownership because that could be an underserved demographic that would constitute a business opportunity. Each of the seven three-item rules we found could provide insights into population patterns and business opportunities, together with quantified measurements of what these rules tell us and what certainty they provide.

There are some different choices we could make in our market basket analysis process that could change our results. If we change the thresholds we specified, we could potentially get more rules, or more useful rules. For example, if we set a 9% instead of a 10% support threshold, fewer rules would be filtered out, and we might have ended with a rule such as "young students who live in condominiums are likely to be Asian-Americans," a rule referring to a group that constitutes only about 9% of survey respondents.

We have focused only on three-item baskets and rules that relate elements of these baskets. By allowing more or fewer items into the baskets we are using to search for rules, we could find more interesting rules that could lead to solid business insights. All of this has been done with relatively few lines of code in a relatively short amount of time. This indicates the usefulness and potential of market basket analysis for solving data problems and business problems.

Market basket analysis has taken a high-dimensional problem (the problem of finding patterns in a large dataset) and given us a low-dimensional solution (six simple, high-confidence rules) without too much effort, computational power, or time.

Principal Component Analysis

The next type of dimension reduction method we will cover is called PCA. This is a very common technique used by researchers in a wide variety of fields.

Linear Algebra Refresher

This short section will not contain an exhaustive review of linear algebra, but merely a reminder of some of its main points.

Note

http://joshua.smcvt.edu/linearalgebra/#current_version covers some basics, including matrices, covariance matrices, eigenvectors, and eigenvalues. You can feel free to skip the linear algebra refresher if you are already familiar with these terms.

Matrices

Linear algebra is largely concerned with the analysis of matrices. A matrix can be thought of as just a collection of numbers in a rectangular format. We can create a matrix in R as follows:

matrix1<-matrix(c(1,2,3,4,5,6),nrow=2)

Here we have created a matrix with two rows and three columns, with six entries total. We describe entries in a matrix according to the row and column in which they appear. In our "matrix1" that we have just created, the number 3 is in the "1-2" position, because it is in the first row and the second column. We can access that particular position in R by calling matrix1[1,2].

Variance

In general, the variance of a variable gives us an idea of how widely that variable is spread out.

Covariance

Covariance is variance that is measured for two different variables together. It measures the extent to which their dispersion matches. In other words, it measures the extent to which if one is high, the other is also high, and how high each of them is expected to be.

Exercise 26: Examining Variance and Covariance on the Wine Dataset

Execute all the steps that are to be followed in Exercise 21, Examining a Dataset that Contains Chemical Attributes of Different Wines. Then calculate variance and covariance for the same dataset:

  1. The alcohol measurements are all between 11.03 and 14.83, which you can see by running the following:

    range(wine$alcohol)

    The output is as follows:

    [1] 11.03 14.83

  2. We can calculate variance by using R's var command. For the wine's alcohol measurement, we find var(wine$alcohol) is about 0.66. By contrast, we find that the magnesium measurements in our dataset are more widely dispersed by executing the following:

    range(wine$magnesium)

    The output is as follows:

    [1] 70 162

  3. This shows that the variable ranges from 70 to 162. Since it is more widely dispersed, we should expect a higher variance, which we indeed find by executing the following:

    var(wine$magnesium)

    The output is as follows:

    [1] 203.9893

  4. To calculate covariance, execute the following code:

    cov(wine$alcohol,wine$magnesium)

    The output is as follows:

    [1] 3.139878

  5. In Step 4, we found that the covariance of the alcohol and magnesium variables is about 3.14. Please note that covariance is symmetric, so the covariance of X with Y is the same as the covariance of Y with X. You can check this by trying the following:

    cov(wine$magnesium,wine$alcohol)

    The output is as follows:

    [1] 3.139878

    You will note that it yields the same value.

  6. Variance of a variable is just the covariance of that variable with itself. You can see this by running the following code:

    var(wine$magnesium)

    The output is as follows:

    [1] 203.9893

    You will yield the same output by executing the following code:

    cov(wine$magnesium,wine$magnesium)

    The output is as follows:

    [1] 203.9893

The covariance matrix is a square matrix where every entry is a variance or a covariance. To construct a covariance matrix, first we must number each of the variables in our dataset. In the wine dataset, we can give each variable a number according to its order in the list of columns. So, alcohol would be variable 1, malic would be variable 2, and so on.

Note

Remember that you can see the list of variables at the data's source website at https://archive.ics.uci.edu/ml/datasets/wine.

After ordering the variables, we can create the covariance matrix. In this matrix, what we say is, "the i-j entry is the covariance of variable i and variable j." So, the item in the first row, second column is the 1-2 entry, and it will be equal to the covariance of the first variable (alcohol) with the second variable (malic). Since covariance is a symmetric operation, the 2-1 entry will be the same as the 1-2 entry. This means that the matrix itself will be symmetrical – every entry is the same as the entry on the mirror image other side of the main diagonal.

The entries on the main diagonal of the covariance matrix will be variances rather than covariances. For example, the entry in the 3-3 position of the matrix will be the covariance of variable 3 with variable 3 – this is the covariance of a variable with itself, which is another way of saying it is the variable's variance.

Eigenvectors and Eigenvalues

When we have a square matrix such as a covariance matrix, there are certain special vectors we can calculate called eigenvectors. Each eigenvector has a value associated with it called an eigenvalue. A discussion about eigenvectors and eigenvalues could easily fill a whole book. For our purposes, the most important thing to know about eigenvectors is that they express the directions of maximum variance in our data. The most important thing to know about eigenvalues is that they indicate which eigenvectors are the most important.

The Idea of PCA

PCA is a powerful dimension reduction technique that is based on the linear algebra topics described in the preceding refresher.

To accomplish PCA, we will take the covariance matrix of our data, and then find its eigenvectors. The eigenvectors of the covariance matrix are called principal components. The principal components enable us to re-express the data in different terms and different numbers of dimensions.

We will use the dataset related to wine that we explored at the beginning of the chapter. Recall that the wine dataset had 13 dimensions that measured a particular chemical attribute of a particular wine. One observation in that dataset consists of 13 numbers – one for each of the dimensions of the data.

One of the things that PCA enables is re-expressing data in different terms. The covariance matrix of the wine dataset will have 13 eigenvectors. We can interpret those eigenvectors as a set of 13 new dimensions – we will see how to do this in the following exercise. Essentially, we will be able to fully describe each observation in terms of a new set of dimensions that we discovered with PCA.

More importantly, PCA enables us to do dimension reduction. Instead of re-expressing the data in terms of 13 new dimensions defined by the eigenvectors, we can select only the 12 most important of these new dimensions, and express the data in terms of those 12 dimensions instead of the original 13. PCA makes it easy to select which dimensions are the most important, because the importance of each eigenvector is measured by its corresponding eigenvalue. The following exercise will illustrate how to do this more thoroughly.

There is a new type of plot that we will create as part of PCA, called a scree plot. A scree plot is a simple line segment plot that shows the eigenvalues of a matrix represented in order from highest to lowest, in order to indicate the relative importance of their associated eigenvectors.

A scree plot shows the values of the eigenvalues of a matrix, plotted in order from largest to smallest. We will use the scree plot to decide which eigenvectors (that is, which dimensions) are the most important.

PCA may sound difficult, and it is based on some terms and ideas that may be new to you, but actually it is relatively simple to implement in R.

Exercise 27: Performing PCA

If we have a covariance matrix, we are ready to perform PCA. In this case, we will use the wine dataset that we explored earlier in this chapter. Our goal is to perform dimension reduction – to express the wine dataset in fewer dimensions than it originally possessed. This exercise is built on top of Exercise 26, Examining Variance and Covariance on the Wine Dataset:

  1. To begin, load the same wine dataset that we used earlier in the chapter. As a first step, we will remove the class column from our wine dataset. We are doing this because class is not a chemical attribute of the wine, but rather a label, and we are interested in studying the chemical attributes of wine. We can remove this column as follows:

    wine_attributes<-wine[,2:14]

  2. We can get the covariance matrix of this smaller matrix as follows:

    wine_cov<-cov(wine_attributes)

  3. Next, we will use a function in R called eigen. This function calculates the special vectors called eigenvectors, and the special values called eigenvalues. We can apply it to our covariance matrix as follows:

    wine_eigen<-eigen(wine_cov)

  4. Now we can look at the eigenvectors we have found:

    print(wine_eigen$vectors)

    The output is as follows:

    Figure 4.10: Eigenvectors of wine
  5. R has compiled the eigenvectors into a square matrix the same size as our original covariance matrix. Each column of this new matrix is one of the eigenvectors of the covariance matrix. If we look at the eigenvalues we have found, we can see the relative importance of each of these eigenvectors. Execute the following to look at the eigenvalues:

    print(wine_eigen$values)

    The output is as follows:

    Figure 4.11: Eigenvalues of wine
  6. We are essentially finished with our PCA. The eigenvectors of the covariance matrix are called the principal components of the data. Let's look at the first one:

    print(wine_eigen$vectors[,1])

    The output is as follows:

Figure 4.12: This first eigenvector expresses a linear combination of our original dimensions.

We can understand our first principal component as follows:

Principal Component 1 = -0.0016592647 * alcohol + 0.0006810156 * malic -0.0001949057 * ash + 0.0046713006 * alcalinity -0.0178680075 * magnesium - 0.0009898297 * phenol -0.0015672883 * flavanoid +0.0001230867 * nonfphenol -0.0006006078 * proanthocyanin -0.0023271432 * color -0.0001713800 * hue -0.0007049316 * od280 -0.9998229365 * proline

So, each element of the eigenvector is a coefficient in this equation to generate a new principal component. The principal component is a linear combination of the original dimensions. We can use each of the principal components as new dimensions. So, instead of describing an observation by saying "it has a 14.23 measurement for alcohol, a 1.71 measurement for malic…." and so on, we can describe it by saying something like "it has a 5.62 measurement for principal component 1, a 9.19 measurement for principal component 2…." and so on.

The most important outcomes for this exercise are the wine_eigen$vectors and wine_eigen$values objects.

Any dimension reduction technique will mean that we have to lose some of the information encoded in the dataset. This is inevitable: one number can never completely express everything that is expressed in 13 numbers. The benefit of PCA is that it guarantees that it is the most efficient way to do dimension reduction – that by expressing data in terms of the principal components, we have lost the least possible amount of what is encoded in the original data.

In the following exercise, we will discuss how to transform the data to accomplish dimension reduction.

Exercise 28: Performing Dimension Reduction with PCA

This exercise is a continuation of the previous exercise – it will use the same data and the same matrices and eigenvectors that we calculated there:

  1. Remember, each of the eigenvectors of our covariance matrix tells us a linear combination of the 13 wine attributes that can be used to summarize the data. In this case, the first eigenvector is telling us that we can make transformation of the data as follows:

    neigen<-1

    transformed<-t(t(as.matrix(wine_eigen$vectors[,1:neigen])) %*% t(as.matrix(wine_attributes)))

    Here, we have specified a number of eigenvectors (1), and we have multiplied our original dataset by this number of eigenvectors, creating a transformed dataset that is expressed in terms of this eigenvector or our first principal component.

  2. We can look at part of our transformed dataset as follows:

    print(head(transformed))

    This will give us the following output:

    Figure 4.13: Transformed dataset

    Here, we have a one-dimensional dataset that describes each observation with only one number. So, we are saying that the first observed wine has a score of -1067.0557 on principal component 1. We have accomplished dimension reduction.

  3. We can do a partial restoration of the dataset with the following multiplication:

    restored<- t(as.matrix(wine_eigen$vectors[,1:neigen]) %*% t(as.matrix(transformed)))

    This should approximately restore our original dataset.

    Note

    Since dimension reduction always loses some of the original information encoded in data, it will not be a perfect restoration.

  4. We can test whether our transformation has led to an accurate reconstruction of the data as follows:

    print(mean(abs(wine_attributes[,13]-restored[,13])))

    The output is as follows:

    [1] 1.466919

    In this case, the error is quite small, indicating that we have been fairly successful in restoring our data.

  5. We can do dimension reduction using any number of dimensions. In general, we can determine how many dimensions we should use in transformations by generating a scree plot, as follows:

    plot(wine_eigen$values,type='o')

    In this case, our scree plot appears as follows:

Figure 4.14: Scree plot showing the eigenvalues of a covariance matrix

To decide how many dimensions to use for dimension reduction, we can look at this scree plot and choose a number of dimensions corresponding to the number of eigenvalues that are relatively high compared to the rest.

We can see that the first eigenvalue is by far the highest, so therefore the first eigenvector is by far the most important one, telling us that the first principal component is the most important dimension. In this case, reduction to a one-dimensional dataset is quite appropriate.

You have just performed PCA on a covariance matrix.

Activity 10: Performing PCA and Market Basket Analysis on a New Dataset

In the following activity, you will load a new dataset, and then you will perform PCA and market basket analysis on it. The activity will go through each of the major steps of both of those procedures, including the required data preparation. The dataset we will use comes from a study that was done of neighborhoods in the area around Boston, Massachusetts, and it contains features of many neighborhoods, including tax rates, property values, and demographics of the local populations.

For this activity, use the "Boston" dataset, which can be obtained by running the following code in R:

library(MASS)

data(Boston)

These steps will help you complete the activity:

  1. Convert all variables into dummy variables by doing the following: for each variable, create a new variable that is equal to 1 if it is at or above that variable's median, and 0 if it is below that variable's median. Create another new variable that is the complement of this: where every 0 in the previously created dummy variable is a 1 and every 1 in the previously created dummy variable is a 0. Save all the dummy variables into a new dataset called Bostondummy.
  2. Find all eigenvectors and eigenvalues of the original data
  3. Create a scree plot of the eigenvalues of this data. How should this scree plot be interpreted?
  4. Attempt to approximate this data using only a few of the principal components. How close is your approximation to the original data?
  5. Using the dummy variables you created in Step 1, perform the first pass of market basket analysis by finding all of the variables whose value is 1 for more than 10% of rows.
  6. Perform the second pass of market basket analysis by finding all combinations of variables in the data that have more than 10% support in the data.
  7. Complete market basket analysis up to three-item baskets.

Expected output: The most important outputs for this activity are the principal components of the dataset, as well as the three-item rules obtained from market basket analysis. The principal components are obtained in the solution to the activity's second step, when we create Boston_eigen, and we can run the print(Boston_eigen$vectors) command to see the principal components as follows:

Figure 4.15: Principal components of the original data

The three-item rules for market basket analysis are obtained in the solution to the activity's Step 14, and we can see the final results when we run print(head(thirdpass_conf)) in the console:

Figure 4.16: Three-item rules for market basket analysis

Note

The solution for this activity can be found on page 222.

Summary

In this chapter, we discussed the idea of the dimensionality of data. We went over why it could be useful to reduce the dimensionality of data and highlighted that the process of dimension reduction can reveal important truths about the underlying structure of data. We covered two important dimension reduction methods. The first method we discussed was market basket analysis. This method is useful for generating associative rules from complex data and can be used for the use case it was named after (analyzing baskets of groceries) or a wide variety of other applications (such as analyzing the clustering of survey responses). We also discussed PCA, a common way to describe data in terms of linear combinations of its dimensions. PCA is easy to perform with some linear algebra tools, and provides an easy way to approximate even very complex data.

In the next chapter, we will have a look at the different data comparison methods.