Chapter 2 – Applied Unsupervised Learning with R

Chapter 2

Advanced Clustering Methods

Learning Objectives

By the end of this chapter, you will be able to:

  • Perform k-modes clustering
  • Implement DBSCAN clustering
  • Perform hierarchical clustering and record clusters in a dendrogram
  • Perform divisive and agglomerative clustering

In this chapter, we will have a look at some advanced clustering methods and how to record clusters in a dendrogram.

Introduction

So far, we've learned about some of the most basic algorithms of unsupervised learning: k-means clustering and k-medoids clustering. These are not only important for practical use, but are also important for understanding clustering itself.

In this chapter, we're going to study some other advanced clustering algorithms. We aren't calling them advanced because they are difficult to understand, but because, before using them, a data scientist should have insights into why he or she is using these algorithms instead of the general clustering algorithms we studied in the last chapter. k-means is a general-purpose clustering algorithm that is sufficient for most cases, but in some special cases, depending on the type of data, advanced clustering algorithms can produce better results.

Introduction to k-modes Clustering

All the types of clustering that we have studied so far are based on a distance metric. But what if we get a dataset in which it's not possible to measure the distance between variables in a traditional sense, as in the case of categorical variables? In such cases, we use k-modes clustering.

k-modes clustering is an extension of k-means clustering, dealing with modes instead of means. One of the major applications of k-modes clustering is analyzing categorical data such as survey results.

Steps for k-Modes Clustering

In statistics, mode is defined as the most frequently occurring value. So, for k-modes clustering, we're going to calculate the mode of categorical values to choose centers. So, the steps to perform k-modes clustering are as follows:

  1. Choose any k number of random points as cluster centers.
  2. Find the Hamming distance (discussed in Chapter 1, Introduction to Clustering Methods) of each point from the center.
  3. Assign each point to a cluster whose center it is closest to according to the Hamming distance.
  4. Choose new cluster centers in each cluster by finding the mode of all data points in that cluster.
  5. Repeat this from Step 2 until the cluster centers stop changing.

You might have noticed that these steps are very similar to those for k-means clustering. Only the type of distance metric is changed here. So, if you understand k-means clustering, it will be very easy for you to understand k-modes clustering as well.

Exercise 10: Implementing k-modes Clustering

Note

For all the exercises and activities where we are importing external CSV's or images, go to RStudio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.

The data for this exercise can be downloaded from here: https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Exercise10/breast_cancer.csv. This is a categorical dataset that includes nine variables, some categorical and some nominal, describing different breast cancer cases. After saving this data in a file called breast_cancer.csv, we'll do the following:

Note 

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data. This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Exercise10/breast_cancer.csv.

  1. Read the dataset and store it in a variable:

    bc_data<-read.csv('breast_cancer.csv',header = FALSE)

  2. Store all of the columns from the second column to the end in a new variable:

    k_bc_data<-bc_data[,2:10]

  3. View the first six rows of the k_bc_data variable:

    head(k_bc_data)

    The output contains the six rows with values for different attributes describing the patient, their symptoms, and their treatment:

    V2 V3 V4 V5 V6 V7 V8 V9 V10

    1 30-39 premeno 30-34 0-2 no 3 left left_low no

    2 40-49 premeno 20-24 0-2 no 2 right right_up no

    3 40-49 premeno 20-24 0-2 no 2 left left_low no

    4 60-69 ge40 15-19 0-2 no 2 right left_up no

    5 40-49 premeno 0-4 0-2 no 2 right right_low no

    6 60-69 ge40 15-19 0-2 no 2 left left_low no

  4. Import the klaR library, which has the kmodes function. klaR is an R library that is used for classification and visualization:

    install.packages("klaR")

    library(klaR)

  5. Predict and store the final cluster centers in a variable. In this step, we enter the dataset and the number of clusters (that is, k and the maximum amount of iterations to find the number of clusters):

    k.centers<-kmodes(k_bc_data,2,iter.max = 100)

  6. View the cluster centers:

    k.centers

    The output is as follows:

Figure 2.1: Screenshot of the cluster centers

The clustering algorithm has grouped all of the breast cancer cases into two clusters, with each cluster containing cases that are similar to each other. In the output, there are two main components: the cluster modes and the clustering vector. The cluster modes section is telling us the modes or coordinates of the centers for cluster 1 and cluster 2. Below that, the clustering vector contains the cluster number of each data point in the index sequence.

Note

You could get different results every time you run the algorithm because of the random starting positions of the centers.

As it is a multi-dimensional categorical dataset, there's no easy way to visualize the results other than printing the data to the R console.

Activity 5: Implementing k-modes Clustering on the Mushroom Dataset

Note 

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/datasets/Mushroom. We have downloaded the file, cleaned the data, and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity05

In this activity, we will perform k-modes clustering on the mushroom dataset. This dataset lists the attributes of 23 different species of mushrooms. Each species is classified as being either edible (e) or poisonous (p). We will see how well unsupervised learning can classify poisonous and edible mushrooms by grouping the data into two clusters. These steps will help you complete the activity:

  1. Download mushrooms.csv from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity05.
  2. Read the mushroom.csv file into a variable.
  3. Import the klaR library.
  4. Calculate the clusters according to k-modes clustering.
  5. Check the results of clustering by forming a matrix of data labels versus the cluster assigned.

    Note

    The solution for this activity can be found on page 212.

The output will be a table of true labels and cluster labels, as follows:

1 2

e 80 4128

p 3052 864

Introduction to Density-Based Clustering (DBSCAN)

Density-based clustering or DBSCAN is one of the most intuitive forms of clustering. It is very easy to find naturally occurring clusters and outliers in data with this type of clustering. Also, it doesn't require you to define a number of clusters. For example, consider the following figure:

Figure 2.2: A sample scatter plot

There are four natural clusters in this dataset and a few outliers. So, DBSCAN will segregate the clusters and outliers, as depicted in the following figure, without you having to tell it how many clusters to identify in the dataset:

Figure 2.3: Clusters and outliers classified by DBSCAN

So, DBSCAN can find regions of high density separated by regions of low density in a scatter plot.

Steps for DBSCAN

As mentioned before, DBSCAN doesn't require you to choose a number of clusters, but you have to choose the other two parameters to perform DBSCAN. The first parameter is commonly denoted by ε (epsilon), which denotes the maximum distance between two points in the same cluster. Another parameter is the minimum number of points in a cluster, which is usually denoted by minPts. Now we'll look at the DBSCAN clustering algorithm step by step:

  1. Select any point, R, in the dataset.
  2. Find all the points within distance epsilon from point R.
  3. If the total number of points within distance epsilon from point R is greater than minPts, then it is a cluster and R is the core point.
  4. If the total number of points within distance epsilon from point p is less than minPts, all the points within distance epsilon will be classified as noise. We then start the process again from step 2 after selecting a new point, R, that has neither been classified as noise nor as part of the cluster.
  5. Repeat the process for other points in the cluster to find points within distance epsilon that are not already in the cluster. Those new points will also be classified in the same cluster.
  6. Once these steps have been performed for all points in the cluster, repeat the same process by selecting a new random point, R, that has neither been classified in a cluster nor as noise.

So, to illustrate of the preceding algorithm, let's take an example with epsilon as x and the minimum number of points in a cluster as 4. Look at the following figure:

Figure 2.4: Only 2 points lie within x distance of point R1

Only three points lie within x distance of point R1 and our threshold for the minimum number of points within x radius is four. So, these four points will be classified as outliers or noise. But if there were one more point, R5, somewhere between R1 and R4, all of these four points would belong to a cluster as in the following figure:

Figure 2.5: All of these four points belong to a cluster

In the preceding figure, points R1 and R5 are core points, as they have four points each within x distance from them. And points R4, R2, and R3 are not core points, as illustrated in the following figure:

Figure 2.6: Core versus non-core points

Any point outside of any of these circles would be classified as a noise point like, R6 in the following figure:

Figure 2.7: Noise point R6

Exercise 11: Implementing DBSCAN

In this exercise, we will have a look at the implementation of DBSCAN using the Iris dataset. To perform DBSCAN, we will execute the following steps:

  1. Store the first two columns of the Iris dataset in iris_data:

    iris_data<-iris[,1:2]

  2. Import the dbscan library, which contains the implementation of various DBSCAN algorithms:

    install.packages("dbscan")

    library(dbscan)

  3. Calculate and store clusters in the clus variable. In this step, you also have to choose the values of epsilon and minPts:

    clus<-dbscan(iris_data,eps=.3,minPts = 5)

    Note

    You can experiment with the values of epsilon. To get the desired output, we have set the value as 0.3.

  4. Import the factoextra library for visualizing clusters:

    install.packages("factoextra") #use it only the first time if library is not installed already

    library(factoextra)

  5. Plot the cluster centers. You need to enter the variable in which you stored the results of DBSCAN as the first parameter of the plot function. As the second parameter, you need to enter a dataset. In our case, it's iris_data. The geom variable in the function is used to define the geometry of the graph. We will only use point. Now, ggtheme is used to select a theme for the plot. palette is used to select the geometry of the points. The ellipse value is set to false so that the function does not draw an outline of the clusters.:

    fviz_cluster(clus,data=iris_data,geom = "point",palette="set2",ggtheme=theme_minimal(),ellipse=FALSE)

    Here is the output:

Figure 2.8: DBSCAN clusters in different colors

In Figure 2.8, there are three clusters in orange, blue, and green. The points in black are noise or outliers. Points in orange here belong to one species, but points in green belong to two different species.

Note

You can make a quick comparison of this scatter plot with Figure 1.18 of Chapter 1, Introduction to Clustering Methods.

DBSCAN is different from all the clustering we've studied so far and is one of the most common and one of the most frequently cited types of clustering methods in scientific literature. There are a few advantages that DBSCAN has over other clustering methods:

  • You don't need to select the number of clusters initially.
  • It does not produce different results every time you run it, unlike k-means or other "k-type" clustering methods, where starting points can have an effect on the end results. So, results are reproducible.
  • It can discover clusters of any shape in data.

But there are also a few disadvantages of DBSCAN:

  • The correct set of parameters, that is, epsilon and minPts, are hard to determine for the proper clustering of data.
  • DBSCAN cannot differentiate between clusters based on density. If a dataset has large variations in density, it may not perform as well.

Uses of DBSCAN

As DBSCAN is one of the most frequently cited clustering methods, it has many practical uses. Some of the practical real-life uses of DBSCAN clustering are the following:

  • DBSCAN can be used in urban planning in many ways. For example, given data about crime incidents with locations, DBSCAN can be used to identify crime-ridden areas in a city. This data can be used to plan police force deployment or even investigate potential gang activity.
  • DBSCAN can be used to form strategies in games such as cricket and basketball. Given the data related to ball pitching on a cricket pitch, you can identify the strengths and weaknesses of batsmen and bowlers. Or if we have data on a batsman hitting the ball, data gained from DBSCAN can be used to adjust fielding accordingly.
  • During natural disasters or the spread of viral diseases, the geolocation of tweets can be used to identify highly affected areas with DBSCAN.

Activity 6: Implementing DBSCAN and Visualizing the Results

In this activity, we will perform DBSCAN and compare the results with k-means clustering. To do this, we are going to use the multishapes dataset, which contains simulated data that represents different shapes.

These steps will help you complete the activity:

  1. Generate and store the multishapes dataset in the factoextra library.
  2. Plot the first two columns of the multishapes dataset.
  3. Perform k-means clustering on the dataset and visualize.
  4. Perform DBSCAN on the dataset and visualize the data to compare the results with k-means clustering.

    Note

    The solution for this activity can be found on page 213.

The plot of DBSCAN clustering will look as follows:

Figure 2.9: Expected plot of DBCAN on the multishapes dataset

Here, all the points in black are anomalies and are not classified in any cluster, and the clusters formed in DBSCAN cannot be obtained with any other type of clustering method. These clusters have taken several types of shapes and sizes, whereas, in k-means, all clusters are of approximately spherical shape.

Introduction to Hierarchical Clustering

The last type of clustering that we're going to study is hierarchical clustering. A hierarchy is defined as "a system in which people or things are placed in a series of levels with different importance or status." Hierarchical clustering merges clusters sequentially. This sequence of merged clusters is called a hierarchy. We can see that more clearly with the help of the output of a hierarchical clustering algorithm called a dendrogram.

Hierarchical clustering comes in two types:

  • Agglomerative
  • Divisive

Since both types of hierarchical clustering are similar, it makes sense to study both of them together.

Agglomerative clustering is also known as the bottom-up approach to hierarchical clustering. In this method, each data point is assumed to be a single cluster at the outset. From there, we start merging the most similar clusters according to a similarity or distance metric until all the data points are merged in a single cluster.

In divisive clustering, we do exactly the opposite. It is a top-down approach to hierarchical clustering. In this method, all the data points are assumed to be in a single cluster initially. From there on, we start splitting the cluster into multiple clusters until each data point is a cluster on its own. Differences and similarities between the two clustering types will become clear in further sections. But first, we should try to understand why we need this other type of clustering and the special purpose it serves that other types of clustering don't serve. Hierarchical clustering is used mainly for the following reasons:

  • Just like DBSCAN, we don't have to choose a number of clusters initially.
  • The final output of hierarchical clustering, dendrograms, can help us visualize the clustering results in a way that means we don't need to re-run the algorithm to see a different number of clusters in the results.
  • Unlike k-means, any type of distance metric can be used in hierarchical clustering.
  • It can find complex-shaped clusters, unlike other clustering algorithms, such as k-means, which only finds approximately spherical clusters.

The combination of all the preceding factors make hierarchical clustering an important clustering method in unsupervised learning.

Types of Similarity Metrics

As described previously, agglomerative hierarchical clustering is a bottom-up approach to hierarchical clustering. We merge the most similar clusters one by one based on a similarity metric. This similarity metric can be chosen from one of several different types:

  • Single link: In single-link similarity, we measure the distance or similarity between the two most similar points of two clusters:
Figure 2.10: Demonstration of the single-link metric
  • Complete link: In this type of metric, we measure the distance or similarity between the two most distant points of a cluster:
Figure 2.11: Demonstration of the complete-link metric
  • Group average: In this metric, we measure the average distance between all members of one cluster and any members of a second cluster:
Figure 2.12: Demonstration of the group-average metric

Note

The distance between members of the same cluster is not measured in these similarity metrics.

  • Centroid similarity: In this type of similarity, the similarity between two clusters is defined as the similarity between the centroids of both clusters:

Figure 2.13: Demonstration of centroid similarity

Steps to Perform Agglomerative Hierarchical Clustering

With the knowledge of these similarity metrics, we can now understand the algorithm to perform agglomerative hierarchical clustering:

  1. Initialize each point as a single cluster.
  2. Calculate the similarity metric between every pair of clusters. The similarity metric can be any of the four metrics we just read about.
  3. Merge the two most similar clusters according to the similarity metric selected in step 2.
  4. Repeat the process from step 2 until we have only one cluster left.

This whole process will produce a graph called a dendrogram. This graph records the clusters formed at each step. A simple dendrogram with very few elements would look as follows:

Figure 2.14: A sample dendrogram

In the preceding dendogram, suppose that point A and point B are the closest among all points on the similarity measure that we are using. Their closeness is used to determine the height of the joining line, which is at L1 in the case of point A and point B. So, point A and point B are clustered first. After that, at L2, point D and point E are clustered, then, at L3, points A, B, and C are clustered. At this point, we have two clusters that are joined together to form one cluster at L4.

Now, to get clusters from this dendrogram, we make horizontal cuts. For example, if we make a cut between L4 and L3, we will get two clusters:

  • Cluster 1 – A, B, and C
  • Cluster 2 – D and E

The clusters will look as follows:

Figure 2.15: Clusters represented in the dendrogram

Similarly, if we make a horizontal cut in the dendrogram between L3 and L2, we will get three clusters:

  • Cluster 1 – A and B
  • Cluster 2 – C
  • Cluster 3 – D and E

The clusters will look as follows:

Figure 2.16: Representation of clusters in a dendrogram

So, to get a different number of clusters, we didn't need to rerun the process. With this method, we can get any number of clusters possible in the data without executing the whole process again.

Exercise 12: Agglomerative Clustering with Different Similarity Measures

In this exercise, we're going to perform agglomerative hierarchical clustering with different similarity measures and compare the results:

  1. Let's enter the last three columns of the iris_flowers dataset, which are petal length, petal width, and species, in the iris_data variable, as follows:

    iris_data<-iris[,3:5]

    install.packages('cluster')

    library(cluster)

  2. In this step, we use the hclust function to get hierarchical clustering. In the hclust function, we need to enter the pair-wise distance of all the points with each other, for which we use the dist() function. The second parameter, method, is used to define the similarity measure for the hierarchical clustering:

    h.clus<-hclust(dist(iris_data),method="complete")

  3. Now, let's plot the results of the hierarchical clustering in a dendrogram, as follows:

    plot(h.clus)

    The output is as follows:

    Figure 2.17: Dendrogram derived from the complete similarity metric
  4. To select a number of clusters from the preceding dendrogram, we can use an R function called cutree. We feed the results of hclust along with the number of clusters to this function:

    clusterCut <- cutree(h.clus, 3)

    table(clusterCut, iris_data$Species)

    The output table is as follows:

    Figure 2.18: Table displaying the distribution of clusters

    The setosa and virginica species get classified accurately with this clustering method.

  5. Use single as a similarity metric as follows:.

    h.clus<-hclust(dist(iris_data),method = "single")

    plot(h.clus)

    The output is as follows:

    Figure 2.19: Dendrogram derived from the single similarity metric

    Notice how this dendrogram is different from the dendrogram created with the complete similarity metric.

  6. Divide this dataset into three clusters:

    clusterCut <- cutree(h.clus, 3)

    table(clusterCut, iris_data$Species)

    The output is as follows:

    Figure 2.20: Table displaying the distribution of clusters

    Here, our clustering method is successfully able to separate only one class from the other two.

  7. Now let's perform hierarchical clustering with the average similarity metric:

    h.clus<-hclust(dist(iris_data),method = "average")

    plot(h.clus)

    The output is as follows:

    Figure 2.21: Dendrogram derived from the average similarity metric
  8. Let's divide the preceding dendrogram into three clusters again and see the results:

    clusterCut <- cutree(h.clus, 3)

    table(clusterCut, iris_data$Species)

    The output is as follows:

    Figure 2.22: Table displaying the distribution of clusters

    Here, with the average similarity metric, we get almost completely correct classification results.

  9. Now let's try creating a dendrogram with the last similarity metric, centroid:

    h.clus<-hclust(dist(iris_data),method = "centroid")

    plot(h.clus)

    The output is as follows:

    Figure 2.23: Dendrogram derived from the centroid similarity metric
  10. Now, let's divide the preceding dendrogram into three clusters and see the results:

    clusterCut <- cutree(h.clus, 3)

    table(clusterCut, iris_data$Species)

    The output is as follows:

Figure 2.24: Table displaying the distribution of clusters

Although dendrograms of clusters with the average and centroid similarity metrics look different, they have the same number of elements in each cluster when we cut the dendrogram at three clusters.

Divisive Clustering

Divisive clustering is the opposite of agglomerative clustering. In agglomerative clustering, we start with each point as its own cluster, while in divisive clustering, we start with the whole dataset as one cluster and from there, we start dividing it into more clusters:

Figure 2.25: Representation of agglomerative and divisive clustering

So, the divisive clustering process can be summarized in one figure as follows:

Figure 2.26: Representation of the divisive clustering process

Steps to Perform Divisive Clustering

From step 1 to step 6, the divisive clustering process keeps on dividing points into further clusters until each point is a cluster on its own. Figure 2.26 shows the first 6 steps of the divisive clustering process, but to correctly run the complete process, we will need to execute as many steps as there are points in the dataset.

So, for divisive clustering, we will use the DIANA algorithm – DIANA stands for Divisive Analysis. For this, we need to carry out the following steps:

  1. Start with all the points in the dataset in one single cluster.
  2. Choose two of the most dissimilar clusters of all the possible clusters in the dataset according to any distance metric you like.
  3. Repeat step 2 until all the points in the dataset are clustered on their own.

We're going to use the cluster library in R to perform DIANA clustering.

Exercise 13: Performing DIANA Clustering

In this exercise, we will perform DIANA clustering:

  1. Put the petal length, petal width and species name in the iris_data variable:

    iris_data<-iris[,3:5]

  2. Import the cluster library:

    install.packages("cluster")

    library("cluster")

  3. Pass the iris_data dataset and the metric by which to measure dissimilarity to the diana() function:

    h.clus<-diana(iris_data, metric="euclidean")

  4. Plot the dendrogram with the pltree() function. To plot the dendrogram, pass the results of the diana function and the title of the graph to the pltree() function:

    pltree(h.clus, cex = 0.6, main = "Dendrogram of divisive clustering")

    The dendrogram of the divisive clustering results appears as follows:

    Figure 2.27: Dendrogram of divisive clustering
  5. If we divide the preceding dendrogram into three clusters, we'll see that this clustering method is also able to identify different species of flowers on its own:

    clusterCut <- cutree(h.clus, 3)

    table(clusterCut, iris_data$Species)

    The output is as follows:

    clusterCut setosa versicolor virginica

    1 50 1 0

    2 0 49 0

    3 0 0 50

In the preceding output, only one flower is misclassified into another category. This is the best performance of all the clustering algorithms we have encountered in this book when it comes to classifying flower species without knowing anything about them.

Activity 7: Performing Hierarchical Cluster Analysis on the Seeds Dataset

In this activity, we will perform hierarchical cluster analysis on the seeds dataset. We will see what the results of the clustering are when classifying three types of seeds.

Note 

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at http://archive.ics.uci.edu/ml/machine-learning-databases/00236/. We have downloaded the file, cleaned it, and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity07/seeds_data.txt

These steps will help you complete the activity:

  1. Download the seeds dataset from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson02/Activity07/seeds_data.txt.
  2. Perform agglomerative hierarchical clustering of the dataset and plot the dendrogram.
  3. Make a cut at k=3 and check the results of the clustering by forming a table with original labels.
  4. Perform divisive clustering on the dataset and plot the dendrogram.
  5. Make cut at k=3 and check the results of the clustering by forming a table with original labels.

    Note

    The solution for this activity can be found on page 215.

The output of this activity will be a table that shows how the results of the clustering have performed at classifying the three types of seeds. It will look as follows:

Figure 2.28: Expected table classifying the three types of seeds

Summary

Congratulations on completing the second chapter on clustering techniques! With this, we've covered all the major clustering techniques, including k-modes, DBSCAN, and both types of hierarchical clustering, and we've also looked at what connects them. We can apply these techniques to any type of dataset we may encounter. These new methods, at times, also produced better results on the same dataset that we used in the first chapter. In the next chapter, we're going to study probability distributions and their uses in exploratory data analysis.