Chapter 6 – Applied Unsupervised Learning with R

Chapter 6

Anomaly Detection

Learning Objectives

By the end of this chapter, you will be able to:

  • Use parametric and non-parametric methods to find outliers in univariate and multivariate data
  • Use data transformations to identify outliers in univariate and multivariate data
  • Work with Mahalanobis distances
  • Improve anomaly detection performance by incorporating a model of seasonality

In this chapter, we will have a look at different anomaly detection techniques.

Introduction

Data analysis often begins with an implicit assumption that all observations are valid, accurate, and trustworthy. But this is not always a reasonable assumption. Consider the case of credit card companies, who collect data consisting of records of charges to an individual's credit card. If they assumed that all charges were valid, they would open the door to thieves and fraudsters to take advantage of them. Instead, they examine their transaction datasets and look for anomalies – transactions that deviate from the general observed pattern. Since fraudulent transactions are not labeled, they have to use unsupervised learning to find these anomalies and prevent criminal activity.

There are many other situations in which anomaly detection is useful. For example, manufacturers may use anomaly detection methods to find defects in their products. Medical researchers may look for anomalies in otherwise regular heartbeat patterns to diagnose illnesses. IT security professionals try to find anomalous activities on servers or computers to identify malware. In each case, unsupervised learning methods can help separate the valid observations from the anomalous ones.

This chapter will cover several anomaly detection techniques. We will begin by using parametric and non-parametric methods to find outliers in univariate and multivariate data. We will then discuss using data transformations to identify outliers in univariate and multivariate data. Next, we will have a look at Mahalanobis distances, a multivariate tool for anomaly detection. We will conclude the chapter with a discussion of using regression modeling to improve anomaly detection performance by incorporating a model of seasonality, and detection of contextual and collective anomalies.

Univariate Outlier Detection

Anomaly detection is simplest in the univariate case, that is, when each observation is only one number. In this case, we might start by doing a common-sense check for anomalies by checking whether observations are missing, NULL, NA, or recorded as infinity or something that doesn't match the type of the rest of the observations. After performing this check, we can apply true unsupervised learning.

For univariate data, anomaly detection consists of looking for outliers. R's built-in boxplot function makes an initial exploratory check for outliers quite easy, as can be seen in the following exercise.

Exercise 37: Performing an Exploratory Visual Check for Outliers Using R's boxplot Function

For univariate data, anomaly detection consists of looking for outliers. R's built-in boxplot function makes an initial exploratory check for outliers quite easy, as demonstrated in this exercise. We will use a dataset called mtcars, which is built into R.

In this exercise, we will create a boxplot that you can see in Figure 6.3. A boxplot is an important type of univariate plot. Here, the thick horizontal line around 3 indicates the median value in the data. The box surrounding this median line has a lower limit at the first quartile (the 25th percentile) and an upper limit at the third quartile (the 75th percentile). The vertical dotted lines extend to the lower and upper ends of all of the non-outlier data. These dotted lines are called the whiskers of the plot, so this type of plot is sometimes called a "box and whiskers" plot. Finally, the two observations that are represented as circular points near the top of the plot are (at least, according to R) outliers.

Percentiles can also be called quantiles, and we'll refer to them as such in this boxplot. A quantile is a point in the data that is greater than some fixed proportion of all data points. For example, a 0.1 quantile is an observation in the data that is greater than 10% of all observations, and less than the rest. The 0.25 quantile is also called the 25th percentile or the first quartile, and it is greater than 25% of the data, and the 75th percentile or 0.75 quantile is greater than 75% of the data. When we take an observation and find what proportion of observations it is greater than, that is called finding its quantile. When we take a quantile such as 0.25 and try to find an observation that corresponds to that quantile, we can call that taking an inverse quantile.

To perform an exploratory visual check for outliers using R's boxplot function, perform the following steps:

  1. To load the data, open the R console and enter the following command:

    data(mtcars)

  2. Execute the following command to view the first six rows of the dataset:

    head(mtcars)

    The output is as follows:

    Figure 6.1: Top six rows of the mtcars dataset
  3. You can find details about this dataset by executing the following in the R console:

    ?mtcars

    The documentation is as follows:

    Figure 6.2: Section of output
  4. Create a boxplot of the weights of cars as follows:

    boxplot(mtcars$wt)

    The output will look like this:

    Figure 6.3: Boxplot representing weight of cars
  5. There appear to be two observations that R has classified as outliers that appear to have values higher than 5. Note that these weights are measured in thousands of pounds, so actually these are weights higher than 5,000 pounds. We can determine these observations by running a simple filter on our data:

    highest<-mtcars[which(mtcars$wt>5),]

    print(highest)

    The output is as follows:

Figure 6.4: Cars with weights greater than 5,000 pounds

We can see that there are three models of car whose weights are higher than 5,000 pounds. Since we only observed two outliers, we conclude that the Cadillac Fleetwood, the car model with the third-highest weight, is not one of the outliers. This leaves the other two: the Lincoln Continental and the Chrysler Imperial, as the car models that apparently have outlier weights. These car models, when compared to the other car models, constitute anomalies. A potential next step for a researcher is to investigate why these car models appear to have anomalously high car weights.

In Chapter 3, Probability Distributions, we discussed different distributions that datasets tend to follow. Many datasets have so-called long tails or fat tails, meaning that a disproportionate amount of observations are very far from the mean – not necessarily because they are anomalous outliers, but only because of the nature of their distribution. Our standard for defining an outlier should change if we happen to be working with a dataset that follows a fat-tailed distribution.

In the following exercise, we will transform a dataset that follows a fat-tailed distribution and observe the changes in which observations are reported as outliers.

Exercise 38: Transforming a Fat-Tailed Dataset to Improve Outlier Classification

In the following exercise, we will transform a dataset that follows a fat-tailed distribution and observe the changes in which observations are reported as outliers. We will use the rivers dataset, which comes pre-loaded into R:

  1. Load the dataset as follows:

    data(rivers)

  2. Execute the following command to view the first six observations of the dataset:

    head(rivers)

    The output is as follows:

    [1] 735 320 325 392 524 450

  3. You can see that the rivers dataset is a vector. You can find out more about it by typing this:

    ?rivers

    The documentation is as follows:

    Figure 6.5: Information of the rivers dataset
  4. Observe the distribution of outliers. First, try a boxplot of the rivers data by running the following command:

    boxplot(rivers)

    The boxplot looks like this:

    Figure 6.6: Boxplot of the rivers dataset

    You can see that this boxplot looks different from the mtcars boxplot that we looked at previously. In particular, the box and whiskers take up a smaller portion of the plot, and a huge amount of the plot consists of observations that R has classified as outliers. However, outliers are not supposed to be extremely numerous in any datasets – by definition, they are supposed to be rare. When we observe a boxplot such as this, that presents many outliers that take up a large portion of the plot, we can reasonably conclude that our distribution is fat-tailed.

  5. In order to get a better classification of outliers, it could be helpful to transform the data. Many datasets related to nature and natural phenomena are known to follow a log-normal distribution. If we take the logarithm of every observation, the resulting distribution will be normal (and therefore not fat-tailed). To transform the data, you can use the following command:

    log_rivers<-log(rivers)

  6. Observe the boxplot of the transformed data. Finally, we can execute the following command to see the boxplot of the transformed data:

    boxplot(log_rivers)

    The boxplot will look as follows:

Figure 6.7: Boxplot of transformed dataset

As expected, the box and whiskers take up a greater proportion of the plot, and there are fewer observations that are classified as outliers. In this case, we can use the log_rivers plot instead of the rivers plot to classify outliers.

The preceding exercise shows the importance of data preparation in anomaly detection. We can also attempt anomaly detection on a raw dataset. But sometimes, like in the example of rivers that we looked at, we can get different and better results by performing some simple data preparation. One thing to keep in mind is that there are many possible transformations of data. We have used the log transformation, but there are many others that could work. Another thing to keep in mind is that data preparation can be harmful as well as helpful: some data preparation methods, such as averaging and data smoothing, can cause us to throw out valuable information that will make our anomaly detection less effective.

So far, we have relied on R's built-in outlier detection methods, and we have done simple visual inspections of boxplots to determine which observations were outliers. In the next exercise, we will determine which observations are outliers ourselves without using R's built-in outlier classification. We will be finding quantiles of the data – specifically the 0.25 and 0.75 quantiles.

Exercise 39: Finding Outliers without Using R's Built-In boxplot Function

In this exercise, we will determine which observations are outliers ourselves without using R's built-in outlier classification. We will be finding quantiles of the data – specifically the .25 and .75 quantiles. We will use the rivers data again, which we used in the preceding exercise:

  1. Load the data by executing the following command:

    data(rivers)

  2. The interquartile range is the difference between the first quartile (the 25th percentile) and the third quartile (the 75th percentile). So, we can store the interquartile range in a variable called interquartile_range by running the following:

    interquartile_range<-unname(quantile(rivers,.75)-quantile(rivers,.25))

    In this case, we use unname to make sure that interquartile_range is just a number instead of a DataFrame or list or other data type. This will make it easier and more reliable to work with later.

  3. Use the following command to check the interquartile range:

    print(interquartile_range)

    The output is as follows:

    [1] 370

  4. The standard method in R's boxplot function is to use 1.5 times the interquartile range as a limit to how far non-outlier observations can be dispersed. Then the upper limit of non-outliers is the third quartile plus 1.5 times the interquartile range, and the lower limit of non-outliers is the first quartile minus 1.5 times the interquartile range. We can calculate this as follows:

    upper_limit<-unname(quantile(rivers,.75)+1.5*interquartile_range)

    lower_limit<-unname(quantile(rivers,.25)-1.5*interquartile_range)

  5. Our outliers are the observations that are above our upper_limit or below our lower_limit. We can determine these as follows:

    rivers[which(rivers>upper_limit | rivers<lower_limit)]

    This will output a list of observations that we classify as outliers:

    [1] 1459 1450 1243 2348 3710 2315 2533 1306 1270 1885 1770

    Note

    This exercise uses the method that is used in R's boxplot function. But with unsupervised learning, there is always flexibility about the details of the methods.

  6. Another way to look for outliers would be to use the method in the preceding exercise, but to use a different value for the upper and lower limits. We could do this by changing the coefficient that we multiply by interquartile_range, as follows:

    upper_limit<-unname(quantile(rivers,.75)+3*interquartile_range)

    lower_limit<-unname(quantile(rivers,.25)-3*interquartile_range)

    We have changed the coefficient from 1.5 to 3. This change makes our method less likely to classify any particular observation as an outlier, because it increases the upper limit and decreases the lower limit.

In general, you can try to be creative about changing unsupervised learning methods in ways that you believe are reasonable and will lead to good results.

The method in the preceding exercise is what is called a non-parametric method. In statistics, there are some methods that are called parametric and others that are called non-parametric. Parametric methods make assumptions about the underlying distribution of data, for example, the assumption that the data follows a normal distribution. Non-parametric methods are meant to be free of these constraining assumptions. The method in the preceding exercise relies only on quantiles, so it does not make any assumptions about the distribution of the data or the parameters (such as mean and variance) that go with them. Because of this, we call it a non-parametric method. A parametric anomaly detection method, by contrast, is one that makes assumptions about the distribution of the data or its parameters (such as mean and variance). Please note, non-parametric methods and parametric methods are always looking for the same anomalies: there is no such thing as a parametric anomaly or a non-parametric anomaly, but only parametric methods and non-parametric methods. We will discuss a parametric anomaly detection method in the next exercise.

We will calculate something called a z-score. A z-score is a standardized measurement of how far an observation is from the mean. Each z-score is measured in units of standard deviations. So, a z-score of 0 means that an observation is 0 standard deviations away from the mean; in other words, it is equal to the mean. A z-score of 1 means that an observation is 1 standard deviation above the mean. A z-score of -2.5 means that an observation is 2.5 standard deviations below the mean. It is conventional in some contexts to consider observations that are more than two standard deviations away from the mean unusual or anomalous.

Exercise 40: Detecting Outliers Using a Parametric Method

In this exercise, we will investigate anomalies by looking for observations whose z-scores are greater than 2 or less than -2. We will use the rivers dataset used in previous exercise:

  1. Load the data and determine the standard deviation as follows:

    data(rivers)

    standard_deviation<-sd(rivers)

  2. Determine z-scores for every observation by calculating how many standard deviations each observation is away from the mean:

    z_scores<-(rivers-mean(rivers))/ standard_deviation

  3. Determine which observations are outliers by selecting observations whose z-scores are greater than 2 or less than -2:

    outliers<-rivers[which(z_scores>2 | z_scores<(-2))]

    outliers

    The output is as follows:

    [1] 2348 3710 2315 2533 1885 1770

    In this case, we can see that there are six rivers that are classified as outliers – all of them with z-scores higher than 2. Here, outliers is perhaps a strong word for these anomalies, since a z-score of 2 is not particularly enormous. There is no strict rule about what defines an outlier, and whether to use the term will depend on the particular situation and context.

Parametric anomaly detection, as shown in this exercise, is common practice. However, its applicability depends on whether the parametric assumptions behind it are valid. In this case, we calculated a standard deviation and looked for outliers more than two standard deviations away from the mean. This practice is justified if the data comes from a Gaussian (normal) distribution. In this case, some evidence indicates that the rivers data does not come from a Gaussian distribution. For cases in which there is doubt about the distribution of the underlying data, a non-parametric or transformed method (as used in the previous exercises) could be more suitable.

Multivariate Outlier Detection

All of the methods so far have focused on univariate outlier detection. However, in practice such data is actually rare. Most datasets contain observations about multiple attributes of data. In these cases, it is not clear how to calculate whether a point is an outlier or not. We can calculate a z-score on each individual dimension, but what should be done about observations whose z-scores are high on one dimension and low in the other, or relatively large in one dimension and average in the other? There is no easy answer. In these cases, we can calculate a multidimensional analogue of a z-score using something called Mahalanobis distance. Mahalanobis distance is a multidimensional analogue of a z-score. What that means is that it measures the distance between two points, but in special units that, just like a z-score, depend on the variance of the data. We will be better able to understand the meaning of a Mahalanobis distance after going through the following exercise.

Exercise 41: Calculating Mahalanobis Distance

In this exercise, we will learn a new measurement that will eventually help us find which individuals are outliers compared to the general population when considering both their height and their weight. The expected output of this exercise will be a Mahalanobis distance for a single data point, measuring how far from the center of the data it is.

Note

For all the exercises and activities where we are importing external CSV files or images, go to RStudio-> Session-> Set Working Directory-> To Source File Location. You can see in the console that the path is set automatically.

In a later exercise, we will find the Mahalanobis distance for all the points in the dataset. We will then be able to use these measurements to classify which individuals are outliers compared to the general population when considering both their height and their weight:

Note

This dataset is taken from the Statistics Online Computational Resource. You can find the dataset at http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson06/Exercise41-42/heightsweights.csv. We have used the first 200 rows of the data.

  1. Start by loading this in R. First, download it from our GitHub repository at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson06/Exercise41-42/heightsweights.csv. Then, save it to your computer in R's working directory. Then, load it into R as follows:

    filename<-'heightsweights.csv'

    raw<-read.csv(filename, stringsAsFactors=FALSE)

  2. We have given the dataset the name raw. You can make sure the columns have the right names as follows:

    names(raw)<-c('index','height','weight')

  3. Plot the data and observe the patterns:

    plot(raw$height,raw$weight)

    The plot appears as follows:

    Figure 6.8: Plot of height and weight

    We can see that there is great variation in both heights and weights in this sample. Note that our univariate methods will not work in a straightforward way in this data. We could calculate a standard deviation or interquartile range for height, and find outliers for height. But the outliers for height will not necessarily be outliers for weight – they might be a normal weight, or exactly what is expected for weight. Similarly, outliers for weight might have completely average or expected heights. It is not immediately obvious how we should calculate "complete outliers," or observations that are outliers in some holistic sense, taking into account both height and weight.

  4. Next, we will calculate the multidimensional equivalent of the mean, called centroid:

    centroid<-c(mean(raw$height),mean(raw$weight))

  5. Calculate the distance between any given point and the centroid. As an example, we will choose the first observation in the dataset:

    example_distance<-raw[1,c('height','weight')]-centroid

  6. Calculate the inverse of the covariance matrix of our data. First, we calculate the covariance matrix of our height and weight data:

    cov_mat<-cov(raw[,c('height','weight')])

  7. Calculate its inverse using the solve function, which calculates matrix inverses in R:

    inv_cov_mat<-solve(cov_mat)

  8. Calculate the Mahalanobis distance between our point and the centroid of our dataset:

    mahalanobis_dist<-t(matrix(as.numeric(example_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(example_distance))

    In this case, we use %*% because it indicates matrix multiplication, which is what we want to perform, and we likewise need to turn every argument into a numeric matrix.

    The output of this exercise is a Mahalanobis distance, which is a generalization of a z-score in multiple dimensions: that is, a generalized measure of how many standard deviations each point is away from the mean. In this case, the Mahalanobis distance we found is 1.71672. Mahalanobis distances are like any type of distance measurement – 0 is the lowest possible measurement and the higher the number, the farther the distance. Only the centroid will have a measured Mahalanobis distance of 0. The advantage of Mahalanobis distances is that they are standardized in a way that takes into account the variances of each variable and makes them effective for outlier detection. Later on in this chapter, we will see how to use this type of measurement to find outliers in this multi-dimensional dataset.

Detecting Anomalies in Clusters

In the first two chapters, we discussed the different clustering methods. The method we are discussing now, Mahalanobis distances, could be used fruitfully in clustering applications, if you imagine that the data we are looking at is the data corresponding to one particular cluster. For example, in divisive clustering, the point with the highest Mahalanobis distance could be selected as the point to remove from a cluster. Also, the range of Mahalanobis distances could be used to express the dispersion of any given cluster.

Other Methods for Multivariate Outlier Detection

There are other methods for multivariate outlier detection, including some that are known as non-parametric methods. Non-parametric methods, like some of the preceding exercises, may rely on quantiles, or in other words, the rank of each observation from largest to smallest, in order to classify outliers. Some non-parametric methods use the sums of such rankings to understand the distribution of data. However, such methods are not common and are not more effective than Mahalanobis distances in general, so we recommend relying on Mahalanobis distance for multivariate outlier detection.

Exercise 42: Classifying Outliers based on Comparisons of Mahalanobis Distances

In this exercise, we will use a comparison of Mahalanobis distances to classify outliers. This exercise is a continuation of the previous exercise, and it will rely on the same dataset and some of the same variables. In that exercise, we found the Mahalanobis distance for one data point; now we're going to find it for all the data points. Before executing the following exercise, you should run all of the code in the previous exercise and make sure you are familiar with the ideas presented there:

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson06/Exercise41-42/heightsweights.csv. We have used the first 200 rows of the data.

  1. Create a NULL variable that will hold each of our calculated distances:

    all_distances<-NULL

  2. Loop through each observation and find the Mahalanobis distance between it and the centroid of the data. The code inside this loop is taken from the previous exercise, where we learned how to calculate Mahalanobis distances:

    k<-1

    while(k<=nrow(raw)){

    the_distance<-raw[k,c('height','weight')]-centroid

    mahalanobis_dist<-t(matrix(as.numeric(the_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(the_distance))

    all_distances<-c(all_distances,mahalanobis_dist)

    k<-k+1

    }

    After running this loop, we have a measured Mahalanobis distance for each of the points in our data.

  3. Plot all observations that have particularly high Mahalanobis distances. In this case, we will say that particularly high means the highest 10% of Mahalanobis distances. This means all Mahalanobis distances that are higher than the 0.9 quantile of Mahalanobis distances, which we select in the following code:

    plot(raw$height,raw$weight)

    points(raw$height[which(all_distances>quantile(all_distances,.9))],raw$weight[which(all_distances>quantile(all_distances,.9))],col='red',pch=19)

    The output is as follows:

Figure 6.9: Observations with high Mahalanobis distances

This plot shows solid red points for each point that is in the top 10% of Mahalanobis distances. We can see that some of them appear to be outliers both for height and for weight, and could be observed simply by performing our univariate outlier detection methods. However, many of the red points in the plot above are neither univariate outliers for height nor for weight. They are outliers only relative to the entire cloud of points, and Mahalanobis distances enable us to quantify and detect that.

Detecting Outliers in Seasonal Data

So far, we have only discussed outlier detection as a way to detect anomalies. However, anomaly detection consists of more than only outlier detection. Some anomalies cannot easily be detected as raw outliers. Next, we will look at seasonal, trended data. In this data, we want to find anomalies that occur specifically in the context of a seasonal trend or cycle.

We will use data from the expsmooth package in R.

The data we will use records the monthly number of visitors to Australia between 1985 and 2005, measured in thousands of people.

In the following exercise, we will be working with data that has a time trend. By time trend, we mean that over time, the observations tend to increase or decrease (in this case, they tend to increase). In order to detect outliers, we want to do something called de-trending. To de-trend data means to remove, as much as possible, its time trend so that we can find how much each observation deviates from the expected values.

Exercise 43: Performing Seasonality Modeling

In this exercise, we will attempt to model this data to establish what we should regard as the expected values of the data, and what we should regard as deviations from expectations. The expected output is a set of error values, which we will use in a future exercise to classify outliers – the observations with the largest errors will be classified as the dataset's outliers:

  1. First, load this data into R by executing the following commands:

    install.packages("expsmooth")

    library(expsmooth)

    data(visitors)

    plot(visitors)

    The output is as follows:

    Figure 6.10: Plot of monthly visitors to Australia between 1985 and 2005
  2. Check the first six observations of the data as follows:

    head(visitors)

    The output is as follows:

    May Jun Jul Aug Sep Oct

    1985 75.7 75.4 83.1 82.9 77.3 105.7

    Check the last six observations as follows:

    tail(visitors)

    The output is as follows:

    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

    2004 479.9 593.1

    2005 462.4 501.6 504.7 409.5

  3. Since dates can be hard to work with, we can assign a numeric variable that tracks the dates in order. We will call this variable period and define it as follows:

    period<-1:length(visitors)

  4. Combine the visitors data with the period variable we just created by putting them both into one DataFrame called raw:

    raw<-data.frame(cbind(visitors,period))

  5. This step is technically supervised, rather than unsupervised, learning. In this case, we will use supervised learning, but only as an intermediate step on the way to doing unsupervised learning. To find the time trend in the data, we can run a linear regression relating the sales figures to the numeric time period as follows:

    timetrend<-lm(visitors~period+I(log(period)),data=raw)

  6. Next, we can obtain fitted values for this time trend, and store them as part of the raw DataFrame:

    raw$timetrend<-predict(timetrend,raw)

  7. The process of de-trending means we subtract the predicted trend that we found in Step 6. The reason we do this is that we want to find anomalies in the data. If we keep the trend in the data, something that looks like an anomaly may actually be the expected result of a trend. By removing the trend from the data, we can be sure that observed anomalies are not the result of a trend. We can accomplish de-trending as follows:

    raw$withouttimetrend<-raw$visitors-raw$timetrend

  8. We can draw a simple plot of the de-trended data as follows:

    plot(raw$withouttimetrend,type='o')

    The plot looks as follows:

    Figure 6.11: Plot of the de-trended data

    In this plot, you should notice that there is no apparent left-to-right trend in the data, showing that we have successfully de-trended it.

    Our data records the monthly number of visitors to Australia. There are variations in temperature and weather across seasons in Australia, and it is reasonable to hypothesize that tourist visits could increase or decline depending on how favorable the weather tends to be in a particular month. There could even be variations in business or diplomatic visits to Australia that relate to seasons and weather changes throughout the year. So, it is reasonable to suppose that there are yearly patterns in visits to Australia. Our next step will be to remove these yearly patterns from our data.

  9. First, we create a matrix where each of the columns contains de-trended visitor data about a separate month of the year:

    seasonsmatrix = t(matrix(data = raw$withouttimetrend, nrow = 12))

  10. Take the mean of each column to get the mean of de-trended visitors in that particular month:

    seasons = colMeans(seasonsmatrix, na.rm = T)

  11. This gives us a vector with 12 values – one for each month of the year. Since we have 20 years of data, we repeat this vector 20 times:

    raw$seasons<-c(rep(seasons,20))

  12. Finally, we can obtain our de-trended, de-cycled data, which we will name error because random error is the only thing remaining after removing the time trends and yearly cycles in the data:

    raw$error<-raw$visitors-raw$timetrend-raw$seasons

  13. We can plot this to see what it looks like:

    plot(raw$error,type='o')

    The plot looks as follows:

    Figure 6.12: Plot of the de-trended data
  14. Plot all elements of seasonality modeling together. Finally, we can show all of the elements we have isolated from our seasonal data: a time trend, a yearly cycle, and random error:

    par(mfrow=c(3,1))

    plot(raw$timetrend,type='o')

    plot(raw$seasons,type='o')

    plot(raw$error,type='o')

    The output is as follows:

Figure 6.13: Plot of elements of seasonality modelling

The three plots shown together in Figure 6.13 show a decomposition of the original data. The first plot shows the time trend, or in other words the overall pattern observed in the data that each month tends to have more visitors than the previous month. Of course, this is an oversimplification of the data, because there are also seasonal patterns: certain months tend to have more or fewer visitors than other months, irrespective of the overall trend. The second plot shows these seasonal patterns, which repeat in the same way year after year. Finally, the third plot shows the error, which is all of the variation in the data that is not captured by an overall time trend or by a seasonal pattern within each year. If you take the sum of all the data presented in these three plots, you will recover the original dataset. But in these three plots, we can see these elements decomposed, and this decomposition gives us a better understanding of how time trends, seasonality, and error interact to constitute the observed data.

This exercise enabled us to create a variable called error that we added to the raw data frame. Now that we have created the error vector, we can use standard anomaly detection to find anomalies in the data frame.

Exercise 44: Finding Anomalies in Seasonal Data Using a Parametric Method

In this exercise, we will do anomaly detection by finding the largest deviations from expected values:

  1. Calculate the standard deviation of the error data calculated in the previous exercise:

    stdev<-sd(raw$error)

  2. Find which data points are more than two standard deviations away from the mean value:

    high_outliers<-which(raw$error>(mean(raw$error)+2*sd(raw$error)))

    low_outliers<-which(raw$error<(mean(raw$error)-2*sd(raw$error)))

  3. Examine the observations that we have classified as outliers:

    raw[high_outliers,]

    The output is as follows:

    visitors period timetrend withouttimetrend seasons error

    1 75.7 1 67.18931 8.510688 -55.94655 64.45724

    130 392.7 130 305.93840 86.761602 24.35847 62.40313

    142 408.0 142 323.44067 84.559332 24.35847 60.20086

    147 397.4 147 330.70509 66.694909 11.55558 55.13933

    188 559.9 188 389.78579 170.114205 91.11673 78.99748

    Low outliers are classified as follows:

    raw[low_outliers,]

    The output is as follows:

    visitors period timetrend withouttimetrend seasons error

    80 266.8 80 231.4934 35.30663 91.11673 -55.81010

    216 321.5 216 429.7569 -108.25691 -20.46137 -87.79553

    217 260.9 217 431.1801 -170.28007 -55.94655 -114.33352

    218 308.3 218 432.6029 -124.30295 -42.40371 -81.8992

  4. We can plot these points as follows:

    plot(raw$period,raw$visitors,type='o')

    points(raw$period[high_outliers],raw$visitors[high_outliers],pch=19,col='red')

    points(raw$period[low_outliers],raw$visitors[low_outliers],pch=19,col='blue')

    The plot looks as follows:

Figure 6.14: Plot of data classified as anomalies

The plot shows all of our data and the points that we have classified as anomalies, with high anomalies plotted in red and low anomalies plotted in blue. You should note that not all of these points are immediately obvious as outliers. We have only been able to recognize them as outliers after doing the seasonality modeling exercise earlier, and determining expected values that the anomalies depart from.

Next, we will introduce two more types of anomalies: contextual and collective anomalies. In order to introduce these concepts, we will generate an artificial dataset that contains both contextual and collective anomalies.

Contextual and Collective Anomalies

At around x=1 in Figure 6.15, you can see a single point that is quite far from its neighbors. This is an example of a contextual anomaly, and we will discuss what that means and how to detect these types of anomalies first. At around x=3.6, you can see a region where y is flat, with every value equal to 0. Zero values are not anomalous in this data, but having so many of them together is anomalous. So, this collection of data is referred to as a collective anomaly.

We will consider contextual anomalies first. Contextual anomalies are observations that are considered anomalies only because of their neighbors. In the dataset we have just generated, there is one point at x=1 where y=0. However, the values of y for x=0.99 and x=1.01 are close to 0.84, very far from 0 in this context. Contextual anomalies can be detected by finding observations that have anomalous distance from their neighbors, as we will see in the following exercise.

Exercise 45: Detecting Contextual Anomalies

The following exercise shows how to detect contextual anomalies in the dataset we have just introduced. Since contextual anomalies are observations that are very different from their neighbors, we need to do an explicit comparison of each observation with its neighbor. In order to do this, we calculate a first difference. A first difference is simply the value of an observation minus the value of the observation before it.

The expected outcome of this exercise will be the observations in the dataset that are contextual anomalies:

  1. We will generate an artificial dataset that contains both contextual and collective anomalies. The dataset can be generated by running the following code in the R console:

    x<-1:round(2*pi*100+100)/100

    y<-rep(0,round(2*pi*100)+100)

    y[1:314]<-sin(x[1:314])

    y[415:728]<-sin(x[315:628])

    y[100]<-0

  2. You can see what this data looks like by plotting x and y as follows:

    plot(x,y,type='o')

    The output is as follows:

    Figure 6.15: Plot of generated dataset

    This plot shows a sine curve: a simple type of curve that begins at 0, and slopes gently upward and downward, finally returning to 0 again. This data is artificially generated; however, we might imagine that it represents observations of temperature: low in some months, climbing higher in some months, high in some months, and climbing downward in other months. Temperature and weather data often follow a pattern that can be modeled with a trigonometric curve such as a sine or cosine. We have altered our sine curve so that it includes some anomalous data.

  3. Find the first difference for every observation at the same time as follows:

    difference_y<-y[2:length(y)]-y[1:(length(y)-1)]

  4. Create a boxplot of the first difference data:

    boxplot(difference_y)

    The resulting boxplot looks as follows:

    Figure 6.16: Boxplot of the first difference data

    This boxplot shows that nearly all first differences are very close to zero, while two outlier differences are far away from the rest. We can see from the first difference boxplot that the single high outlier is larger than 0.5.

    You may notice that there are two apparent outliers in Figure 6.16, but only one single outlier apparent in Figure 6.15. The reason for this is that Figure 6.16 shows first difference data, and the single outlier in the data leads to two large first differences: the difference between the 99th and the 100th values, and the difference between the 100th and 101st values. One outlier observation in the original data has led to two outlier observations in the first difference data.

  5. Determine which observation corresponds to this outlier by using R's useful which function:

    which(difference_y>0.5)

    which returns the value 100, indicating that it is the 100th observation that is a contextual anomaly. If we check, y[100] is equal to 0, and its neighbors are not, so we have successfully found the contextual anomaly.

Next, we will discuss collective anomalies. In our x-y plot, we pointed out the 100 observations that were all equal to 0 around x=3.64. A zero value is not in itself an anomaly in this dataset, but having 100 observations all equal to 0 is anomalous. The collection of 100 zero values together here is referred to as a collective anomaly. Collective anomalies are more difficult to detect than contextual anomalies, but we will attempt to detect them anyway in the following exercise.

Exercise 46: Detecting Collective Anomalies

In the following exercise, we will detect collective anomalies in the dataset we created earlier. The expected outcome of this exercise is a list of observations that constitute collective anomalies:

  1. In order to detect this kind of anomaly, we need to look for groups of observations, or neighborhoods, that contain no variation or only small variations. The following loop accomplishes this. It creates a vector that consists of the maximum value of two differences: the difference between an observation and the observation 50 periods before, and the difference between an observation and the observation 50 periods later. If the maximum of these differences is zero or very small, then we have detected a neighborhood that is extremely flat, a sign of this kind of collective anomaly. Here is the loop we will use:

    changes<-NULL

    ks<-NULL

    k<-51

    while(k<(length(y)-50)){

    changes<-c(changes,max(abs(y[k-50]),abs(y[k+50])))

    ks<-c(ks,k)

    k<-k+1

    }

    This loop has created a vector called changes. Each element of this vector measures the maximum difference observed between an observation and its neighbors 50 observations away. Especially small values of the changes vector will indicate that we could have a collective anomaly consisting of a flat neighborhood.

  2. Now that we have a vector measuring neighborhood changes, we can create a simple boxplot, the same as we have done in previous exercises:

    boxplot(changes)

    The output is as follows:

    Figure 6.17: Boxplot of neighborhood changes

    We can see that there are many observations classified as outliers, which shows evidence of very low neighborhood changes.

  3. We can find which observation leads to this collective anomaly as follows:

    print(ks[which(changes==min(changes))])

    The output is 364.

  4. We can verify that this is an index corresponding to the collective anomaly by checking the y value at that index:

    print(y[ks[which(changes==min(changes))]])

    So, y is 0, which is the value of y at the collective anomaly, providing evidence that we have found the right point.

Kernel Density

To conclude the chapter, we will discuss using kernel density estimates to perform outlier detection on a set of blood samples. Kernel density estimation provides a natural way to test whether a particular set of blood results are anomalous, even without having specialized knowledge of the particular blood test being used or even of medicine in general.

Suppose that you are working at a doctor's office and your boss asks you to do a new type of blood test on patients. Your boss wants to know if any of the patients have anomalous test results. However, you are not familiar with this new blood test and you do not know what normal and anomalous results are supposed to look like. All you have is a record of previous blood tests that your boss assures you are from normal patients. Suppose that these tests had the following results:

normal_results<-c(100,95,106,92,109,190,210,201,198)

Now suppose that your boss wants you to find anomalies (if any) in the following new blood test results:

new_results<-c(98,35,270,140,200)

In kernel density estimation, we model a dataset with a collection of kernels. For our purposes, the kernels will just be normal distributions with different means and variances. You can learn more about the normal distribution in Chapter 3, Probability Distributions. We will suppose that our data has a density that is captured by a sum of normal distributions. Then, any data that appears to be inconsistent with the sum of normal distributions that we specify can be classified as an anomaly.

When we calculate kernel density, we will have to specify something called bandwidth. Here, the bandwidth will be a measure of the variance of the normal distributions that we are using to model our data. If we specify a high bandwidth, we are assuming that the data is widely dispersed, and if we specify a low bandwidth, we are assuming that the data is largely contained in a relatively narrow range. This should become more clearer to you as you work through the following exercise.

Exercise 47: Finding Anomalies Using Kernel Density Estimation

In this exercise, we will cover how to find anomalies using kernel density estimation. The expected output of this exercise is a list of observations that constitute anomalies according to a kernel density estimation method:

  1. Specify our data and our parameters. For our data, we will use the normal results and new results we specified earlier:

    normal_results<-c(100,95,106,92,109,190,210,201,198)

    new_results<-c(98,35,270,140,200)

  2. Kernel density estimation relies on a parameter called bandwidth. For that, we will start with 25. The choice of bandwidth will depend on your own preferences and also on the original data. If you are not sure, you can choose a bandwidth that is about the same size as the standard deviation of your data. In this case, we will choose 25, which is lower than the standard deviation of our data. You can set the bandwidth at 25 as follows:

    bandwidth<-25

    Note

    You can feel free to experiment with other bandwidth values if you like.

  3. Use R's density function to obtain a kernel density estimate:

    our_estimate<-density(normal_results, bw=bandwidth)

  4. Plot the density estimate:

    plot(our_estimate)

    The resulting plot appears as follows:

    Figure 6.18: Plot of density estimate
  5. The shape of the graph in Figure 6.18 represents the distribution of our original data. You can learn more about probability distributions in Chapter 3, Probability Distributions. This distribution is called bimodal, which means that the data appears to cluster mainly around two points: most of the observations are close to either 100 or 200. We will interpret an observation as an anomaly if its corresponding point on the distribution shown in Figure 6.18 shows that it is particularly unlikely.
  6. We can obtain density estimates for each of our new results. For each of the observations in new_results, we will calculate the density estimate according to the kernel illustrated in Figure 6.18. We will store each of these density estimates in new variables as follows:

    new_density_1<-density(normal_results,bw=25,n=1,from=new_results[1],to=new_results[1])$y

    new_density_2<-density(normal_results,bw=25,n=1,from=new_results[2],to=new_results[2])$y

    new_density_3<-density(normal_results,bw=25,n=1,from=new_results[3],to=new_results[3])$y

    new_density_4<-density(normal_results,bw=25,n=1,from=new_results[4],to=new_results[4])$y

    new_density_5<-density(normal_results,bw=25,n=1,from=new_results[5],to=new_results[5])$y

    The output will be the heights on the graph from Step 3 that corresponds to each of the x-values specified in the new_results vector. We can observe each of these values by printing them. Print new_density_1 as follows:

    print(new_density_1)

    The output is as follows:

    [1] 0.00854745

    Print new_density_2 as follows:

    print(new_density_2)

    The output is as follows:

    [1] 0.0003474778

    Print new_density_3 as follows:

    print(new_density_3)

    The output is as follows:

    [1] 0.0001787185

    Print new_density_4 as follows:

    print(new_density_4)

    The output is as follows:

    [1] 0.003143966

    Print new_density_5 as follows:

    print(new_density_5)

    The output is as follows:

    [1] 0.006817359

  7. We can plot these points on our density plot as follows:

    plot(our_estimate)

    points(new_results[1],new_density_1,col='red',pch=19)

    points(new_results[2],new_density_2,col='red',pch=19)

    points(new_results[3],new_density_3,col='red',pch=19)

    points(new_results[4],new_density_4,col='red',pch=19)

    points(new_results[5],new_density_5,col='red',pch=19)

  8. The result of executing these commands is the following plot:
    Figure 6.19: Points mapped on density plot

    This shows the same density plot we examined earlier, with five points added – one for each of the new results that we are examining to find anomalies. Each of the points shows the relative likelihood of each particular observation: points that have a higher estimated density value are more likely to be observed, while points that have a lower estimated density value are less likely to be observed and are therefore more likely to be anomalies. There is no strict rule about which observations are anomalies and which are not, but in general the observations whose estimated densities are closest to zero are most likely to be anomalies.

  9. Interpret the results to classify anomalies.

    Each of the density values appear quite close to zero. However, some are much closer to zero than others. In this case, the closer the density estimate is to zero, the more confident we are that the observation is an anomaly. In this case, we will choose a threshold value and say that blood test results whose kernel density estimates are lower than the threshold are anomalous, and results whose kernel density estimates are higher than the threshold are not anomalous. It seems that 0.001 is a reasonable threshold, since it separates the high-density values from the lowest-density values – the observations whose density values are lower than 0.001 appear on the plot shown in Step 6 to be very unlikely. So, we will classify the blood test results 35 and 270 as anomalous results, and all of the others as reasonable, because we saw in step 4 that 35 and 270 corresponded to density estimates that were lower than 0.001.

    So, the final output of our exercise is the conclusion that the blood test results 35 and 270 are anomalies, while the rest of the blood test results are reasonable.

Continuing in Your Studies of Anomaly Detection

If you continue to learn about anomaly detection, you will find that there are a huge number of different anomaly detection techniques. However, all of them follow the same basic pattern that we have seen in our seasonality modeling example. Specifically, advanced anomaly detection usually consists of the following:

  • Specifying a model of what is expected
  • Calculating the difference between what is expected based on the model, and what is observed – this is called the error
  • Using univariate outlier detection on the error vector to determine anomalies

The biggest difficulty comes in the first step: specifying a useful model of what is expected. In this case, we specified a seasonal model. In other cases, it will be necessary to specify models that take into account multi-dimensional image data, audio recordings, or complicated economic indicators or other complex attributes. The right way to set up models for those cases will require study of the particular domains from which the data is drawn. However, in each case, anomaly detection will follow the three-step bulleted pattern listed earlier.

Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method

The aim of the activity is to find univariate anomalies using a parametric method. For this activity, we will use a dataset that is built in to R, called islands. If you execute ?islands in R, you can find the documentation of this dataset. In this documentation, you can notice that this dataset contains the areas in thousands of square miles of landmasses on Earth that exceed 10,000 square miles.

This might be a dataset that would be of interest to a geologist who studied the formation of landmasses on earth. According to scientists, there are numerous ways that islands can form: sometimes through volcanic activity, sometimes through coral growth, and sometimes in other ways. A geologist might be interested in finding islands that are anomalously large or small – these islands might be the best places to do further research to try to understand the natural processes of island formation. In this activity, we will find anomalies in the islands dataset.

These steps will help you complete the activity:

  1. Load the data called islands in R's built-in datasets and create a boxplot of this data. What do you notice about the distribution of data and outliers?
  2. Transform the islands data with a logarithm transformation and create a boxplot of this transformed data. How has this changed the data points that are classified as outliers?
  3. Calculate the outliers in the islands dataset manually using a non-parametric method (the method that defines outliers as points lying more than 1.5 times the interquartile range above or below the first and third quartiles, respectively). Do the same for the log transformation of the islands data.
  4. Classify outliers in the islands dataset using a parametric method by calculating the mean and standard deviation of the data and classifying outliers as observations that are more than two standard deviations away from the mean. Do the same for the logarithm transformation of the islands data.
  5. Compare the results of each of these outlier detection methods.

    Note

    The solution for this activity can be found on page 234.

Activity 15: Using Mahalanobis Distance to Find Anomalies

In the following activity, we will examine data related to the speed and stopping distance of cars. This data might be useful to an automotive engineer who is trying to research which cars perform the best. Cars that have especially low stopping distance relative to their speeds can be used as examples of high-performing cars, while those that have anomalously high stopping distance relative to their speeds may be candidates for further research to find areas for improvement. In this activity, we will find anomalies based on both speed and stopping distance. Because we are working with multivariate data, it makes sense to use Mahalanobis distance to find anomalies.

These steps will help you complete this activity:

  1. Load the cars dataset from R's built-in datasets. This dataset contains the speed of some very old cars together with the distance required to stop at that speed. Plot the data.
  2. Calculate the centroid of this data, and calculate the Mahalanobis distance between each point and the centroid. Find the outliers (points whose Mahalanobis distance from the centroid is the highest). Draw a plot showing all the observations and the outliers plotted together.

    The plot will look as follows:

Figure 6.20: Plot with outliers marked

Note

The solution for this activity is on page 237.

Summary

In this chapter, we have discussed anomaly detection. We began with univariate anomaly detection, including non-parametric and parametric approaches. We discussed performing data transformations to obtain better classifications of outliers. We then discussed multivariate anomaly detection using Mahalanobis distances. We completed more advanced exercises to classify anomalies related to seasonally varying data. We discussed collective and contextual anomalies, and concluded the chapter with a discussion of how to use kernel density estimation in anomaly detection.