Chapter 3 – Applied Unsupervised Learning with R

Chapter 3

Probability Distributions

Learning Objectives

By the end of this chapter, you will be able to:

  • Generate different distributions in R
  • Estimate probability distribution functions for new datasets in R
  • Compare the closeness of two different samples of the same distribution or different distributions

In this chapter, we will learn how to use probability distributions as a form of unsupervised learning.

Introduction

In this chapter, we're going to study another aspect of unsupervised learning, called probability distributions. Probability distributions are part of classical statistics covered in many mathematical textbooks and courses. With the advent of big data, we've started using probability distributions in exploratory data analysis and other modeling applications as well. So, in this chapter, we're going to study how to use probability distributions in unsupervised learning.

Basic Terminology of Probability Distributions

There are two families of methods in statistics: parametric and non-parametric methods. Non-parametric methods are meant to deal with data that could take any shape. Parametric methods, by contrast, make assumptions about the particular shape that data takes on. These assumptions are often encoded as parameters. The following are the two main parameters that you should be aware of:

  • Mean: This is the average of all values in the distribution.
  • Standard Deviation: This is the measure of the spread of values around the mean of a distribution.

Most of the parametric methods in statistics depend in some way on those two parameters. The parametric distributions that we're going to study in this chapter are these:

  • Uniform distributions
  • Normal distributions
  • Log-normal distributions.
  • Binomial distributions
  • Poisson distributions
  • Pareto distributions

Uniform Distribution

In the uniform distribution, all values between an interval, let's say [a,b], are equiprobable. Mathematically, it's defined as follows:

Figure 3.1: Mathematical formula for a uniform distribution

It can be plotted on a graph as follows:

Figure 3.2: Graph of a uniform distribution

The uniform distribution is the simplest of the parametric distributions. There are many processes in the real world that follow uniform sampling:

  • If it's raining in a very large area, the distribution of where raindrops will fall can be assumed to be uniform in a small area.
  • The last digit of a social security number should follow a uniform distribution for any subset of people, such as for all the CEOs of start-ups in California.
  • Uniform random sampling is very important for generating data for experiments and running controlled trials.

Exercise 14: Generating and Plotting Uniform Samples in R

In this exercise, we will generate uniform samples and plot them. To do this, perform the following steps:

  1. Use the built-in runif R function to generate uniform samples. Firstly, enter the number of samples you want to generate. Here we're generating 1,000 samples. Then, enter the min value and max value:

    uni<-runif(1000, min=0, max=100)

  2. After storing the generated random numbers in the uni variable, we'll plot their values versus their index:

    plot(uni)

    The output is as follows:

    Figure 3.3: Uniform distribution

    As you can see, the points are scattered everywhere almost equally. We can also plot a histogram of this to get a clearer picture of the distribution.

  3. We'll use the hist() function of R to plot a histogram of the generated sample:

    hist(uni)

    The output is as follows:

Figure 3.4: Histogram of the distribution

As you can see, it's not exactly flat, as we envisioned it previously. It's more or less uniform, but not exactly uniform, because it was randomly generated. Each time we generate a new sample, it will resemble this histogram, and most probably won't be exactly flat, because of the noise that comes with all random sampling methods.

Normal Distribution

The normal distribution is a type of parametric distribution that is governed by two parameters: mean and standard deviation from the mean. It's symmetric about the mean, and most values are near the mean. Its curve is also known as a bell curve:

Figure 3.5: Approximate representation of a bell curve, typical with normally distributed data

The normal distribution is defined by the following equation:

Figure 3.6: Equation for the normal distribution

Here, is the mean and is the standard deviation.

The normal distribution is a very common type of distribution in the real world. The following are some examples of the normal distribution:

  • The height of NBA players is approximately normally distributed.
  • The scores of students in a class could have a distribution that is very close to the normal distribution.
  • The Nile's yearly flow volume is normally distributed.

Now we're going to generate and plot a normal distribution in R.

Exercise 15: Generating and Plotting a Normal Distribution in R

In this exercise, we will generate a normal distribution to model the test scores (out of 100) of 1,000 school pupils and plot them. To do this, perform the following steps:

  1. Generate a normal distribution by entering the number of samples, the mean, and the standard deviation in the rnorm function in R:

    nor<-rnorm(1000,mean=50, sd= 15)

  2. Plot the generated numbers against their index:

    plot(nor)

    The output is as follows:

    Figure 3.7: Normal distribution

    As you can see here, most values are around the mean of 50, and as we move away from 50, the number of points starts decreasing. This distribution will be clarified by with a histogram in the next step.

  3. Plot a histogram of the normal distribution with the hist() function:

    hist(nor)

    The output is as follows:

Figure 3.8: Normal distribution histogram

You can see that this shape very much resembles the bell curve of the normal distribution.

Skew and Kurtosis

As we have seen, many of the distributions you'll see in practice are assumed to be normal distributions. But not every distribution is a normal distribution. To measure the degree to which a distribution deviates from a standard normal distribution, we use two parameters:

  • Skew
  • Kurtosis

The Skew of a distribution is the measure its asymmetry compared to the standard normal distribution. In a dataset with high skew, the mean and mode will differ from each other. The skew can be of two types: positive and negative:

Figure 3.9: Negative skew and positive skew

Negative skew is when there is a long tail of values on the left-hand side of the mean, and positive skew is when there is a long tail of values on the right-hand side of the mean. Skewness can also be expressed with the following formula:

Figure 3.10: Mathematical formula for skewness

Here, is the expected value or the mean of , where and are the mean and standard deviation of the distribution respectively.

Kurtosis is a measure of the fatness of tails of a distribution compared to the normal distribution. Kurtosis doesn't introduce asymmetry in a distribution, unlike skewness. An illustration of this is provided here:

Figure 3.11: Kurtosis demonstration

Kurtosis can also be expressed with the following formula:

Figure 3.12: Mathematical formula for Kurtosis

Here, is the expected or average value of , where and are the mean and standard deviation of the distribution respectively. A standard normal distribution has a skew of 0 and a kurtosis measure equal to 3. Because normal distributions are very common, we sometimes just measure excess kurtosis, which is this:

Kexcess = K - 3

So, a normal distribution has excess kurtosis equal to 0.

Log-Normal Distributions

Log-normal distributions are distributions of values whose logarithm is distributed normally. If we show a log-normal distribution on a log scale, it is perfectly identical to a normal distribution, but if we show it on a standard distribution scale, it acquires very high positive skewness:

Figure 3.13: Log-normal distribution

To show that the log-normal distribution is a normal distribution on log scale, we're going to do an exercise.

Log-normal distributions are used in the field of financial risk management to model stock prices. As the growth factor is assumed to be normally distributed, the prices of stock can be modeled log-normally. This distribution is also used in calculations related to option pricing, including value at risk (VaR).

Exercise 16: Generating a Log-Normal Distribution from a Normal Distribution

In this exercise, we will generate a log-normal distribution from a normal distribution. To do this, perform the following steps:

  1. Generate a normal distribution and store the values in a variable:

    nor<-rnorm(1000,mean=5, sd= 1)

  2. Plot a histogram of the normal distribution with 100 different bins:

    hist(nor,breaks = 100)

    The output is as follows:

    Figure 3.14: Normal distribution with a mean of 5 and a standard deviation of 1
  3. Create a vector that will store 1,000 values for a log-normal distribution:

    lnor <- vector("list", 1000)

  4. Enter exponential values into the lnor vector. The exponent function is the inverse function of the log function:

    for (x in 1:1000){

    lnor[x]=exp(nor[x])

    }

  5. Plot a histogram of lnor:

    hist(as.integer(lnor), breaks = 200)

    The output is as follows:

Figure 3.15: Log-normal distribution

Notice how the preceding figure looks like a log-normal distribution plot and that this plot is generated from a normal distribution. If we plot a new graph after taking the log of values in the preceding graph, then it'll be a normal distribution again.

The Binomial Distribution

The binomial distribution is a discrete distribution, as opposed to normal or uniform distribution, which are continuous in nature. The binomial distribution is used to model the probability of multiple events occurring together where there are two possibilities for each event. One example where the binomial distribution can be applied is in finding out the probability of heads coming up for all three coins when we toss three coins together.

The mean and variance of a binomial distribution are n*p and n*p(1-p) respectively, where p is the probability of success and n is the number of trials. The binomial distribution is symmetric when p= 0.5. When p is less than 0.5, it's skewed more toward the right, and is skewed more toward the left when p is greater than 0.5.

The formula for the binomial distribution is as follows:

P(x) = n!/((n-x)!x!)*(p^x)*((1-p)^x)

Here, n is the total number of trials, x is the focal number of trials, and p is the probability of success.

Exercise 17: Generating a Binomial Distribution

In this exercise, we will generate a binomial distribution to model the number of times that heads will come up when tossing a coin 50 times. To do this, perform the following steps:

  1. To generate a binomial distribution, we'll first need a sequence of 50 digits as an index, which will act as the number of successes we are interested in modeling. This would be x in the formula in the previous section:

    s <- seq(0,50,by = 1)

  2. Now we will pass s as a parameter to the dbinom() function in R, which will calculate probabilities for every value in the s variable and store them in a new probs variable. Firstly, in the function, we enter the sequence that will encode the range of the number of successes. Then, we enter the length of the sequence, and then we enter the probability of success:

    probs <- dbinom(s,50,0.5)

  3. In the final step, we plot s and probs together:

    plot(s,probs)

    The output is as follows:

Figure 3.16: Binomial distribution

Here, the x axis shows a number of heads we are interested in, and the y axis shows the probability of getting exactly that number of heads in 50 trials. When we toss a coin 50 times, the most probable outcome is that we will get 25 heads and 25 tails, but the probability of getting all 50 heads or tails is very low. This is explained very well by the preceding graph.

The Poisson Distribution

The Poisson distribution is another type of discrete distribution that is used to model occurrences of an event in a given time period given the mean number of occurrences of that event in a particular time period.

It's formulated by the following equation:

Figure 3.17: Formula for poisson distribution

Here, lambda is the mean number of occurrences in a given time period, e is Euler's constant, and x is the number of events for which you want to find the probability. Given the number of new people that have been observed coming to an event every minute so far, the, Poisson distribution can be used to calculate the probability of any number of people coming to that event in the next minute.

A Poisson distribution plot looks like this:

Figure 3.18: Plot for poisson distribution

Here, we can see probabilities of the different values of x vary with respect to lambda.

The Pareto Distribution

The pareto distribution is an exponent-based probability distribution. This distribution was invented to model the 80:20 rule that is observed in many real-world situations. Some fascinating situations that follow the 80:20 rule are listed here:

  • Approximately 80% of the world's wealth is owned by 20% of people.
  • In business management, it was found that for most companies, 80% of their revenue was generated by 20% of their customers.
  • It is said that 20% of all drivers cause 80% of all accidents.

There are many other real-world observations that can be modeled by the Pareto distribution. The mathematical formula of the Pareto distribution is as follows:

Figure 3.19: Mathematical formula of the Pareto distribution

Introduction to Kernel Density Estimation

So far, we've studied parametric distributions in this chapter, but in real life, all distributions are either approximations of parametric distributions or don't resemble any parametric distributions at all. In such cases, we use a technique called Kernel Density Estimation, or KDE, to estimate their probability distributions.

KDE is used to estimate the probability density function of distributions or random variables with given finite points of that distribution using something called a kernel. This will be more clear to you after you continue further in the chapter.

KDE Algorithm

Contrary to what it might seem like given the heavy name, KDE is a very simple two-step process:

  1. Choosing a kernel
  2. Placing the kernel on data points and taking the sum of kernels

A kernel is a non-negative symmetric function that is used to model distributions. For example, in KDE, a normal distribution function is the most commonly used kernel function. Kernel functions can be of different types. They are very much related to the distributions we studied earlier in the chapter. Some of the types are summarized here:

  • In a uniform kernel, all values in a range are given equal weightage. This is represented as follows:
Figure 3.20: Representation of a uniform kernel function
  • In a triangular kernel, weightage increases linearly as values move toward the middle of the range. This is represented as follows:
Figure 3.21: Representation of a triangular kernel function
  • In a Gaussian kernel, weightage is distributed normally. This is represented as follows:
Figure 3.22: Representation of a Gaussian kernel function

Along with the kernel, in the first step, we have to choose another parameter called the bandwidth of the kernel. Bandwidth is the parameter that affects the smoothness of the kernel. Choosing the right bandwidth is very important, even more important than choosing the right kernel. We'll look at an example here.

Exercise 18: Visualizing and Understanding KDE

Let's suppose we have a distribution of five different points (1, 2, 3, 4, and 5). Let's visualize and understand KDE using this example:

  1. Store the vector of the five points in a variable:

    x<- c(1,2,3,4,5)

  2. Plot the points:

    y<-c(0,0,0,0,0)

    plot(x,y)

    The output is as follows:

    Figure 3.23: Plot of the five points
  3. Install the kdensity package, if you don't have it already, and import it:

    install.packages("kdensity")

    library('kdensity')

  4. Compute the kernel density with the kdensity() function. Enter the distribution, x, and the bandwidth parameter as .35. The kernel is gaussian by default:

    dist <- kdensity(x, bw=.35)

  5. Plot the KDE as follows:

    plot(dist)

    The output is as follows:

    Figure 3.24: Plot of the Gaussian kernel

    This is the final output of KDE. In this next step, it assumed that there was a Gaussian kernel centered on every point (1, 2, 3, 4, and 5) and summed them together to get this plot. The following figure will make it clearer:

    Note

    This graph is for illustration purposes rather than for generation in R.

    Figure 3.25: Gaussian kernel plotted on each point

    As you can see in the preceding figure, a Gaussian kernel was plotted on each one of the points and then all the kernels were summed to get the final curve.

    Now, what if we were to change the bandwidth to 0.5 instead of 0.35?

  6. Change the bandwidth to 0.5 in the kdensity() function and plot the kdensity plot again:

    dist <- kdensity(x, bw=.5)

    plot(dist)

    The output is as follows:

Figure 3.26: Plot of the Gaussian kernel with a bandwidth of .5

You can see that the kernel is much smoother now. The following kernels were used:

Figure 3.27: Gaussian kernel plotted on each point

Note

The graph is for illustration purposes rather than for generation in R.

This time, the kernels are much wider.

If we were given a sufficient amount of points for estimation, the choice of kernel wouldn't change the shape of the final KDE as much as the choice of bandwidth parameter. So, choosing the ideal bandwidth parameter is an important step. There are many techniques that are used to select the ideal bandwidth parameter. Studying them is beyond the scope of this book, but the R libraries can take care of selecting the ideal parameter on their own. We'll study this in the next exercise.

Exercise 19: Studying the Effect of Changing Kernels on a Distribution

In this exercise, we'll generate two normal distributions with different standard deviations and means, and combine them both to generate their combined KDE:

  1. Generate two different normal distributions and store them in two variables:

    y1 <- rnorm(100,mean = 0, sd = 1)

    y2<-rnorm(100, mean = 3, sd=.2)

  2. Combine the generated distributions and plot them:

    y3<-c(y1,y2)

    plot(y3)

    The output is as follows:

    Figure 3.28: Plot of combined distributions

    You can see there are two different distributions with different means and spreads (standard deviations).

  3. Plot a histogram of y3 for reference:

    hist(y3)

    The output is as follows:

    Figure 3.29: Histogram of the resultant distribution
  4. Generate and plot the KDE of y3 with a gaussian kernel:

    dist<-kdensity(y3,kernel = "gaussian")

    plot(dist)

    The output is as follows:

    Figure 3.30: Plot of Gaussian kernel density

    In the preceding plot, we used a Gaussian kernel and the bandwidth was selected by the function automatically. In this distribution, we have 200 points, which should be enough for generating a robust KDE plot such that changing the kernel type won't produce a significant difference in the final KDE plot. In the next step, let's try and change the kernel and look at the final plot.

  5. Generate and plot the KDE with the triangular kernel:

    dist<-kdensity(y3,kernel = "triangular")

    plot(dist)

    The output is as follows:

Figure 3.31: KDE with triangular kernel

Both plots with different kernels look almost identical. So, the choice of bandwidth is much more important than the choice of kernel. In this exercise, the bandwidth was chosen automatically by the kdensity library of R.

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

In this activity, we will find the standard distribution closest to the distribution of variables of the Iris dataset for the setosa species. These steps will help you complete the activity:

  1. Load the Iris dataset.
  2. Select rows corresponding to the setosa species only.
  3. Plot the distribution generated by the kdensity function for sepal length and sepal width.

    Note

    The solution for this activity can be found on page 218.

The final outcome of this activity will be a plot of KDE for sepal width, as follows:

Figure 3.32: Expected plot of the KDE for sepal width

Introduction to the Kolmogorov-Smirnov Test

Now that we’ve learned how to generate the probability density functions of datasets that don't closely resemble standard distributions, we’ll learn how to perform some tests to distinguish these nonstandard distributions from each other.

Sometimes, we're given multiple observed samples of data and we want to find out whether those samples belong to the same distribution or not. In the case of standard distributions, we have multiple tests, such as Student's t-test and z-test, to determine this. For non-standard distributions, or where we don't know the type of distribution, we use the Kolmogorov-Smirnov test. To understand the Kolmogorov-Smirnov test, you first need to understand a few terms:

  • Cumulative Distribution Function (CDF): This is a function whose value gives the probability of a random variable being less than or equal to the argument of the function.
  • Null Hypothesis: In hypothesis testing, a null hypothesis means there is no significant difference between the observed samples. In hypothesis testing, our aim is to falsify the null hypothesis.

The Kolmogorov-Smirnov Test Algorithm

In a Kolmogorov-Smirnov test, we perform the following steps:

  1. Generate a CDF for both functions.
  2. Specify one of the distributions as the parent distribution.
  3. Plot the CDF of the two functions together in the same plot.
  4. Find the greatest vertical difference between the points in both CDFs.
  5. Calculate the test statistic from the distance measured in the previous step.
  6. Find the critical values in the Kolmogorov-Smirnov table.

In R, these steps are automated, so we don't have to do each one of them individually.

Exercise 20: Performing the Kolmogorov-Smirnov Test on Two Samples

To perform the Kolmogorov-Smirnov test on two samples, execute the following steps:

  1. Generate two independent distributions for comparison:

    x_norm<-rnorm(100, mean = 100, sd=5)

    y_unif<-runif(100,min=75,max=125)

  2. Plot the CDF of x_norm as follows:

    plot(ecdf(x_norm))

    The output is as follows:

    Figure 3.33: Plot of ecdf(x_norm)

    To plot ecdf(y_unif), execute the following:

    plot(ecdf(y_unif),add=TRUE)

    The output is as follows:

    Figure 3.34: Plot of ecdf(y_unif)

    As you can see, the CDF of the functions look completely different, so the Kolmogorov-Smirnov test will return p values that are very small.

  3. Run the Kolmogorov-Smirnov test with the ks.test() function in R:

    ks.test(x_norm,y_unif)

    The output is as follows:

    Two-sample Kolmogorov-Smirnov test

    data: x_norm and y_unif

    D = 0.29, p-value = 0.0004453

    alternative hypothesis: two-sided

    Note

    This exercise depends on randomly generated data. So, when you run this code, some of the numbers might be different. In hypothesis testing, there are two hypotheses: the null hypothesis, and the test hypothesis. The goal of hypothesis testing is to determine whether we have strong enough evidence to reject the null hypothesis. In this case, the null hypothesis is that the two samples were generated by the same distribution, and the test hypothesis is that the two samples were not generated by the same distribution. The p-value represents the probability, assuming the null hypothesis is true, of observing differences as extreme or more extreme than what is observed. When the p-value is very close to zero, we take that as evidence that the null hypothesis is false and vice versa.

    As you can see, ks.test() returns two values, D and p-value. The D value is the absolute maximum distance between two points in the CDF of both distributions. The closer it is to zero, the greater the chance that both samples belong to the same distribution. The p-value has the same interpretation as in any other case.

    In our case, D is 0.29 and the p-value is very low, near zero. So, we reject the null hypothesis that both samples belong to the same distribution. Now, in the next step, let's generate a new normal distribution and see its effect on the p-value and D.

  4. Generate a new normal distribution with the same mean and sd as xnorm:

    x_norm2<-rnorm(100,mean=100,sd=5)

  5. Plot the combined CDF of x_norm and x_norm2:

    plot(ecdf(x_norm))

    plot(ecdf(x_norm2),add=TRUE)

    The output is as follows:

    Figure 3.35 Plot of combined cdf
  6. Run ks.test() on x_norm and x_norm2:

    ks.test(x_norm,x_norm2)

    The output is as follows:

    Two-sample Kolmogorov-Smirnov test

    data: x_norm and x_norm2

    D = 0.15, p-value = 0.2106

    alternative hypothesis: two-sided

    As you can see, the p-value is much higher this time and D is much lower. So, according to the p-value, we are less justified in rejecting the null hypothesis that both samples belong to the same distribution.

Activity 9: Calculating the CDF and Performing the Kolmogorov-Smirnov Test with the Normal Distribution

With the help of randomly generated distributions, calculate what standard distribution the sample of sepal length and width is closest to:

  1. Load the Iris dataset into a variable.
  2. Keep rows with the setosa species only.
  3. Calculate the mean and standard deviation of sepal length.
  4. Generate a new normal distribution with the mean and standard deviation of the sepal length column.
  5. Plot the CDF of both functions.
  6. Generate the results of the Kolmogorov-Smirnov test and check whether the distribution is a normal distribution.
  7. Repeat steps 3, 4, 5, and 6 for the sepal width column.

    Note

    The solution for this activity can be found on page 219.

The final outcome of this activity will be as follows:

Two-sample Kolmogorov-Smirnov test

data: xnorm and df$Sepal.Width

D = 0.12, p-value = 0.7232

alternative hypothesis: two-sided

Summary

Congratulations on completing the third chapter of the book! In this chapter, we learned the types of standard probability distribution, as well as when and how to generate them in R. We also learned how to find PDFs and CDFs of unknown distributions with KDE. In the final section, we learned how to compare two samples and determine whether they belong to the same distribution in R. In further chapters, we will learn about other types of unsupervised learning techniques that will help not only in exploratory data analysis but also give us other useful insights into data as well.