4. Linear Discriminant Analysis – Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods

© Hong Zhou 2020
H. ZhouLearn Data Mining Through Excelhttps://doi.org/10.1007/978-1-4842-5982-5_4

4. Linear Discriminant Analysis

Hong Zhou1 
(1)
University of Saint Joseph, West Hartford, CT, USA
 

Please download the sample Excel files from https://github.com/hhohho/Learn-Data-Mining-through-Excel for this chapter’s exercises.

General Understanding

Linear discriminant analysis (LDA) is similar to linear regression and K-means clustering, but different from both, too. K-means clustering is an unsupervised classification method, while LDA is supervised since its classification is trained on known data, however. The predictive results are quantitatively numerical in linear regression, while they are qualitatively categorical in LDA.

LDA tries to find the best linear function that can separate data points into distinct categories. The variances between group means should be as large as possible, while the variances inside each group should be as small as possible. Note that the variances are computed after each data point is projected to a linear function P = w1x1 + w2x2 + ... + wnxn + b. The training dataset is used to find the best P function which is represented by the parameter set (w1,w2,...,wn,b). Figure 4-1 demonstrates the concept in two dimensions.
Figure 4-1

LDA is to separate groups

Given a set S of training data points which have already been sorted into different groups, the three steps of applying LDA can be simplified as the following:

Step 1: Sum the inter-group variances via the following expression:
(4-1)

where k is the number of groups, is the mean of group means, and is the mean of an individual group.

Step 2: Sum the within-group variances via the following expression:
(4-2)

where n is the number of data points in a group.

Step 3: P is the best projection linear function such that Sg(P)/Sw(P) is maximized. As P is defined by (w1,w2,...,wn,b), the model construction phase of LDA is to solve for (w1,w2,...,wn,b). This is explained in Equation (4-3):

w1x1 + w2x2 + ... + wnxn + b = max(Sg(P)/Sw(P))(4-3)

The least square method can be applied to solve for (w1,w2,...,wn,b) first. We can then further optimize (w1,w2,...,wn,b) by using Excel Solver.

Solver

Excel Solver is an optimization tool. It adjusts the values of decision variables to search for a maximum or minimum target value. Solver is a built-in add-in program that can be installed without any downloading. Follow these instructions to install Solver together with Data Analysis add-in:
  1. 1.

    Open a blank Excel file.

     
  2. 2.
    Click File ➤ click Options. The link Options is at the bottom as shown in Figure 4-2.
    Figure 4-2

    Click File ➤ Options

     
  3. 3.
    Click Add-ins as shown in Figure 4-3.
    Figure 4-3

    Click Add-ins

     
  4. 4.
    On the upcoming menu, select Solver Add-in, and then click the button Go... as shown in Figure 4-4. Also, select Analysis ToolPak and click the button Go..., too.
    Figure 4-4

    Select Solver Add-in and click Go

     
  5. 5.
    Click the button OK as shown in Figure 4-4. A small menu comes up as shown in Figure 4-5. Make sure that both Solver and Analysis ToolPak are checked. Click the button OK.
    Figure 4-5

    Add Solver and Analysis ToolPak

     
  6. 6.
    Click the main tab Data; both Data Analysis and Solver are available as shown in Figure 4-6.
    Figure 4-6

    Solver is now available

     

Learn LDA Through Excel

Chapter4-1a.xlsx contains the famous Iris dataset that is a perfect sample for LDA practice. In the Iris dataset, the four attributes x1, x2, x3, and x4 are some flower characteristics that are used to determine the type of iris: the target variable. This is the training dataset. Note that there are three distinct values for the target variable, that is, the target variable has three classes (categories). Follow these instructions:
  1. 1.

    Open Chapter4-1a.xlsx; enter “y” in cell F1 (without quotation marks).

     
  2. 2.

    In cells K1, K2, and K3, enter 0, 1, and 2, respectively.

     
  3. 3.
    In cells L1, L2, and L3, enter Iris-setosa, Iris-versicolor, and Iris-virginica. Our worksheet looks like Figure 4-7.
    Figure 4-7

    Iris dataset and table setup

    We would like to make use of the least square method first to obtain the linear function P. Thus, we need to convert the categorical values into numerical values first. Cells K1:L3 define that 0 ➤ Iris-setosa, 1 ➤ Iris-versicolor, and 2 ➤ Iris-virginica.

     
  4. 4.

    In cell F2, enter the following formula:

    =IF(A2=L$1,K$1,IF(A2=$L$2,K$2,K$3))

     
  5. 5.
    Autofill cell F2 all the way to F151 (there are 150 data points). Our worksheet should look like Figure 4-8.
    Figure 4-8

    Iris dataset categorized

     
  6. 6.

    In cells K5:K9, enter b, w1, w2, w3, and w4, respectively. These represent the coefficients (weights) and intercept of the linear function P.

     
  7. 7.

    In cells L5:L9, enter 5, 4, 3, 2, and 1, respectively. We are going to make use of the functions INDEX and LINEST to find the coefficients and intercept for the linear function P. The function LINEST returns an array containing five elements. The order of the elements in the array is w4, w3, w2, w1, and b.

     
  8. 8.

    In cell M5, enter the formula:

    =INDEX(LINEST(F$2:F$151,B$2:E$151,TRUE,TRUE),1,L5)

     
  9. 9.
    Autofill cells M5 to M9. Part of our worksheet should look like Figure 4-9. Cells M5:M9 contain the coefficients of the linear function P. Our next step is to apply P to obtain the predicted value for each data point. We will store such predicted values in column G.
    Figure 4-9

    Using least square method to obtain function coefficients

     
  10. 10.

    Enter “Numerical prediction” in cell G1. In cell G2, enter the following formula:

    =M$6*B2+M$7*C2+M$8*D2+M$9*E2+M$5

    This formula represents the expression w1x1 + w2x2 + ...+ wnxn + b.

     
  11. 11.

    Autofill cells G2 to G152. Once we have the predicted values for each data point, what we need next is to find cutoffs to classify each predicted value into a specific iris type.

     
  12. 12.

    In cells K11, L11, and M11, enter “mean”, “sample number”, and “cutoff”, respectively (no quotation marks).

     
  13. 13.

    In cell K12, enter the following formula:

    =AVERAGEIFS(G$2:G$151,F$2:F$151,K1)

    This formula computes the mean of the predicted values for Iris-setosa.

     
  14. 14.

    Autofill K12 to K14. The means of the predicted values for Iris-versicolor and Iris-virginica are stored in cells K13 and K14, respectively.

     
  15. 15.

    In cell L12, enter the formula:

    =COUNTIFS(F$2:F$151,K1)

    This formula counts the number of Iris-setosa in the training dataset.

     
  16. 16.

    Autofill L12 to L14. Cells L13 and 14 store the number of Iris-versicolor and Iris-virginica, respectively.

     
  17. 17.

    In cell M12, enter the formula to compute the first cutoff:

    =(K12∗L12+K13∗L13)/(L12+L13)

     
  18. 18.
    Autofill M12 to M13. Do not autofill to cell M14. Based on the two cutoffs, each predicted numerical value can be converted into an iris type. Part of our worksheet looks like Figure 4-10.
    Figure 4-10

    Cutoffs computed

     
  19. 19.

    In cell H1, type “Type classification”.

     
  20. 20.

    In cell H2, enter the following formula and then autofill from H2 to H152:

    =IF(G2<M$12,L$1,IF(G2<M$13,L$2,L$3))

    The preceding formula converts each numerical value in column G into an iris type. A glance of the result would tell that most iris types are classified into the right group.

     
  21. 21.

    To count how many iris data points are incorrectly classified, enter “Difference” in cell I1.

     
  22. 22.

    In I2, enter the following formula:

    =IF(A2=H2,0,1)

    This formula compares the known iris type with the classified iris type. If a mismatch is found, 1 is returned.

     
  23. 23.

    Autofill cells I2 to I152.

     
  24. 24.

    Select cells K16 and L16, merge these two cells, and type inside the text “Difference =” in the merged cell.

     
  25. 25.
    In cell M16, enter the formula =SUM(I2:I151). Part of our formula should look like Figure 4-11.
    Figure 4-11

    LDA analysis on the Iris training dataset

     

This LDA analysis is quite successful, though there are still four iris data points incorrectly classified. Note that it is common for a data mining method to miss a few data points on the training dataset. Generally speaking, we should not use the same training dataset to assess how good a model can perform, however. A model with a perfect classification or prediction on the same training dataset might suggest an overfit, which indicates that the model is too specific to the training data but is likely to fail on unknown scoring data.

If we assign 0 to Iris-virginica, 1 to Iris-versicolor, and 2 to Iris-setosa in cells K1:L3, we will get a very similar result. However, if we assign 0 to Iris-virginica, 1 to Iris-setosa, and 2 to Iris-versicolor in cells K1:L3, we will find that the classification is totally messed up. This is a disadvantage of carrying out LDA analysis in Excel.

With the first model constructed, that is, the initial linear function P found, we can predict the iris type based on x1, x2, x3, and x4. We can view such predictions in the worksheet iris-LINEST of the file Chapter4-1b.xlsx.

Equation (4-3) requests to maximize the ratio of Sg/Sw. For this purpose, we need to make use of Solver. Note, we can obtain an optimized parameter set by using Solver directly without using LINEST first. However, in LDA, it is a good idea to use least square method first before applying Solver.

Continue our learning through the following step-by-step instructions:
  1. 26.

    In cells K18, K19, and K20, type “Inter-group variance”, “Within-group variance”, and “inter/within ratio”, respectively.

     
  2. 27.

    Enter the following formula in cell L18:

    =(K12-AVERAGE(K12:K14))^2+(K13-AVERAGE(K12:K14))^2+(K14-AVERAGE(K12:K14))^2

    This formula computes the variance among group means, that is, the Sg as shown in Equation (4-1). The larger this value, the higher the separability among groups. Note: Cells K12, K13, and K14 reference the within-group means.

     
  3. 28.

    We must compute the variance of each data point in their belonging group first before computing the within-group variance. Enter “Within-g variances” in cell J1, and then enter the following formula in cell J2:

    =IF(F2=K$1,(G2-K$12)^2,IF(F2=K$2, (G2-K$13)^2,(G2-K$14)^2))

    In this formula, column F is used to determine which group each data point lies in. Note that column F represents the true group each data point belongs. LDA is a supervised classification method; it is more accurate to use column F than column G here. If F2 = K1, then the within-group variance of the corresponding data point is computed by the expression (G2-K12)^2. If not but F2 = K2, the variance is computed via the expression (G2-K13)^2; otherwise using the expression (G2-K14)^2.

     
  4. 29.

    Autofill cells J2 to J152.

     
  5. 30.

    Enter the formula =SUM(J1:J152) in cell L19. This is the Sw value as shown in Equation (4-2).

     
  6. 31.
    Enter the formula =L18/L19 in cell L20. This is the Sg(P)/Sw(P) that needs to be maximized by Solver. By now, part of our worksheet looks like Figure 4-12.
    Figure 4-12

    Variances computed

     
  7. 32.

    Copy the content of cells M5:M9 to cells N5:N9 by value only. Remember to make use of the Excel feature “Paste Special”. We are going to use Solver to optimize these values in cell M5:M9. Before doing that, we would like to make a copy of them so that we can tell the changes after using Solver. At this point, take a look at the values in cells L18 and L19, and try to remember them.

     
  8. 33.
    Click the menu tab Data, and locate the Solver as shown in Figure 4-13.
    Figure 4-13

    Using Excel Solver

     
  9. 34.
    Click Solver. On the upcoming menu, specify the values exactly as shown in Figure 4-14. Set Objective indicates the value to be optimized. In this case, we need the value in L20 to be maximized. The Solver is requested to maximize L20 by changing values in cells M5:M9. The selected solving method is GRG Nonlinear.
    Figure 4-14

    Using Excel Solver to optimize the parameters for function P

     
  10. 35.
    After clicking Solve, click OK on the upcoming menu as demonstrated in Figure 4-15.
    Figure 4-15

    Keep Solver Solution

     
The values inside cells M5:M9 are changed, and the value in cell M16 is changed to be 2 from 4. This is shown in Figure 4-16.
Figure 4-16

Solver further improves the LDA model

Though the new parameter set slightly improves the classification results, we need to pay close attention to the changes happened in cells K12:K14 and L18:L20 as shown in Figure 4-17. Clearly, the inter-group variance decreases significantly, but the within-group variance decreases even more.
Figure 4-17

Watch for the parameters

To assess how good our model is, we need to conduct cross-validation on our model and learn receiver operating characteristic (ROC) curve analysis . This will be the topic of the next chapter.

Review Points

  1. 1.

    General understanding of LDA

     
  2. 2.

    Solver, including its installation and use

     
  3. 3.

    Excel functions INDEX and LINEST

     
  4. 4.

    Excel functions IF, COUNTIFS, and AVERAGEIFS