4. Linear Discriminant Analysis
Please download the sample Excel files from https://github.com/hhohho/Learn-Data-Mining-through-Excel for this chapter’s exercises.
Linear discriminant analysis (LDA) is similar to linear regression and K-means clustering, but different from both, too. K-means clustering is an unsupervised classification method, while LDA is supervised since its classification is trained on known data, however. The predictive results are quantitatively numerical in linear regression, while they are qualitatively categorical in LDA.
Given a set S of training data points which have already been sorted into different groups, the three steps of applying LDA can be simplified as the following:
where k is the number of groups, is the mean of group means, and is the mean of an individual group.
where n is the number of data points in a group.
Step 3: P is the best projection linear function such that Sg(P)/Sw(P) is maximized. As P is defined by (w1,w2,...,wn,b), the model construction phase of LDA is to solve for (w1,w2,...,wn,b). This is explained in Equation (4-3):
w1x1 + w2x2 + ... + wnxn + b = max(Sg(P)/Sw(P))(4-3)
The least square method can be applied to solve for (w1,w2,...,wn,b) first. We can then further optimize (w1,w2,...,wn,b) by using Excel Solver.
Open a blank Excel file.
- 2.Click File ➤ click Options. The link Options is at the bottom as shown in Figure 4-2.
- 3.Click Add-ins as shown in Figure 4-3.
- 4.On the upcoming menu, select Solver Add-in, and then click the button Go... as shown in Figure 4-4. Also, select Analysis ToolPak and click the button Go..., too.
- 6.Click the main tab Data; both Data Analysis and Solver are available as shown in Figure 4-6.
Learn LDA Through Excel
Open Chapter4-1a.xlsx; enter “y” in cell F1 (without quotation marks).
In cells K1, K2, and K3, enter 0, 1, and 2, respectively.
- 3.In cells L1, L2, and L3, enter Iris-setosa, Iris-versicolor, and Iris-virginica. Our worksheet looks like Figure 4-7.
We would like to make use of the least square method first to obtain the linear function P. Thus, we need to convert the categorical values into numerical values first. Cells K1:L3 define that 0 ➤ Iris-setosa, 1 ➤ Iris-versicolor, and 2 ➤ Iris-virginica.
In cell F2, enter the following formula:
- 5.Autofill cell F2 all the way to F151 (there are 150 data points). Our worksheet should look like Figure 4-8.
In cells K5:K9, enter b, w1, w2, w3, and w4, respectively. These represent the coefficients (weights) and intercept of the linear function P.
In cells L5:L9, enter 5, 4, 3, 2, and 1, respectively. We are going to make use of the functions INDEX and LINEST to find the coefficients and intercept for the linear function P. The function LINEST returns an array containing five elements. The order of the elements in the array is w4, w3, w2, w1, and b.
In cell M5, enter the formula:
- 9.Autofill cells M5 to M9. Part of our worksheet should look like Figure 4-9. Cells M5:M9 contain the coefficients of the linear function P. Our next step is to apply P to obtain the predicted value for each data point. We will store such predicted values in column G.
Enter “Numerical prediction” in cell G1. In cell G2, enter the following formula:
This formula represents the expression w1x1 + w2x2 + ...+ wnxn + b.
Autofill cells G2 to G152. Once we have the predicted values for each data point, what we need next is to find cutoffs to classify each predicted value into a specific iris type.
In cells K11, L11, and M11, enter “mean”, “sample number”, and “cutoff”, respectively (no quotation marks).
In cell K12, enter the following formula:
This formula computes the mean of the predicted values for Iris-setosa.
Autofill K12 to K14. The means of the predicted values for Iris-versicolor and Iris-virginica are stored in cells K13 and K14, respectively.
In cell L12, enter the formula:
This formula counts the number of Iris-setosa in the training dataset.
Autofill L12 to L14. Cells L13 and 14 store the number of Iris-versicolor and Iris-virginica, respectively.
In cell M12, enter the formula to compute the first cutoff:
- 18.Autofill M12 to M13. Do not autofill to cell M14. Based on the two cutoffs, each predicted numerical value can be converted into an iris type. Part of our worksheet looks like Figure 4-10.
In cell H1, type “Type classification”.
In cell H2, enter the following formula and then autofill from H2 to H152:
The preceding formula converts each numerical value in column G into an iris type. A glance of the result would tell that most iris types are classified into the right group.
To count how many iris data points are incorrectly classified, enter “Difference” in cell I1.
In I2, enter the following formula:
This formula compares the known iris type with the classified iris type. If a mismatch is found, 1 is returned.
Autofill cells I2 to I152.
Select cells K16 and L16, merge these two cells, and type inside the text “Difference =” in the merged cell.
- 25.In cell M16, enter the formula =SUM(I2:I151). Part of our formula should look like Figure 4-11.
This LDA analysis is quite successful, though there are still four iris data points incorrectly classified. Note that it is common for a data mining method to miss a few data points on the training dataset. Generally speaking, we should not use the same training dataset to assess how good a model can perform, however. A model with a perfect classification or prediction on the same training dataset might suggest an overfit, which indicates that the model is too specific to the training data but is likely to fail on unknown scoring data.
If we assign 0 to Iris-virginica, 1 to Iris-versicolor, and 2 to Iris-setosa in cells K1:L3, we will get a very similar result. However, if we assign 0 to Iris-virginica, 1 to Iris-setosa, and 2 to Iris-versicolor in cells K1:L3, we will find that the classification is totally messed up. This is a disadvantage of carrying out LDA analysis in Excel.
With the first model constructed, that is, the initial linear function P found, we can predict the iris type based on x1, x2, x3, and x4. We can view such predictions in the worksheet iris-LINEST of the file Chapter4-1b.xlsx.
Equation (4-3) requests to maximize the ratio of Sg/Sw. For this purpose, we need to make use of Solver. Note, we can obtain an optimized parameter set by using Solver directly without using LINEST first. However, in LDA, it is a good idea to use least square method first before applying Solver.
In cells K18, K19, and K20, type “Inter-group variance”, “Within-group variance”, and “inter/within ratio”, respectively.
Enter the following formula in cell L18:
This formula computes the variance among group means, that is, the Sg as shown in Equation (4-1). The larger this value, the higher the separability among groups. Note: Cells K12, K13, and K14 reference the within-group means.
We must compute the variance of each data point in their belonging group first before computing the within-group variance. Enter “Within-g variances” in cell J1, and then enter the following formula in cell J2:
In this formula, column F is used to determine which group each data point lies in. Note that column F represents the true group each data point belongs. LDA is a supervised classification method; it is more accurate to use column F than column G here. If F2 = K1, then the within-group variance of the corresponding data point is computed by the expression (G2-K12)^2. If not but F2 = K2, the variance is computed via the expression (G2-K13)^2; otherwise using the expression (G2-K14)^2.
Autofill cells J2 to J152.
Enter the formula =SUM(J1:J152) in cell L19. This is the Sw value as shown in Equation (4-2).
- 31.Enter the formula =L18/L19 in cell L20. This is the Sg(P)/Sw(P) that needs to be maximized by Solver. By now, part of our worksheet looks like Figure 4-12.
Copy the content of cells M5:M9 to cells N5:N9 by value only. Remember to make use of the Excel feature “Paste Special”. We are going to use Solver to optimize these values in cell M5:M9. Before doing that, we would like to make a copy of them so that we can tell the changes after using Solver. At this point, take a look at the values in cells L18 and L19, and try to remember them.
- 33.Click the menu tab Data, and locate the Solver as shown in Figure 4-13.
- 34.Click Solver. On the upcoming menu, specify the values exactly as shown in Figure 4-14. Set Objective indicates the value to be optimized. In this case, we need the value in L20 to be maximized. The Solver is requested to maximize L20 by changing values in cells M5:M9. The selected solving method is GRG Nonlinear.
- 35.After clicking Solve, click OK on the upcoming menu as demonstrated in Figure 4-15.
To assess how good our model is, we need to conduct cross-validation on our model and learn receiver operating characteristic (ROC) curve analysis . This will be the topic of the next chapter.
General understanding of LDA
Solver, including its installation and use
Excel functions INDEX and LINEST
Excel functions IF, COUNTIFS, and AVERAGEIFS