Chapter 7: Parametric Classification Models – End-to-End Data Science with SAS

Chapter 7: Parametric Classification Models

Overview

Classification Overview

Difference Between Linear and Logistic Regression

Logistic Regression

Loss Function

Data

Data Restriction

Visualization

Logistic Regression Model

PROC LOGISTIC Code

PROC LOGISTIC Model Output

Scoring the TEST Data Set

Linear Discriminant Analysis

PROC DISCRIM

PROC DISCRIM Model Output

Chapter Review

Overview

In the last chapter, we focused on linear regression models. Two important characteristics define those models. 1.) They are designed to support a quantitative target variable. 2.) They assume a functional form (Y ≈ β0 + β1 X1); therefore, they are categorized as parametric models.

Linear regression models are constrained to fit the data by utilizing the linear regression equation. The constraint of an assumed functional form is what makes the linear regression model so powerful. The equation is transparent and models the data quickly because it only has to fit the data to a predetermined functional form. In later chapters, we will review the non-parametric models that are incredibly versatile and powerful but come at the cost of computational resources and processing time.

Although linear regression models are powerful, they are not appropriate to use in many situations. The primary decision point concerning what type of model to use is often driven by the nature of the target variable. Remember that the linear regression model is appropriate for quantitative dependent variables. What if the dependent variable is qualitative?

Qualitative variables are often binary (yes or no, 1 or 0, on or off), but they can also be multi-level categories (1st, 2nd, 3rd) or (red, yellow, green). Notice that these categories do not need to be numeric. They are often descriptive categories that we will need to transform into numeric representations of the data.

There are specialized models that are designed to predict a categorical dependent variable while retaining a predetermined functional form. This chapter will focus on two of these models: Logistic Regression and Linear Discriminant Analysis.

Classification Overview

Classification problems are prevalent in the modeling world. In fact, it is often more important to determine whether an event will occur rather than the exact time or value of the event. Some examples include:

● Will a viewer click on an ad?

● Will a borrower default on a loan?

● Will this new drug lower a patient’s blood pressure?

● Is this transaction fraudulent?

● Which one of three locations should the company drill for oil?

The answer to each of these examples can be represented as a binary or multi-level categorical response. We can still structure our modeling data set in the same manner that we reviewed in Chapter 5: Creating a Modeling Data Set, the only difference will be the selection or creation of the dependent target variable. The modeling data set will still contain several independent predictor variables that we will use to try to determine the outcome variable. So, if nearly everything is the same as the linear regression model, why not just use the linear regression model to predict categorical outcomes?

Difference Between Linear and Logistic Regression

As stated previously, the main difference between linear and logistic regression is the structure of the dependent variable. Linear regression assumes a quantitative numeric variable while logistic regression assumes a qualitative categorical target variable. Figure 7.1 demonstrates an example distribution of values for these modeling types.

Figure 7.1: Linear vs. Logistic Regression Data Distributions

In this example, the linear regression model can take on any value between zero and one. Let’s assume that the target value represents “the percentage of tickets sold (Y) given a certain discount rate (X).” This value can be 0% or 31% or 73% or 100% or any value in between.

The figure on the right shows an example binary distribution for a logistic regression model. Let’s assume that the target value represents “sale or no sale.” There are no values in between “sale” (represented as a 1) and “no sale” (represented as a 0) given a certain discount rate (X).

If we were to try and model a binary outcome with a linear regression model, we would get some confusing results. Figure 7.2 compares the modeling approaches of linear and logistic regression on the binary target variable.

Figure 7.2: Linear vs. Logistic Models on a Binary Outcome

The linear regression model aims to minimize the Residual Sum of Squares (RSS) for the data. The underlying functional form of the model assumes that the dependent variable (Y) is a continuous value. Under this assumption, the straight line is the best fit for the data. We could actually use the linear regression model to make predictions of the data by stating any predicted value above the green dotted line will make a purchase and any value below the green dotted line will not make a purchase.

One of the problems with this approach is that as you approach the highest or lowest values for the predictor, the logic of the linear regression breaks down. Notice that as the value of X nears 0%, the predicted value of a purchase is negative, and as we approach the high end of X, the predicted value is greater than one. Neither of these outcomes is possible because they lie outside of the [0,1] range.

Maybe we are just being picky and overly academic about the use of the linear regression model. We could simply state that all predicted values greater than one will be capped at one, and all predicted values less than zero will be capped at zero. Problem solved, right? Unfortunately, not.

There are three main reasons why we should not use linear regression on classification problems:

1. The issue of probabilities being greater than one or less than zero as described above.

2. Homoscedasticity – One of the assumptions of linear regression is that the variance of Y is constant across values of X. However, for binary models, the variance is calculated as the proportion of positive values (P) times the proportion of negative values (Q). The variance is therefore (PQ). This value varies across the logistic regression line. For example, at the midpoint of the logistic regression line when P = 0.5 then Q = 0.5, the variance is PQ = 0.25. However, on the far end of the logistic regression line when P = 0.1 and Q = 0.9, then variance is PQ = 0.09.

3. Linear regression assumes that the residuals are normally distributed. Because a classification problem has a binary target, this assumption is hard to justify.

Logistic Regression

A better approach is to model the binary outcome variable as a probability that the event will occur. This probability is represented as the S-curved regression line in Figure 7.2. The formula for this line is much different from the linear regression formula, but it does share some similarities.

Equation 7.1: Linear Regression

Y ≈ β0 + β1 X1

Equation 7.2: Logistic Regression

The logistic regression equation (Equation 7.2) contains four main parts:

1. p(X) represents the probability of a positive value (1), which is also the proportion of 1s. This is the mean value of Y (just like linear regression).

2. Since this is a logistic model, we should expect to see the base value of the natural logarithm (e) represented in the model.

3. The value β0 represents P when X is zero (similar to the linear regression intercept value).

4. The value β1 adjusts how quickly the probability changes with changing X a for single unit (similar to the β1value in linear regression)

The logistic regression equation is called the logistic function, and we use this function because it produces outputs between 0 and 1 for all values of X, and it allows the variance of Y to vary across values of X.

Loss Function

In the linear regression chapter, we described the process of how the β values were determined by the process of minimizing the cost function of RSS through the process of gradient descent. For logistic regression, since we are dealing with an S-curved logistic regression line, there is no mathematical solution that will minimize the sum of squares as we did with the linear regression model. For the logistic regression model, the β values are determined through the process of maximum likelihood.

Remember that we are dealing with probabilities. The term likelihood represents a conditional probability of P(Y|X), the probability of Y given X. The parameters of the logistic regression function (β0 and β1) and are determined by selecting values that the predicted probability (xi) of a value for each observation corresponds as closely as possible to the observation’s actual value.

Therefore, maximum likelihood means that we are selecting values for β0 and β1 that result in a probability close to one for all ones and a probability close to zero for all zero values. This is called maximum likelihood because we are attempting to maximize the likelihood (conditional probability) of the model estimate matching the sample data.

Data

In Chapter 5: Create a Modeling Data Set, we developed a modeling data set with a binary target variable. We should be familiar with this data set by now. We have extensively examined the variables in Chapter 5, and we used the data set to develop our linear regression models in Chapter 6: Linear Regression Models. Let’s finish off our tour of parametric models by using this same data set to demonstrate the development and interpretation of parametric binary classifiers.

For those of you who skipped Chapters 5 and 6 (shame on you), the data set is the Lending Club loan database. This data set represents loan accounts sourced from the Lending Club website and provides anonymized account-level information with data attributes representing the borrower, loan type, loan duration, loan performance, and loan status.

The target variable is a bivariate indicator labeled “bad,” and it is an indicator for loan accounts that resulted in a loan status of “Charged Off,” “Default” or “Does not meet the credit policy: Charged Off.” All of these loan status categories represent losses to the loan issuer.

The data set was split into separate TRAIN and TEST data sets using a 70/30 split ratio. The predictor variables have been modified to cap outliers according to the 1.5 IQR rule. We have also identified variables with less than 30% of their observations having missing data, and we inferred the missing values according to a process appropriate for that data type. We then created dummy variables for the categorical variables and developed several new variables through feature engineering techniques. If you have any questions on how to perform these data transformations, check out Chapter 5.

Data Restriction

Now that we have our base modeling data set constructed, we need to make a final decision prior to modeling. Since our data set is the full list of loan accounts since the inception of the Lending Club database, we need to understand that there is an initial ramp-up period that will have a low volume of loans, and we also need to realize that loans do not go bad right away. There will be a lag between application and eventual delinquency. Figure 7.3 shows the trend in the number of accounts with a bad status over the eleven-year history of the Lending Club data set.

Figure 7.3: Lending Club Account Trend

We can easily see that the ramp-up period for the data set extends from inception to around the beginning of 2012. Throughout this time period, loan volume is very low. If we were to include this time period in our modeling data set, it would not be adequately representative of the loan type and volume that we would expect at the point of model implementation. Figure 7.4 isolates the volume of accounts with a bad status for the Lending Club data set.

Figure 7.4: Lending Club Bad Status Trend

This graph shows that the number of bads (accounts with a bad status) starts to increase around the beginning of 2012 and reaches its peak in March 2016 after which the number of bads decreases through the end of the time period.

What is a data scientist supposed to do with this information? These charts give us insight into the nature of the target variable. We now understand that there is a clear ramp-up period where a small number of loans are being issued, and during that time, the number of bad loans is very low. This situation is most likely because loans do not go bad right away. There is generally a lag between the time period of loan issuance and delinquency. We can also see this occurring in the later time periods where the number of bads is decreasing in the later months. This is most likely because these loans have not had enough time to go bad.

If we are to build a model that adequately represents the true bad rate, I would choose a tight time frame where the loans have enough volume, and they have enough time to have gone bad. For our modeling purposes, I will restrict the data to loan issue dates between Jan 1, 2015, and Dec 31, 2015. That will give us a solid one-year period of representative loan volume.

Program 7.1 limits the training data set to the specified time period and develops a frequency distribution for the “bad” variable.

Program 7.1: Limit Data Set by Time Frame

DATA TRAIN; 
  SET MYDATA.MODEL_TRAIN;
  WHERE ‘01JAN2015’d le issue_date le ‘01DEC2015’d;
RUN;
PROC FREQ DATA=TRAIN; TABLES BAD; RUN;

Output 7.1: Target Value Frequency Distribution

Output 7.1 shows that 18.01% of loans issued in the specified time period result in a bad status.

Visualization

The first step to any modeling process is to visualize the data. This step will help you to understand the relationships between the predictors and the relationship between each predictor and the target variable, but there is another, often-unspoken, reason to visualize the data. With the visualization, you can gain an a priori idea of how well a model can separate the positive target values from the non-positive target values.

With just one look, you can often know before you do any modeling whether there is a clear separation between the binary target values, in which case you could expect strong modeling results. If there is not a clear separation between the binary target values, you can expect that your model will not be as strong.

Output 7.2 shows a scatter plot graph that shows the relationship between the predictor variables loan_amnt and int_rate. This graph also represents the target variable bad as a blue circle for cases where bad = 0 and a red triangle where bad = 1. Program 7.2 provides the code to develop this graph.

Program 7.2: Scatter Plot of Predictor Variables

ods graphics / attrpriority=none;
proc sgplot data=TRAIN (obs=1000);
styleattrs datasymbols=(circlefilled trianglefilled) ;
       scatter x=loan_amnt y=int_rate / group=bad;
       title ‘Scatter Plot of Loan Data’;
run;

The first thing that we notice when we examine Output 7.2 is that the target variable appears to be randomly scattered throughout the decision space. There does not appear to be a strong separation between the “goods” and the “bads” in relation to these two variables.

Output 7.2: Scatter Plot of Loan Data

Maybe this issue is just the result of these particular variables. In order to get a broader understanding of the issue, we can develop a scatter plot matrix that contains several variables and plot them against each other. This approach can provide us with different views of the target variable, and we can gain additional information on what our expectations should be for any model that we develop.

Program 7.3 provides the code to develop the scatter plot matrix with each variable grouped by the target variable.

Program 7.3: Scatter Plot Matrix of Predictor Variables

proc sgscatter data=TRAIN (obs=250);
  title “Scatterplot Matrix for Loan Data”;
  matrix loan_amnt total_bc_limit dti int_rate / group=bad;
run;

Output 7.3 shows the scatter plot matrix for four of the predictor variables where each has been grouped by the target variable. In this plot, the blue circles indicate where bad = 0 and the red plus signs represent where bad = 1.

A scatter plot matrix can easily get visually overwhelming. A lot is going on in one chart. Notice that I’ve included only 250 observations in the chart. Keep in mind that with a sample this small, we can expect some bias in the results. However, this does show us that the same issue of a non-distinct decision boundary between the goods and bads is prevalent across all these predictor variables.

Output 7.3: Scatter Plot Matrix of Predictor Variables

Most books about data science rely on examples where there are clear decision boundaries between the “goods” and the “bads” and that each predictor variable is a strongly significant predictor. However, in the real world, data is messy. In fact, most of the time, it will be a challenge to find a significant predictive relationship between your target and a list of predictor variables.

This challenge is especially true when there are restrictions (such as regulatory restrictions) put in place about what types of variables you are allowed to use (such as age, race, sex, location, and so on). There are also modeling issues of data bleed, which is where direct information about your target variable is contained in a predictor variable. There are other technical issues of data availability and data refresh schedules and the ability to move certain variables into a production environment that will limit your ability to build the best model possible. All of these restrictions need to be taken into account when developing your model.

Now that we have taken all of our restrictions into account and we have an a priori idea about how effective a model will be in separating our goods from our bads, we can expect that a binary classifier model will have a moderate level of effectiveness. We should not be surprised if we get a Gini score of around 40 and an overall accuracy of around 80%. Don’t worry, the concepts of Gini and accuracy will be explained in the next couple of sections.

Logistic Regression Model

We have stressed the importance of variable selection prior to any modeling efforts. This selection process ensures that the variables that we will include in our model are appropriate (no data bleed) and permitted (allowed by regulation standards) and available (allowed by an IT department that controls the data infrastructure). Once these restrictions have been met, we can develop our predictive model.

Remember that the business goal of our model is to predict whether a loan will result in a bad status at some point in the future. We would like to implement this model at the point of the loan application, so all variables that we will feed into our model must be available at the point of application. When someone applies for a loan, we would not know if they have missed any payments yet or if they are delinquent or if they have a collection recovery fee. We wouldn’t know any of this information because, at the point of application, these variables do not exist, so we must not include them in the model.

If we did include those variables, this would be an issue of data bleed, which is also known as information leakage. This is when you have variables that are directly related to the target variable in the model. If you were to include these variables, your model would be highly accurate, but it is a false accuracy because you could not implement that model due to those variables not being available at the point of application.

I have taken the liberty of examining the list of all variables in the Lending Club data set and selecting only those that are available at the point of application. Table 7.1 shows the list of 45 variables that meet our criteria.

Table 7.1: Variables Available at the Point of Application

acc_open_past_24mths

grade_F

purpose_other

annual_inc

grade_G

revol_bal

app_individual

home_mort

term_36

app_joint

home_own

term_60

bc_util

home_rent

tot_hi_cred_lim

dti

inq_last_6mths

total_bc_limit

emp_10

int_rate

ver_not

emp_0to4

loan_amnt

ver_source

emp_5to9

mo_sin_old_il_acct

ver_verified

emp_NA

mo_sin_rcnt_tl

num_actv_bc_tl

grade_A

mort_acc

num_bc_tl

grade_B

mths_since_recent_bc

num_il_tl

grade_C

mths_since_recent_inq

open_acc

grade_D

purpose_dc

pct_tl_nvr_dlq

grade_E

purpose_hi

purpose_cc

These variables represent a mixture of loan attributes that include information about the borrower (annual income, length of employment, homeownership), loan information (loan grade, purpose, term length), prior borrower behavior (number of accounts open in that previous 24 months, loan inquiries in the last 6 months, percent of trades never delinquent), along with several other descriptive variables.

I can easily place these variables into a global variable by utilizing the LET statement. Program 7.4 puts the variables into a global variable labeled num_vars.

Program 7.4: Create Macro Variable That Contains Model Variable Names

%LET num_vars = acc_open_past_24mths annual_inc app_individual app_joint bc_util dti emp_10 emp_0to4 emp_5to9 emp_NA grade_A grade_B grade_C grade_D grade_E grade_F grade_G home_mort home_own home_rent inq_last_6mths int_rate loan_amnt mo_sin_old_il_acct mo_sin_rcnt_tl mort_acc mths_since_recent_bc mths_since_recent_inq num_actv_bc_tl num_bc_tl num_il_tl open_acc pct_tl_nvr_dlq purpose_cc purpose_dc purpose_hi purpose_other revol_bal term_36 term_60 tot_hi_cred_lim total_bc_limit ver_not ver_source ver_verified ;

PROC LOGISTIC Code

Now that we have our training data set constructed and we have placed the predictors into a global variable, we are ready to develop the logistic regression algorithm. Program 7.5 develops a standard logistic regression model with the target variable bad and the predictors contained in the global num_var macro.

Program 7.5: Logistic Regression Model

ODS GRAPHICS ON;
PROC LOGISTIC DATA=TRAIN DESCENDING PLOTS=ALL;
       MODEL BAD = &num_vars. / SELECTION=STEPWISE 
SLE=0.01 SLS=0.01 CORRB OUTROC=performance;
       OUTPUT OUT=MYDATA.LOG_REG_PROB PROB=score;
RUN;

Notes on Program 7.5:

● The ODS GRAPHICS statement is set to ON. This will provide us with all the output graphics that we specify in the logistic regression options.

● PROC LOGISTIC is used to develop the model. This is a highly flexible procedure that can be used to predict binary, ordinal, or nominal responses. There are several options available that allow the researcher to customize the specifications of their model:

◦ The DESCENDING option reverses the sorting order for the levels of the response variable. This ensures that the model is predicting the event where BAD = 1. If this option is not selected, then the model will predict where BAD = 0.

◦ The PLOTS option is set to ALL. You can specify which plots you want to appear in the output. The ALL option states that we want all available plots to appear in the output.

◦ The MODEL statement specifies the target and predictor values. This is the core part of the model where we specify that the target variable is BAD, and the predictors are the variables contained in the global variable num_vars. Remember that it is not necessary to have a global variable. You can easily place the individual predictor variables directly in the PROC LOGISTIC code. (Make sure that they are separated by a space and do not include commas to separate them.)

◦ The SELECTION method is set to STEPWISE. This selection is similar to the FORWARD selection method, but variables are not guaranteed to remain in the model. They can be replaced with new variables that are introduced further in the selection cycle.

◦ The SLE option specifies the Chi-Square significance level to enter the model with the FORWARD or STEPWISE selection methods.

◦ The SLS option specifies the Chi-Square significance level to remain in the model in the BACKWARDS elimination step.

◦ The CORRB statement displays the correlation matrix for the parameter estimates.

◦ The OUTROC statement creates an output data set that contains the information that we will need to develop the Receiver Operating Characteristic (ROC) curve.

◦ The OUTPUT OUT= statement specifies that we want to create a new data set that contains the target variable and all of the predictors as well as the newly developed model output predictive score.

◦ The SCORE option specifies that we want to name the newly created predictive score variable “score.”

Program 7.5 may seem like a lot of code to develop for a simple logistic regression. Most of the code has been created to specify the options available in the LOGISTIC procedure. However, the code does not need to be complicated at all. Program 7.6 develops a logistic regression in a single line of code.

Program 7.6: Simple Logistic Regression Program

PROC LOGISTIC DATA=TRAIN DESCENDING; MODEL BAD = &num_vars.; RUN;

That’s all you really need to develop a logistic regression in SAS. However, all of the available options allow you to customize your model according to your needs.

PROC LOGISTIC Model Output

PROC LOGISTIC creates a variety of output, including summary information about the modeling data set, summary information for each step in our stepwise selection, maximum likelihood estimates, odds ratio estimates, model fit statistics, and any graphs that we specified in the PLOTS statement.

Model Summary Information

The first set of outputs provides you verification of your model inputs. Output 7.4 shows that the model is developed on the WORK.TRAIN data set and that target variable is BAD. It also specifies that this is a binary response model that was optimized with the Fisher’s scoring technique.

The model used all 65,252 observations. This means that there were no missing values in the data set. The Response Profile shows that the model is specified where the target variable BAD = 1.

Output 7.4: Logistic Regression Output

Stepwise Selection Summary

The next section of the model output shows each step of the stepwise selection technique. The predictor variables are added to the model one at a time and evaluated for their predictive power as defined by the Chi-Square metric. Output 7.5 shows the summary of the stepwise selection.

Output 7.5: Logistic Regression Stepwise Selection Table

Summary of Stepwise Selection

Step

Effect

DF

 

Score

Pr > ChiSq

Entered

Removed

Number In

Chi-Square

1

int_rate

 

1

1

4387.7371

<.0001

2

acc_open_past_24mths

 

1

2

378.4419

<.0001

3

tot_hi_cred_lim

 

1

3

310.9002

<.0001

4

dti

 

1

4

131.341

<.0001

5

emp_NA

 

1

5

77.7841

<.0001

6

loan_amnt

 

1

6

54.8411

<.0001

7

grade_A

 

1

7

48.8296

<.0001

8

home_mort

 

1

8

39.9212

<.0001

9

total_bc_limit

 

1

9

20.5846

<.0001

10

num_actv_bc_tl

 

1

10

47.8554

<.0001

11

term_36

 

1

11

20.8089

<.0001

12

inq_last_6mths

 

1

12

23.3802

<.0001

13

mort_acc

 

1

13

20.3845

<.0001

14

grade_C

 

1

14

12.6847

0.0004

15

ver_not

 

1

15

9.4022

0.0022

16

home_own

 

1

16

8.7725

0.0031

17

grade_D

 

1

17

6.9997

0.0082

This table of information shows that there were 17 variables that made it into the final model. These variables are ordered by their respective predictive power as defined by their Chi-Square value. Although we specified that the selection method was stepwise, in this particular model, none of the variables that were initially added to the model were replaced by new variables being added to the model. The Removed column would show where a new variable replaced a previous variable and at which step in the model that activity occurred.

Maximum Likelihood Estimates

The Maximum Likelihood Estimates table provides the information that we would typically use to state our model. The standard output for this table is that the intercept value is stated first, and then the model predictors are listed in alphabetical order. However, I have reordered the list of predictors by sorting them by their Chi-Square values. This adjustment allows us to see which variables carry the most predictive power in the model. Output 7.6 shows the re-sorted Analysis of Maximum Likelihood Estimates table.

Output 7.6: Logistic Regression Maximum Likelihood Estimates

Analysis of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard

Error

Wald

Chi-Square

Pr > ChiSq

Intercept

1

-3.5418

0.0758

2182.8133

<.0001

int_rate

1

0.1144

0.00379

913.5426

<.0001

acc_open_past_24mths

1

0.0744

0.00421

311.9744

<.0001

dti

1

0.013

0.00128

102.5576

<.0001

emp_NA

1

0.4312

0.0426

102.5457

<.0001

total_bc_limit

1

-8.48E-06

1.12E-06

57.261

<.0001

tot_hi_cred_lim

1

-9.90E-07

1.32E-07

56.2152

<.0001

loan_amnt

1

0.000012

1.72E-06

46.2241

<.0001

num_actv_bc_tl

1

0.044

0.00676

42.4342

<.0001

home_mort

1

-0.1802

0.0293

37.7032

<.0001

term_36

1

-0.1451

0.0268

29.2797

<.0001

inq_last_6mths

1

0.0903

0.0187

23.2999

<.0001

grade_A

1

-0.2291

0.0518

19.575

<.0001

grade_C

1

0.109

0.0259

17.74

<.0001

mort_acc

1

-0.04

0.0095

17.6877

<.0001

ver_not

1

-0.0832

0.0275

9.1287

0.0025

home_own

1

-0.1067

0.0363

8.6406

0.0033

grade_D

1

0.0776

0.0293

6.9983

0.0082

We can further refine this information to make it more user-friendly with some data manipulation. Output 7.7 shows a new column labeled “Percent Contribution.” This newly calculated field is simply each of the predictor’s Chi-Square value divided by the sum of the Wald Chi-Square values (excluding the intercept value).

Notice that the variable int_rate contains 50.7% of the model’s predictive value. This makes intuitive sense since the interest rate is based on either a separate predictive model or a set of heuristic rules that reflect the riskiness of the borrower. It would be a reasonable approach to remove this variable from the list of predictors that are fed into the model and rerun the algorithm. We would expect that a few new variables could replace the overbearing int_rate variable, and we should expect that the model’s predictive power will decrease slightly due to the removal of this powerful variable.

I have also rescaled three of the predictors to make their interpretation clearer. The variables total_bc_limit, total_hi_cred_limit, and loan_amnt have model estimates that are very small. This is because a single dollar increase in a borrower’s loan amount would impact the dependent variable by only 0.000012 points. We can easily divide these values by 10,000 so that the current variable loan_amnt is transformed into a new variable “loan amount per $10,000,” and the new model estimate is 0.12 points. This adjustment can be interpreted as the dependent variable BAD will increase 0.12 points for every $10,000 borrowed with everything else being held constant.

Output 7.7: Transformed Maximum Likelihood Estimates

Analysis of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard

Wald

Percent Contribution

Error

Chi-Square

Intercept

1

-3.541800

0.08

2,182.81

 

int_rate

1

0.114400

0.00

913.54

50.7%

acc_open_past_24mths

1

0.074400

0.00

311.97

17.3%

dti

1

0.013000

0.00

102.56

5.7%

emp_NA

1

0.431200

0.04

102.55

5.7%

total_bc_limit / 10000

1

-0.084800

0.00

57.26

3.2%

tot_hi_cred_lim / 10000

1

-0.009900

0.00

56.22

3.1%

loan_amnt / 10000

1

0.120000

0.00

46.22

2.6%

num_actv_bc_tl

1

0.044000

0.01

42.43

2.4%

home_mort

1

-0.180200

0.03

37.70

2.1%

term_36

1

-0.145100

0.03

29.28

1.6%

inq_last_6mths

1

0.090300

0.02

23.30

1.3%

grade_A

1

-0.229100

0.05

19.58

1.1%

grade_C

1

0.109000

0.03

17.74

1.0%

mort_acc

1

-0.040000

0.01

17.69

1.0%

ver_not

1

-0.083200

0.03

9.13

0.5%

home_own

1

-0.106700

0.04

8.64

0.5%

grade_D

1

0.077600

0.03

7.00

0.4%

Output 7.7 also shows us that if we wanted to develop a parsimonious model and select only the variables that significantly contribute to the overall predictive power of the model, we could comfortably retain the top 5 or 6 variables and still have nearly the same overall strength of the model.

Odds Ratio Estimates

The next section of the model output shows the table of Odds Ratio Estimates. The PLOTS=ALL option also provides a visual description of the table. These pieces of output are shown in Output 7.8.

Output 7.8: Logistic Regression Odds Ratios

Odds ratios are easy to interpret. If a variable has an odds ratio of 1, this means that the variable has no influence on the target variable. An odds ratio greater than 1 means that there are higher odds of the outcome event happening when there is exposure to this variable. An odds ratio of less than 1 means that there are lower odds of the outcome happening when there is exposure to this variable.

Let’s look at our example. The variables total_bc_limit, total_hi_cred_limit, and loan_amnt have odds ratios of one. This makes sense in their raw form because the model estimate for loan_amnt is 0.000012, which is nearly zero. If the odds ratios were extended to seven decimal places, then we would see a slight impact on the odds of the outcome event happening in relation to that variable. If the Odds Estimate table was constructed with the newly scaled version of these variables (example: loan amount per $10,000 = 0.12), then we would see distinct positive or negative odds ratios for those scaled variables.

The variable Grade_A has an odds ratio of 0.795. This can be interpreted as when the dummy variable Grade_A is 1, there is a reduced chance of the target variable BAD being positive. This makes sense because the variable Grade_A represents a loan category that has the least risky borrowers. If a loan is categorized as Grade_A, then there is less of a chance (lower odds) of it going bad.

In contrast, the variables Grade_C and Grade_D represent more risky loans. When either of these variables is positive, there is an increased chance that the response variable BAD will be positive.

Model Evaluation Metrics

Up to this point, we have seen SAS output related to several stages of the model development. These stages include what was fed into the model, how the model was constructed, and each step of the model selection process, point estimates, and odds ratios. These are all incredibly valuable pieces of information, but they do not answer one of the most important questions: Is my model any good?

For binary classification models, there are several methods of model evaluation. We will cover several of these methods in the chapter devoted to evaluating model output. For now, we will focus on the two main evaluation methods, which are standard outputs for PROC LOGISTIC:

● Gini metric and Somers’ D score

● ROC and AUC curves

Gini Metric and Somers’ D scores

One of the standard model output tables that the PROC LOGISTIC statement creates is titled “Association of Predicted Probabilities and Observed Responses.” This table contains an important model evaluation metric and the information about how this metric was constructed. Output 7.9 shows this table and its associated components.

Output 7.9: Logistic Regression Evaluation Metrics

The main metric that this table of information displays is the Somers’ D. This metric is also commonly called the Gini metric or the Accuracy Ratio. Although there are different ways of calculating these metrics, they all mean the same thing. The default methodology of calculating the Somers’ D statistic in PROC LOGISTIC is through the concordance and tied percent. All of this information is contained in the above table, but let’s review how this information is created and used.

Step 1: The logistic regression output creates an output table that contains the actual event data and the predicted probability field. This data is separated into two data sets. The first data set contains all of the observations where the target event (BAD) = 1 along with the associated predicted probabilities for those observations. The second data set contains all of the observations where the target event (BAD) = 0 along with the associated predicted probabilities for those observations.

Step 2: SAS creates a matrix data set where each observation in the first data set is compared to each observation in the second data set. This is a Cartesian product (cross-join) of events and non-events. The volume of this matrix data set is displayed in Output 7.9 in the field titled Pairs. We can see that the matrix contains 628K observations.

Step 3: Concordance is evaluated. A pair of observations is considered concordant if the event =1 and the predicted probability is higher than the observation where event = 0. A pair of observations is considered discordant if the opposite is true. This is where the observation where event = 0 has a higher predicted probability than the observation where event = 1. A pair is tied if the predicted probability is tied for an observation where event = 1 and an observation where event = 0.

Step 4: Final percent values are calculated:

Percent Concordant = 100*[(Number of concordant pairs) / Total number of pairs]

Percent Discordant = 100*[(Number of discordant pairs) / Total number of pairs]

Percent Tied = 100*[(Number of tied pairs) / Total number of pairs]

Step 5: Evaluation metrics calculated:

Somers’ D = 2 * AUC – 1 (also calculated as (Percent Concordant – Percent Discordant))

◦ This metric is used to determine the strength and direction of the relationship between pairs of variables. Its values range from -1.0 (all pairs disagree) to 1.0 (all pairs agree).

◦ This is very similar to the Gini metric produced by other binary classifiers.

Gamma Utilizes nearly the same methodology as the Somers’ D metric, but the Gamma metric does not penalize for ties. Because it does not penalize for ties, its value will be generally higher than the Somers’ D value.

Tau-a = (2(number of concordant – number of discordant) / (N(N-1)))

◦ The denominator of this equation represents all possible pairs.

◦ This value is generally much lower than the Somers’ D value since there are generally many paired observations with the same response.

Area Under the Curve (AUC) = (Percent Concordant + 0.5 * Percent Tied) / 100

◦ This is also labeled as “c” in Output 7.10. This value is also often represented as a Receiver Operating Characteristic (ROC) curve.

◦ This value ranges from 0.5 to 1.0 where 0.5 represents a model randomly selecting a response and 1.0 represents a model perfectly predicting a response.

◦ Output 7.10 displays the ROC curve produced by PROC LOGISTIC.

Output 7.10: ROC Curve

Detailed information about the ROC curve as well as several model evaluation metrics will be provided in the chapter dedicated to model evaluation.

Scoring the TEST Data Set

At this point, we have selected and transformed our TRAIN data set, developed our logistic regression model, and evaluated the results on the TRAIN data set. However, in order to truly evaluate the predictive accuracy of our model, we will need to apply it to a hold-out TEST population.

There are two main methods of applying our model to a data set that was not used for the development of the model.

1. The SCORE option in PROC LOGISTIC

2. Hard-coding the equation into a DATA step

Both of these methods will produce the same score for a given data set.

Transform the Hold-out TEST Data Set

The first step to scoring a hold-out TEST data set is to make sure that the data set has been exposed to the same treatments and filters that were applied to the TRAIN data set. For our example, many of the original data transformations (missing value imputation, feature engineering, outlier adjustments) were already applied to the TEST data set. This process was detailed in Chapter 5: Create a Modeling Data Set.

However, in this chapter, prior to developing the logistic regression model we made a few additional filters to the data. Due to a ramp-up period and a lag in the response variable, we made the decision to restrict the data to loan issue dates between Jan 1, 2015, and Dec 31, 2015.

If we do not restrict the data in the same way, then we will get very poor model results on the TEST data set. This is because the model was designed to calculate the probability of the dependent variable being positive given a set of independent variables during a time frame where the dependent variable has an event rate of 18.01% (Table 7.1). If the model is applied to earlier or later time periods where the event rate is much lower, the model will underperform.

Program 7.7 filters the MYDATA.MODEL_TEST data set to the same time period as the TRAIN data set. Again, the only reason that we are selecting these dates is because the event rate varies across the entire time period of the Lending Club data set.

Program 7.7: Limit the TEST Data Set by Time Frame

DATA TEST; 
  SET MYDATA.MODEL_TEST;
  WHERE ‘01JAN2015’d le issue_date le ‘01DEC2015’d;
RUN;
PROC FREQ DATA=TEST; TABLES BAD; RUN;

Output 7.11 shows that the event rate for the TEST data set for this time period is 17.66%. This is within a half-percent range of the event rate of the TRAIN data set.

Output 7.11: TEST Data Set Target Value Frequency Distribution

Now that we have selected the appropriate time frame for the TEST data set, we can apply the model to the TEST data set.

SCORE Option

PROC LOGISTIC includes an option called SCORE that allows you to specify a data set that you want to score with the model that was developed on the TRAIN data set. Program 7.8 shows this option in context with the original PROC LOGISTIC. Notice that the evaluation options (PLOTS, CORRB, OUTROC) have all been removed from the program. You can retain these options if you want, but they will just reproduce the same evaluation output that we already examined in the initial model development section.

Program 7.8: Score the TEST Data Set in the PROC LOGISTIC Program

PROC LOGISTIC DATA=TRAIN DESCENDING PLOTS=NONE;
       MODEL BAD = &num_vars. / SELECTION=STEPWISE SLE=0.01 SLS=0.01;
       SCORE DATA=TEST OUT=TEST_SCORE;
RUN;

The SCORE option allows you to specify the data set that you want to score. The DATA=TEST statement tells SAS that you want to score the hold-out TEST data set, and the OUT=TEST_SCORE tells SAS that you want to create an output data set that contains the probability scores and all of the original data from the TEST data set.

The default output of the SCORE option contains all of the original data from the TEST data set and two additional fields:

● P_1 – The probability that the observation has a dependent variable = 1

● P_0 – The probability that the observation has a dependent variable = 0

We can examine the scored TEST data with the use of PROC MEANS. Program 7.9 develops this program.

Program 7.9: Analyze Model Predicted Values

PROC MEANS DATA=TEST_SCORE N NMISS MIN MAX MEAN;
       VAR BAD P_0 P_1;
RUN;

Program 7.9 creates an analysis table that shows the values for the original target variable BAD along with the two predicted probability variables. Output 7.12 shows the analytical output.

Output 7.12: PROC MEANS Output

This table shows that the average rate of the actual target variable BAD is 17.66%. The variable P_1 shows the predicted probability of the event and the average predicted probability of the event is 17.95%. This analysis gives us an initial high-level analysis of the predictive accuracy of the model. The actual event rate and the predicted event rate are not too far from one another.

Evaluation Macros

We have seen that the standard LOGISTIC procedure provides evaluation metrics (Somers’ D and ROC chart) for the data set that the model was developed on. However, the standard model output does not contain evaluation metrics for the scored data set. In order to evaluate the scored data set, you have to calculate the evaluation metrics yourself.

Luckily, there are many evaluation macros already developed and free to use on the internet. I will explore some of these in detail in the chapter on model evaluation metrics. But for now, I will quickly apply one of the model evaluation macros that I gathered from the internet and use it to assess the accuracy of the model on the TEST data set.

The model evaluation macro that I will use was developed by Wensui Liu and is labeled “separation.” This easy-to-use macro creates several evaluation metrics and charts. I have placed this macro on my C drive, and I call it with a %INCLUDE statement. Program 7.10 shows the application of this macro.

Program 7.10: Application of Macro

%INCLUDE ‘C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas’;
%separation(data = TEST_SCORE, score = P_1, y = bad);

The first line simply states where I placed the macro and the name of the macro. The actual macro is called with the %separation statement followed by specifications of your data set that you want to evaluate.

● The DATA statement allows you to specify the name of the data set that you want to evaluate.

● The SCORE statement allows you to specify the name of the predicted probability variable.

● The Y statement allows you to specify the name of the target variable.

With these three simple statements, the macro will develop a detailed analysis of the separation power and accuracy of your scored data set. Output 7.13 shows one of the output tables generated by this macro. This output table is called a lift table and will be reviewed in detail in the model evaluation metrics chapter.

The main takeaways from this output for our purposes are the AUC and Gini scores. In the header information in Output 7.13, we can see that the TEST data set has been evaluated with an AUC = 0.7097 and a Gini = 0.4193.

Output 7.13: TEST Data Set Lift Table

GOOD BAD SEPARATION REPORT FOR P_1 IN DATA TEST_SCORE

( AUC STATISTICS = 0.7097, GINI COEFFICIENT = 0.4193, DIVERGENCE = 0.5339 )

 

MIN

MAX

GOOD

BAD

TOTAL

ODDS

BAD

CUMULATIVE

BAD

CUMU. BAD

SCORE

SCORE

#

#

#

RATE

BAD RATE

PERCENT

PERCENT

BAD

0.342

0.7104

1,689

1,088

2,777

1.55

39.18%

39.18%

22.18%

22.18%

|

0.2708

0.342

1,947

830

2,777

2.35

29.89%

34.53%

16.92%

39.10%

|

0.2239

0.2708

2,112

665

2,777

3.18

23.95%

31.00%

13.56%

52.66%

|

0.1874

0.2239

2,196

581

2,777

3.78

20.92%

28.48%

11.85%

64.51%

|

0.1562

0.1874

2,288

490

2,778

4.67

17.64%

26.31%

9.99%

74.50%

|

0.1284

0.1562

2,384

393

2,777

6.07

14.15%

24.29%

8.01%

82.51%

|

0.1021

0.1284

2,438

339

2,777

7.19

12.21%

22.56%

6.91%

89.42%

|

0.0766

0.1021

2,548

229

2,777

11.13

8.25%

20.77%

4.67%

94.09%

V

0.0533

0.0766

2,581

196

2,777

13.17

7.06%

19.25%

4.00%

98.08%

GOOD

0.0142

0.0533

2,683

94

2,777

28.54

3.38%

17.66%

1.92%

100.00%

 

0.0142

0.7104

22,866

4,905

27,771

 

 

 

 

 

The AUC for the development TRAIN data set was 0.7111, and the Somers’ D score was 0.422. When we compare these scores to the scores that were generated on the hold-out TEST data set, we can see that the TEST metrics are slightly lower for both the AUC and the Gini metrics. This shows that the model is not overfitting the data and can be adequately applied to other data sets.

Hard-coding Scoring Method

This scoring method allows you to input all of the components of the logistic regression model directly into a DATA step. All of the information that we need to define the logistic regression equation is contained in the model output that was generated from the LOGISTIC procedure developed on the TRAIN data set (Output 7.6).

Program 7.11 shows the development of the logistic regression equation within the same data set that we used to filter the TEST data set.

Program 7.11: Hard-coding Scoring Method

DATA TEST_SCORE;
       SET MYDATA.MODEL_TEST;
       WHERE ‘01JAN2015’d le issue_date le ‘01DEC2015’d;
       /*Model variables and coefficients from TRAIN model output*/
       xb = (-3.5418) +
       int_rate * 0.1144 +
       acc_open_past_24mths * 0.0744 +
       dti * 0.013 +
       emp_NA * 0.4312 +
       (total_bc_limit / 10000) * -0.0848 +
       (tot_hi_cred_lim / 10000) * -0.0099 +
       (loan_amnt / 10000) * 0.12 +
       num_actv_bc_tl * 0.044 +
       home_mort * -0.1802 +
       term_36 * -0.1451 +
       inq_last_6mths * 0.0903 +
       grade_A       * -0.2291 +
       grade_C       * 0.109 +
       mort_acc * -0.04 +
       ver_not * -0.0832 +
       home_own * -0.1067 +
       grade_D * 0.0776;
score = exp(xb)/(1+exp(xb));
RUN;

Notes on Program 7.11:

● A DATA step is used to develop and process the logistic regression equation. This DATA step inputs data from the MYDATA.MODEL_TEST data set.

● The data is filtered to the defined date range in order to be aligned with the developmental data’s event rate. At this point, the data set is exactly the same as the TEST data set that we fed into the previous PROC LOGISTIC with the SCORE option.

● The logistic regression equation is created by inputting the model intercept and each model variable along with their associated weighted values.

● Notice that the three variables with extremely small weights (total_bc_limit, tot_hi_cred_lim, loan_amnt) have been transformed by dividing each value by 10,000. Their associated weighted values have been adjusted accordingly.

● The final predicted score value is labeled “score”. This value is based on the equation that was introduced in Equation 7.2:

The hard-coded scoring technique is very useful because you do not have to redevelop PROC LOGISTIC every time that you want to apply the model to a new data set. It is also very practical when you move your model into production and have to hand over the model to your IT team to implement the model in a live environment. It is impractical to run PROC LOGISTIC every time you want to score a new loan applicant.

Linear Discriminant Analysis

We just reviewed how the logistic regression model directly models the conditional distribution of the response variable Y given the predictors X (Equation 7.2). The Linear Discriminant Analysis (LDA) model also has the goal of estimating these conditional probabilities, but with a slightly different approach. The LDA algorithm models the distribution of the predictors (X) separately for each response class (Y). The algorithm then incorporates Bayes’ theorem to transform these distributions into a form very similar to the logistic regression form.

The LDA algorithm develops a discriminant function that is also known as the classification criterion. The target variable defines a set of groups, and the discriminant function classifies each observation into one of the defined groups. A measure of generalized squared distance determines this discriminant function. These groups are projected onto a lower-dimensional plane that maximizes the separation between the groups. The discriminant function determines the minimum number of dimensions needed to describe the differences between the groups.

Because the LDA algorithm is essentially projecting the data set onto a lower-dimensional space, it is often used as a dimensionality reduction technique. It is very similar to Principle Component Analysis (PCA); however, PCA attempts to find the component axes that maximize the variance of the data, the LDA algorithm attempts to find the component axes that maximize the separation between groups.

Because of the projection mapping methodology, LDA can be used in two different ways:

1. Descriptive Discriminant Analysis – Data pre-processing dimensionality reduction methodology that reduces several independent variables onto a projected lower-dimensional plane. These newly created variables that maximize the separation between groups can be fed into other predictive models.

2. Predictive Discriminant Analysis – This is a machine learning classification technique that predicts the probability of two or more classes.

In this chapter, we will focus on the predictive discriminant analysis approach.

When to Use LDA for Prediction

Although logistic regression is the standard modeling approach for parametric classification models, there are specific instances when you might use the LDA model instead of a logistic regression model:

1. When there are more than two response classes.

2. When classes are well-separated. (In this case, the parameter estimates for the logistic regression model can be unstable, so the LDA is preferable.)

3. There is a small number of observations.

4. The distribution of the predictors is approximately normal.

If your data set contains one or more of the above characteristics, then the LDA model could provide significant predictive power with stable model coefficients.

PROC DISCRIM

The linear discriminant analysis predictive model can be created in SAS with PROC DISCRIM. This flexible procedure allows researchers to customize their model by specifying a variety of options, including whether the model will be parametric or non-parametric and whether the covariance matrices will be pooled. Researchers can also specify the threshold for classification, they can specify non-parametric methods, and they can select cross-validation methods.

For our purposes, we will develop a basic LDA predictive model on the same Lending Club data set that we used for the logistic regression model and the same target variable BAD. This will allow us to compare the logistic regression and LDA output.

Program 7.12 creates a global variable called “log_vars” that contains all of the predictors that were significant in the logistic regression model. This list of variables is fed into PROC DISCRIM.

Program 7.12: Create a Macro Variable That Contains All the Predictor Variables

%LET log_vars = int_rate acc_open_past_24mths dti emp_NA total_bc_limit tot_hi_cred_lim loan_amnt num_actv_bc_tl home_mort term_36 inq_last_6mths grade_A grade_C mort_acc ver_not home_own grade_D ;
PROC DISCRIM DATA=TRAIN OUTSTAT=DIS out=discrim_out 
       TESTDATA=TEST TESTOUT=TEST_OUT;
       CLASS BAD;
       VAR &log_vars.;
RUN;

Notes on Program 7.12:

● PROC DISCRIM is used to create the LDA model on the TRAIN data set.

● The OUTSTAT statement tells SAS to create an output data set containing all of the various statistics such as means, standard deviations, and correlations.

● The OUT statement tells SAS to create an output data set that contains all of the TRAIN data set variables along with the model score (also called “posterior probabilities”).

● The TESTDATA statement specifies that we want to score the hold-out TEST data set with the LDA model that was constructed on the TRAIN data set.

● The TESTOUT statement tells SAS to create an output data set that contains all of the TEST data set variables along with the model score.

● The CLASS statement specifies the groups for analysis.

● The VAR statement specifies the predictive variables for the model. In this example, I have limited the list of variables to the variables that were indicated as significant in the logistic regression model.

PROC DISCRIM Model Output

PROC DISCRIM creates a variety of output including summary information about the modeling data set, generalized squared distance, and the linear discriminant function for the target variable.

Model Summary Information

The first set of outputs provides you verification of your model inputs. Output 7.14 shows that the model is developed on the TRAIN data set that contains 17 predictors and a target variable that has two classes. It also specifies that the model used all 65,252 observations in the development of the model, and no observations were excluded.

The Class Level Information table shows the frequency and proportion of the target variable and the prior probability of that target variable.

Output 7.14: PROC DISCRIM Output

Method Details

PROC DISCRIM allows you to specify whether you want to use the pooled or within-group covariance matrix to calculate the generalized squared distances. The default option is POOL=YES. This option specifies that the model will compute the linear discriminant function. If the option was changed to POOL=NO, then the procedure would use the within-group covariance matrix and the model would compute the quadratic discriminant function.

Output 7.15 shows the natural log of the determinant of the pooled covariance matrix.

Output 7.15: Pooled Covariance Matrix

The procedure also allows you to specify whether you want to create a parametric or non-parametric model. This option is indicated with the METHOD statement. The default value is METHOD=NORMAL. This will create a parametric LDA model. If the option is changed to METHOD = NPAR, this will use a non-parametric approach to develop the model.

Linear Discriminant Function

The linear discriminant function is similar to the logistic regression model estimates. It shows how the data is used to classify an observation into one of the given groups (target levels). This example shows a binary classification of 0 and 1. However, if there were more than two classification groups, there would be a separate linear discriminant function for each level of classification. Output 7.16 shows the linear discriminant function for the binary target BAD.

Output 7.16: Linear Discriminant Function

Classification Matrix and Error Rate

PROC DISCRIM outputs a classification matrix (also called a confusion matrix). This table of information shows the actual number of observations for a given group (target level) and the predicted number of observations that were classified for a given group. The table also supplies the row percentage. These values allow the researcher to determine the number of predictions that the model got right and the number of predictions that the model got wrong.

If the actual value for an observation is 1 and the predicted value for an observation is categorized as 1, then it is an accurate prediction. Output 7.17 shows that when the actual value is 1, the model correctly predicts the value in 62.58% of the observations. However, when the actual value is 0, the model incorrectly predicts that it is 1 in 37.42% of the observations.

Classifications are determined by the 0.5 threshold value. If a predicted probability is less than 0.5, then the observation is classified as group 0. If a predicted probability is greater than or equal to 0.5, then the observation is classified as group 1. Output 7.17 displays the classification matrix.

Output 7.17: Confusion Matrix

The Error Count Estimates table shows the percentage of errors compared to the prior probability values. This table shows that given a random assignment of a classification group, we would expect 50% of the observations to be incorrect. However, for observations where the actual value is 1, the model provides a predicted value that is incorrect in 37% of the observations. These are Type II errors that represent false negatives. Even though a relatively high percentage of the observations is incorrectly categorized, a comparison to the prior probabilities shows that the model provides a significant lift over a random assignment.

Output 7.18 shows the classification matrix for the scored hold-out TEST data set. The first part of the table verifies that all of the TEST data observations were able to be scored.

Output 7.18: TEST Data Set Confusion Matrix

The remaining information in the classification matrix and the error count estimate are very similar to the data generated from the TRAIN data set. This shows that the model is not overfitting. If the error counts for the TRAIN data set were much lower than the error counts for the TEST data set, then we would have to be concerned that the model was overfitting the TRAIN data and not generalizing well to the hold-out TEST data.

Evaluation Macro

As a final look at the LDA predictive model, we can use the same evaluation macro that we applied to the logistic regression. These types of macros are flexible evaluation tools that can be applied to any data set that has a binary target variable and a continuous predicted probability score.

Program 7.13 shows the code that applies the macro created by Wensui Liu to the scored TEST data set.

Program 7.13: Application of Macro

%INCLUDE ‘C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas’;
%separation(data = TEST_OUT, score = ‘1’N, y = BAD);

This is obviously very similar to the scoring code that we developed for the logistic regression model. However, notice that the score field is labeled ‘1’N. This odd naming convention is a result of the default predicted probability field created by the DISCRIM procedure. In this case, the field is labeled 1. In order to specify that field, SAS requires us to refer to the field as ‘1’N.

Output 7.19 shows the lift table generated by the evaluation macro. If you compare this lift table to the one applied to the logistic regression model output (Output 7.13), you will see that they are very similar. This shows that for this particular data set, the two modeling procedures (log reg and LDA) produce very similar results.

Output 7.19: Lift Table Generated by the Evaluation Macro

GOOD BAD SEPARATION REPORT FOR ‘1’N IN DATA TEST_OUT

( AUC STATISTICS = 0.7097, GINI COEFFICIENT = 0.4193, DIVERGENCE = 0.5734 )

 

MIN

MAX

GOOD

BAD

TOTAL

ODDS

BAD

CUMULATIVE

BAD

CUMU. BAD

SCORE

SCORE

#

#

#

RATE

BAD RATE

PERCENT

PERCENT

BAD

0.7196

0.9393

1,685

1,092

2,777

1.54

39.32%

39.32%

22.26%

22.26%

|

0.6235

0.7195

1,967

810

2,777

2.43

29.17%

34.25%

16.51%

38.78%

|

0.5505

0.6235

2,093

684

2,777

3.06

24.63%

31.04%

13.94%

52.72%

|

0.4869

0.5505

2,200

577

2,777

3.81

20.78%

28.47%

11.76%

64.49%

|

0.4322

0.4868

2,291

487

2,778

4.7

17.53%

26.29%

9.93%

74.41%

|

0.3815

0.4321

2,400

377

2,777

6.37

13.58%

24.17%

7.69%

82.10%

|

0.332

0.3815

2,437

340

2,777

7.17

12.24%

22.46%

6.93%

89.03%

|

0.2819

0.332

2,531

246

2,777

10.29

8.86%

20.76%

5.02%

94.05%

V

0.2286

0.2819

2,584

193

2,777

13.39

6.95%

19.23%

3.93%

97.98%

GOOD

0.0812

0.2286

2,678

99

2,777

27.05

3.56%

17.66%

2.02%

100.00%

 

0.0812

0.9393

22,866

4,905

27,771

 

 

 

 

 

Chapter Review

The goal of this chapter was to introduce you to the concept of parametric classification models and demonstrate a few methods of producing those models in SAS. The logistic regression model is the go-to model for most binary target variable problems. This model design will deliver a high-performance model on large data sets with many predictors. Due to the parametric nature of the model design, the model can be generated quickly and efficiently.

However, the logistic regression and LDA model designs are not applicable for all binary target variable data sets. There are often cases where the target values are not well separated and cannot be segmented by a straight or curved line. This will be the subject of the next chapter on non-parametric classification models.