Chapter 4: Advanced SAS Coding – End-to-End Data Science with SAS

Chapter 4: Advanced SAS Coding

Overview

DO Loop

ARRAY Statements

SCAN Function

FIND Function

PUT Function

FIRST. and LAST. Statements

Macros Overview

Macro Variables

Automatic Macro Variables

User-Defined Macro Variables

Macros

Defining and Calling Macros

Chapter Review

Overview

We have covered several basic data exploration and data manipulation techniques. These techniques will certainly get you up and running with SAS programming, and you can spend years learning all the various options of each of these techniques and become proficient in these methods. However, there are several additional techniques that can significantly expand and improve a data scientist’s codebase. These methods can improve your ability to efficiently manipulate data and take your coding to the next level.

DO Loop

Looping is a convenient way to execute one or more statements, usually over changing parameters. This technique can be a highly efficient method of iterating through different levels of a variable and applying a computation to each level. You can replace a long series of IF-THEN statements with a single DO loop.

A simple example of a DO loop is to iterate over a small set of numbers and perform a calculation on each value in the number set.

Program 4.1: DO Loop Example

DATA X;
       DO i = 1 TO 10;
              y = i*2;
              OUTPUT;
       END;
RUN;

Notes on Program 4.1:

● The DO loop is performed inside of a DATA step.

● The DO loop starts with a DO statement and contains a TO specification.

● You assign an index variable (in this case, “i”) that is assigned a value in each cycle of the loop based on the TO specification.

● The calculation is performed using the current value of the index variable.

● You use the OUTPUT statement to output the current contents of the PDV in each cycle of the loop.

● The END statement marks the end of the iteration. The index value is incremented here.

● The loop iterates until the index variable value is outside of the range.

● The END statement marks the end of the statements within the loop.

The default for the DO loop is to iterate by single whole numbers; however, you have the option of specifying the unit increase in the DO loop with a BY statement.

Program 4.2: DO Loop with BY Statement

DATA X;
       DO i = 1 to 10 BY 2;
              y = i*2;
              OUTPUT;
       END;
RUN;

This code will loop through values 1 to 10 in increments of 2. So, your “i” values will be 1, 3, 5, 7, 9 and your output values will be 1, 6, 10, 14, 18.

One last option that we will review with the DO loop is the specification that uses WHILE. This statement will stop the DO loop if a certain condition occurs.

Program 4.3: DO Loop with WHILE Statement

DATA X;
       DO i = 1 to 10 BY 2 WHILE (y < 15);
              y = i*2;
              OUTPUT;
       END;
RUN;

In this example, the DO loop will check before each iteration to see if the condition y < 15 is true. It will continue to process the DO loop while that condition is true. When the condition becomes false, the restriction implemented by the WHILE statement will stop the looping.

ARRAY Statements

Many people find arrays challenging to learn. You are not alone in your struggles. However, the benefits of the ARRAY statement greatly outweigh the learning difficulties. I will attempt to break it down into very simple concepts.

If you have many lines of code in a DATA step where the only change is a variable name, then you will greatly benefit from learning the ARRAY statement. The ARRAY statement is often used in conjunction with a DO loop to perform the same calculation on several different variables, all with just a few lines of code.

Let’s learn this concept by example. Let’s say you have a data set that has the weight of 100 boxers with each boxer’s weight as its own variable. So, you have a data set with 100 columns of weight reported in pounds. However, you have been tasked to report these values in kilograms.

The formula to convert pounds to kilograms is to multiply the number of pounds (x) by 0.45: x * 0.45

Although this is a very easy calculation, you do not want to have to apply it once for each of the 100 variables. That would be 100 lines of code.

An ARRAY statement identifies a group of variables of a similar type that you want to perform an operation on. In our example, that would be all 100 of the boxer’s weight variables, and the operation that we would like to perform is to convert the weight as represented in pounds to kilograms. Each item in an array is referred to as an element. Array elements can be referenced using a numeric expression instead of a hard-coded variable name.

There are three essential components to an ARRAY statement

array array-name {n} array-elements ;

1. Array_name – This is what you decide to name your array. You can use any valid SAS name.

2. n – This is the number of elements in your array. In our example, there are 100 elements.

3. Elements – A list of common SAS variables that are part of the array. In our example, the elements are all of the boxer weight columns.

The following code combines an ARRAY statement and a DO loop (they often occur together) to perform the necessary conversion:

Program 4.4: ARRAY Example

DATA convert;
       SET boxers;
       ARRAY weight_array [100] weight1-weight100;
       DO i = 1 TO 100;
              weight_array{i} = weight_array{i} * 0.45;
       END;
RUN;

Notes on Program 4.4:

● The ARRAY statement is compiled within a DATA statement.

● The ARRAY statement performs the following steps:

◦ Assigns an array name (weight_array) that will be referenced in the calculation.

◦ Specifies the number of elements in the array (100 columns).

◦ Specifies the list of variables that will comprise the elements of the array (weight1 through weight100 from the boxer data set).

● A DO loop will loop through each of the 100 elements contained in the weight_array because the index variable, i, increments from 1 to 100 by 1 and in each cycle of the reference (weight_array{i} will resolve to each element of the array.

● The conversion calculation is executed on each of the elements in the array.

The resulting data set will contain the newly calculated kilogram weight variables. This code works great if you want to overwrite existing variables with a newly calculated variable. However, we often need to keep the original variable as it is and create brand new variables that contain the newly calculated values. This is an easy modification to the code by adding a second ARRAY statement that contains the new variables.

Program 4.5: ARRAY Statement Creates New Variables

DATA convert;
       SET boxers;
       ARRAY weight_array {100} weight1-weight100;
       ARRAY kilo_array {100} kilo_weight1-kilo_weight100;
       DO i = 1 TO 100;
              kilo_array{i} = weight_array{i} * 0.45;
       END;
RUN;

Notes on Program 4.5:

● The ARRAY statements are compiled within a DATA statement.

● The first ARRAY statement performs the following steps:

◦ Assigns an array name (weight_array) that will be referenced in the calculation.

◦ Specifies the number of elements in the array (100 columns).

◦ Specifies the list of variable that will comprise the elements of the array (weight1 through weight100 from the boxer data set).

● The second array performs the same steps for the newly created variables weight_array1 thorough weight_array100.

● A DO loop will loop through each of the 100 elements contained in the weight_array, perform the conversion calculation and assign the result to the corresponding element of the kilo_array.

The resulting data set will contain all of the original weight variables along with the newly calculated kilogram weight variables.

Note that arrays created with the ARRAY statement exist as the DATA step processes as a way to refer to a series of variables. The arrays are not stored in data sets, only the variables processed or created with arrays.

SCAN Function

The SCAN function is a very easy way to parse text data from character strings. Data sets will often contain fields such as name, address, city, and state, or any character variable where there is more than one piece of information contained in a single variable. The classic example is customer_name. This field can contain first name, last name, and title. Usually, these fields are represented in a consistent order such as “Gearheart, James, Mr.”. We will often want to parse out the first name and last name into separate variables. The SCAN function is an easy way to accomplish this.

Program 4.6: SCAN Example

DATA names;
  SET dataset;
  first_name = SCAN(customer_name, 2);
  last_name = SCAN(customer_name, 1);
RUN;

Notes on Program 4.6:

● The SCAN function is performed inside of a DATA statement.

● The SCAN function has two required arguments:

◦ The first argument inside of the parentheses specifies the text field that you would like to parse.

◦ The second argument inside of the parentheses indicates the number of the word to be extracted from the text string specified in the first argument.

● SCAN determines where one word ends and the next begins using a pre-defined set of delimiters. If the pre-set delimiters don’t delimit words the way you need for your application, you can specify your own list of delimiters as the third argument within the parentheses. If you do choose to specify your own delimiters, only the delimiters specified will be used by the SCAN functions; the original delimiter list is ignored.

FIND Function

The FIND function is another very useful function to apply to text data. This function can scan through a text field and find a specified value. If the function finds the specified value, the function then returns the position of the given value.

The syntax of the FIND function is:

FIND(string, substring, <modifiers>, <startpos>)

The FIND function has two required arguments, two optional arguments, and returns a numeric value.

String – this required argument identifies the source text in which you want to search. You can either directly type constant text into the field (the example below), or you can reference a variable containing text.

Substring – this required argument specifies the text that you would like to search for. If this search text is found, the output will specify the position of the text. If not found, the output will be 0.

Modifiers – this optional argument provides flexibility in how FIND is applied to text. For example, you can specify a case-insensitive match or trim away trailing blanks from the search text.

i – ignores the character case during search

t – trims trailing blanks

Startpos – this optional argument specifies the starting position of the text search.

Program 4.7: FIND Example

DATA chk;
       pos_1 = FIND (“Data Science”,”nce”);
       pos_2 = FIND (“Data Science”,”sci”);
       pos_3 = FIND (“Data Science”,”sci”,”i”);
       pos_4 = FIND (“Data Science”,”ata”,42);
RUN;
PROC PRINT DATA = chk; 
RUN;

The output in Table 4.1 shows the results of the above code.

Table 4.1: FIND Output

Notes on Program 4.7:

● The FIND function is performed within a DATA step.

● The first statement finds the substring “nce” at position 10 in the referenced text string.

● The second statement does not find the substring “sci” in the referenced text string because it is a case-sensitive search.

● The third statement uses the modifier “i” to make the search indifferent to case and therefore finds the substring “sci” starting at the sixth position.

● The fourth statement specifies a starting position past the end of the source string and thus returns 0.

PUT Function

The PUT function is most widely used to convert a numeric value to a character value. This simple function is incredibly valuable and is often used when manipulating data. There are often times when a variable is represented as a numeric value in one database and as a character value in another database. To join these databases, you will need to make the data types of the two variables consistent. The PUT function is an easy way to change a variable value from numeric to character.

The syntax of the PUT function is:

PUT(source, format.)

The PUT function has two required arguments.

Source – the variable containing the value (or expression) that you would like to convert

Format –the format that you want to use to convert the variable value to text

Program 4.8: Convert Numeric to Character

DATA temp;
  SET dataset;
  id_char = PUT(id_num, $8.);
RUN;

The above code creates a new variable id_char by converting the existing numeric variable id_num into a character variable. The first argument identifies the variable that you want to convert, and the second argument specifies that you want to convert the variable to a character format $8.

FIRST. and LAST. Statements

Certain processes happen “behind the scenes” in SAS. These processes are how SAS processes information. I touched on this topic in the explanation of the DATA steps and how the program data vector (PDV) iterates through a data set, line by line, in Chapter 3.

When a DATA step contains a BY statement, the DATA statement creates two temporary indicator variables for each variable in the BY statement. The names of these variables are FIRST.variable and LAST.variable (where “variable” is the name of the associated variable in the BY statement). If you sort the data set by the variables contained in the BY statement, you are able to perform a running calculation iteratively applied to each observation and to perform summary calculations within each BY group.

For example, if we wanted to count the number of observations for each level of a categorical variable, we can use the FIRST.variable LAST.variable statements to calculate a total within each level of the BY variable. Program 4.9 demonstrates this technique applied to the MYDATA.Listings data set to count the number of observations for each level of the variable room_type.

Program 4.9: First. Last. Summary BY Variable

PROC SORT DATA=MYDATA.LISTINGS;
  BY room_type;
RUN;
DATA count;
  SET MYDATA.LISTINGS;
  BY room_type;
  IF FIRST.room_type THEN
      count = 0;
  count + 1;
  IF LAST.room_type;
RUN;
PROC PRINT DATA=count NOOBS;
  FORMAT count comma10.;
  VAR room_type count;
RUN;

Notes on Program 4.9:

● The data must be sorted by the variable that you will include in the BY statement.

● Within a DATA step, the BY statement controls the operation of the SET statement and sets up the grouping variable.

● SAS creates a FIRST.room_type temporary indicator variable and initializes a count = 0 at the beginning of each level of the BY variable.

◦ Because of the presence of the BY statement, when the SET statement reads a row, it checks to see if this is the first row of a group. If so, it sets the value of first.room_type to 1, otherwise it sets the value to 0. It also checks to see if the next row to be read contains another row from the current group. If so, it sets the value of last.room_type to 0, otherwise it sets the value to 1.

◦ The value of first.room_type is tested and, if it is 1, the value of count is set to 0.

◦ Next, 1 is added to the count variable for every row of data read.

◦ In the subsetting IF statement, the value of last.room_type is tested and, if it is 1, the program continues on to excute the RUN statement, outputting a record as the count data. Otherwise, processing returns to the top of the DATA step without outputting a row.

◦ The process continues through each level of the room_type variable until no more rows are available in the room_type data set.

● Because we know that there are three distinct values for the room_type variable, we will see three rows output to the count data set.

● The PROC PRINT statement is optional and allows you to create a report displaying the output data set containing the summarized results

The output of the above code is shown in Table 4.2.

Table 4.2: First. Last. Summary Output

Macros Overview

We have discussed many ways in which you can manipulate data. SAS offers an incredible number of methods that allow you to ingest, transform, and report data. However, one of the most powerful and often misunderstood methods is the use of macros. Sometimes the terminology can get a bit confusing, so let’s make a quick distinction between a macro variable and a macro.

There are two main categories of macros:

Macro variable – This is a variable that is typically defined and assigned a value once and then referenced throughout your program. Usually, a %LET statement is used to define and assign a value to a macro variable.

Macro – This is a named repeatable set of stored SAS code that can be called for execution within SAS programs. The method used to name and define this set of stored statements is to use a %MACRO statement to mark the beginning of the set of statements and a %MEND to mark the end of the set of statements.

Macro Variables

There are two main types of macro variables. When you invoke a SAS program, the SAS system’s macro processor generates automatic macro variables that supply information related to the SAS session. These are called automatic macro variables.

A user can also create, define, and call their own macro variables. These are called user-defined macro variables. These variables can have a maximum of 65,534 characters and must be printable characters. This requirement is because macro variables are essentially a method of dynamically modifying text within the SAS program through the use of symbolic substitution. If the text is not printable, then SAS would not be able to read it and interpret it in its execution.

Automatic Macro Variables

When you invoke a SAS program, the SAS system automatically generates several macro variables that supply information related to the SAS session. This information includes how the data will be processed as well as system information such as the date and time that the session was invoked. You have the ability to access and use these system-generated macro variables as part of your program. Most of these system-generated macro variables begin with the “SYS” prefix. A full list of these automatic macro variables can be found in the SAS documentation.

A common use of the system-generated macro variables is to use them in reporting to specify when the report was created. Program 4.10 incorporates the code that we developed for the PROC REPORT section in the previous chapter; however, we have now added a FOOTNOTE statement that includes automatic macro variables. The system-generated macro variables SYSDAY and SYSDATE9 are used to specify the day and date that the SAS program was invoked. Macro variables are referenced by placing an ampersand (&) before the name of the requested variable. The dot (.) following the macro variable delineates the end of the macro variable reference. This is required if the macro variable reference is not followed by a white space or other character that will otherwise mark the end of the macro variable reference.

Program 4.10: System-Generated Macro Footnote

PROC REPORT DATA=MYDATA.LISTINGS ;
  title ‘Summarized AirBnB Report’;
  COLUMN bed_type room_type price;
  DEFINE bed_type / GROUP ‘Bed Type’;
  DEFINE room_type / GROUP ORDER=FREQ DESCENDING’Room Type’;
  DEFINE price / MEAN FORMAT=DOLLAR10.2 ‘Average Price’;
  FOOTNOTE “Report for &sysday., &sysdate9.”;
RUN;

This report will include a footnote that specifies the day and date in which the report was created. If this report is run according to a scheduled process, then the FOOTNOTE statement will detail which day and date that the specific report was created.

Output 4.1: System-Generated Macro Footnote Output

User-Defined Macro Variables

User-defined macro variables allow you to increase your program’s efficiency and to increase quality control. User-defined macro variables must begin with either a letter or an underscore and can be followed by letters or digits. These types of macro variables can be used for a variety of purposes, including representing constant text, equations, and null values. They can also be used to refer to another macro variable. The possibilities of these combinations are nearly endless.

Table 4.3: User-Defined Macro Examples

Assignment Type

User-Defined Macro Variable

Purpose

Output

Constant text

%LET animal = DOG;

Create a text reference for a specific value

DOG

Digits

%LET num = 100+200

Create a text reference for specific digits

100+200

Equation

%LET eq = %EVAL(20-15);

Create an equation that resolves to a calculated value

5

Null value

%LET actor = ;

Create a null value for a specific variable

 

Macro variable reference

%LET first_name = George;
%LET last_name = Washington
%LET name = &first_name &last_name;

Create a text reference developed from previously defined macro variables

George Washington

Table 4.3 provides a few examples of user-defined macros based on assignment type. A few items to note about the information contained in Table 4.3:

● The “Digits” assignment demonstrates that the macro does not evaluate the expression 100+200. It treats the expression as character values and therefore the output is “100+200” rather than 300.

● The “Equation” assignment incorporates the %EVAL function to evaluate the expression 20-15. This expression results in the calculated value of 5 as the output value.

● The “Macro variable reference” assignment allows you to create macro variables and then refer to those variables within another user-defined macro variable.

Let’s look more closely at a common application of user-defined macro variables. If you have a SAS program that references a particular value repeatedly, it is extremely useful to incorporate macro variables. We often write programs that constantly refer to a certain date or categorical value throughout the program. If we need to update that date value to a new time period, then we will have to search throughout the program and change it at every instance that it is specified. Hopefully, you will not miss one of the instances where it is called and throw off your entire data output!

An easy fix to this problem is to create a macro variable. This technique allows you to specify the value of a variable once, usually at the beginning of your program, and then refer to the macro variable throughout your program. This way, if you need to change the value of the variable, you will need to do it only once, and all of the referred instances of the macro variable will be in line with the new value.

Program 4.11 demonstrates the development of the macro variable and provides an example of how it is called within a DATA step.

Program 4.11: Macro Variable Example

%LET date = ‘01MAY2019’d;
DATA POP;
  SET DATASET;
  IF start_date = &date.;
RUN;

Notes on Program 4.11:

● The %LET statement is used to create and define the macro variable. We are creating a macro variable that we have decided to call “date,” and we are setting this value to ‘01MAY2019’d. This statement is telling SAS that every time that the macro variable “date” is referenced, that it should interpret it as “’01MAY2019’d”.

● The DATA step is one way of incorporating the newly formed macro variable. You can use and define macro variables anywhere within the SAS program except in data lines used as input to a DATA step through the use of the DATALINES statement. In this example, the macro variable is used to filter the data set to observations where the “start_date” is equal to ‘01MAY2019’d.

● The macro variable is referenced by placing an ampersand (&) in front of the defined macro variable name. The dot (.) following the macro variable name delineates the end of the macro variable reference. This is required if the macro variable reference is not followed by a white space or other character that will otherwise mark the end of the macro variable reference.

Macros

Macros are reusable pieces of code that perform a function and can easily be updated with new information. You may be familiar with the concept of macros if you have used functions in other programming languages. Table 4.4 demonstrates a simple reusable algorithm developed in the three most popular data science programming languages, R, Python, and SAS. The example algorithm raises the first value by the power of the second value.

Table 4.4: Macro Comparison Table

Language

Programming Code

Output

R Programming

power <- function(x, y) {

  result <- x^y

  print(paste(x,”raised to the power”, y, “is”, result))

}

power(8,2)

 

 

“8 raised to the power 2 is 64”

 

 

Python Programming

def power(x,y):

    result = x**y

    print(x, “raised to the power”, y, “is”, result)

power(8,2)

 

 

8 raised to the power 2 is 64

 

SAS Programming

%MACRO POWER(x,y);

    %LET result = %EVAL(&x.**&y.);

    %PUT &x. raised to the power &y. is &result.;

%MEND;

%POWER(8,2);

 

 

8 raised to the power 2 is 64

 

 

Now that these functions have been developed, the programmer would merely have to call the name of the function with updated values to get new results.

Table 4.5: Macro Call Examples

Call Function

Output

%power(3,5)

%power(10,3)

%power(-7,3)

3 raised to the power 5 is 243

10 raised to the power 3 is 1000

-7 raised to the power 3 is -343

This simple example demonstrates a powerful programming concept. Complex programs can be developed with symbolic placeholders for values to be input later. This methodology is the path to program efficiency. Reusable code can save a lot of time and be shared with team members to ensure methodological consistency and quality control.

Defining and Calling Macros

The basic structure of a macro contains three necessary parts. The %MACRO statement begins the process by telling the SAS system that you are generating a user-defined macro. The macro name assigns a name to the macro that can be called later. The %MEND statement tells the SAS system that the definition is complete. Program 4.12 demonstrates the basic structure.

Program 4.12: Macro Structure

%MACRO <macro_name>;
    programming code;
%MEND;

When you submit a macro definition, the macro processer compiles the definition and adds it to the available SAS macros for that session. Once a macro is compiled, user-defined macros are available during the current SAS session. Once the session ends, the user-defined macros will be removed from the available macro list. If you begin a new session, you will need to compile the macro prior to calling it.

Once the algorithm has been compiled, you can call the macro any time by simply calling the macro name. Program 4.13 demonstrates how to call the macro.

Program 4.13: Macro Call Example

%<macro_name>;

Let’s take an in-depth look at the previous example of the polynomial macro and explain each part.

Program 4.14: Polynomial Macro Example

%MACRO POWER(x,y);
       %LET result = %EVAL(&x.**&y.);
       %PUT &x. raised to the power &y. is &result.;
%MEND;

Notes on Program 4.14:

● The %MACRO statement begins the creation of a user-defined macro.

● The macro is given the name POWER. You can assign any valid SAS name for a macro name.

● The parentheses contain the symbolic placeholders for values to be used in the macro program. These are generally called the “parameters” of the macro. You can use any valid SAS name for the parameters. These parameters will be used in the following macro program. The parameters within the macro will be called by placing an ampersand (&) symbol before the parameter name. In the provided example, the first parameter is called by the value &x. The second value is called by the value &y.

● Once the macro is initiated and the name and parameters are defined, the next step is to develop the content of the macro. You can use any valid SAS programming code within a macro. If you have defined parameters in the parentheses following your macro name, you can use them within the content of the macro.

● The %MEND statement completes the macro process.

Let’s take a look at another example of a macro statement developed on the Airbnb data set from our case study. If we had decided that we wanted to implement a discount on Airbnb properties based on room type and price, we can create a macro that performs this calculation without knowing the specific values of room type and price. Program 4.15 provides an example of this code.

Program 4.15: Macro Price Change Example

%MACRO SALE(type, price, discount);
  DATA chk;
      SET TEST;
         WHERE bed_type = &type.
         AND price gt &price.;
         discount = price* &discount.;
  KEEP host_id price discount;
  PROC SORT DATA=chk; BY price; RUN;
%MEND;

Program 4.15 creates a macro called SALE that contains three parameters. The type parameter will assign a user-defined value that will be used to specify bed type. The price parameter will be used to determine the per diem price of the property. The discount parameter will specify the percentage discount to be applied to the calculation.

The first two values are used to filter the data set to observations that meet these criteria. This data set is then subject to the discount calculation. We can now call the macro by calling the macro name along with values for the parameters. Program 4.16 demonstrates the execution of the macro with the parameters specified.

Program 4.16: Macro Price Change Call

%SALE(‘Real Bed’, 100, 0.8);

This example limits the data set to values where the room type is specified as “Real Bed” and that the per diem price of the property is greater than $100. This data set is then subjected to a discount rate of 80%.

Now that the macro has been defined, we can easily create new sales strategies by specifying different values for the parameters. Program 4.17 demonstrates different implementations of the user-defined macro.

Program 4.17: Macro Price Change Call Examples

%SALE(‘Real Bed’, 100, 0.8);
%SALE(‘Couch’, 50, 0.6);
%SALE(‘Futon’, 75, 0.75);

These examples demonstrate how easy it is to change the parameters contained in the macro and develop new output.

Chapter Review

Congratulations on finishing the Advanced SAS Coding chapter! If you are new to programming, these concepts and techniques can be challenging to learn. If you have programmed in other languages, you can probably see similar approaches and coding techniques in SAS that you have used elsewhere. The methods that we have discussed in chapters 3 and 4 comprise most of the hands-on programming that is necessary for data ingestion, manipulation, and analysis. However, there is always more coding to learn. The methods that I have demonstrated are only a small fraction of the possibilities that SAS has to offer to programmers.