Chapter 3: SAS Coding – End-to-End Data Science with SAS

Chapter 3: SAS Coding

Overview

Get Data

Structured and Unstructured Data

SAS Libraries

PROC IMPORT

WORK Library

DATA Step

Server Connections

Explore Data

PROC CONTENTS

PROC FREQ

PROC UNIVARIATE

PROC MEANS

PROC SUMMARY

Manipulate Data

DATA Step

Filtering Data

PROC SORT

Export Data

PROC EXPORT

PROC REPORT

Graphs

Chapter Review

Overview

I am going to assume that if you bought a book on data science in SAS that you have some familiarity with the SAS programming language. If you do have some experience in SAS coding, then this chapter will be a refresher on common SAS coding techniques. If you do not have much experience in SAS coding, then you may have to spend a little extra time studying this chapter and trying out the programming concepts yourself on the data that I’ve provided in the GitHub repository.

I will break the SAS coding concepts into four sections:

● Get data

● Explore data

● Manipulate data

● Export data

Regardless of what industry or academic program that you are in, these four areas of SAS coding will cover about 80% of what you will need to know to sufficiently work with data.

An important programming note to understand before we begin developing SAS code is that SAS does not incorporate “white space formatting.” In many other programming languages, the programmer is required to indent four spaces or use a tab to begin or continue a process. SAS does not require the programmer to do this. Although a programmer could either continuously write all their code on a single line or they could put as much white space as they wish between each statement, it is a best practice to use standard formatting to make your code easy to understand.

Get Data

The first step in any analytical project is to get the data. This section will explore the various methods of getting data into the SAS environment.

Structured and Unstructured Data

The first step in any data project is to get the data. Data can come in many forms and from many environments. To simplify the complex subject of data gathering, I will focus on structured data that is contained in existing databases and unstructured data that is contained in text files.

Structured data looks like a spreadsheet. It has defined rows and columns, and it can be stored in many structured environments like SAS tables, Microsoft Excel spreadsheets, and Microsoft Access tables. Structured data can also be stored in database management systems (DMBS) such as Teradata, Microsoft SQL Server, Apache Hadoop, and many others.

Unstructured data does not have defined columns. This data includes text files, JSON files, and weblogs. Some unstructured data may look as if it is structured (for example, a tab-delimited text file), but the computer just sees a long string of text with delimiters. This data would need to be brought into a structured environment for it to be manipulated and analyzed.

The first step to working with either structured or unstructured data is to put the data into a SAS table. A SAS table is a structured data file that has defined columns and rows. This data file has the extension .sas7bdat.

A SAS table has columns and rows with the variable names in the first row and the data in the following rows. Although the structure is similar to a spreadsheet, there are a few SAS specific rules for a SAS table:

● SAS column names can be between 1 and 32 characters in length.

● SAS column names must start with either a letter or an underscore. (They cannot begin with a number.)

● SAS column names cannot contain special values.

● SAS column names can be uppercase, lowercase or mixed case.

● You cannot mix character and numeric data types within the same column.

● Numeric columns are always stored with a length of 8 bytes (which equates to 16 digits).

● Character columns can be any length between 1 and 32,767 bytes (one byte per character).

SAS Libraries

A SAS library is a collection of data files that are of the same data type and in the same location. For example, a SAS library can be constructed to contain all the Teradata files. Another SAS library can be built to contain all the Hadoop files, while another SAS library can be built to hold all your locally stored text files. Many SAS programs will have several library statements at the beginning of the program to construct pathways to all the various data files that you will need for your project.

You can create a SAS library with a LIBNAME statement. The general structure of a LIBNAME statement is:

LIBNAME libref engine “path”;

Let’s look at each part of the LIBNAME statement:

LIBNAME – This is the keyword reference that calls the LIBNAME procedure.

libref – This is the user-defined name of the library. You can name it whatever you want as long as it is eight characters or less, starts with a letter or an underscore, and includes only letters, numbers, and underscores in the name.

engine – An engine is a set of “behind-the-scenes” instructions for reading structured data. Different types of data environments require different engines. Therefore, SAS has different engines for Excel, Teradata, Hadoop, and several others. SAS also has the Base SAS engine to read in SAS tables. The Base SAS engine is the default engine that SAS uses when an engine is not specified in the LIBNAME statement.

path – The path is the location of the data that you want to read. This location is relative to the installation of the SAS environment. If you have SAS installed on your local computer, you can specify a path to a local file. If you have SAS installed on a server, then your paths must be relative to the server.

For example, I have a folder on my desktop that I’ve labeled “SAS Book Stuff,” and within this folder, I have another folder that I’ve labeled “Data.” This folder is where I put all of the data that I will use for my examples. The Windows file path for this folder is: C:\Users\James Gearheart\Desktop\SAS Book Stuff\Data.

I can create a SAS library that I will call “MYDATA.”

Program 3.1: Establish a Library Pathway

LIBNAME MYDATA BASE “C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data”;

An important thing to notice is that the pathway in the LIBNAME statement in the UNIX/Linux environment requires you to use the forward-slash “/” instead of the backslash “\” that you normally see in a Windows environment.

Notes on Program 3.1:

● I use the LIBNAME statement to create the library.

● I have specified that the libref that I will use is MYDATA.

● I am using the Base SAS engine (although I do not need to specify it because it is the default engine).

● I am specifying the path to my local folder (making sure that I use the forward slash).

SAS libraries are a great way to manage your data environment. They can act as pointers to existing locations that contain similar data files, or you can use SAS libraries to create a place to store files that you create through your programming code permanently. By placing data into a library, the data files become permanent until you choose to delete them.

PROC IMPORT

Now that we have established a library to house our data, let’s get some data to populate the library. The code below imports a comma-delimited text file called listings_clean.csv into our newly created MYDATA library.

Program 3.2: Import Data with PROC IMPORT

FILENAME REFFILE ‘C:\Users\James Gearheart\Desktop\SAS Book Stuff\Data\listings_clean.csv’;
PROC IMPORT DATAFILE=REFFILE
       DBMS=CSV
       OUT= MYDATA.Listings;
       GETNAMES=YES;
RUN;

Let’s examine each piece of the above code:

● I have used the FILENAME statement to create a path to the data that I want to import. I have called this pathway REFFILE. Notice that that pathway uses the backslash in the file path.

● The IMPORT procedure reads unstructured data from an external source and writes it into a SAS data set.

● The DATAFILE statement specifies the file reference that I created in the FILENAME statement. You can choose not to use a separate file reference and instead use the file path.

● The DBMS statement specifies the type of information that you want to import. An important note is that since the data file that we are importing has a .csv extension, we do not need to specify that the delimiter is a comma. PROC IMPORT can identify that this type of delimited external file contains comma-separated data values.

● The OUT statement specifies where the imported data will be placed and what it will be labeled. This example shows that we are putting the data in our newly created MYLIB library, and we are assigning the data to the name “Listings.”

● The GETNAMES statement imports variable names from the first row of the imported data set.

The MYDATA library permanently contains the Listings data set. Even if you delete your MYDATA library reference or shut off SAS completely, the Listings data set will still exist in the location that you specified. This is because the libref is merely a pointer to the data. You can delete the pointer, but the data remains until you choose to delete it.

WORK Library

You certainly do not want to permanently keep every data set that you work with throughout your project. Some data files are meant to be temporarily used as part of a process flow and then discarded. Data files consume a lot of computer memory, so you should be judicious in deciding which ones you want to retain after your SAS session is finished.

The WORK library is a temporary library that is automatically created by SAS at the beginning of a SAS session. The files that are written to the WORK library are deleted at the end of the SAS session. If a data file is created and does not contain the reference to a specific library, then this data file is placed into the WORK library.

A programmer can either specify the WORK library directly, or they can infer the WORK library by not providing a specific library. The code below shows two examples of placing the MYDATA.Listings data into a temporary WORK library labeled “Test.”

Program 3.3: Place Data into a WORK Library

DATA WORK.TEST;
  SET MYDATA.Listings;
RUN;
DATA TEST;
  SET MYDATA.LISTINGS;
RUN;

The first example explicitly states the WORK library while the second example implicitly states the WORK library by not specifying a library.

If you were to restart your SAS session, the MYDATA.Listings data set would still exist; however, the WORK.Test data would not exist.

DATA Step

SAS libraries are a great way to access structured data. The easiest example to understand is copying an existing SAS table from one library to another SAS library. If a SAS table exists in another SAS library, you can copy it into a new library with a DATA step:

Program 3.4: Place Data into a SAS Library

DATA MYDATA.DEMOGRAPHICS;
  SET ORG.DEMOGRAPHICS;
RUN;

The above code accesses the data set named “DEMOGRAPHICS” that is found in the ORG library previously created by someone else and creates a copy of it in your MYDATA library. We have chosen to keep the same name of the file, but we could have decided to call it whatever we wanted. We will explore the DATA step in more detail in the “Manipulate Data” section of this chapter, but for now, it is important to know that existing SAS tables can easily be copied from one library to another.

Server Connections

Most data scientists work in corporations where data is not stored locally or in Excel files. In the era of big data, data files are generally stored on servers that the data scientist will need to connect to and navigate the correct pathway to access and select the data that they need.

The following code establishes a connection to a specific set of tables on a Teradata server.

Program 3.5: Establish a Connection to a Teradata Server

%INCLUDE ‘~/pw/userinfo.sas’;
LIBNAME MY_TERA TERADATA USER=”&sysuserid” PASSWORD=”&pw”
SERVER=”TERADATA_SERVER” SCHEMA=SPECIFIC_AREA

Let’s look at each piece of the above code:

● In this example, the %INCLUDE statement uses a Linux/UNIX path. This is because this example assumes that you are connecting to a Teradata server in a Linux environment. Linux paths are also compatible with SAS University Edition and most SAS Viya installations.

● It is vitally important to protect your sign-on information. The %INCLUDE statement points to a protected file that contains your username and password information. This file is not to be shared with anyone else. This step is not necessary to connect to servers, but it is highly recommended that you do not directly insert your username and password information into your programs. The tilde (~) indicates the home directory.

● The LIBNAME statement is used to create a library that we will call MY_TERA.

● The TERADATA statement is used to access the Teradata engine that will connect to the Teradata server.

● The USER and PASSWORD statements pull information from the %INCLUDE file to establish your access privileges.

● SERVER specifies the name of your organization’s Teradata server. This can also be called with the TDPID statement which is an alias for the SERVER statement.

● The SCHEMA statement points to a specific collection of tables on the Teradata server. Servers will usually have several different schemas that contain different types of data tables.

Pulling data from corporate servers is a bit trickier than using a simple PROC IMPORT. The data found in these servers are generally massive in scale. These data sets can range from a few gigabytes up to the petabyte and exabyte scale. The data contained in these servers are usually not in SAS tables. You will need to access the specific data that you need and bring it into the SAS environment and place it into a SAS table. Let’s look at an example piece of code:

Program 3.6: PROC SQL Connection to Teradata Server

PROC SQL;
       CONNECT TO TERADATA(USER=”&sysuserid” PASSWORD=”&pw”
              SERVER=”TERADATA_SERVER” SCHEMA=SPECIFIC_AREA 
              MODE=TERADATA);
       CREATE TABLE MYLIB.WEBLOGS AS SELECT * 
FROM CONNECTION TO TERADATA
              (SELECT a.*
                     FROM SPECIFIC_AREA.TABLE_NAME AS a
                            WHERE date >= ‘2017-12-01’
              );
QUIT;

Notes on Program 3.6:

● The SQL procedure implements the Structured Query Language (SQL) in SAS. SQL is a widely used programming language that accesses existing data structures and creates data tables.

● The CONNECT TO statement uses the same connection specifications that we used to establish a connection to the Teradata server.

● SQL provides the option to create a table or a view. We are creating a table named “WEBLOGS” in the MYLIB library that we have previously created. This table is a permanent data set that is a SAS table.

● The AS SELECT * FROM statement tells SQL where we want to get the data to populate the MYLIB.WEBLOGS data set that we are creating.

● The CONNECTION TO TERADATA statement specifies that we are pulling data from our newly established Teradata connection.

● The remainder of the code within the parentheses is standard SQL coding that we will explore shortly. It is important to note that because the data contained in servers is massive, it is important to limit your data pull. In this example, I am using a WHERE statement to select only the observations that occur after a specific date.

We have just seen a few examples of how to get data into a SAS environment. The LIBNAME statement helps you create a place to put data. PROC IMPORT, the DATA step, and PROC SQL are just a few of the many ways that you can populate your library with SAS tables. Now that we have some data to play with, we will need to explore the data.

Explore Data

After you have successfully gathered some data, you will need to know some information about your data, such as:

● Number of columns

● Number of observations

● Format for each column

PROC CONTENTS

The CONTENTS procedure shows the contents of a specified SAS table. The procedure writes the contents to an output file and provides a summary of the information in the Results window of your SAS environment.

Program 3.7: Explore the Contents of a Data Set

PROC CONTENTS DATA=MYDATA.Listings; RUN;

The output contained in the Results window has three parts. The first two parts provide summary information about the data set, including when it was created, where it is stored, and the owner of the data set. In this example, we can see that the MYDATA.LISTINGS data set contains 49,056 observations and 54 variables. It was created on March 2nd, 2019, and the file size is 21MB.

Output 3.1: PROC CONTENTS Output

The third and final section of the output of Program 3.7 shows the individual columns in the SAS table along with their format information. This table provides an alphabetic list that contains the variable name, type, length, format, and informat information.

Output 3.2: PROC CONTENTS Output

Although there are 54 variables in the MYDATA.LISTINGS data set, I have elected to show only the first 16 variables. From this list, you can see that there is a mix of character and numeric variables. The numeric variables are mostly in the BEST12. format with the exception being the first_review variable. This variable is a date field and is formatted in the MMDDYY10. format.

The character variables are individually formatted to their specific length. It is important to inspect the format of character variables. These variables are often stored in formats that are unnecessarily large and consume a lot of computer resources.

PROC FREQ

The FREQ procedure is used extensively by data scientists to understand the contents of both character and numeric variables. This procedure produces one-way to multi-way frequency distributions of data elements. For a single variable, the default settings of the procedure will provide a list of unique observations and the frequency at which they occur. This default setting will also provide the individual percentage of total observations, the cumulative frequency of observations, the cumulative percent of observations, and the total number of missing values.

Be cautious with using PROC FREQ on numeric data. If your numeric variable has decimal points or has values greater than 100, this procedure can easily produce extremely long tables of information. To explore continuous data with PROC FREQ, first can use a DATA step with the ROUND function to round values up to a whole number. Alternatively, you can apply a format to create numeric bins for frequency analysis.

The example below shows a standard FREQ procedure called on the Listings data set contained in the MYDATA library. The TABLES statement requests a one-way to multi-way crosstabulation table for the specified variables.

Program 3.8: Frequency Distribution Example

PROC FREQ DATA=MYDATA.LISTINGS;
       TABLES neighbourhood_group_cleansed;
RUN;

Output 3.3: PROC FREQ Output

By default, the data is sorted in alphabetical order. It is often useful to have the data sorted in order of frequency from high to low. The ORDER statement controls the order in which the data is sorted.

Program 3.9: Ordered Frequency Distribution Example

PROC FREQ DATA=MYDATA.LISTINGS ORDER=FREQ;
       TABLES neighbourhood_group_cleansed;
RUN;

Output 3.4: Ordered Frequency Distribution Example Output

PROC FREQ can also create crosstabulation tables for two or more variables. The following code shows the crosstabulation between the variable neighbourhood_group_cleansed and the variable room_type. The code also specifies that I want to suppress the column, row, and overall percentage information from the output. Finally, I also want to create a temporary output data set of this crosstabulation that I will name “room_freq.”

Program 3.10: Crosstabulation Example

PROC FREQ DATA=MYDATA.LISTINGS ORDER=FREQ;
       TABLES neighbourhood_group_cleansed * room_type /
       NOCOL NOROW NOPERCENT OUT=room_freq;
RUN;

Output 3.5: Crosstabulation Output

The FREQ procedure has several options to evaluate the statistics of a given set of variables. These options include:

● Chi-square tests and measures

● Measures of association

● Binomial proportions and risk differences for 2 x 2 tables

● Odds ratios and relative risks for 2 x 2 tables

● Tests for trend

● Tests and measures of agreement

● Cochran-Mantel-Haenszel statistics

Each of these topics is certainly worthy of its own chapter. However, this chapter has a “Getting Started with SAS” level of detail. I recommend that the reader go to the SAS Support Documentation for further information: https://support.sas.com/en/documentation.html

PROC UNIVARIATE

The UNIVARIATE procedure produces descriptive statistics for numeric data. This procedure provides a lot of information about the distribution of the data and the measures of centrality. These include the mean, median, and mode along with the standard deviation, range, and interquartile range. PROC UNIVARIATE is used extensively by data scientists to investigate the distribution of numeric data and assess the need to transform variables into a normal distribution.

The code below shows a simple UNIVARIATE procedure to be run against the Listings data set contained in the MYDATA library. We are specifying that we want to see the descriptive statistics for the “price” variable contained in that data set. We also have specified that we want to see a histogram of the “price” variable by including the HISTOGRAM statement.

Program 3.11: PROC UNIVARIATE Example

PROC UNIVARIATE DATA=MYDATA.LISTINGS;
  VAR Price;
  HISTOGRAM;
RUN;

The first section of the output from the procedure is contained in Output 3.6a. This table shows the total number of observations (N), the central point of the data (Mean), the distribution of the data (Std Deviation) along with several additional statistics that describe the distribution of the data. This table provides an excellent overview of the selected variable that will be detailed in the following sections of the PROC UNIVARIATE output.

Output 3.6a: PROC UNIVARIATE Output – Moments Table

The second section contains the Basic Statistical Measures. This valuable table provides additional information on the measures of centrality including the median and the mode. This table also provides additional information on the distribution of the data including the range and the interquartile range.

Output 3.6b: PROC UNIVARIATE Output – Statistical Measures

The third section of the output contains the Tests for Location. PROC UNIVARIATE produces a test statistic for the null hypothesis. The null hypothesis is the point at which the mean or median is equal to a given value of µ. This statistic is a two-sided test with the default value of µ = 0.

Output 3.6c: PROC UNIVARIATE Output – Tests for Location

The fourth section of PROC UNIVARIATE provides a table of the quantiles of the data. This table provides a deeper look of the distribution by providing the values of the data at predetermined percentage markers. The example shows that the range of the price variable is from 0 (0% Min) to 10,000 (100% Max) with a median value of 110 (50% Median). This table also provides data values that mark the first and third quartile of the data distribution as well as various high and low percentage markers.

The quantile table is extremely valuable because it gives the researcher a concise table of information describing the distribution of the data. This table can serve as a decision point as to whether a variable has outliers, needs to be capped on the high or low ends, or needs to be transformed into a normally distributed variable.

Output 3.6d: PROC UNIVARIATE Output – Quantiles

The fifth section of the PROC UNIVARIATE output provides the Extreme Observations table. This table shows the top five and bottom five observation values and their associated observation number. This table allows the researcher to quickly see if there is a single high or low outlier or if there are several values on the high or low end of the spectrum.

Output 3.6e: PROC UNIVARIATE Output – Extreme Observations

The final default table produced by the UNIVARIATE procedure is the Missing Values table. This table provides an overview of the volume of observations that have missing values for the selected variable. This table also provides the percentage of observations that have missing values. This information is very important because it tells the researcher how big of an issue that missing values are for the selected variable. As a rule of thumb, if missing values are less than 20% of the total observations, then you can infer the missing values through a variety of methods. If the missing values are greater than 20% of the total observations, then you may need to take an alternative approach that we will discuss in the “Create a Modeling Data Set” chapter.

Output 3.6f: PROC UNIVARIATE Output – Missing Values

We have also decided to include a histogram graphic that is not part of the default PROC UNIVARIATE output. The histogram shows a graphical demonstration of the data distribution. The example shown in Figure 3.1 shows a highly skewed price variable with extreme high-end outliers. This data would need to be adjusted before it could be used for modeling.

Figure 3.1: PROC UNIVARIATE Histogram Output

PROC MEANS

The MEANS procedure is another way to calculate descriptive statistics for your data set. This procedure allows you to select the types of statistical features that you would like to include in the output. The procedure provides you with the flexibility to stratify the data and create output data sets of the calculated statistics.

The following code calls the MEANS procedure on the Listings data set contained in the MYDATA library. The programmer has the option of specifying the variables that they want to investigate through the use of the VAR statement. If no VAR statement is included, the PROC MEANS will calculate the default statistics for all the numeric variables in the selected data set.

Program 3.12: PROC MEANS Example

PROC MEANS DATA=MYDATA.LISTINGS;
  VAR price accommodates bathrooms;
RUN;

The output data shown in Output 3.7 shows the default statistics generated from PROC MEANS.

Output 3.7: PROC MEANS Output

We can expand the MEANS procedure to include additional statistics that we specify. The following code specifies that I want the output to show the number of observations (N), the number of missing values (NMISS), the minimum value (MIN), the maximum value (MAX), the mean and median values (MEAN, MEDIAN) as well as the standard deviation (STDDEV).

Program 3.13: PROC MEANS with a CLASS Statement

PROC MEANS DATA=MYDATA.LISTINGS 
  N NMISS MIN MAX MEAN MEDIAN STDDEV;
  CLASS room_type;
  VAR price accommodates bathrooms;
  OUTPUT OUT=list_stats;
RUN;

The CLASS statement stratifies the data by the selected variable. The output will contain the selected statistics for each level of the variable in the CLASS statement.

Output 3.8: PROC MEANS with a CLASS Statement Output

Output 3.8 shows the output for the expanded MEANS procedure. The selected statistics are calculated for each level of the CLASS variable.

The final element created from the MEANS procedure is the output data set. The OUTPUT statement specifies the statistics to be stored in the data set specified in the OUT option. The default output includes the N, MIN, MAX, MEAN and STD statistics. These are the same defaults as the unspecified PROC MEANS output.

The OUTPUT statement creates the output data set shown below (Output 3.9). It is important to note that the output data set has two distinct levels as specified by the _TYPE_ field. The output where _TYPE_ = 0 provides summary statistics for the entire data set. The output where _TYPE_ = 1 provides the calculated statistics for each of the levels specified in the CLASS statement.

Output 3.9: PROC MEANS with a CLASS Statement Output Data Set

PROC SUMMARY

The last procedure that we will look at in the “Explore Data” section of this chapter will be PROC SUMMARY. The SUMMARY procedure can calculate summary statistics for a data set much in the same manner as PROC MEANS and PROC UNIVARIATE. The default PROC SUMMARY creates an output data set that is very similar to what we have just seen in the default MEANS procedure.

Program 3.14: PROC SUMMARY Example

PROC SUMMARY DATA=MYDATA.LISTINGS;
  VAR price accommodates bathrooms;
  OUTPUT OUT=list_sum;
RUN;

Output 3.10: PROC SUMMARY Output Data Set

The code displayed in Program 3.14 produces the output data set in Output 3.10. This data set is identical to the information that we receive in the default MEANS procedure.

So, you have to ask yourself the question, why would I need to use PROC SUMMARY if I already have ways of getting that information? Good question.

As you can probably tell by the title of the procedure, the SUMMARY procedure can summarize information in many different ways. For example, if you have an ID variable, you can summarize specific information at the ID level and create a summary output data set. Program 3.15 below shows a SUMMARY procedure at the ID level to create a summary data set.

If a SUMMARY procedure includes a BY statement, the data will need to be sorted by that variable prior to the SUMMARY procedure.

Program 3.15: PROC SUMMARY with a BY Statement

PROC SORT DATA=MYDATA.LISTINGS OUT=list_sort;
  BY host_id;
RUN;
PROC SUMMARY DATA=list_sort;
  BY host_id;
  VAR price accomodates bathrooms;
  OUTPUT OUT=list_sum SUM=;
RUN;

After the data has been sorted, the PROC SUMMARY code is developed to specify that the procedure will use the sorted data and summarize the price, accommodates, and bathrooms variables. These variables will be summarized at the host_id level, and an output data set will be created named list_sum. The SUM=; option in the OUTPUT statement suppresses the statistics from the output data set. The first ten rows of the final data set are shown in Output 3.11.

Output 3.11: PROC SUMMARY with a BY Statement Output Data Set

This data set shows the summarized information at the host_id level. The _FREQ_ column shows how many observations are included in the host_id summary. For example, on line 5, we can see that host_id 2787 has eight observations. This can be interpreted as “Host ID # 2787 has eight properties listed on Airbnb in NYC as of Dec 5th, 2018. These properties can accommodate a total of 42 people at a total nightly price of $667.”

It is useful to create a summary data set and perform statistical analysis on the summarized data or to take it a step further and create a clustering algorithm on this information. We will explore this idea further in the unsupervised modeling chapter.

Manipulate Data

At this point, you have gathered some data, created a library to store your data, imported the data into your library, and explored the contents of the data. This is a great start! However, data is rarely useful in its raw form. We often have to do things to the data to prepare it for reporting or modeling. This process can include filtering the data, appending it with additional data sets, creating new variables, and many, many other adjustments that are necessary prior to extracting actionable information from our data.

DATA Step

The DATA step is considered “home base” in SAS programming. This is where much of your data adjustments will take place. Many of the techniques that we will be discussing will take place within a DATA step. The DATA step is the primary method of creating a SAS data set, merging two or more data sets, creating variables, filtering data, and many more data manipulation techniques.

Behind the scenes, the DATA step executes in two phases:

1. The Compilation Phase – This initial phase has three important functions:

a. Checks the SAS syntax for errors and then translates your SAS code into machine code.

b. Initializes the program data vector (PDV). This is a logical area in memory where SAS builds a data set one observation at a time. This data set contains each column that you referenced in the DATA step and its attributes. These attributes include the column name, type, and length. The compilation phase establishes the rules for the PDV based on your code. This includes which records will be read into the PDV. This phase also creates a one-row temporary data set of empty columns that will be populated during the execution phase.

c. Creates descriptor information. This is the table metadata that is created by the compilation phase. This table metadata will be used as a rules engine in the execution phase of the DATA step.

2. The Execution Phase – This phase occurs as a loop through the input data set.

a. Initializes the PDV. The general structure and the rules of the PDV were created in the compilation phase of the DATA step. The newly created program variables are set to missing in the PDV.

b. Reads a single row from the input table into the PDV. You can use a SET, INPUT, MERGE, UPDATE, or MODIFY statement to read data into the PDV.

c. Executes processing statements and updates values in the PDV. This step performs all the logical functions that you have specified in the DATA step.

d. At the end of the statements specified in the DATA step, SAS writes an observation to the output SAS data set.

e. Returns to the top of the DATA step and performs the same process on the next row of data in the input table.

f. The DATA step is finished when the process has iterated through each row of the input data set.

So, now that we understand how data is processed through the DATA step, let’s look at some of the cool stuff that we can do with this statement.

Filtering Data

One of the most common techniques performed on data sets is to filter out records that are not needed for your analysis. The two most widely used functions for filtering data are the WHERE and IF statements.

Both the WHERE and IF statements select observations that meet a particular condition. The main difference between the WHERE and IF statement is that the WHERE statement restricts the volume of observations that are initially read from the input table. A WHERE statement is preferred when the condition is not dependent upon rows other than the current, due to its efficiency.

For example, Program 3.16 shows the creation of a temporary data set named TEST that is built from the input table MYDATA.Listings. The WHERE statement restricts the observations that are initially read from the MYDATA.Listings data set to include only those that met the logical condition that was stated in the WHERE statement.

Program 3.16: Filtering Data with a WHERE Statement

DATA WORK.TEST;
  SET MYDATA.LISTINGS;
  WHERE accommodates le 4;
RUN;
NOTE: There were 42565 observations read from the data set MYDATA.LISTINGS.
      WHERE accommodates<=4;
NOTE: The data set WORK.TEST has 42565 observations and 54 variables.
NOTE: DATA statement used (Total process time):
      real time           0.31 seconds
      cpu time            0.07 seconds

The log output shows that 42,565 observations were read from the input table. In contrast, let’s look at the IF statement to perform the same type of filtering.

The same restrictions can be implemented with an IF statement, as demonstrated in Program 3.17.

Program 3.17: Filtering Data with an IF Statement

DATA WORK.TEST;
  SET MYDATA.LISTINGS;
  IF accommodates le 4;
RUN;
NOTE: There were 49056 observations read from the data set MYDATA.LISTINGS.
NOTE: The data set WORK.TEST has 42565 observations and 54 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds

The log output shows that all 49,056 observations were read from the input table. The IF statement then selected 42,565 observations that met the logical condition specified in the IF statement.

This might not seem like too big of a difference to you. However, consider that most corporate data sets contain millions of records with thousands of columns. Rather than reading all of them into temporary memory and then performing your logical conditions, it is much more efficient to only read in the observations that meet the logical condition. This method can be the difference between a program taking five minutes to run or five hours to run.

SAS Operators

SAS operators can be used to filter data, create new variables, and perform calculations on existing data. We can classify SAS operators into three broad groups:

● Arithmetic operators

● Conditional operators

● Logical operators

The arithmetic operators are mathematical expressions that can be applied to numeric data. Table 3.1 provides a list of arithmetic operators along with examples of their application.

Table 3.1: Arithmetic Operators

Symbol

Definition

Example

Result

**

exponentiation

a**3

raise A to the third power

*

multiplication

2*y

multiply 2 by the value of Y

/

division

var / 5

divide the value of VAR by 5

+

addition

num + 3

add 3 to the value of NUM

-

subtraction

sale - discount

subtract the value of DISCOUNT from the value of SALE

Conditional operands establish a comparison, operation, or calculation between two variables. These operands can be applied to both numeric and character variables. The application of these operands to numeric data is straightforward. It establishes a conditional mathematical relationship between the two variables, just as one would expect. However, when you apply these conditional operands to character data, the comparison will always yield a numeric result (0 or 1). Character operands are compared character by character from left to right. Character order depends on the collating sequence, usually ASCII or EBCDIC, used by your computer.

For example, in the EBCDIC and ASCII collating sequences, G is greater than A; therefore, this expression is true:

Gray Adams

Table 3.2: Conditional Operands

Symbol

Mnemonic Equivalent

Definition

Example

=

EQ

equal to

a=3

^=

NE

not equal to

a ne 3

-=

NE

not equal to

a-=3

~=

NE

not equal to

a~=3

>

GT

greater than

a>3

<

LT

less than

a<3

>=

GE

greater than or equal to

a>=3

<=

LE

less than or equal to

a<=3

 

IN

equal to one of a list

a in (3, 4, 5)

SAS can also implement logical (Boolean) expressions to link sequences of comparisons. These expressions can be applied to both numeric and character variables.

● If both quantities linked by the AND operator are true, then the result of the AND operation is 1 (true).

● If one of the qualities linked by the OR operator is true, then the result of the OR operation is 1 (true).

● The result of putting NOT in front of a quantity whose value is 0 (false) is 1 (true).

Table 3.3: Logical Expressions

Symbol

Mnemonic Equivalent

Definition

&

AND

(a>b & c>d)

|

OR

(a>b | c>d)

!

OR

(a>b ! c>d)

^

NOT

(a>b ^ c>d)

~

NOT

(a>b ~ c>d)

Program 3.18 provides an example of using each of the SAS operators within a DATA step.

Program 3.18: SAS Operators Examples

DATA WORK.TEST;
  SET MYDATA.LISTINGS;
  /*Specify length of new character variable*/
  LENGTH bath_count $8.;
  /*Limit the data imported into memory with a 
  symbol operand*/
  WHERE accommodates ^= 0;
  /*IF-THEN statement with mnemonic operand*/
  IF bathrooms EQ 1 THEN bath_count = ‘Single’;
  ELSE bath_count = ‘Multiple’;
  /*Calculation with arithmetic operand*/
  poly_bath = bathrooms**2;
  /*IF-THEN with logical condition*/
  IF (accommodates >= 5 AND bathrooms GE 2)
  THEN big_house = 1; ELSE big_house = 0;
RUN;

Program 3.18 demonstrates the ability to utilize SAS operands to limit data, filter observations and create new variables all within a single DATA step.

IF-THEN/ELSE Logical Statements

SAS evaluates IF-THEN statements and determines if the result of a logical expression is true or false. If the result is true, SAS executes the statement defined in the THEN statement. If the statement is false, SAS continues to the statement defined in the ELSE statement.

In the code shown above (Program 3.18), the IF-THEN / ELSE logical statement follows a pattern:

1. Evaluates the arithmetic expression “IF bathrooms EQ 1”.

2. If this value is true, SAS executes the THEN statement and creates a character variable named bath_count and sets the value to “Single”.

3. If the value is false, SAS executes the ELSE statement and sets the bath_count to “Multiple”.

SAS also allows the programmer to link together multiple IF-THEN / ELSE statements.

SAS Dates

SAS provides the programmer with the flexibility to perform calculations on date values. Dates are stored as numeric values in SAS. This value represents the number of days between January 1, 1960, and a specified date. Any date value before January 1, 1960, is stored as a negative value, while dates after January 1, 1960, are stored as positive values. These date values account for all leap year days.

Figure 3.2: SAS Date Values Example

The SAS code below demonstrates several different processes that can be applied to SAS date values. These processes include visualizing the date value in various ways through the FORMAT statement, performing arithmetic operations with the BETWEEN and DATDIF functions, and extracting information from date values with specific date functions.

Program 3.19: SAS Date Formats Examples

DATA dates;
       set MYDATA.Listings;
       WHERE first_review BETWEEN ‘01JAN2018’d AND ‘30APR2018’d;
       date_fmt1 = first_review;
       date_fmt2 = first_review;
       date_fmt3 = first_review;
       day   = day  (first_review);
       week  = week (first_review);
       month = month(first_review);
       year  = year (first_review);
       btw_dates = DATDIF(first_review, last_review, ‘ACT/ACT’);
       KEEP date_fmt1 date_fmt2 date_fmt3 first_review
           last_review
               day week month year btw_dates;
       FORMAT date_fmt1 DATE. date_fmt2 DATE9.
             date_fmt3 WEEKDATE.;
RUN;

● The WHERE statement performs an arithmetic calculation to identify dates between the two date values specified with a BETWEEN – AND statement.

● Three new variables were created that are each equivalent to the first_review variable. Each of these new variables were treated with a different FORMAT statement to demonstrate different visualizations of the same date value.

● The day, week, month, and year date functions were implemented to create variables specific to these aspects of the first_review field.

● The DATDIF function performs an arithmetic calculation that outputs the number of days between dates. The ACT/ACT specification tells SAS to use the actual number of days between the specified dates.

Output 3.12: SAS Date Values Examples

The output shown in Output 3.12 demonstrates the different manipulations that we have performed on the date variables.

● The DATE. format presents the first_review variable as 12FEB18.

● The DATE9. format presents the first_review variable as 12FEB2018.

● The WEEKDATE. format presents the first_review variable as Monday, February 12, 2018.

● The DAY function returns the numeric day of the month.

● The WEEK function returns the numeric week of the year.

● The MONTH function returns the numeric month of the year.

● The YEAR function returns the year.

● The DATDIF function returns the number of days between the two specified dates.

This example is just a small sampling of the formats that can be applied to any date value, and this example just scratches the surface on the functions that can be applied to date values. I encourage everyone to look through the SAS documentation for a complete review of SAS date, time, and datetime values: https://support.sas.com/en/documentation.html.

Merging Data Sets

A single data set rarely contains all the information that you will need for your data science project. A data scientist will often have to bring together information from various data sources to create one or more data sets that will be used for their analysis. There are multiple ways in which you can combine data sets in SAS, but we will focus on the two most common methods:

● PROC SQL

● SAS MATCH MERGE

PROC SQL

Structured Query Language (SQL – pronounced “sequel”) is a commonly used programming language that is supported by many other programming languages (SAS, Python, R, Scala, and so on). Due to the ubiquity of this language, it is a great investment of your time to learn the basics of SQL. The PROC SQL statement in SAS enables you to do the following:

● Retrieve data stored in tables and views

● Manipulate data and create new variables

● Create new data sets and views

The difference between a table and a view is that the table contains data while the view contains no data. The view is a stored query expression that reads data values from its underlying tables dynamically.

There are four main sections to a SQL procedure:

1. Specify the data set that you want to create.

2. Select and create variables to populate your data set.

3. Identify the table or tables that you want to pull the data from.

4. Specify any conditional logic to limit the data pull.

The following code provides an example of creating a new data set in the MYDATA library that I will name “Combine.” This data set will contain a mixture of raw, renamed, and created variables from two data sets that I’ve joined together through a common variable available in both tables. Certain logical conditions have also limited the data.

Program 3.20: PROC SQL with Logical Conditions Example

PROC SQL;
       CREATE TABLE MYDATA.Combine AS
       SELECT  a.host_id,
                     a.room_type,
                     a.price AS base_price,
                     b.price AS calendar_price,
                     CASE
                            WHEN a.price GE 150 THEN ‘Expensive’
                            ELSE ‘Cheap’
                            END
                            AS price_cat,
                     b.date AS calendar_date
              FROM MYDATA.Listings AS a
              LEFT JOIN MYDATA.Calendar AS b
                     ON a.listing_id = b.listing_id
                            WHERE a.host_id IS NOT NULL
                            AND a.bedrooms GT 1
                            AND b.date BETWEEN ‘01JAN2019’d AND ‘30JAN2019’d;
QUIT;

One of the main benefits of SQL language is that it is very easy to understand because it reads like common language. Even if you are not a programmer, you can probably understand through the naming conventions that I am creating a table called MYDATA.Combine, and I am selecting some variables from the MYDATA.Listings and the MYDATA.Calendar tables and limiting it to specific instances.

Let’s look at each step of the SQL code:

● PROC SQL – this tells SAS that you are about to write some SQL code.

● CREATE TABLE AS – this step specifies that you will create a table rather than a view. You have the option of creating a temporary table or a permanent table with the use of a library statement.

● SELECT – this step specifies which variables that you want to include in your output data set and allows you to create new variables.

◦ You can select all the variables from a specified data set through the use of the “*” symbol (for example A.* will select all of the variables from the table assigned to the alias A).

◦ You can rename a variable by including an AS statement. This method is a quick and easy way to rename variables, especially variables that have identical names in different tables but have different values (for example, the variable name could be found in both tables A and B; however, in table A it represents company name, and in table B it represents customer name. It would be beneficial to rename the variable to specify the meaning of name).

◦ You can create new variables with the use of the CASE-WHEN-THEN-ELSE statements. The CASE statement selects values if certain conditions are met. You can use an arithmetic, logical, or conditional statement to specify your selection method. The CASE statement allows you to define your conditions, specify the output data, and create a variable name, all within the same CASE statement.

● A comma separates each of the selected variables. The final variable does not have a comma following the variable name.

● FROM – specifies a table from a specific database that you will use to either populate your SAS table or limit data that will be included in your SAS table

● JOIN – when you have more than one data source, the JOIN statement tells SAS how to merge the two data sets. There are three main options to join tables:

◦ INNER JOIN – retains records that match a common variable in both tables

◦ OUTER JOIN – retains records that match a common variable in the specified table (right or left)

◦ FULL JOIN – retains records contained in either table

● WHERE – you can specify conditions that limit the data to include in your SAS table.

SAS Match-Merge

If your data sets are already in SAS tables, the SAS match-merge process provides you with the flexibility of all of the functions available in a DATA step with the ability to combine multiple data sets on a common variable as defined in the BY statement.

Program 3.21: SAS Match-Merge Example

PROC SORT DATA=MYDATA.LISTINGS;
  BY listing_id; 
RUN;
PROC SORT DATA=MYDATA.CALENDAR;
  BY listing_id;
RUN;
DATA Left Right Left_Outer Inner;
  MERGE MYDATA.LISTINGS (IN=a) MYDATA.CALENDAR (IN=b);
  BY listing_id;
  IF a THEN OUTPUT left;
  IF b THEN OUTPUT Right;
  IF a AND NOT b THEN OUTPUT Left_Outer;
  IF a AND b THEN OUTPUT Inner;
RUN;

Program 3.21 demonstrates some important aspects of the match-merge process.

1. Each data set that will be merged must be sorted by the common variable that the data sets will be joined on (the BY variable).

2. The use of the IN statement assigns each of the data sets to a group that will control the matching conditional logic. These groupings will be control points for how the data will be joined.

3. You can create multiple data sets in a single DATA step through the use of the OUTPUT statement.

4. The matching logic is controlled by the IF-THEN logic applied to the groupings assigned in the IN statement.

a. LEFT JOIN – IF a

b. RIGHT JOIN – IF b

c. LEFT OUTER JOIN – IF a AND NOT b

d. INNER JOIN – IF a AND b

Table 3.4: Comparing Merge Techniques

PROC SORT

Sorting data is an essential part of data management. Many SAS procedures require you to sort the data before committing to a BY statement within a certain procedure such as MERGE or PROC MEANS. The SORT procedure provides you with an easy and flexible method of sorting your data with options to customize your output.

The standard PROC SORT simply calls the SORT procedure on a specified data set and sorts it by a specified variable:

Program 3.22: PROC SORT Example

PROC SORT DATA=MYDATA.LISTINGS;
  BY listing_id; 
RUN;

Program 3.22 will sort the data contained in the MYDATA.Listings data set by the listing_id variable. This code will replace the existing MYDATA.Listings data set with a data set with the exact same values only sorted in ascending sequence by the variable specified in the BY statement.

In contrast, the Program 3.23 utilizes the OUT statement to create a new data set that contains the same data elements found in the data sets specified in the DATA statement; however, the new data set is sorted by the variable specified in the BY statement while the original data set is left unchanged. This PROC SORT also utilizes the DESCENDING option to change the order of the sort.

Program 3.23: PROC SORT with DESCENDING Option

PROC SORT DATA=MYDATA.LISTINGS OUT=list_sort;
  BY DESCENDING listing_id;
RUN;

The final example of PROC SORT is demonstrated below. This version includes the NODUPKEY statement. This statement is used to de-duplicate a data set by one or multiple variables. It is a good practice to use an OUT statement when you are performing a de-duplicated sort to maintain the original full data set.

Program 3.24: De-duplicated Sort with DESCENDING Option

PROC SORT NODUPKEY DATA=MYDATA.LISTINGS OUT=list_sort;
  BY DESCENDING listing_id;
RUN

Export Data

Data in a vacuum is useless. What I mean by that is there should be a purpose to all the data manipulations that you have performed. You might have successfully accessed data, read it into a SAS table, filtered, summarized, and manipulated the data. However, all those data manipulation techniques are all means to an end. If you are performing all those actions to answer a question, then we will need a way to share that answer with an audience. The three main methods of moving data from the SAS environment to an environment outside of SAS are exporting, reporting, and graphing.

PROC EXPORT

The first method of sharing data will focus on exporting a SAS data set to an external environment. This method is generally used to move data from a SAS table to an Excel file, CSV file, or Access database. However, there are multiple output formats that you can write your data to including JMP, SPSS, Stata, Lotus 1-2-3, tab-delimited, and many more.

PROC EXPORT reads in a SAS data set and writes the data to an external data file. In the example below, PROC EXPORT is used to access an existing SAS data set named MYDATA.list_sort and write the contents of that file to an external file.

Program 3.25: PROC EXPORT Example

PROC EXPORT DATA=MYDATA.list_sort
       OUTFILE=”C:\Users\James Gearheart\Data\list.xls”
       DBMS=EXCEL REPLACE;
       SHEET=”list”;
RUN;

A few notes on Program 3.25:

● PROC EXPORT is called to write data from a SAS table into an external file.

● The DATA statement specifies the data set to be exported.

● The OUTFILE statement specifies where you would like to place the newly created data. This statement also allows you to specify the name and format of the output file.

● The DBMS statement specifies the type of external data file that the PROC EXPORT will create. This step is where you can select from a wide range of data formats including Access, JMP, dBASE, and many others.

● The REPLACE statement specifies that if the data set already exists, then it should be replaced with the newly exported data.

● The SHEET statement is specific to the EXCEL format and allows you to name the sheet in the newly created Excel workbook.

If you need to write a large amount of data from a SAS environment to a CSV file, SAS has created a high-performance macro that can efficiently move larger quantities of data to a CSV file. The SAS utility macro that is already included in SAS is called %DS2CSV. Although we will cover macros in the next chapter, it is important to demonstrate this export methodology in the “Export Data” section of the book.

Program 3.26: SAS Utility Macro Example

%DS2CSV(DATA=MYDATA.Listings, RUNMODE=B, 
CSVFILE=”C:\Users\James Gearheart\Data\listings.csv”);

A few notes on Program 3.26:

● The above code calls the %DS2CSV SAS utility macro.

● The DATA statement specifies the file that you would like to export.

● The RUNMODE statement allows the option of running the macro in batch or server mode.

◦ Batch mode (RUNMODE = B) means that you are submitting the %DS2CSV macro in the SAS Program Editor.

◦ Server mode (RUNMODE = S) is used with Application Dispatcher programs and streaming output stored processes.

● The CSVFILE statement specifies the location and name of the export file.

PROC REPORT

PROC REPORT is a highly flexible procedure that enables a programmer to control the content and format of data to be displayed in a text-based or ODS (Output Delivery System) report. This procedure writes output to the Results window of the SAS programming environment.

The REPORT procedure provides you the ability to control the order of variables that you would like to include in your report as well as if and how they are summarized, and the format in which the results are displayed. Program 3.27 provides an example of the REPORT procedure that builds a summarized report for the MYDATA.Listings data set.

Program 3.27: PROC REPORT Example

PROC REPORT DATA=MYDATA.LISTINGS ;
  TITLE ‘Summarized AirBnB Report’;
  COLUMN bed_type room_type price;
  DEFINE bed_type / GROUP ‘Bed Type’;
  DEFINE room_type / GROUP ORDER=FREQ DESCENDING ‘Room Type’;
  DEFINE price / MEAN FORMAT=DOLLAR10.2 ‘Average Price’;
RUN;

Notes on Program 3.27:

● The REPORT procedure is called to create a summarized report of the MYDATA.Listings data set.

● The TITLE statement is used to create a title to be displayed in the output.

● The COLUMN statement is used to specify the variables to be displayed in the report and the order in which they will appear.

● The DEFINE statement defines how to use and display a report item. The associated GROUP statement specifies that the report should summarize the data for each level of the DEFINE specified variable.

● The single-quoted statement after the GROUP statement specifies how the variable will be labeled in the report.

● The ORDER = FREQ statement specifies that the variable identified in the DEFINE statement will be ordered according to the frequency of occurrence. The DESCENDING statement specifies that the order frequency will be descending in the report.

● The MEAN statement for the numeric variable price specifies that the report will contain the mean value of price as defined by the proceeding GROUP statements.

Output 3.13: PROC REPORT Output

Graphs

One of the most common methods of sharing data with a broad audience is through the use of graphs. When done correctly, graphs provide a quick visual representation of the underlying data that is clear, concise, and easy to understand. PROC SGPLOT enables you to create an incredible variety of graphs, including bar charts, line charts, scatter plots, pie charts, and many others. I will briefly demonstrate the programming code and the associated output for a few of the most commonly utilized graphs.

Histograms

A histogram is a visual display of a continuous variable’s distribution that is grouped into defined ranges. This graph can demonstrate the range of values as well as the volume of observations at each point in the distribution. A histogram is commonly used to visually demonstrate a continuous variable’s minimum value, maximum value, and mean value, as demonstrated by the volume of observations at those points.

Program 3.28 demonstrates the SAS code used to create a histogram of the log adjusted price variable that we used in the case study example. This code will create the histogram and define the bin ranges that comprise the histogram. It will also create a density curve that shows a smooth continuous curve for the distribution of the numeric values.

Program 3.28: Distribution Plot Example

PROC SGPLOT DATA=WORK.TRAIN_ADJ;
  HISTOGRAM Price_Log / BINWIDTH= 0.25;
  DENSITY Price_Log;
  TITLE ‘Distribution of Log Adjusted Price’;
RUN;

Notes on Program 3.28:

● The SGPLOT procedure is called to create a visual display of the TRAIN_ADJ data set located in the WORK library.

● The HISTOGRAM statement specifies the type of graph to be created. The HISTOGRAM statement can only be applied to numeric variables. The Price_Log variable is specified to be represented in the histogram.

● The BINWIDTH statement allows the programmer to specify the width of the categorical bins that are created by the HISTOGRAM statement. This statement is optional because SAS will determine a bin width value by default.

● The DENSITY statement creates a density curve for the distribution of numeric values.

● The TITLE statement is used to create a title to be displayed in the output.

Output 3.14: Distribution of Log-Adjusted Price

Output 3.14 shows the distribution of the Price_Log variable. This chart demonstrates how the log transformation of the continuous Price variable results in a normally distributed variable.

Bar Charts

Bar charts show the distribution of a categorical variable. The volume of observations for each level of a categorical variable is represented as vertical rectangular bars. SAS provides many options to customize bar charts, including grouping, stacking, and selecting an output statistic.

Program code 3.29 creates a bar chart on the case study data. The data will represent the average log price grouped by neighborhood and room type.

Program 3.29: Bar Chart Example

PROC SGPLOT DATA=WORK.TRAIN_ADJ;
  VBAR neighbourhood_group_cleansed / 
  RESPONSE=Price_Log GROUP= room_type STAT=MEAN;
  YAXIS LABEL=’Mean Log Price’;
  XAXIS LABEL=’Neighbourhood Group’;
  TITLE ‘Mean Log Price of Room Type by Neighbourhood Group’;
RUN;

Notes on Program 3.29:

● PROC SGPLOT is called to create a visual display of the TRAIN_ADJ data set located in the WORK library.

● The VBAR statement specifies the type of graph to be created. The VBAR statement can only be applied to character variables. The neighbourhood_group_cleansed variable is specified to be represented in the histogram.

● The RESPONSE statement allows the programmer to specify the output statistic that is created by the VBAR statement. This statement is optional because SAS will report the volume of observations by default.

● The GROUP statement is optional and allows the programmer to create a stacked bar chart using another categorical variable.

● The STAT statement is optional because SAS will report the volume of observations by default.

● The TITLE statement is used to create a title to be displayed in the output.

Output 3.15: Bar Chart Example Output

Output 3.15 shows the mean log price grouped by neighborhood and stacked by room type. This chart demonstrates how the average price varies by the categorical variables.

Line Charts

Line charts are called series plots in SAS. These charts are most often used to show the trend of a numeric variable over time. The value of a numeric variable is plotted for each date over a range of time. Each of the plotted values is connected by a line. This methodology allows the researcher to demonstrate the change in the variable over time.

Program 3.30 first prepares a data set to be graphed with a line chart. The DATA step selects three unique identifiers from the MYDATA.Calendar data set and creates separate price variables for each Listing ID. This data is then sorted by the Listing ID and Date fields. The SGPLOT procedure is incorporated to develop the series plot on the adjusted data set.

Program 3.30: Line Chart Example

DATA CAL; 
  SET MYDATA.Calendar;
  WHERE Listing_id in (21456, 2539, 5178);
  IF listing_id = 21456 THEN price_1 = price;
  IF listing_id = 2539 THEN price_2 = price;
  IF listing_id = 5178 THEN price_3 = price;
run;
PROC SORT DATA=CAL; BY listing_id date; run;
PROC SGPLOT DATA=CAL ;
  SERIES X=date Y=price_1 / LEGENDLABEL=’Listing 21456’;
  SERIES X=date Y=price_2 / LEGENDLABEL=’Listing 2539’;
  SERIES X=date Y=price_3 / LEGENDLABEL=’Listing 5178’;
  YAXIS LABEL= ‘Daily Price’;
  XAXIS LABEL= ‘Date’;
  TITLE ‘Price Per Night’;
RUN;

Notes on Program 3.30:

● The SGPLOT procedure is called to create a visual display of the adjusted MYDATA.Calendar data set.

● The SERIES statement specifies the type of graph to be created. The SERIES statement can be applied only to numeric variables. The date variable is selected for the x-axis while the newly created price variable is selected for the y-axis. This process is repeated for each of the price variables that we want to be represented in the chart.

● The LEGENDLABEL allows you to specify how the data will be referred to in the associated legend.

● The TITLE statement is used to create a title to be displayed in the output.

Output 3.16: Line Chart Example Output

Output 3.16 shows the daily price for each of the selected listings over a defined time period. This chart highlights different pricing strategies for each of the listings, with one of the listings doubling its daily rate on the days surrounding New Year’s Eve while the other listings choose to drop their rate at that same time period.

Scatter Plots

A scatter plot graphs the observations of numeric variables in relation to one another. These graphs primarily focus on the relationship between two variables represented in a two-dimensional graph. Each observation is represented as a single point in the two-dimensional space.

Program 3.31 uses PROC MEANS to calculate the average price for each value of the ordinal review score variable. PROC SGPLOT is used to create a scatter plot where the review score is on the x-axis, and average price is on the y-axis. Conditional logic is also applied to the data set to retain only those cases where there are greater than or equal to twenty observations for each level of the review score.

Program 3.31: Scatter Plot Example

PROC MEANS DATA=MYDATA.LISTINGS;
  VAR price;
  CLASS review_scores_rating;
  OUTPUT OUT= AVG; 
RUN;
PROC SGPLOT DATA=AVG (WHERE=(_STAT_=’MEAN’ and _FREQ_ GE 20));
  SCATTER X = review_scores_rating Y = price;
  YAXIS LABEL= ‘Average Price’;
  XAXIS LABEL= ‘Review Score’;
  TITLE ‘Average Price by Review Score’;
RUN;

Notes on Program 3.31:

● The SGPLOT procedure is called to create a visual display of the newly calculated AVG data set created by PROC MEANS.

● The SCATTER statement specifies the type of graph to be created. The SCATTER statement can be applied only to numeric variables. The review_scores_rating variable is selected for the x-axis while the newly created average price variable is selected for the y-axis.

● The TITLE statement is used to create a title to be displayed in the output.

Output 3.17: Scatter Plot Example Output

Output 3.17 shows the relationship between customers’ review scores and the average price of a property. This is a great way to quickly demonstrate the relationship between price and quality. This chart provides a visual demonstration that the customer review score has a positive association with price.

Chapter Review

This chapter focused on introducing the SAS programming language and providing you with concepts and sample code to get you up and running in SAS. Each of the concepts that this chapter covered could be the basis for an entire chapter devoted to that concept. There are numerous ways to expand upon each of the demonstrated techniques with a variety of options that SAS has available for you. These options can be utilized to perform nearly any data manipulation and analysis that you would like to develop.

Although a skilled data scientist can use the techniques demonstrated in this chapter to develop some incredible analytical products, the next chapter is designed to increase your programming efficiency, introduce automation, and take your programming skills to the next level.