You're reading from Mastering SAS Programming for Data Warehousing An advanced programming guide to designing and managing Data Warehouses using SAS

Product type Paperback

Published in Oct 2020

Publisher Packt

ISBN-13 9781789532371

Length 494 pages

Edition 1st Edition

Tools

SAS

Concepts

Application Development

Author (1):

Monika Wahi

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1: Managing Data in a SAS Data Warehouse

2. Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse FREE CHAPTER

3. Chapter 2: Reading Big Data into SAS

4. Chapter 3: Helpful PROCs for Managing Data

5. Chapter 4: Managing ETL in SAS

6. Chapter 5: Managing Data Reporting in SAS

7. Section 2: Using SAS for Extract-Transform-Load (ETL) Protocols in a Data Warehouse

8. Chapter 6: Standardizing Coding Using SAS Arrays

9. Chapter 7: Designing and Developing ETL Code in SAS

10. Chapter 8: Using Macros to Automate ETL in SAS

11. Chapter 9: Debugging and Troubleshooting in SAS

12. Section 3: Using SAS When Serving Warehouse Data to Users

13. Chapter 10: Considering the User Needs of SAS Data Warehouses

14. Chapter 11: Connecting the SAS Data Warehouse to Other Systems

15. Chapter 12: Using the ODS for Visualization in SAS

16. Assessments

17. Other Books You May Enjoy

Leave a review - let other readers know what you think

Dealing with storage and memory issues

This section will cover issues with storage and memory when using SAS for big data. It will cover the following:

How SAS dealt with competition from structured query language (SQL) for data storage
How PROC SQL works and can be used in data warehouse processing
Considerations about memory and storage that need to be made when using SAS in a data warehouse in modern times
How SAS can work in the cloud

Avoiding memory issues

Even as SAS got more powerful, datasets kept getting bigger, and there were always challenges with running out of memory during processing. For example, using WHERE instead of IF when reading in data would not only reduce CPU usage and the time it took for code to run, it would also prevent unnecessary usage of memory. Even today, tuning SAS code may be necessary to avoid memory issues.

In a data warehouse, mart, or lake, datasets that were transformed in SAS may be stored outside of SAS in SAS format. This makes them easy to read into SAS. However, this format can be very large, so the option to COMPRESS the dataset was created. Curtis Smith reported on his test compressing SAS files in his white paper (available under Further reading), and found that depending upon the dataset, compressing datasets could make them take up half the space.

Smith recommended not only compressing datasets but also deleting unneeded variables to make datasets smaller. In a data warehouse, mart, or lake, source datasets contain native variables. In a data lake, these datasets may remain relatively unprocessed. However, in a data warehouse or data mart, decisions need to be made about what variables to keep available for analysis in the warehouse. Further more, transformed variables may be added during processing to serve the needs of the users of the warehouse.

The team running the data warehouse should ask the following for each raw dataset:

If native variables should be available for analysis, which ones should be kept?
If transformed variables should be available for analysis, which ones should be provided?

By carefully answering these questions, only the columns needed from each dataset can be retained in analysis files, thus reducing processing time for warehouse developers and users.

Accommodating Structured Query Language

SQL was developed and deployed by various companies in the 1990s and early 2000s. SQL was aimed at data maintenance and storage using relational tables rather than flat files. SQL approaches only became possible in the 1990s due to upgrades in technology that allowed faster processing of data.

SQL languages accomplish the same data editing tasks that data steps do in SAS, but they use a different approach. Unlike SAS, which is a procedural language, SQL is a declarative language:

In SAS, the programmer must program a data step to do the procedures in the most efficient way to optimize data handling.
In SQL, the programmer declares what query output they desire using easy-to-understand, simple English statements, and an optimization program (or optimizer) running in the background figures out the most efficient way to execute the query.

While using efficient code in SQL can still improve performance, the efficiency of SQL is less dependent upon the programmer's code and more dependent on the function of its optimizer. Hence, maintaining data in a database became easier using SQL rather than SAS data steps. In SQL, programmers had to learn a few basic commands that could perform a variety of tasks when used together. But with SAS data steps, programmers needed to study a broad set of commands, and they also had to learn the most efficient way to assemble those commands together in order to achieve optimal SAS performance.

What SQL cannot do is analyze data the way SAS can. Therefore, over the latter half of the 1990s and early 2000s, while many databases began to be stored and maintained in SQL, SAS could still be used on them for analysis through the SAS/Access feature.

A 1995 edition of the periodical Computerworld described current options for SAS users in an article titled SAS Institute's customers keep the faith by Rosemary Cafasso (available under Further reading). There were two ways to conceive of data storage in SAS at that time:

Using SAS only for analysis, and connecting to a non-SAS data storage system to do this
Using SAS for both data storage and analysis

For the first option, SAS/Access features could be used. For the second option, SAS/Application Facility (SAS/AF) was used to create a client/server environment to support both data storage and analysis in SAS. Another term for this setup is server SAS (as opposed to PC SAS, which is an application that runs entirely on a PC without a client/server relationship). The advantage of using SAS/AF is that a comprehensive SAS solution could be used that optimized the client/server relationship (through, for example, partitioning the application so it ran on different processors).

Also, in 1995, SQL optimizers had not been improved to the point where they outperformed SAS data steps, so at that time SAS/AF was a better approach than SAS/Access to connect to a SQL database. As noted in the Computerworld article, this led programmers to gravitate toward working either entirely in a SAS environment, or entirely outside of one.

With the visualization tools included in SAS/AF, SAS was now competing with visualization applications as well as data management applications. SAS's users continued to rate it highly, and were very loyal, as moving away from the SAS/AF platform would be very difficult given its dissimilarity to other applications.

Using PROC SQL

SAS's trajectory in general through its release of version 8 in 1999 and later version 9 (the current one) in 2002 has been to build extra functions into its core analysis products, and to also design supporting products to support its functionality. Unlike in the early years, SAS has not revisited data step functioning, nor considered redeveloping its data step language as declarative rather than procedural.

Through the late 1990s and early 2000s, SQL became more predominant, and therefore more programmers were trained in SQL. These SQL programmers had a lot of trouble transferring their skills to use in SAS data steps, so SAS developed a SQL language within SAS called PROC SQL.

PROC SQL has the following features:

It is a language within SAS, in that PROC SQL code starts with a PROC SQL statement and ends with a quit statement.
It uses SQL commands, such as CREATE TABLE and SELECT with GROUP BY and WHERE.
It allows the user to control its use of processors during execution through the THREADED option.
It includes a WHERE clause and other clauses that use indexes if they are available.
Unlike other SQLs, it does not have an independent optimizer program, so creating optimized code is important.

Like SQL, PROC SQL is much easier to use than data step language for a few common tasks. One particularly useful task that is much easier in PROC SQL is creating a VIEW of the data, which allows the user to look at a particular section of the dataset.

Imagine we wanted to view the data in our example dataset, Chap_1_1_Infile, but we only wanted to look at the data for women (SEX1 = 2) who live in Massachusetts (_STATE = 25). We could use this PROC SQL code:

PROC SQL;
Select * from Chap_1_1_Infile
    where SEX1 = 2 and _STATE = 25;
quit;

This code produces output in the following structure (with just the first three rows provided):

Table 1.4 – Output from PROC SQL

To get similar output using SAS commands, the following PROC PRINT code could be used. Note that all variables in the order stored in the dataset are displayed since the VAR statement is excluded:

PROC PRINT DATA=Chap_1_1_Infile;
where SEX1 = 2 and _STATE = 25;
RUN;

But imagine we did not want to return all the variables – assume we only wanted to return age (_AGE80) and BMI (_BMI5). We could easily replace the asterisk in our PROC SQL code to specify only those two columns:

PROC SQL;
Select _AGE80, _BMI5 from Chap_1_1_Infile
    where SEX1 = 2 and _STATE = 25;
quit;

In PROC PRINT, to achieve the same output, we would add a VAR statement to our previous code:

PROC PRINT DATA=Chap_1_1_Infile;
where SEX1 = 2 and _STATE = 25;
var _AGE80 _BMI5;
RUN;

Even in this short example, it is easy to see how SAS PROCs and data steps are more complicated than SQL commands because SQL has fewer, more modular commands. By contrast, SAS has an extensive toolset of commands and options that, when understood and used wisely, can achieve just about any result with big data.

Using SAS today in a warehouse environment

While PROC SQL appears to be a workaround from learning complicated data step language, this is not the case in data warehousing. Because of the lack of optimization of PROC SQL, in many environments, it is very slow and can only be feasibly used with smaller datasets. Even today, when transforming big data in SAS, in most environments, it is necessary to use data step language, and this affords the programmer an opportunity to develop optimized code, as efficiency is always necessary when dealing with data in SAS.

However, when interfacing with another database management system (DBMS) where native data are stored in SQL, SAS PROC SQL might be more useful. In his recent white paper on working with big data in SAS, Mark Jordan describes various modern approaches to improving the processing efficiency of both PROC SQL and SAS data steps in both server SAS environments, as well as environments where SAS is used as the analysis engine and connects to a non-SAS DBMS through SAS/Access.

Jordan describes two scenarios for big data storage and SAS:

Using a modern server SAS set up: Server SAS comes with its own OS, and Base SAS version 9.4 includes its own DS2 programming language. These can be used together to create threaded processing that can optimize data retrieval.
Using SAS for analysis connected to non-SAS data storage: In this setup, SAS/Access is used to connect to a non-SAS DBMS and pull data for analysis into the SAS application. This can create a lag, but if SAS and the DBMS are co-located together and the DBMS can use parallel processing, speed can be achieved.

Ultimately, the main bottleneck in SAS processing has to do with I/O, so the easier it is for the SAS analytic engine to interact with the stored data, the faster processing will go. But even in this modern era, limitations surrounding data I/O continue to force SAS users to develop efficient code.

Jordan provides the following tips for thinking about coding for a SAS data warehouse:

Use WHERE instead of IF wherever possible (due to its increased processing efficiency).
As stated earlier, reduce columns retained to just the native and transformed variables needed in the warehouse.
Using the options SASTRACE and SASTRACELOC will echo all the SQL generated to the SAS log file, which can be useful for performance tuning.
Use PROC SQL and data steps to do the same tasks, and then compare their processing time using information from the SAS log to choose the most efficient code.
It is especially helpful to compare PROC SQL code performance on summary tasks, such as developing a report of order summaries, because PROC SQL may perform better than PROCs or data steps.
If using a server SAS setup with DS2 and data steps and if the log from your data steps shows a CPU time close to the program runtime, then your data steps are CPU-bound. In those cases, rewriting the data step process in DS2 could be helpful because it could take advantage of threaded processing.
DS2 has another advantage as it is able to develop results at a higher precision level than data steps.
DS2 code uses different commands than data step code but can achieve the same results.
On massively parallel processing (MPP) DBMS platforms such as Teradata and Hadoop, DS2 can run as an in-database process using the SAS In-Database Code Accelerator. Using this code accelerator can significantly improve the efficiency of data throughput in these environments.
Note:
In his white paper, Mark Jordan compared PROC SQL processing using the SCAN command compared to the LIKE command for retrieving a record with criteria set on a high-cardinality variable and found the LIKE command to be more efficient.

Using SAS in the cloud

In his white paper, Jordan also describes how SAS now has a new Viya architecture that offers cloud analytic services (CAS). A CAS library allows the following capabilities:

Fast-loading data into memory
Conducting distributed processing across multiple nodes
Retaining data in memory for use by other processes until deliberately saved to disk

A CAS library has application programming interfaces (APIs) that allow actions to be executed from a variety of languages, including Java, Python, and R, and of course, the SAS Version 9.4 client application.

Today, not all warehouse data is stored in the cloud, and many datasets are still stored on traditional servers. Jordan recommended that if the user has an installation of the SAS 9.4M5 application and has access to SAS Viya CAS, and they want to decide whether or not to move to CAS from a traditional server, they should compare the processing time on a subset of data in both environments. Jordan was able to demonstrate cutting the processing time from over 1 minute to 2.35 seconds by moving his data from a traditional server to SAS Viya CAS.