Dealing with storage and memory issues
This section will cover issues with storage and memory when using SAS for big data. It will cover the following:
How SAS dealt with competition from structured query language (SQL) for data storage
How
PROC SQL
works and can be used in data warehouse processingConsiderations about memory and storage that need to be made when using SAS in a data warehouse in modern times
How SAS can work in the cloud
Avoiding memory issues
Even as SAS got more powerful, datasets kept getting bigger, and there were always challenges with running out of memory during processing. For example, using WHERE
instead of IF
when reading in data would not only reduce CPU usage and the time it took for code to run, it would also prevent unnecessary usage of memory. Even today, tuning SAS code may be necessary to avoid memory issues.
In a data warehouse, mart, or lake, datasets that were transformed in SAS may be stored outside of SAS in SAS format. This makes them easy to read into SAS. However, this format can be very large, so the option to COMPRESS
the dataset was created. Curtis Smith reported on his test compressing SAS files in his white paper (available under Further reading), and found that depending upon the dataset, compressing datasets could make them take up half the space.
Smith recommended not only compressing datasets but also deleting unneeded variables to make datasets smaller. In a data warehouse, mart, or lake, source datasets contain native variables. In a data lake, these datasets may remain relatively unprocessed. However, in a data warehouse or data mart, decisions need to be made about what variables to keep available for analysis in the warehouse. Further more, transformed variables may be added during processing to serve the needs of the users of the warehouse.
The team running the data warehouse should ask the following for each raw dataset:
If native variables should be available for analysis, which ones should be kept?
If transformed variables should be available for analysis, which ones should be provided?
By carefully answering these questions, only the columns needed from each dataset can be retained in analysis files, thus reducing processing time for warehouse developers and users.
Accommodating Structured Query Language
SQL was developed and deployed by various companies in the 1990s and early 2000s. SQL was aimed at data maintenance and storage using relational tables rather than flat files. SQL approaches only became possible in the 1990s due to upgrades in technology that allowed faster processing of data.
SQL languages accomplish the same data editing tasks that data steps do in SAS, but they use a different approach. Unlike SAS, which is a procedural language, SQL is a declarative language:
In SAS, the programmer must program a data step to do the procedures in the most efficient way to optimize data handling.
In SQL, the programmer declares what query output they desire using easy-to-understand, simple English statements, and an optimization program (or optimizer) running in the background figures out the most efficient way to execute the query.
While using efficient code in SQL can still improve performance, the efficiency of SQL is less dependent upon the programmer's code and more dependent on the function of its optimizer. Hence, maintaining data in a database became easier using SQL rather than SAS data steps. In SQL, programmers had to learn a few basic commands that could perform a variety of tasks when used together. But with SAS data steps, programmers needed to study a broad set of commands, and they also had to learn the most efficient way to assemble those commands together in order to achieve optimal SAS performance.
What SQL cannot do is analyze data the way SAS can. Therefore, over the latter half of the 1990s and early 2000s, while many databases began to be stored and maintained in SQL, SAS could still be used on them for analysis through the SAS/Access feature.
A 1995 edition of the periodical Computerworld described current options for SAS users in an article titled SAS Institute's customers keep the faith by Rosemary Cafasso (available under Further reading). There were two ways to conceive of data storage in SAS at that time:
Using SAS only for analysis, and connecting to a non-SAS data storage system to do this
Using SAS for both data storage and analysis
For the first option, SAS/Access features could be used. For the second option, SAS/Application Facility (SAS/AF) was used to create a client/server environment to support both data storage and analysis in SAS. Another term for this setup is server SAS (as opposed to PC SAS, which is an application that runs entirely on a PC without a client/server relationship). The advantage of using SAS/AF is that a comprehensive SAS solution could be used that optimized the client/server relationship (through, for example, partitioning the application so it ran on different processors).
Also, in 1995, SQL optimizers had not been improved to the point where they outperformed SAS data steps, so at that time SAS/AF was a better approach than SAS/Access to connect to a SQL database. As noted in the Computerworld article, this led programmers to gravitate toward working either entirely in a SAS environment, or entirely outside of one.
With the visualization tools included in SAS/AF, SAS was now competing with visualization applications as well as data management applications. SAS's users continued to rate it highly, and were very loyal, as moving away from the SAS/AF platform would be very difficult given its dissimilarity to other applications.
Using PROC SQL
SAS's trajectory in general through its release of version 8 in 1999 and later version 9 (the current one) in 2002 has been to build extra functions into its core analysis products, and to also design supporting products to support its functionality. Unlike in the early years, SAS has not revisited data step functioning, nor considered redeveloping its data step language as declarative rather than procedural.
Through the late 1990s and early 2000s, SQL became more predominant, and therefore more programmers were trained in SQL. These SQL programmers had a lot of trouble transferring their skills to use in SAS data steps, so SAS developed a SQL language within SAS called PROC SQL
.
PROC SQL
has the following features:
It is a language within SAS, in that
PROC SQL
code starts with aPROC SQL
statement and ends with aquit
statement.It uses SQL commands, such as
CREATE TABLE
andSELECT
withGROUP BY
andWHERE.
It allows the user to control its use of processors during execution through the
THREADED
option.It includes a
WHERE
clause and other clauses that use indexes if they are available.Unlike other SQLs, it does not have an independent optimizer program, so creating optimized code is important.
Like SQL, PROC SQL
is much easier to use than data step language for a few common tasks. One particularly useful task that is much easier in PROC SQL
is creating a VIEW
of the data, which allows the user to look at a particular section of the dataset.
Imagine we wanted to view the data in our example dataset, Chap_1_1_Infile
, but we only wanted to look at the data for women (SEX1 = 2
) who live in Massachusetts (_STATE = 25
). We could use this PROC SQL
code:
PROC SQL; Select * from Chap_1_1_Infile where SEX1 = 2 and _STATE = 25; quit;
This code produces output in the following structure (with just the first three rows provided):

Table 1.4 – Output from PROC SQL
To get similar output using SAS commands, the following PROC PRINT
code could be used. Note that all variables in the order stored in the dataset are displayed since the VAR
statement is excluded:
PROC PRINT DATA=Chap_1_1_Infile; where SEX1 = 2 and _STATE = 25; RUN;
But imagine we did not want to return all the variables – assume we only wanted to return age (_AGE80
) and BMI (_BMI5
). We could easily replace the asterisk in our PROC SQL
code to specify only those two columns:
PROC SQL; Select _AGE80, _BMI5 from Chap_1_1_Infile where SEX1 = 2 and _STATE = 25; quit;
In PROC PRINT
, to achieve the same output, we would add a VAR
statement to our previous code:
PROC PRINT DATA=Chap_1_1_Infile; where SEX1 = 2 and _STATE = 25; var _AGE80 _BMI5; RUN;
Even in this short example, it is easy to see how SAS PROCs and data steps are more complicated than SQL commands because SQL has fewer, more modular commands. By contrast, SAS has an extensive toolset of commands and options that, when understood and used wisely, can achieve just about any result with big data.
Using SAS today in a warehouse environment
While PROC SQL
appears to be a workaround from learning complicated data step language, this is not the case in data warehousing. Because of the lack of optimization of PROC SQL
, in many environments, it is very slow and can only be feasibly used with smaller datasets. Even today, when transforming big data in SAS, in most environments, it is necessary to use data step language, and this affords the programmer an opportunity to develop optimized code, as efficiency is always necessary when dealing with data in SAS.
However, when interfacing with another database management system (DBMS) where native data are stored in SQL, SAS PROC SQL
might be more useful. In his recent white paper on working with big data in SAS, Mark Jordan describes various modern approaches to improving the processing efficiency of both PROC SQL
and SAS data steps in both server SAS environments, as well as environments where SAS is used as the analysis engine and connects to a non-SAS DBMS through SAS/Access.
Jordan describes two scenarios for big data storage and SAS:
Using a modern server SAS set up: Server SAS comes with its own OS, and Base SAS version 9.4 includes its own DS2 programming language. These can be used together to create threaded processing that can optimize data retrieval.
Using SAS for analysis connected to non-SAS data storage: In this setup, SAS/Access is used to connect to a non-SAS DBMS and pull data for analysis into the SAS application. This can create a lag, but if SAS and the DBMS are co-located together and the DBMS can use parallel processing, speed can be achieved.
Ultimately, the main bottleneck in SAS processing has to do with I/O, so the easier it is for the SAS analytic engine to interact with the stored data, the faster processing will go. But even in this modern era, limitations surrounding data I/O continue to force SAS users to develop efficient code.
Jordan provides the following tips for thinking about coding for a SAS data warehouse:
Use
WHERE
instead ofIF
wherever possible (due to its increased processing efficiency).As stated earlier, reduce columns retained to just the native and transformed variables needed in the warehouse.
Using the options
SASTRACE
andSASTRACELOC
will echo all the SQL generated to the SAS log file, which can be useful for performance tuning.Use
PROC SQL
and data steps to do the same tasks, and then compare their processing time using information from the SAS log to choose the most efficient code.It is especially helpful to compare
PROC SQL
code performance on summary tasks, such as developing a report of order summaries, becausePROC SQL
may perform better than PROCs or data steps.If using a server SAS setup with DS2 and data steps and if the log from your data steps shows a CPU time close to the program runtime, then your data steps are CPU-bound. In those cases, rewriting the data step process in DS2 could be helpful because it could take advantage of threaded processing.
DS2 has another advantage as it is able to develop results at a higher precision level than data steps.
DS2 code uses different commands than data step code but can achieve the same results.
On massively parallel processing (MPP) DBMS platforms such as Teradata and Hadoop, DS2 can run as an in-database process using the SAS In-Database Code Accelerator. Using this code accelerator can significantly improve the efficiency of data throughput in these environments.
Note:
In his white paper, Mark Jordan compared
PROC SQL
processing using theSCAN
command compared to theLIKE
command for retrieving a record with criteria set on a high-cardinality variable and found theLIKE
command to be more efficient.
Using SAS in the cloud
In his white paper, Jordan also describes how SAS now has a new Viya architecture that offers cloud analytic services (CAS). A CAS library allows the following capabilities:
Fast-loading data into memory
Conducting distributed processing across multiple nodes
Retaining data in memory for use by other processes until deliberately saved to disk
A CAS library has application programming interfaces (APIs) that allow actions to be executed from a variety of languages, including Java, Python, and R, and of course, the SAS Version 9.4 client application.
Today, not all warehouse data is stored in the cloud, and many datasets are still stored on traditional servers. Jordan recommended that if the user has an installation of the SAS 9.4M5 application and has access to SAS Viya CAS, and they want to decide whether or not to move to CAS from a traditional server, they should compare the processing time on a subset of data in both environments. Jordan was able to demonstrate cutting the processing time from over 1 minute to 2.35 seconds by moving his data from a traditional server to SAS Viya CAS.