Using SAS in modern warehousing
Today, SAS data warehousing is more complicated than it was in the past because there are so many options. Learning about these options can help the user envision the possibilities, and design a SAS data warehousing system that is appropriate for their organization's needs. This section will cover the following:
A modern case study that used SAS components for analyzing unstructured text in helpdesk tickets
A case study of a data SAS warehouse that upgraded an old system to include a new API allowing users more visualization functionality through SAS Visual Analytics
A case study of a legacy SAS shop that began to incorporate R into their system
A review of how SAS connects with a new cloud storage system, Snowflake
Warehousing unstructured text
In his white paper on warehousing unstructured text in SAS, Nick Evangelopoulos describes how the IT Shared Services (ITSS) division at the University of North Texas (UNT) used SAS to study their service tickets to try to improve services (link available under Further reading). Here are the steps they took:
They chose to study a set of 9,691 tickets (representing approximately 18 months' worth of tickets) comprised mainly of unstructured text from the native application platform ServiceNow.
Using the open source statistical application R, they conducted text cleaning. Mostly, this consisted of removing back-and-forth conversations by email that were recorded in the unstructured ticket text.
Using the text mining component of SAS called SAS Text Miner (used within the SAS platform SAS Enterprise Miner (SAS EM)), they were able to use text extraction to help classify the tickets by topic.
Next, the team used Base SAS and the analytics component SAS STAT to add indicator variables and other quantitative variables to the topics, thus creating a quantitative dataset that could be analyzed and visualized.
After doing this, the team wondered if SAS EM would classify the tickets under the same topic as the user entering the ticket would. To answer this question, the team analyzed 1,481 new tickets that were classified using SAS EM as well as being classified by the user. They found dramatic differences between how users and SAS EM classified the tickets, suggesting that this classification may need additional development in order to be useful.
Using SAS components for warehousing
A white paper by Li-Hui Chen and Manuel Figallo describes a modern SAS data warehouse using SAS applications (available under Further reading). The US Department of Health and Human Services (DHHS) has a data warehouse of health indicators called the Health Indicators Warehouse (HIW). They described how they upgraded their SAS data warehouse system to improve performance and customer service using SAS Visual Analytics (VA) accessed through an API.
The HIW serves many users over the internet. Prior to the upgrade, SAS datasets were accessed from storage using SAS, and extract-transform-load (ETL) processes needed to take place manually on the data before it could be visualized. This made the data in the warehouse difficult to visualize.
With the upgrade, this is the new process:
Users obtain permission to access the API, which controls access to the underlying data as well as the VA capabilities.
Using the API, which contains a GUI, users indicate which health indicator they want to extract from the HIW, and how they want to visualize it.
The API extracts the necessary data from the HIW data store using automated processing.
The API performs necessary ETL processes to support visualization.
The API then visualizes the results using VA.
Here is a conceptual diagram of the old and new systems:

Figure 1.3 – SAS warehousing system before and after adding an API layer
Focusing on I/O, the authors pointed out that ETL in the API is achieved by running SAS macros, or code routines developed in the SAS macro language that can take user or system inputs and can be run automatically. They pointed out that they can run these macros either through a stored process (where the macro can be run on one dataset at a time) or a batched process (where the macro is run on several datasets at once). The authors found that they needed to use a batch process when transferring large amounts of HIW data through an API call.
Using other applications with SAS
SAS has been around a long time and has typically been the first choice for warehousing big data. However, since the invention and rise of SQL, there has been competition between SAS and SQL for data storage functions. With the rise of R, open source statistical software known for visualization and an easy web interface, SAS has seen competition with respect to statistical analysis functions.
Over time, SAS responded to competition by building in extra functionality. SAS/Access, SAS VA, and SAS Viya are all examples of this. However, the reality is that SAS is best at analytics, so other applications tend to be superior at these other functions. This has created challenges for legacy SAS warehouses that are now rethinking how they use SAS in their system. Teams are approaching this challenge with a variety of responses.
Dr. Elizabeth Atkinson shared her team's story of moving from a 100% SAS shop to incorporating R for some functions. She leads a biostatistics service at the Mayo Clinic, a famous specialty clinic in the US, which has been a SAS shop since 1974, when punch cards were still being used, and now has a staff of 300 in 3 locations. The service supports data storage and analysis for studies, both large and small.
In 2014, Mayo went to negotiate their SAS license and found that the price had increased significantly. SAS has always been a distinctive product with a high price. According to the Computerworld article, in 1995, a full SAS application development package, when bundled for 10 users, cost $1,575 per seat; this is expensive even by today's standards. However, in 2014, the increase in cost was felt to be unsustainable, and the Mayo team started looking for other options.
They wanted to decrease their dependence on SAS by moving some of their functions to R, and also improving their customer service and satisfaction. They faced the following challenges:
SAS infrastructure was entrenched: All of the training was based on SAS, SAS was integrated into every workflow, and automation used SAS macros. Many users only trusted SAS and did not trust numbers coming out of R. SAS users relied on their personal code repositories.
R infrastructure was dynamic: Unlike SAS, R releases new versions often. R innovates quickly, so it is hard to keep up a stable R environment. R packages, which are external components of Base R that can be added, were also upgraded regularly, leading to code that would break without warning, and cause user confusion.
Time constraints: Reworking some SAS functions to be done by R required a lot of effort of deconstructing SAS and constructing R. Both leaders and users had time constraints.
Different learning styles and levels of knowledge: SAS users had spent years learning data steps. R data management is completely different. It was hard for SAS users to learn R, and R users to learn SAS.
R support needed: SAS provides customer support, but that is not available with open source software like R. The organization needed to build its own R support desk. Compared to SAS, R's documentation is less standardized and comprehensive.
To integrate R into their shop, they took the following steps:
Committed funding: Divisional funding was committed to the project.
Identified R champions: This was a group of R users with expertise in R and SAS.
Set up an R server: Having an R server available increased enthusiasm and interest in R.
Rebuilt popular local SAS macros in R: These are the ones that were deconstructed and rebuilt in R. Many of these were for reporting. They took the opportunity to improve reporting when rebuilding these macros.
Developed integrated SAS and R training: Because they are now a combined shop, their training shows how to do the same tasks in SAS and R. They also hold events demonstrating R and providing online examples.
Set up an R helpdesk: This provides on-call, in-house R support. They maintain a distribution list and send out R tips.
Even after offering R as an alternative, many users chose to stay with SAS. The reasons the shop could not completely convert from R to SAS include the following:
Time and cost constraints: It was not possible to move all the small projects already in SAS over to R.
Data retrieval and ETL: R cannot handle big data like SAS. The SAS data steps provide the ability to control procedural data processing in SAS, and this is not possible in R.
Analysis limitations: Certain tasks are much clumsier in R than in SAS. At the Mayo Clinic, they found that mixed effect models were much more challenging in R than in SAS.
One of the overall benefits of this effort was that it opened the larger conversation behind what skills will be needed among analysts in the division in the future. These considerations run parallel to the consideration as to what SAS and non-SAS components will be used in the data system in the near future, what roles they will play, how they will be supported, and how they will work together to improve the user experience.
Connecting to Snowflake
As data gets bigger and bigger, new solutions have been developed to store data in the cloud. Microsoft Azure and Amazon Web Services (AWS) are cloud services to help move business operations to the cloud. Snowflake (https://www.snowflake.com/) is a relatively new cloud data platform that runs on Microsoft Azure and AWS and may run on other cloud services in the future.
Snowflake enables a programmer to make a virtual data warehouse with little cost, thus solving a data storage problem. However, data still needs to be accessed to be analyzed. Therefore, SAS upgraded its SAS/Access component to now be able to connect directly to Snowflake.
SAS documentation about connecting to Snowflake indicates that Snowflake uses SQL as its query language. Both PROC SQL
and regular SAS functions can be passed to Snowflake, but there are cases where SAS and Snowflake function names conflict. Further more, careful settings of options and code tuning are needed to improve I/O from SAS to Snowflake.
Although products like Snowflake can solve the big data storage problem, the issue with SAS will always be I/O. Using the newest and most appropriate technology along with the most efficient coding approaches will always be the best strategy for dealing with the data warehousing of big data in SAS.