You're reading from Cloud Analytics with Microsoft Azure Transform your business with the power of analytics in Azure

Product type Paperback

Published in Jan 2021

Publisher Packt

ISBN-13 9781800202436

Length 184 pages

Edition 2nd Edition

Languages

SQL

Tools

Azure

Concepts

Cloud Computing

Authors (3):

Michael John Peña

Has Altaiar

Jack Lee

View More author details

Table of Contents (7) Chapters

Preface

About Cloud Analytics with Microsoft Azure, Second Edition

1. Introducing analytics on Azure

2. Introducing the Azure Synapse Analytics workspace and Synapse Studio FREE CHAPTER

3. Processing and visualizing data

4. Business use cases

5. Conclusion

Index

Why do you need a modern data warehouse?

A data warehouse is a centralized repository that aggregates different (often disparate) data sources. The main difference between a data warehouse and a database is that data warehouses are meant for Online Analytical Processing (OLAP) and databases, on the other hand, are intended for Online Transaction Processing (OLTP). OLAP means that data warehouses are primarily used to generate analytics, business intelligence, and even machine learning models. OLTP means that databases are primarily used for transactions. These transactions are the day-to-day operations of applications, which concurrently read and write data to databases.

A data warehouse is essential if you want to analyze your big data as it also contains historical data (often called cold data). Most data that's stored has legacy information, such as data stored 5 years ago, 10 years ago, or even 15 years ago. You probably don't want the same database instance that your end users are querying against to also contain that historical data, as it might affect your performance when at scale.

Here are some of the advantages of having a modern data warehouse:

Supports any data source
Highly scalable and available
Provides insights from analytical dashboards in real-time
Supports a machine learning environment

Microsoft offers the following tools and services that collectively create a modern data warehouse:

Figure 1.4: Modern data warehouse using Azure Synapse Analytics

There are a lot of emerging patterns and architectures for data warehousing, but the most popular ones are those that support the separation of duties and responsibilities in different phases of the data pipeline (more on this in the Creating a data pipeline section).

In order to understand what it means for a data warehouse to be modern, you first need to understand how you create and manage a traditional one. It boils down to two major concepts:

Compute: This refers to the ability to process the data and make sense out of it. It can be in the form of a database query to make the results accessible to another interface, such as web applications.
Storage: This refers to the ability to keep data in order for it to be accessible at any time in the future.

A modern data warehouse separates compute and storage in cost-effective ways. Unlike the case traditionally with SQL Server and SQL Server Integration Services (SSIS), the pricing model involves both the storage capacity and computing power to analyze data. Azure is the first cloud provider to offer a data warehouse that separates compute and storage.

Another change in pattern is that the traditional Extract-Transform-Load (ETL) model of data warehousing has now changed to Extract-Load-Transform (ELT). In the traditional ETL model, analysts are accustomed to waiting for the data to be transformed first, since they don't have direct access to all data sources. In a modern data warehouse, massive amounts of data can be stored in either a data lake or data warehouse, and can be transformed anytime by analysts without the need to wait for data engineers or database admins to serve the data.

Of course, there are more factors to consider in order to modernize your data warehouse, such as extensibility, disaster recovery, and availability. However, this section will focus on compute for the time being.

Bringing your data together

In the past, databases were often the only source of data for your applications. But nowadays, you have hundreds and thousands of different data sources. The data coming from these different sources is of different data types—some structured, some unstructured, some semi-structured.

Structured data: The word "structured" suggests that there is a pattern that can be easily interpreted. This usually comes with a predefined set of models and a schema. A relational database management system (RDBMS) such as Microsoft SQL Server is a common example of a data storage solution that is structured. This is because it comes with a database schema and table columns that define the data that you are storing.

Here are some examples of structured data types:

Customer names
Addresses
Geolocation
Date and time
Mobile and phone numbers
Credit card numbers
Product names and Stock Keeping Units (SKUs)
General transaction information such as "From" and "To" with time stamps and amount values

A good example of structured data is the information provided by the users when signing up to an application for the first time. They are presented with a form that needs to be filled in. Once that person clicks the submit button, it sends the data to a database and inserts it into a user table with predefined columns: names, addresses, and other details. This will then allow the user to log into the application since the system can now look up the existing record for the registered user in the database.

From there, a user can access the application and perform transactions, such as transferring money and assets. In time, users will generate a series of transactions that will eventually make your database larger. Your database schema will also expand to support different business requirements.

Once you have enough data, you can perform data exploration. This is where you start looking for patterns in data. You may identify fraudulent transactions and test hypotheses by analyzing large and repeated transaction amounts from the same user.

Your data exploration is limited because you can only base it on a dataset that is structured and with a semantic form. What if you also want to consider other data sources that are unstructured, such as free-form text? An example is a transaction description, which may state the nature or the recipient of the transaction. You don't want to manually read each transaction description and insert it in the right column of a database table. You probably want to extract only the relevant information and transform it into a structured format. This is where unstructured data comes in.

Unstructured data: This data type, more or less, is the rest—that is, everything that isn't structured data. This is mainly because you are not limited to any storage and data type.

Unstructured data types usually don't have a predefined data model that can fit directly into a database. Unstructured data can be text-heavy and is usually read per line or is space-separated.

Here are some examples of unstructured data sources:

Image files
Videos
Email messages and documents
Log files
IoT devices and sensors
NoSQL databases such as MongoDB
Social media and Microsoft Graph

Image files and videos are classified as unstructured data because of their dynamic nature. Although their metadata is something you can consider as structured (such as title, artist, filename, and so on), the content itself is unstructured. With modern tools and data analytics technology, you can now examine this data and make sense of it. The usual example is face recognition in either images or videos.

Emails, documents, and log files all have metadata, but what you're actually more interested in is the content of those files. Usually, in emails, documents, and log files, data is separated per line and the messages are unstructured. You would want to describe the content without manually reading everything (which could be hundreds or even millions of files). An example is doing sentiment analysis on content to determine whether the prevailing emotion is happy, sad, or angry. For log files, you probably want to separate the error messages, time stamps (dates), and measurements (traces) between messages.

IoT devices and sensors, similarly to log files, are used to capture measurements and errors about a certain item. The main difference is that these devices usually work on a large number of clusters (hundreds to thousands of devices) and continuously stream data. Data generated from these devices is semi-structured or unstructured since it is in JSON or XML format. Modern technologies, such as Azure IoT services, already solve these complexities with services such as Azure IoT Hub, which aggregates all this data from various sensors and continuously exports it to a data source. Sometimes you can classify this data as semi-structured since these traces and logs are things that a system can easily understand.

Social media platforms and Microsoft Graph both provide semi-structured data. It is classified this way because just querying all of Twitter's tweets about a topic is not enough. The results don't really make a lot of sense until you do some analysis of them. The primary focus is to discern patterns and anomalies. For example, you may want to identify trends about news and topics but also want to remove data that is irrelevant, such as tweets coming from fake accounts.

Interestingly, some line-of-business (LOB) applications provide both structured and unstructured data. For example, both Microsoft Dynamics CRM and Salesforce provide structured data that can easily be interpreted and exported to your SQL database tables, such as data for products and their amounts and value. However, they also support unstructured data such as images, videos, and text notes. Note that even though text notes are considered as the string data type, they can still be considered as unstructured data because they are designed to be free text. They don't have a proper format to follow, but they are still worth exploring. A common scenario for unstructured data its use is to understand why sales were not successful.