Azure Synapse is a unified analytics platform that brings together several cloud services to help you manage your data science projects from data ingestion all the way to serving data to end users. Figure 1.3 illustrates the service architecture for Azure Synapse Analytics:
Figure 1.3 – Azure Synapse Analytics service architecture
All these capabilities are managed by the umbrella Azure Synapse service in the form of what is called an Azure Synapse workspace (the shaded area on top in Figure 1.3). When you provision a new Azure Synapse workspace, you are offered a single point of entry and single point of management for all services included in Azure Synapse. You don’t need to go to different places to create data ingestion pipelines, explore data using Apache Spark, or author Power BI reports—instead, all the work is done through one development and management environment called Azure Synapse Studio, reachable through https://web.azuresynapse.net.
Before Azure Synapse, E2E advanced analytics and data science projects were built by putting together several different services that could be hosted on the cloud or on-premises. The promise of Azure Synapse is to offer one platform where all advanced analytics tasks can be performed.
Let’s look in detail at the capabilities offered by Azure Synapse.
Data integration
Leveraging the Azure Data Factory (ADF) code base, Azure Synapse pipelines offer a code-free experience to build data integration jobs that enable data ingestion from data sources in the cloud, on-premises, Software-as-a-Service (SaaS) sources, data streaming, and more. It includes native connectors to more than 95 data sources.
With Azure Synapse pipelines, data engineers can build E2E workflows for data moving and processing. Azure Synapse supports nested activities, linked services, and execution triggers, and offers common data transformation and data wrangling (transforming data, or mapping data to other columns) activities. In Azure Synapse, you can even add a notebook activity that contains complex logic to process data using an Azure Synapse notebook, a code-rich experience, or use flowcharts with a rich user-friendly interface that implements complex pipelines using a code-free experience. This is illustrated in Figure 1.4.
Figure 1.4 – Code-free pipeline and data flow authoring experience
Note
Not all features offered in ADF are available in Azure Synapse Analytics. To explore the differences between ADF and Azure Synapse Analytics, visit https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences.
You should avoid loading data into Azure Synapse SQL using traditional methods that are normal practice in Microsoft SQL Server. For example, issuing batch INSERT
, UPDATE
, and DELETE
statements to load or update data is not an optimal process in Azure Synapse SQL, because the massively parallel processing (MPP) engine in Azure Synapse SQL was not designed for singleton operations. To help load data efficiently, besides using pipelines, Azure Synapse offers a convenient COPY
Transact-SQL (T-SQL) command that helps move data from Azure Data Lake Storage Gen2 (ADLS Gen2) to Synapse SQL tables in an optimal fashion.
Enterprise data warehousing
An enterprise data warehouse—or, simply put, a data warehouse—is a centralized system that integrates data from disparate data sources to enable reporting and analytics in organizations. It stores the data efficiently and is configured to serve data through reporting or user queries, without inflicting overhead on transactional systems.
Azure Synapse offers a highly scalable data warehousing solution through Synapse SQL pools—a distributed query engine to process SQL queries at petabyte (PB) volume. The SQL analytical engine in Azure Synapse is an evolution of a product previously called Azure SQL Data Warehouse. An Azure Synapse workspace can have several Synapse SQL pools, and the user can run queries using any of the compute pools available. Figure 1.5 illustrates the ability to pick the desired compute pool for a given query. Pools that have a gray icon without a checkmark are either stopped or not available for use:
Figure 1.5 – Picking a SQL pool in the query editor
Synapse SQL is offered in two flavors, as detailed here:
- Dedicated SQL pool: This is a pre-provisioned compute cluster that offers predictable performance and cost. When a dedicated SQL pool is provisioned, the cluster capacity is reserved and kept online to respond to user queries, unless you choose to pause the SQL pool—a strategy to save money when the cluster is not in use. Dedicated SQL pools run an MPP engine to distribute data across nodes on a cluster and retrieve data. A central control node receives user queries and distributes them across the cluster nodes, resolving the user queries in parallel. When you provision a new dedicated SQL pool, you specify its desired cluster size based on your service-level objective. Dedicated SQL pool sizes range from one cluster node to up to 60 nodes processing user queries.
- Serverless SQL pools: A query engine that is always available to use when needed, mostly applicable for unplanned use, or bursty workloads. You do not need to pre-provision serverless SQL pools. You are charged based on the data volume processed in queries. Every Azure Synapse workspace includes a serverless SQL pool. The distributed query processing engine that runs serverless SQL pools is more robust and more complex than the engine that runs dedicated SQL pools and assigns resources to the cluster as needed. You do not control the size of your compute pool or how many resources are allocated to user queries.
Figure 1.6 illustrates the service architecture for dedicated and serverless SQL pools:
Figure 1.6 – Service architecture for dedicated and serverless SQL pools (adapted from https://docs.microsoft.com/azure/synapse-analytics/sql/overview-architecture)
To learn in depth how serverless pools work in Azure Synapse, I recommend the POLARIS: The Distributed SQL Engine in Azure Synapse white paper, which can be found at https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf.
Exploration on the data lake
Through native integration with ADLS Gen2, serverless SQL pools allow you to query data directly from files residing on Azure Storage. You can store data in a variety of file formats, such as Parquet or comma-separated files (CSV), and query it using the familiar T-SQL language.
Exploration on the data lake offers a quick alternative for users who want to explore and experiment with the existing data before it gets loaded into Synapse SQL tables for high-performance querying. In Figure 1.7, you can see a T-SQL query that uses the OPENROWSET
operator to reference data from a Parquet file stored on ADLS:
Figure 1.7 – Running T-SQL queries to query data stored on the data lake
This is a powerful capability for users who want to explore data before they decide how to store it to enable the processing of queries at scale. To learn more about exploring data on the data lake using serverless SQL pools, visit https://learn.microsoft.com/azure/synapse-analytics/get-started-analyze-sql-on-demand.
Apache Spark
Apache Spark is an open source, highly scalable big data processing engine. It achieves high performance by supporting in-memory data processing and automatically distributing jobs across nodes in a cluster of servers. Apache Spark is widely popular in the data science community not only for its performance benefits (achieved due to its support for in-memory processing and scalability, as described), but also due to the fact that it has built-in support for popular languages such as Python, R, and Scala. Some Apache Spark distributors add additional support for third-party languages as well, such as SQL or C#.
Azure Synapse includes a fully managed Spark service that can be used for data exploration, data engineering, data preparation, and creating ML models and applications. You can choose from a range of programming languages for your data exploration needs, including C#, Scala, R, PySpark, and Spark SQL. The Apache Spark service offered by Azure Synapse is automatically provisioned based on your workload size, so you do not need to worry about managing the actual instances in the cluster.
Apache Spark in Azure Synapse comes with a rich set of libraries, including some of the most used by data engineers and data scientists, such as NumPy, Pandas, Scikit-learn, Matplotlib, and many others. You can also install any packages that are compatible with the Spark distribution used in your Apache Spark pool.
Note
To see a full list of libraries that are pre-installed in Apache Spark pools in Azure Synapse, visit https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-version-support and select your desired Spark runtime version.
To explore data using Apache Spark, Azure Synapse offers a notebook experience that allows users to use markdown cells and code cells, as with other popular notebook experiences that are available on the market. This experience is illustrated in Figure 1.8.
Figure 1.8 – Synapse notebooks authoring experience (subscription ID obfuscated)
Notebooks are saved in your Synapse workspace (or in your source control system if you configured Git integration) just like other workspace artifacts, so anyone connecting to the same workspace will be able to collaborate on your notebooks.
Log and telemetry analytics
Azure Synapse includes native integration with Azure Data Explorer to bring log and telemetry data to E2E advanced analytics and data science projects. You can pre-provision Data Explorer pools in Azure Synapse and have reserved compute capacity for your analytical needs.
Through Azure Synapse Data Explorer (in preview at the time of writing), Data Explorer pools in Azure Synapse enable interesting new scenarios for analysts, data scientists, and data engineers. For example, they offer integration with notebooks in Azure Synapse, allowing you to explore data using your language of choice, in a fully collaborative environment. As you can see in Figure 1.9, by right-clicking a table on a Data Explorer pool database and selecting New notebook, Azure Synapse Studio can automatically generate a notebook with code to load that table to a Spark data frame:
Figure 1.9 – Notebooks reading data from Data Explorer pools
The notebook experience is the same as with Apache Spark reading data from the data lake, except in this case, Azure Synapse generates code for you to load data from Data Explorer into a Spark DataFrame (a table-like data structure that allows you to work with data).
Integrated business intelligence
Having all these data capabilities at your fingertips, it would make sense to be able to richly visualize data. Azure Synapse offers integration with Microsoft Power BI, allowing you to add Power BI datasets, edit reports directly in Azure Synapse Studio, and automatically see those changes reflected in the Power BI workspace that hosts that report.
Note
Azure Synapse does not provision new Power BI workspaces for you. Instead, you add your existing Power BI workspaces to the Azure Synapse workspace using a linked service connection. A separate license to Power BI may be required.
Thanks to the Azure Active Directory (AAD) integration, connecting to the Power BI service is a simple process. Synapse Studio uses your credentials to look for Power BI workspaces in your AAD tenant and allows you to select the desired one from a combobox, as illustrated in Figure 1.10.
Figure 1.10 – Adding a Power BI workspace as a linked service in Azure Synapse (tenant and workspace globally unique IDs (GUIDs) obfuscated)
This is a powerful and quite useful capability. Not only does it allow you to be more productive and avoid switching between apps to do different work, but it also allows analysts to see the shape and form of their data while they are still exploring it and experimenting with transformations.
Data governance
With the growth of a data culture in corporations, an explosion happened in the number of data marts, data sources, and amount of data that can be used for analytical needs. In Azure Synapse alone, projects normally consume data residing on several data sources. This data can then be copied to a data lake on Azure, transformed, and eventually copied to SQL tables. That is a lot of data and metadata to maintain! How do you get a global view of all the data assets in your organization, and how do you govern this data and classify sensitive data so that it is not misused?
Microsoft Purview is Microsoft’s data governance solution for enterprises. It connects to data on-premises, on the cloud, and even to SaaS sources, giving companies a unified view of their data estate. It has advanced data governance features such as data catalogs, data classification, data lineage, data sharing, and more. You can learn more about Microsoft Purview at https://azure.microsoft.com/en-us/services/purview/.
Note
Just as with the Power BI integration, Microsoft Purview requires you to have a Purview account, with the appropriate rights, configured separately from your Synapse workspace. To learn how to connect a Synapse workspace to a Purview account, visit https://docs.microsoft.com/en-us/azure/synapse-analytics/catalog-and-governance/quickstart-connect-azure-purview.
Configuring your integration with Purview is a simple process: Azure Synapse Studio allows you to pick the Purview account from a list of subscriptions, or to provide the details of your Purview account manually. Once you have configured the integration, you can manage it from Synapse Studio, as illustrated in Figure 1.11:
Figure 1.11 – Managing the Microsoft Purview integration
After configuring the Purview integration on a Synapse workspace, you benefit from the following features:
- Discovery: Search for any data assets cataloged by Purview by using the global search box.
- Data lineage: Understand how the data traveled through the organization and how it was transformed before it landed in the current shape and location. It also allows you to see the raw form of data before it was transformed.
- Connect to new data: Having discovered new data assets, instantly connect to them using linked services. From here, you can leverage any service on Azure Synapse to work with the data, from experimentation on Apache Spark to moving data using pipelines.
- Push lineage back to Microsoft Purview: After you apply transformations to data and create new datasets on Azure Synapse, you can push metadata that describes your new datasets to Microsoft Purview’s central repository for discovery from future users.
While Purview integration is outside of the scope of this book, make sure you understand how to make governance a first-class citizen in your projects—it is a critical aspect of analytical projects and is quickly becoming a hard prerequisite for enterprises.
Broad support for ML
ML is a first-class citizen in Azure Synapse. Models can be trained on Apache Spark, as discussed previously, using a variety of algorithms and libraries, such as Spark MLlib or Scikit-learn. Another option is to connect to the Azure Machine Learning service from within an Azure Synapse notebook and train models using Azure Machine Learning’s compute engine. Because Azure Machine Learning offers automated ML (AutoML), you do not even need to know the best algorithm and features to achieve your objective: AutoML tests a series of parameters and algorithms with the given data and offers you the best algorithm based on a series of results.
Besides model training, Azure Synapse can consume models that were previously trained to run batch scoring with data residing on Azure Synapse. The SQL analytical engine in Azure Synapse can import Open Neural Network Exchange (ONNX) models (which can be generated from Azure Machine Learning) into its model registry and allow you to use the PREDICT
function in T-SQL to score columns in real time, as part of regular SQL queries. This is quite powerful!
For example, given a dbo.Trips
SQL table that contains New York taxi trip data, the query shown in Figure 1.12 uses the model scored with Id = 60
to predict taxi fares, given other columns such as passenger count, trip distance, and the date and time of the trip:
Figure 1.12 – Using PREDICT to score columns in a SQL query
Note that in this example, the model is stored in Azure Synapse SQL’s model registry and scoring is done locally, which produces a quick query response time with negligible impact on the query plan. No external services are called.
For consumption of more complex models, or in scenarios where an ML application needs more complex logic that can be better achieved by using a language other than T-SQL, Apache Spark is a perfect alternative. By leveraging Spark pools in Azure Synapse, the regular Notebook experience can also be used for batch scoring.
Security and Managed Virtual Network
The fact that Azure Synapse offers all these cloud services on a single platform may give the impression that it is hard to protect your data. In reality, Azure Synapse workspaces can be created so that they are fully isolated from other workspaces, at the network layer. This is achieved by using Managed Virtual Network (or Managed VNet) in Azure Synapse.
Managed VNet manages the network isolation, and you do not need to configure network security group (NSG) rules to allow traffic to and from your virtual network. When associated with a managed private endpoint, workspaces configured on a Managed VNet are fully protected against data exfiltration.
Besides network isolation, Azure Synapse offers an industry-leading set of security features for data protection, role-based access control (RBAC), different authentication mechanisms, threat protection, and more.
Note
For a detailed view of the security capabilities across all Azure Synapse services, refer to the Azure Synapse Analytics security white paper at https://docs.microsoft.com/en-us/azure/synapse-analytics/guidance/security-white-paper-introduction.
Management interface
As you can tell by now, Azure Synapse offers several different services that allow you to have a unified platform to build your advanced analytics and data science projects—from data ingestion, all the way to serving data to end users using Power BI. To manage all these services and to build your projects, the primary (and almost only) tool we will use is Azure Synapse Studio.
If you have used ADF before, you will find the user experience (UX) in Azure Synapse Studio a familiar one. It organizes your work and resources you are using by implementing the concept of hubs (seen on the left-hand side of the user interface (UI)), as follows:
Figure 1.13 – The home page of Azure Synapse Studio
The hubs in Azure Synapse Studio are set out here:
- Home: This is the landing page of Azure Synapse Studio. It offers quick links to recently used resources, pinned resources, links to the Knowledge center, and other links to useful content.
- Data: The Data hub lets you navigate through any SQL, Apache Spark, or Data Explorer pools that you provisioned for your environment, as well as any linked data sources, such as ADLS. It provides a tree view where you can navigate through your compute pools and databases, and even glance at data by right-clicking tables and generating preview scripts.
- Develop: Go to the Develop hub to find all the SQL/Kusto Query Language (KQL) scripts you are working on, as well as Synapse notebooks, Power BI reports, and even data flows. All the work you do and save (the actual word used in Synapse Studio is publish) is stored with the Synapse service, in the context of your workspace. However, you can configure source control and collaboration using Git, and save your work in a Git repository, on Azure DevOps, or on GitHub.
- Integrate: This is where you manage all the data integration pipelines created in your Synapse workspace. Azure Synapse Studio provides a code-free experience to build pipelines in a workflow-like environment with a rich UX that mirrors the same experience on ADF.
- Monitor: The Monitor hub is your single pane of glass (SPOG) to monitor the status of your compute pools and pipeline runs, Apache Spark jobs, and more. You can also browse through a history of recent activities and verify their execution.
- Manage: Finally, the Manage hub is where you configure connections to linked services and integration runtimes, scale your SQL, Data Explorer, or Apache Spark pools, and even create pipeline triggers.
Other tools can be used to connect to Azure Synapse services to perform queries and some basic management tasks. Tools such as SQL Server Management Studio (SSMS) and ADS can be used to run queries on SQL pools. For overall service management and provisioning, while some tasks can be accomplished via T-SQL statements, Azure Synapse can be managed using PowerShell, as well as through the Azure portal.
As you can see, Azure Synapse brings together several different services from data ingestion, through data processing (using your choice of analytical engine) and data visualization to deliver an E2E approach to analytics. It is an industry-defining service offering that brings pieces of the analytical puzzle together for a 360-degree view of your data estate.
This book will focus on Data Explorer in Synapse and how it integrates with these services. So, let’s look into it in detail.