Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hands-On Data Warehousing with Azure Data Factory

You're reading from   Hands-On Data Warehousing with Azure Data Factory ETL techniques to load and transform data from various sources, both on-premises and on cloud

Arrow left icon
Product type Paperback
Published in May 2018
Publisher Packt
ISBN-13 9781789137620
Length 284 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Authors (3):
Arrow left icon
Christian Cote Christian Cote
Author Profile Icon Christian Cote
Christian Cote
Giuseppe Ciaburro Giuseppe Ciaburro
Author Profile Icon Giuseppe Ciaburro
Giuseppe Ciaburro
Michelle Gutzait Michelle Gutzait
Author Profile Icon Michelle Gutzait
Michelle Gutzait
Arrow right icon
View More author details
Toc

What's new in V2.0?

With V2, ADF has now been overhauled. This section will describe the main novelties of ADF V2.

Integration runtime

This is one of the main features of version 2.0. It represents the compute infrastructure and performs data integration across networks. Here are some enhancements it can provide:

  • Data movements between public and private networks either on-premises or using a virtual private network (VPN). They were known as data management gateways in V1 and Power BI.
    • Public: They are used by Azure and other cloud connections. There's a default integration runtime that comes with ADF.
    • Private: They are used to connect private computer resources such as SQL Server on-premises to ADF. We need to install a service on one Windows machine in the private network. That machine can connect to the enterprise resources and send the data to ADF via the service installed on it.
  • SSIS package execution—managing SSIS packages in Azure. This is one of the main topics of this book. Chapter 3, SSIS Lift and Shift, is completely dedicated to this feature.

Linked services

Linked services now have a connectVia property to be able to use the Integration Runtimes that we mentioned in this chapter before. They can now connect to a lot more of data stores than it was possible before.

Datasets

Datasets are the same as they were in V1, but we don't need to define any availability schedules in them now. This means that they have more flexibility in their usage. In conjunction with Linked Services, the datasets have now access to a whole lot of new data stores: sources and destinations.

Pipelines

Pipelines have been modified quite a lot in V2. They don't have any windows of execution, with start times and end times. Pipelines can now be executed using the following technique:

  • On demand via .NET, PowerShell, REST API, or Python
  • Trigger:
    • Schedule trigger: This trigger uses a wall clock kind of schedule, for example, a pipeline can be executed on a weekly basis every Tuesday and Thursday at 10:00 AM
    • Tumbling window trigger: This works on a periodic interval, for example, every 15 minutes between two specific dates

Activities

Pipelines now have the following control activities:

  • Execute pipeline: Calls another pipeline in the same factory.
  • For each activity: Executes activities in a loop similar to any for each loop in structured programming languages.
  • Web activity: Used to call custom REST endpoints.
  • Lookup activity: Gets a record from any external data. The output can later be used by subsequent activities.
  • Get metadata activity: Gets the metadata of activities in ADF.
  • Until activity: Loops the execution of activity sets until the condition is evaluated to true.
  • If condition activity: This is like any if statement in standard programming languages.
  • Wait activity: Pauses the pipeline for a time before resuming other activities.

Parameters

Parameters can be used in pipelines. They are read-only values that are passed when the pipeline is executed manually or when they are scheduled to be executed.

Expressions

In V1, functions could be used to filter out dataset queries. In V2, expressions can be used anywhere in JSON-defined factory objects.

Controlling the flow of activities

Calling activities is more flexible in V2 than in the previous one (V1). As stated in the Pipeline section, there are many new activities, such as for each, if, until, lookup, and so on.

SSIS package deployment in Azure

There is now a new SSI runtime that completely manages clusters of Azure VMs dedicated to running SSIS in the cloud. Packages are deployed in the same manner that they are deployed on-premises when using the Azure SSIS integration runtime. SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS) can be used to deploy SSIS packages.

Spark cluster data store

There are many more data stores available now.

Spark clusters are now available in V2. Since Spark is very performant and now integrates more functionalities, it has become an almost essential player in the big data world. In the previous version of ADF, Spark clusters were available via MapReduce custom activities. In this version, Spark is now a first-class citizen, so there will be no more headaches when it comes to integrating it in our data flow.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £16.99/month. Cancel anytime