Creating big data pipelines using Azure Data Lake and Azure Data Factory
Running big data pipelines is an essential feature of Azure Data Factory. They allow you to ingest and preprocess data at any scale. You can program and test any ELT/ETL processes out of the web UI. This is one of the core tasks of the data engineer in your company.
Getting ready
Let's load and preprocess the MovieLens
dataset (F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872). It contains ratings and free-text tagging activity from a movie recommendation service.
The MovieLens
dataset exists in a few sizes, which have the same structure. The smallest one has 100,000 ratings, 600 users, and 9,000 movies. The biggest one can be as big as 1.2 billion reviews, 2.2 million users, and 855,000 items.
MovieLens
is distributed as a set...