Working with real data
We will now work with the IMDb public dataset. This is a more complex dataset divided into various tables.
The following code will download five tables from the imdb
dataset and save them into the ./data/imdb/
path (also available at https://github.com/PacktPublishing/Bigdata-on-Kubernetes/blob/main/Chapter05/get_imdb_data.py).
First, we need to download the data locally:
get_imdb_data.py
import os import requests urls_dict = { "names.tsv.gz": "https://datasets.imdbws.com/name.basics.tsv.gz", "basics.tsv.gz": "https://datasets.imdbws.com/title.basics.tsv.gz", "crew.tsv.gz": "https://datasets.imdbws.com/title.crew.tsv.gz", "principals.tsv.gz": "https://datasets.imdbws.com/title.principals.tsv.gz", "ratings.tsv.gz": "https://datasets.imdbws...