Getting started with data extraction
We will be using open source data for CSV, Parquet, and APIs, as well as manually preparing data for RDBMS databases and HTML using public safety data from NYC Open Data (available at https://data.cityofnewyork.us).
Within your PyCharm terminal, verify that your pipenv
virtual environment has been activated and open the Jupyter notebook associated with Chapter 4. In the first cell, import the pandas
module into your notebook, like so:
# Import modules import pandas as pd
CSV and Excel data files
Not surprisingly, stored data files are commonly used as an input data source for an extract, transform, load (ETL) pipeline. Data files can be sourced from anywhere, from locally stored files on your device to cloud storage filesystems. Even when primarily working with databases or external APIs, using physical files is a great way to use timestamped data with ease, which can come in handy during any temporary connection issues.
Download...