SampleData – a simple API for loading data
Loading data into a Notebook is one of the most repetitive tasks a data scientist can do, yet depending on the framework or data source being used, writing the code can be difficult and time-consuming.
Let's take a concrete example of trying to load a CSV file from an open data site (say https://data.cityofnewyork.us) into both a pandas and Apache Spark DataFrame.
Note
Note: Going forward, all the code is assumed to run in a Jupyter Notebook.
For pandas, the code is pretty straightforward as it provides an API to directly load from URL:
import pandas data_url = "https://data.cityofnewyork.us/api/views/e98g-f8hy/rows.csv?accessType=DOWNLOAD" building_df = pandas.read_csv(data_url) building_df
The last statement, calling building_df,
will print its contents in the output cell. This is possible without a print because Jupyter is interpreting the last statement of a cell calling a variable as a directive to print it...