Using the Datasets API in Python
Before you ask: yes, the Datasets API is available in Python too! Let’s do a quick rundown of all the same features we just covered but using the PyArrow Python module instead of C++. Since the majority of data scientists utilize Python for their work, it makes sense to show off how to use these APIs in Python for easy integration with existing workflows and utilities. Since Python’s syntax is simpler than C++, the code is much more concise, so we can run through everything really quickly in the following sections.
Creating our sample dataset
We can start by creating a similar sample dataset to what we were using for the C++ examples with three columns, but using Python:
>>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import pathlib >>> import numpy as np >>> import os >>> base = pathlib.Path(os.getcwd()) >>> (base / "parquet_dataset").mkdir...