Using the DataFrame API
For those who are familiar with Python's pandas
package, it might be interesting to know that Apache Beam has a pandas-compatible API. It is called the DataFrame API, and we will briefly introduce it here. We will not walk through the details of the pandas API itself; it can easily be found online. Instead, we will explain how to use it and how to switch between the DataFrame API and the classical PCollection API.
The basic idea behind a DataFrame (both in Beam and in pandas
) is that a data point can be viewed as a row in a table, where each row can have multiple fields (columns). Each field has an associated name and data type. Not every row (data point) has to have the same set of fields.
We can either use the DataFrame API directly from the beginning or swap between the classical API and the DataFrame API, depending on the situation and which API gives more readable code.
We'll start by introducing the first option – creating a DataFrame...