Combining datasets
So, we have seen moving a data frame into Spark for analysis. This appears to be very close to SQL tables. Under SQL it is standard practice not to reproduce items in different tables. For example, a product table might have the price and an order table would just reference the product table by product identifier, so as not to duplicate data. So, then another SQL practice is to join or combine the tables to come up with the full set of information needed. Keeping with the order analogy, we combine all of the tables involved as each table has pieces of data that are needed for the order to be complete.
How difficult would it be to create a set of tables and join them using Spark? We will use example tables of Product
, Order
, and ProductOrder
:
Table | Columns |
Product | Product ID, Description, Price |
Order | Order ID, Order Date |
ProductOrder | Order ID, Product ID, Quantity |
So, an Order
has a list of Product
/Quantity
values associated.
We can populate the data frames and move them into Spark:
from...