Denormalizing tables
In this section, we will look at an example use case. There is a fictional e-commerce company that sells products and has a website that allows people to buy these products. There are three tables stored in the web system – two dimension tables, product
and customer
, and one fact table, sales
. The product
table stores the product’s name, category, and price. The customer
table stores individual customer names, email addresses, and phone numbers. These email addresses and phone numbers are sensitive pieces of information that need to be handled carefully. When a customer buys a product, that activity is recorded in the sales
table. One new record is inserted into the sales
table every time a customer buys a product.
The following is the product
dimension table:
Figure 6.5 – Product table
The following code can be used to populate the preceding sample data in a Spark DataFrame:
df_product = spark.createDataFrame...