Practical lab
We have a bronze table being loaded into our data lake using a third-party tool. There has been a request to clean up the data and resolve known issues. Your task is to write the needed Python code that will address each of the following issues.
The following are the issues present:
- Wrong column name: The
date
column is spelled wrong - Nulls not correctly identified: The
sales_id
column has null values asNA
strings - Data with missing values is unwanted: Any data with a null in
sales_id
should be dropped - Duplicate sales_id: Take the first value of any duplicate rows
- Date column not DateType: The
date
column is not a DateType
Loading the problem data
The following code will create our bronze table:
bronze_sales = spark.createDataFrame(data = [ ("1", "LA", "2000-01-01",5, 1400), ("2", "LA", "1998-2-01",4, 1500), ...