Transformation and data cleansing
As the next step in the pipeline creation tutorial, it is crucial to perform data cleansing tasks on each of the DataFrames to create reliable and trustworthy data for your clients. These tasks are essential for ensuring data quality and reliability. As a team, you decide to perform the following data cleansing tasks on each of the DataFrames:
- Remove duplicates: Remove any duplicate rows in each DataFrame, if any, using the
drop_duplicates()
function:df = df.drop_duplicates()
- Handle missing values: Check for any missing values in the DataFrames and handle them appropriately. For example, you can replace missing values in numeric columns with the mean and categorical columns with the mode using the
fillna()
function:# Replace missing values in numeric columns with the meandf.fillna(df.mean(), inplace=True)# Replace missing values in categorical columns with the modedf.fillna(df.mode().iloc[0], inplace=True)
- Convert data types: Convert...