Data cleaning in Python for Excel data
Data cleaning is a critical process when working with Excel data in Python. It ensures that your data is in the right format and free of errors, enabling you to perform accurate EDA.
We will start with generating some dirty data as an example:
import pandas as pd import numpy as np # Create a DataFrame with missing data, duplicates, and mixed data types data = { 'ID': [1, 2, 3, 4, 5, 6], 'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eva', 'Eva'], 'Age': [25, np.nan, 30, 28, 22, 23], 'Salary': ['$50,000', '$60,000', 'Missing', '$65,000', '$55,000', '$75,000'] } df = pd.DataFrame(data) # Introduce some missing data df.loc[1, 'Age'] = np.nan df.loc[3, &apos...