Introducing the dataset
First, let’s introduce our problem statement. For loan providers, it is important to ensure that people who get a loan can make payment and don’t default. However, it is equally important that people are not denied a loan due to a model trained on poor-quality data. This is where the data-centric approach helps make the world a better place – it provides a framework for data scientists and data engineers to question the quality of data.
For this chapter, we will use the loan prediction dataset from Analytics Vidhya. You can download the dataset from https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#ProblemStatement. There are two files – one for training and one for testing. The test file doesn’t contain any labels. For this chapter, we will utilize the training file, which has been downloaded and saved as train_loan_prediction.csv
.
First, we will look at the dataset and check the first...