Checking that the data is unique
Now that we have ensured the data is consistent, we must also ensure it's unique, before it enters the machine learning system.
In this section, we will investigate the data and check whether the values in the loan_id
column are unique, as well as whether a combination of certain columns can ensure data is unique.
In pandas, we can utilize the .nunique()
method to check the number of unique records for the column and compare it with the number of rows. First, we will check that loan_id
is unique and that no duplicate applications have been entered:
df.loan_id.nunique(), df.shape[0] (614, 614)
With this, we have ensured that loan IDs are unique. However, we can go one step further to ensure that incorrect data is not added to another loan application. We believe it’s quite unlikely that a loan application will require more than one combination of income and loan amount. We must check that we can use a combination of column values...