Ensuring that the data is valid
So far, we have ensured that our data is consistent, unique, and complete. But do we know if the data we have is valid? Do the data labels conform to the rules? For example, what if the property area in the dataset didn’t conform to the rules and semi_urban
is invalid? What if one or a couple of annotators believed some suburbs are neither urban nor rural, and they violated the rules and entered semi_urban
? To measure validity, we may need to look at business rules and check the percentage of data that conforms to these business rules. Let’s assume that semi_urban
is an invalid value. In Python, we could check the percentage of invalid labels and then reach out to annotators to correct the data. We could also achieve this by using the data that was used to generate the label. If we had the suburb_name
to property_area
data mapping, and suburb_name
was available in the dataset, then we could leverage the mapping and catch invalid values...