Ensuring that the data is accurate
Even though the data is valid, it may not be accurate. Data accuracy measures the percentage of data that matches real-world data or verifiable sources. Considering the preceding example of the property area, to measure data accuracy, we may have to look up a reliable published dataset and check the population of the area and the type of the area. Let’s assume that the population matches the verifiable data source, but the area type source is unavailable. Using the rule of what defines a rural area and what defines an urban area, we can measure data accuracy.
Using this business rule, we will create a new label called true_property_area
that takes rural
as a value when the population is 20,000 or below; otherwise, takes urban
as a value:
df['true_property_area'] = df.population.apply(lambda value: 'rural' if value <= 20000 else 'urban')
Next, we print the rows of the dataset to see if there are any mismatches...