Data from the real world is not always as perfect as we would like it to be. On one hand, there are cases where the errors in data are so critical that the only solution is to report them or even abort a process.
There is, however, a different kind of issue with data: minor problems that can be fixed somehow, as in the following examples:
- You have a field that contains years. Among the values, you see 2912. This can be considered a typo; assume that the proper value is 2012.
- You have a string that represents the name of a country, and it is supposed that the names belong to a predefined list of valid countries. You, however, see the values as USA, U.S.A., or United States. On your list, you have only USA as valid, but it is clear that all of these values belong to the same country and should be easy to unify.
- You have a field that should contain integer numbers...