Data validation is the process of examining the quality of data to ensure it is both correct and useful for performing analysis. It uses routines, often called validation rules, that check for the genuineness of the data that is input to the models. In the age of big data, where vast caches of information are generated by computers and other forms of technology that contribute to the quantity of data being produced, it would be incompetent to use such data if it lacks quality, highlighting the importance of data validation.
In this case study, we are going to consider two DataFrames:
- Test DataFrame (from a flat file)
- Validation DataFrame (from MongoDB)
Validation routines are performed on the test DataFrame, keeping its counterpart as the reference.