Categorizing data quality
It is perhaps an accepted notion that issues with data quality may be categorized into one of the following areas:
- Accuracy
- Completeness
- Update status
- Relevance
- Consistency (across sources)
- Reliability
- Appropriateness
- Accessibility
The quality or level of quality of your data can be affected by the way it is entered, stored, and managed. The process of addressing data quality (referred to most often as data quality assurance (DQA)) requires a routine and regular review and evaluation of the data and performing ongoing processes termed profiling and scrubbing (this is vital even if the data is stored in multiple disparate systems, making these processes difficult).
Here, tidying the data will be much more project centric in that we're probably not concerned with creating a formal DQA process, but are only concerned with making certain that the data is correct for your particular predictive project.
In statistics, data unobserved or not yet reviewed by the data scientist...