Compensating for missing and out-of-range data
There will be cases where some columns may have missing data. The business use case will determine how serious it is and what to do about it. If a field is being used as an input to a model, it needs a data point. Here are some strategies regarding what you can do:
- Drop the affected records. This is OK when you do not need to use the information for downstream workloads.
- Flag the row/column by adding a marker value (for example, -1). This allows you to see missing data later on without violating a schema:
- Perform basic imputing so that you have a "best guess" regarding what the data could have been, often by using the mean of non-missing data:
- The following is an example of filling default values for specific columns:
- The following is an example of using the "average strategy" to impute the values of the specified columns: