Improving pre-training data processing
Data processing in the early stages of a machine learning life cycle, before model training and evaluation, determines the quality of the data we feed into the training, validation, and testing process, and consequently our success in achieving a high-performance and reliable model.
Anomaly detection and outlier removal
Anomalies and outliers in your data could decrease the performance and reliability of your models in production. The existence of outliers in training data, the data you use for model evaluation, and unseen data in production could have different impacts:
- Outliers in model training: The existence of outliers in the training data for supervised learning models could result in lower model generalizability. It could cause unnecessarily complex decision boundaries in classification or unnecessary nonlinearity in regression models.
- Outliers in model evaluation: Outliers in validation and test data could lower the model...