As the dataset grows, so do inconsistencies and errors. Whether as a result of human error, system failure, or data structure evolutions, real-world data is rife with invalid, absurd, or missing values. Even when the dataset is spotless, the nature of some variables need to be adapted to the model. We look at the most common data anomalies and characteristics that need to be corrected in the context of Amazon ML linear models.
Dealing with messy data
Classic datasets versus real-world datasets
Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach...