Noise
Data quality in machine learning systems has one additional and crucial attribute – noise. Noise can be defined as data points that contribute negatively to the ability of machine learning systems to identify patterns in the data. These data points can be outliers that make the datasets skew toward one or several classes in classification problems. The outliers can also cause prediction systems to over- or under-predict because they emphasize patterns that do not exist in the data.
Another type of noise is contradictory entries, where two (or more) identical data points are labeled with different labels. We can illustrate this with the example of product reviews on Amazon, which we saw in Chapter 3. Let’s import them into a new Python script with dfData = pd.read_csv('./book_chapter_4_embedded_1k_reviews.csv')
. In this case, this dataset contains a summary of the reviews and the score. We focus on these two columns and we define noise as different scores...