Preprocessing
Library scikit-learn has a preprocessing module, which is the topic of this section. In the previous chapter, Chapter 9, Analyzing Textual Data and Social Media, we installed scikit-learn, and we practiced a form of data preprocessing by filtering out stopwords. Some machine learning algorithms have trouble with data that is not distributed as a Gaussian with a mean of 0 and a variance of 1. The sklearn.preprocessing
module takes care of this issue. We will be demonstrating it in this section. We will preprocess the meteorological data from the Dutch KNMI institute (original data for De Bilt weather station from http://www.knmi.nl/climatology/daily_data/datafiles3/260/etmgeg_260.zip). The data is just one column of the original datafile and contains daily rainfall values. It is stored in the .npy
format discussed in Chapter 5, Retrieving, Processing, and Storing Data. We can load the data into a NumPy array. The values are integers that we have to multiply by 0.1 in order...