Data processing
In the previous step we structured the raw data which is now ready for further analysis. Our objective is to analyze two types of data:
- Textual data in description
- Numerical data in other variables
Each of them requires a different pre-processing technique. Let's take a look at each type in detail.
Textual data
For the first kind, we have to create a new variable which contains a cleaned string. We will do it in three steps which have already been presented in previous chapters:
- Selecting English descriptions
- Tokenization
- Stopwords removal
As we work only on English data, we should remove all the descriptions which are written in other languages. The main reason to do so is that each language requires a different processing and analysis flow. If we left descriptions in Russian or Chinese, we would have very noisy data which we would not be able to interpret. As a consequence, we can say that we are analyzing trends in the English-speaking world.
Firstly, we remove all the empty strings...