Feature engineering for the baseline model
For this application, we will be using a basic statistical feature extraction concept in order to generate the features from raw text data. In the NLP domain, we need to convert raw text into a numerical format so that the ML algorithm can be applied to that numerical data. There are many techniques available, including indexing, count based vectorization, Term Frequency - Inverse Document Frequency (TF-IDF ), and so on. I have already discussed the concept of TF-IDF in Chapter 4, Generate features using TF-IDF:
Note
Indexing is basically used for fast data retrieval. In indexing, we provide a unique identification number. This unique identification number can be assigned in alphabetical order or based on frequency. You can refer to this link: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Count-based vectorization sorts the words in alphabetical order and if a particular word is present then its vector value...