Text pre-processing
Before we build our model, we need to prepare our data so it can be provided to our model. We want a feature vector and a class label. In our case, the class label can take two values, positive or negative depending on if the sentence has a positive or a negative sentiment. Words are our features. We will use the bag-of-words model to represent our text as features. In a bag-words-model, the following steps are performed to transform a text into a feature vector:
- Extract all unique individual words from the
text
dataset. We call atext
dataset a corpus. - Process the words. Processing typically involves removing numbers and other characters, placing the words in lowercase, stemming the words, and removing unnecessary white spaces.
- Each word is assigned a unique number and together they form the vocabulary. A word uknown is added to the vocabulary. This is for the unknown words we will be seeing in future datasets.
- Finally, a document term matrix is created. The rows of this...