Building the training and testing datasets for the baseline model
In this section, we will be generating the training dataset as well as the testing dataset. We will iterate over the files of our dataset and consider all files whose names start with the digit 12 as our test dataset. So, roughly 90% of our dataset is considered the training dataset and 10 % of our dataset is considered the testing dataset. You can refer to the code for this in the following figure:
As you can see, if the filename starts with 12 then we consider the content of those files as the testing dataset. All files apart from these are considered the training dataset. You can find the code at this GitHub link: https://github.com/jalajthanaki/Sentiment_Analysis/blob/master/Baseline_approach.ipynb.