Training the baseline model
As you know, we have selected the RandomForestRegressor algorithm. We will be using the scikit-learn library to train the model. These are the steps we need to follow:
Splitting the training and testing dataset
Splitting prediction labels for the training and testing dataset
Converting sentiment scores into the numpy array
Training the ML model
So, let's implement each of these steps one by one.
Splitting the training and testing dataset
We have 10 years of data values. So for training purposes, we will be using 8 years of the data, which means the dataset from 2007 to 2014. For testing purposes, we will be using 2 years of the data, which means data from 2015 and 2016. You can refer to the code snippet in the following screenshot to implement this:
As you can see from the preceding screenshot, our training dataset has been stored in the train dataframe and our testing dataset has been stored in the test dataframe...