Training a logistic regression model for document classification
In this section, we will train a logistic regression model to classify the movie reviews into positive and negative reviews based on the bag-of-words model. First, we will divide the DataFrame
of cleaned text documents into 25,000 documents for training and 25,000 documents for testing:
>>> X_train = df.loc[:25000, 'review'].values
>>> y_train = df.loc[:25000, 'sentiment'].values
>>> X_test = df.loc[25000:, 'review'].values
>>> y_test = df.loc[25000:, 'sentiment'].values
Next, we will use a GridSearchCV
object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation:
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.feature_extraction...