Training a logistic regression model for document classification
In this section, we will train a logistic regression model to classify the movie reviews into positive and negative reviews. First, we will divide the DataFrame
of cleaned text documents into 25,000 documents for training and 25,000 documents for testing:
>>> X_train = df.loc[:25000, 'review'].values >>> y_train = df.loc[:25000, 'sentiment'].values >>> X_test = df.loc[25000:, 'review'].values >>> y_test = df.loc[25000:, 'sentiment'].values
Next we will use a GridSearchCV
object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation:
>>> from sklearn.grid_search import GridSearchCV >>> from sklearn.pipeline import Pipeline >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> tfidf = TfidfVectorizer(strip_accents=None, ....