We have used a question classification dataset that is open sourced by the University of Illinois, Urbana Champaign. We will try and classify questions based on their text into one of the following six classes:
- ABBREVIATION
- ENTITY
- DESCRIPTION
- HUMAN
- LOCATION
- NUMERIC
More about the dataset can be found at https://cogcomp.seas.upenn.edu/Data/QA/QC/.
Go through the following steps to classify the questions based on their text:
- Import the basic libraries:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
- Now, we will read the dataset using the following code snippet:
train_data = open(...