Before starting to code, we have to load the dataset in Python and also provide Python with all the necessary packages for our project. We will need to have these packages installed on our system (the latest versions should suffice, no need for any specific package version):
- Numpy
- pandas
- fuzzywuzzy
- python-Levenshtein
- scikit-learn
- gensim
- pyemd
- NLTK
As we will be using each one of these packages in the project, we will provide specific instructions and tips to install them.
For all dataset operations, we will be using pandas (and Numpy will come in handy, too). To install numpy and pandas:
pip install numpy
pip install pandas
The dataset can be loaded into memory easily by using pandas and a specialized data structure, the pandas dataframe (we expect the dataset to be in the same directory as your script or Jupyter notebook):
import pandas...