Preparing the data
A number of data-preprocessing steps are to be performed before the data is fed into the model. This section will describe how to clean the data and prepare it so it can be fed into the model.
Getting ready
All the text from the .txt
files is first converted into one big corpus. This is done by reading each sentence from each file and adding it to an empty corpus. A number of preprocessing steps are then executed to remove irregularities such as white spaces, spelling errors, stopwords
, and so on. The cleaned text data has to then be tokenized, and the tokenized sentences are added to an empty array by running them through a loop.
How to do it...
The steps are as follows:
Type in the following commands to search for the
.txt
files within the working directory and print the names of the files found:
book_names = sorted(glob.glob("./*.txt")) print("Found books:") book_names
In our case, there are five books named got1
, got2
, got3
, got4
, and got5
saved in the working directory....