Preparing the data
Preparing the text is a task in its own right. This is because in the real world, text is often messy and cannot be fixed with a few simple scaling operations. For instance, people can often make typos after adding unnecessary characters as they are adding text encodings that we cannot read. NLP involves its own set of data cleaning challenges and techniques.
Sanitizing characters
To store text, computers need to encode the characters into bits. There are several different ways to do this, and not all of them can deal with all the characters out there.
It is good practice to keep all the text files in one encoding scheme, usually UTF-8, but of course, that does not always happen. Files might also be corrupted, meaning that a few bits are off, therefore rendering some characters unreadable. Therefore, before we do anything else, we need to sanitize our inputs.
Python offers a helpful codecs
library, which allows us to deal with different encodings. Our data is UTF-8 encoded...