For our NLP experiments, we need some reasonably big texts. I used the complete works of classical writers and statesmen from the Gutenberg project because they are in the public domain, but you can find your own texts and train models on them. If you want to use the same texts as I did, I included them in the supplementary material for this chapter under the Corpuses folder. There should be five of them: Benjamin Franklin, John Galsworthy, Mark Twain, William Shakespeare, and Winston Churchill. Create a new Jupyter notebook and load Mark Twain's corpus as one long string:
import zipfile zip_ref = zipfile.ZipFile('Corpuses.zip', 'r') zip_ref.extractall('') zip_ref.close() In [1]: import codecs In [2]: one_long_string = "" with codecs.open('Corpuses/MarkTwain.txt', 'r', 'utf-8-sig&apos...