4. Deep Learning for Text – Embeddings
Activity 4.01: Text Preprocessing of the 'Alice in Wonderland' Text
Solution
You need to perform the following steps:
Note
Before commencing this activity, make sure you have defined the alice_raw
variable as demonstrated in the section titled Downloading Text Corpora Using NLTK.
- Change the data to lowercase and separate into sentences:
txt_sents = tokenize.sent_tokenize(alice_raw.lower())
- Tokenize the sentences:
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
- Import
punctuation
from thestring
module andstopwords
from NLTK:from string import punctuation stop_punct = list(punctuation) from nltk.corpus import stopwords stop_nltk = stopwords.words("english")
- Create a variable holding the contextual stop words
--
andsaid
:stop_context = ["--", "said"]
- Create a master list for the stop words to remove words that contain terms from punctuation, NLTK stop...