Removing stop words from the text
A stop word is a very common word used in the English language and is often removed from common NLP techniques because they can be distracting. Common stop word would be words such as the or and.Â
Getting ready
This section requires importing the following libraries:
from pyspark.ml.feature import StopWordsRemover from pyspark.ml import Pipeline
How to do it...
This section walks through the steps to remove stop words.
- Execute the following script to extract each word in
chat
into a string within an array:
df = df.withColumn('words',F.split(F.col('chat'),' '))
- Assign a list of common words to a variable,
stop_words
, that will be considered stop words using the following script:
stop_words = ['i','me','my','myself','we','our','ours','ourselves', 'you','your','yours','yourself','yourselves','he','him', 'his','himself','she','her','hers','herself','it','its', 'itself','they','them','their','theirs','themselves', 'what','which','who','whom','this','that','these','those...