Stemming text data
The stemming procedure involves creating a suitable word with reduced letters for the words of the tokenizer.
How to do it...
- Initialize the stemming process with a new Python file:
from nltk.stem.porter import PorterStemmer from nltk.stem.lancaster import LancasterStemmer from nltk.stem.snowball import SnowballStemmer
- Let's describe some words to consider, as follows:
words = ['ability', 'baby', 'college', 'playing', 'is', 'dream', 'election', 'beaches', 'image', 'group', 'happy']
- Identify a group of
stemmers
to be used:
stemmers = ['PORTER', 'LANCASTER', 'SNOWBALL']
- Initialize the necessary tasks for the chosen
stemmers
:
stem_porter = PorterStemmer() stem_lancaster = LancasterStemmer() stem_snowball = SnowballStemmer('english')
- Format a table to print the results:
formatted_row = '{:>16}' * (len(stemmers) + 1) print 'n', formatted_row.format('WORD', *stemmers), 'n'
- Repeatedly check the list of words and arrange them using chosen
stemmers
:
for word in words: stem_words...