Identifying the topic of an article
Counting words is a very popular and simple technique that normally renders good results if you want to get a feeling for the topic of the body of text. In this recipe, we will show you how to count the words from The Seattle Times article we have been working with so far to identify the topic of the article without even reading it.
Getting ready
To execute this recipe, you will need NLTK
, the regular expressions module from Python, NumPy
, and Matplotlib
. No other prerequisites are required.
How to do it…
The beginning of the code for this recipe is very similar to the one presented in the previous recipe so we will present only the relevant parts (the nlp_countWords.py
file):
# part-of-speech tagging tagged_sentences = [nltk.pos_tag(w) for w in tokenized] # extract names entities -- regular expressions approach tagged = [] pattern = ''' ENT: {<DT>?(<NNP|NNPS>)+} ''' tokenizer = nltk.RegexpParser(pattern) for sent in tagged_sentences: ...