Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python 3 Text Processing with NLTK 3 Cookbook

You're reading from   Python 3 Text Processing with NLTK 3 Cookbook

Arrow left icon
Product type Paperback
Published in Aug 2014
Publisher
ISBN-13 9781782167853
Length 304 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Jacob Perkins Jacob Perkins
Author Profile Icon Jacob Perkins
Jacob Perkins
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Tokenizing Text and WordNet Basics FREE CHAPTER 2. Replacing and Correcting Words 3. Creating Custom Corpora 4. Part-of-speech Tagging 5. Extracting Chunks 6. Transforming Chunks and Trees 7. Text Classification 8. Distributed Processing and Handling Large Datasets 9. Parsing Specific Data Types A. Penn Treebank Part-of-speech Tags
Index

Looking up Synsets for a word in WordNet

WordNet is a lexical database for the English language. In other words, it's a dictionary designed specifically for natural language processing.

NLTK comes with a simple interface to look up words in WordNet. What you get is a list of Synset instances, which are groupings of synonymous words that express the same concept. Many words have only one Synset, but some have several. In this recipe, we'll explore a single Synset, and in the next recipe, we'll look at several in more detail.

Getting ready

Be sure you've unzipped the wordnet corpus at nltk_data/corpora/wordnet. This will allow WordNetCorpusReader to access it.

How to do it...

Now we're going to look up the Synset for cookbook, and explore some of the properties and methods of a Synset using the following code:

>>> from nltk.corpus import wordnet
>>> syn = wordnet.synsets('cookbook')[0]
>>> syn.name()
'cookbook.n.01'
>>> syn.definition()
'a book of recipes and cooking directions'

How it works...

You can look up any word in WordNet using wordnet.synsets(word) to get a list of Synsets. The list may be empty if the word is not found. The list may also have quite a few elements, as some words can have many possible meanings, and, therefore, many Synsets.

There's more...

Each Synset in the list has a number of methods you can use to learn more about it. The name() method will give you a unique name for the Synset, which you can use to get the Synset directly:

>>> wordnet.synset('cookbook.n.01')
Synset('cookbook.n.01')

The definition() method should be self-explanatory. Some Synsets also have an examples() method, which contains a list of phrases that use the word in context:

>>> wordnet.synsets('cooking')[0].examples()
['cooking can be a great art', 'people are needed who have experience in cookery', 'he left the preparation of meals to his wife']

Working with hypernyms

Synsets are organized in a structure similar to that of an inheritance tree. More abstract terms are known as hypernyms and more specific terms are hyponyms. This tree can be traced all the way up to a root hypernym.

Hypernyms provide a way to categorize and group words based on their similarity to each other. The Calculating WordNet Synset similarity recipe details the functions used to calculate the similarity based on the distance between two words in the hypernym tree:

>>> syn.hypernyms()
[Synset('reference_book.n.01')]
>>> syn.hypernyms()[0].hyponyms()
[Synset('annual.n.02'), Synset('atlas.n.02'), Synset('cookbook.n.01'), Synset('directory.n.01'), Synset('encyclopedia.n.01'), Synset('handbook.n.01'), Synset('instruction_book.n.01'), Synset('source_book.n.01'), Synset('wordbook.n.01')]
>>> syn.root_hypernyms()
[Synset('entity.n.01')]

As you can see, reference_book is a hypernym of cookbook, but cookbook is only one of the many hyponyms of reference_book. And all these types of books have the same root hypernym, which is entity, one of the most abstract terms in the English language. You can trace the entire path from entity down to cookbook using the hypernym_paths() method, as follows:

>>> syn.hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('creation.n.02'), Synset('product.n.02'), Synset('work.n.02'), Synset('publication.n.01'), Synset('book.n.01'), Synset('reference_book.n.01'), Synset('cookbook.n.01')]]

The hypernym_paths() method returns a list of lists, where each list starts at the root hypernym and ends with the original Synset. Most of the time, you'll only get one nested list of Synsets.

Part of speech (POS)

You can also look up a simplified part-of-speech tag as follows:

>>> syn.pos()
'n'

There are four common part-of-speech tags (or POS tags) found in WordNet, as shown in the following table:

Part of speech

Tag

Noun

n

Adjective

a

Adverb

r

Verb

v

These POS tags can be used to look up specific Synsets for a word. For example, the word 'great' can be used as a noun or an adjective. In WordNet, 'great' has 1 noun Synset and 6 adjective Synsets, as shown in the following code:

>>> len(wordnet.synsets('great'))
7
>>> len(wordnet.synsets('great', pos='n'))
1
>>> len(wordnet.synsets('great', pos='a'))
6

These POS tags will be referenced more in the Using WordNet for tagging recipe in Chapter 4, Part-of-speech Tagging.

See also

In the next two recipes, we'll explore lemmas and how to calculate Synset similarity. And in Chapter 2, Replacing and Correcting Words, we'll use WordNet for lemmatization, synonym replacement, and then explore the use of antonyms.

You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook - Second Edition
Published in: Aug 2014
Publisher:
ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime