Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python 3 Text Processing with NLTK 3 Cookbook

You're reading from   Python 3 Text Processing with NLTK 3 Cookbook

Arrow left icon
Product type Paperback
Published in Aug 2014
Publisher
ISBN-13 9781782167853
Length 304 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Jacob Perkins Jacob Perkins
Author Profile Icon Jacob Perkins
Jacob Perkins
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Tokenizing Text and WordNet Basics FREE CHAPTER 2. Replacing and Correcting Words 3. Creating Custom Corpora 4. Part-of-speech Tagging 5. Extracting Chunks 6. Transforming Chunks and Trees 7. Text Classification 8. Distributed Processing and Handling Large Datasets 9. Parsing Specific Data Types A. Penn Treebank Part-of-speech Tags
Index

Calculating WordNet Synset similarity

Synsets are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the Synsets it contains. The closer the two Synsets are in the tree, the more similar they are.

How to do it...

If you were to look at all the hyponyms of reference_book (which is the hypernym of cookbook), you'd see that one of them is instruction_book. This seems intuitively very similar to a cookbook, so let's see what WordNet similarity has to say about it with the help of the following code:

>>> from nltk.corpus import wordnet
>>> cb = wordnet.synset('cookbook.n.01')
>>> ib = wordnet.synset('instruction_book.n.01')
>>> cb.wup_similarity(ib)
0.9166666666666666

So they are over 91% similar!

How it works...

The wup_similarity method is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym:

>>> ref = cb.hypernyms()[0]
>>> cb.shortest_path_distance(ref)
1
>>> ib.shortest_path_distance(ref)
1
>>> cb.shortest_path_distance(ib)
2

So cookbook and instruction_book must be very similar, because they are only one step away from the same reference_book hypernym, and, therefore, only two steps away from each other.

There's more...

Let's look at two dissimilar words to see what kind of score we get. We'll compare dog with cookbook, two seemingly very different words.

>>> dog = wordnet.synsets('dog')[0]
>>> dog.wup_similarity(cb)
0.38095238095238093

Wow, dog and cookbook are apparently 38% similar! This is because they share common hypernyms further up the tree:

>>> sorted(dog.common_hypernyms(cb))
[Synset('entity.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('whole.n.02')]

Comparing verbs

The previous comparisons were all between nouns, but the same can be done for verbs as well:

>>> cook = wordnet.synset('cook.v.01')
>>> bake = wordnet.0('bake.v.02')
>>> cook.wup_similarity(bake)
00.6666666666666666

The previous Synsets were obviously handpicked for demonstration, and the reason is that the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns can be traced up to the hypernym object, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate the similarity. For example, if you were to use the Synset for bake.v.01 in the previous code, instead of bake.v.02, the return value would be None. This is because the root hypernyms of both the Synsets are different, with no overlapping paths. For this reason, you also cannot calculate the similarity between words with different parts of speech.

Path and Leacock Chordorow (LCH) similarity

Two other similarity comparisons are the path similarity and the LCH similarity, as shown in the following code:

>>> cb.path_similarity(ib)
0.3333333333333333
>>> cb.path_similarity(dog)
0.07142857142857142
>>> cb.lch_similarity(ib)
2.538973871058276
>>> cb.lch_similarity(dog)
0.9985288301111273

As you can see, the number ranges are very different for these scoring methods, which is why I prefer the wup_similarity method.

See also

The recipe on Looking up Synsets for a word in WordNet has more details about hypernyms and the hypernym tree.

You have been reading a chapter from
Python 3 Text Processing with NLTK 3 Cookbook - Second Edition
Published in: Aug 2014
Publisher:
ISBN-13: 9781782167853
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at ₹800/month. Cancel anytime