You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product type Paperback

Published in Aug 2014

Publisher

ISBN-13 9781782167853

Length 304 pages

Edition 2nd Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1):

Jacob Perkins

View More author details

Table of Contents (12) Chapters

Preface

1. Tokenizing Text and WordNet Basics FREE CHAPTER

2. Replacing and Correcting Words

3. Creating Custom Corpora

4. Part-of-speech Tagging

5. Extracting Chunks

6. Transforming Chunks and Trees

7. Text Classification

8. Distributed Processing and Handling Large Datasets

9. Parsing Specific Data Types

A. Penn Treebank Part-of-speech Tags

Index

Calculating WordNet Synset similarity

Synsets are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the Synsets it contains. The closer the two Synsets are in the tree, the more similar they are.

How to do it...

If you were to look at all the hyponyms of reference_book (which is the hypernym of cookbook), you'd see that one of them is instruction_book. This seems intuitively very similar to a cookbook, so let's see what WordNet similarity has to say about it with the help of the following code:

>>> from nltk.corpus import wordnet
>>> cb = wordnet.synset('cookbook.n.01')
>>> ib = wordnet.synset('instruction_book.n.01')
>>> cb.wup_similarity(ib)
0.9166666666666666

So they are over 91% similar!

How it works...

The wup_similarity method is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym:

>>> ref = cb.hypernyms()[0]
>>> cb.shortest_path_distance(ref)
1
>>> ib.shortest_path_distance(ref)
1
>>> cb.shortest_path_distance(ib)
2

So cookbook and instruction_book must be very similar, because they are only one step away from the same reference_book hypernym, and, therefore, only two steps away from each other.

There's more...

Let's look at two dissimilar words to see what kind of score we get. We'll compare dog with cookbook, two seemingly very different words.

>>> dog = wordnet.synsets('dog')[0]
>>> dog.wup_similarity(cb)
0.38095238095238093

Wow, dog and cookbook are apparently 38% similar! This is because they share common hypernyms further up the tree:

>>> sorted(dog.common_hypernyms(cb))
[Synset('entity.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('whole.n.02')]

Comparing verbs

The previous comparisons were all between nouns, but the same can be done for verbs as well:

>>> cook = wordnet.synset('cook.v.01')
>>> bake = wordnet.0('bake.v.02')
>>> cook.wup_similarity(bake)
00.6666666666666666

The previous Synsets were obviously handpicked for demonstration, and the reason is that the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns can be traced up to the hypernym object, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate the similarity. For example, if you were to use the Synset for bake.v.01 in the previous code, instead of bake.v.02, the return value would be None. This is because the root hypernyms of both the Synsets are different, with no overlapping paths. For this reason, you also cannot calculate the similarity between words with different parts of speech.

Path and Leacock Chordorow (LCH) similarity

Two other similarity comparisons are the path similarity and the LCH similarity, as shown in the following code:

>>> cb.path_similarity(ib)
0.3333333333333333
>>> cb.path_similarity(dog)
0.07142857142857142
>>> cb.lch_similarity(ib)
2.538973871058276
>>> cb.lch_similarity(dog)
0.9985288301111273

As you can see, the number ranges are very different for these scoring methods, which is why I prefer the wup_similarity method.