You're reading from Python Natural Language Processing Cookbook Over 50 recipes to understand, analyze, and generate text for implementing language processing tasks

Product type Paperback

Published in Mar 2021

Publisher Packt

ISBN-13 9781838987312

Length 284 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Mobile Application Development

Author (1):

Zhenya Antić

View More author details

Table of Contents (10) Chapters

Preface

1. Chapter 1: Learning NLP Basics

2. Chapter 2: Playing with Grammar FREE CHAPTER

3. Chapter 3: Representing Text – Capturing Semantics

4. Chapter 4: Classifying Texts

5. Chapter 5: Getting Started with Information Extraction

6. Chapter 6: Topic Modeling

7. Chapter 7: Building Chatbots

8. Chapter 8: Visualizing Text Data

9. Other Books You May Enjoy

Extracting entities and relations

It is possible to extract triplets of the subject entity-relation-object entity from documents, which are frequently used in knowledge graphs. These triplets can then be analyzed for further relations and inform other NLP tasks, such as searches.

Getting ready

For this recipe, we will need another Python package based on spacy, called textacy. The main advantage of this package is that it allows regular expression-like searching for tokens based on their part of speech tags. See the installation instructions in the Technical requirements section at the beginning of this chapter for more information.

How to do it…

We will find all verb phrases in the text, as well as all the noun phrases (see the previous section). Then, we will find the left noun phrase (subject) and the right noun phrase (object) that relate to a particular verb phrase. We will use two simple sentences, All living things are made of cells and Cells have organelles. Follow these steps:

Import spaCy and textacy:

import spacy
import textacy
from Chapter02.split_into_clauses import find_root_of_sentence

Load the spacy engine:
```
nlp = spacy.load('en_core_web_sm')
```

We will get a list of sentences that we will be processing:

sentences = ["All living things are made of cells.", 
             "Cells have organelles."]

In order to find verb phrases, we will need to compile regular expression-like patterns for the part of speech combinations of the words that make up the verb phrase. If we print out parts of speech of verb phrases of the two preceding sentences, are made of and have, we will see that the part of speech sequences are AUX, VERB, ADP, and AUX.
```
verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, 
                  {"POS":"ADP"}], 
                 [{"POS":"AUX"}]]
```

The contains_root function checks if a verb phrase contains the root of the sentence:

def contains_root(verb_phrase, root):
    vp_start = verb_phrase.start
    vp_end = verb_phrase.end
    if (root.i >= vp_start and root.i <= vp_end):
        return True
    else:
        return False

The get_verb_phrases function gets the verb phrases from a spaCy Doc object:

def get_verb_phrases(doc):
    root = find_root_of_sentence(doc)
    verb_phrases = textacy.extract.matches(doc, 
                                           verb_patterns)
    new_vps = []
    for verb_phrase in verb_phrases:
        if (contains_root(verb_phrase, root)):
            new_vps.append(verb_phrase)
    return new_vps

The longer_verb_phrase function finds the longest verb phrase:

def longer_verb_phrase(verb_phrases):
    longest_length = 0
    longest_verb_phrase = None
    for verb_phrase in verb_phrases:
        if len(verb_phrase) > longest_length:
            longest_verb_phrase = verb_phrase
    return longest_verb_phrase

The find_noun_phrase function will look for noun phrases either on the left- or right-hand side of the main verb phrase:

def find_noun_phrase(verb_phrase, noun_phrases, side):
    for noun_phrase in noun_phrases:
        if (side == "left" and \
            noun_phrase.start < verb_phrase.start):
            return noun_phrase
        elif (side == "right" and \
              noun_phrase.start > verb_phrase.start):
            return noun_phrase

In this function, we will use the preceding functions to find triplets of subject-relation-object in the sentences:

def find_triplet(sentence):
    doc = nlp(sentence)
    verb_phrases = get_verb_phrases(doc)
    noun_phrases = doc.noun_chunks
    verb_phrase = None
    if (len(verb_phrases) > 1):
        verb_phrase = \
        longer_verb_phrase(list(verb_phrases))
    else:
        verb_phrase = verb_phrases[0]
    left_noun_phrase = find_noun_phrase(verb_phrase, 
                                        noun_phrases, 
                                        "left")
    right_noun_phrase = find_noun_phrase(verb_phrase, 
                                         noun_phrases, 
                                         "right")
    return (left_noun_phrase, verb_phrase, 
            right_noun_phrase)

We can now loop through our sentence list to find its relation triplets:

for sentence in sentences:
    (left_np, vp, right_np) = find_triplet(sentence)
    print(left_np, "\t", vp, "\t", right_np)

The result will be as follows:

All living things        are made of     cells
Cells    have    organelles

How it works…

The code finds triplets of subject-relation-object by looking for the root verb phrase and finding its surrounding nouns. The verb phrases are found using the textacy package, which provides a very useful tool for finding patterns of words of certain parts of speech. In effect, we can use it to write small grammars describing the necessary phrases.

Important note

The textacy package, while very useful, is not bug-free, so use it with caution.

Once the verb phrases have been found, we can prune through the sentence noun chunks to find those that are around the verb phrase containing the root.

A step-by-step explanation follows.

In step 1, we import the necessary packages and the find_root_of_sentence function from the previous recipe. In step 2, we initialize the spacy engine, and in step 3, we initialize a list with the sentences we will be using.

In step 4, we compile part of speech patterns that we will use for finding relations. For these two sentences, the patterns are AUX, VERB, ADP, and AUX.

In step 5, we create the contains_root function, which will make sure that a verb phrase contains the root of the sentence. It does that by checking the index of the root and making sure that it falls within the verb phrase span boundaries.

In step 6, we create the get_verb_phrases function, which extracts all the verb phrases from the Doc object that is passed in. It uses the part of speech patterns we created in step 4.

In step 7, we create the longer_verb_phrase function, which will find the longest verb phrase from a list. We do this because some verb phrases might be shorter than necessary. For example, in the sentence All living things are made of cells, both are and are made of will be found.

In step 8, we create the find_noun_phrase function, which finds noun phrases on either side of the verb. We specify the side as a parameter.

In step 9, we create the find_triplet function, which will find triplets of subject-relation-object in a sentence. In this function, first, we process the sentence with spaCy. Then, we use the functions defined in the previous steps to find the longest verb phrase and the nouns to the left- and right-hand sides of it.

In step 10, we apply the find_triplet function to the two sentences we defined at the beginning. The resulting triplets are correct.

In this recipe, we made a few assumptions that will not always be correct. The first assumption is that there will only be one main verb phrase. The second assumption is that there will be a noun chunk on either side of the verb phrase. Once we start working with sentences that are complex or compound, or contain relative clauses, these assumptions no longer hold. I leave it as an exercise for you to work with more complex cases.

There's more…

Once you've parsed out the entities and relations, you might want to input them into a knowledge graph for further use. There are a variety of tools you can use to work with knowledge graphs, such as neo4j.

You're reading from Python Natural Language Processing Cookbook Over 50 recipes to understand, analyze, and generate text for implementing language processing tasks

Table of Contents (10) Chapters

Extracting entities and relations

Getting ready

How to do it…

How it works…

There's more…

Authors (1)

Other recommended products