Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Natural Language Processing Cookbook

You're reading from   Python Natural Language Processing Cookbook Over 50 recipes to understand, analyze, and generate text for implementing language processing tasks

Arrow left icon
Product type Paperback
Published in Mar 2021
Publisher Packt
ISBN-13 9781838987312
Length 284 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Zhenya Antić Zhenya Antić
Author Profile Icon Zhenya Antić
Zhenya Antić
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Chapter 1: Learning NLP Basics 2. Chapter 2: Playing with Grammar FREE CHAPTER 3. Chapter 3: Representing Text – Capturing Semantics 4. Chapter 4: Classifying Texts 5. Chapter 5: Getting Started with Information Extraction 6. Chapter 6: Topic Modeling 7. Chapter 7: Building Chatbots 8. Chapter 8: Visualizing Text Data 9. Other Books You May Enjoy

Extracting entities and relations

It is possible to extract triplets of the subject entity-relation-object entity from documents, which are frequently used in knowledge graphs. These triplets can then be analyzed for further relations and inform other NLP tasks, such as searches.

Getting ready

For this recipe, we will need another Python package based on spacy, called textacy. The main advantage of this package is that it allows regular expression-like searching for tokens based on their part of speech tags. See the installation instructions in the Technical requirements section at the beginning of this chapter for more information.

How to do it…

We will find all verb phrases in the text, as well as all the noun phrases (see the previous section). Then, we will find the left noun phrase (subject) and the right noun phrase (object) that relate to a particular verb phrase. We will use two simple sentences, All living things are made of cells and Cells have organelles. Follow these steps:

  1. Import spaCy and textacy:
    import spacy
    import textacy
    from Chapter02.split_into_clauses import find_root_of_sentence
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. We will get a list of sentences that we will be processing:
    sentences = ["All living things are made of cells.", 
                 "Cells have organelles."]
  4. In order to find verb phrases, we will need to compile regular expression-like patterns for the part of speech combinations of the words that make up the verb phrase. If we print out parts of speech of verb phrases of the two preceding sentences, are made of and have, we will see that the part of speech sequences are AUX, VERB, ADP, and AUX.
    verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, 
                      {"POS":"ADP"}], 
                     [{"POS":"AUX"}]]
  5. The contains_root function checks if a verb phrase contains the root of the sentence:
    def contains_root(verb_phrase, root):
        vp_start = verb_phrase.start
        vp_end = verb_phrase.end
        if (root.i >= vp_start and root.i <= vp_end):
            return True
        else:
            return False
  6. The get_verb_phrases function gets the verb phrases from a spaCy Doc object:
    def get_verb_phrases(doc):
        root = find_root_of_sentence(doc)
        verb_phrases = textacy.extract.matches(doc, 
                                               verb_patterns)
        new_vps = []
        for verb_phrase in verb_phrases:
            if (contains_root(verb_phrase, root)):
                new_vps.append(verb_phrase)
        return new_vps
  7. The longer_verb_phrase function finds the longest verb phrase:
    def longer_verb_phrase(verb_phrases):
        longest_length = 0
        longest_verb_phrase = None
        for verb_phrase in verb_phrases:
            if len(verb_phrase) > longest_length:
                longest_verb_phrase = verb_phrase
        return longest_verb_phrase
  8. The find_noun_phrase function will look for noun phrases either on the left- or right-hand side of the main verb phrase:
    def find_noun_phrase(verb_phrase, noun_phrases, side):
        for noun_phrase in noun_phrases:
            if (side == "left" and \
                noun_phrase.start < verb_phrase.start):
                return noun_phrase
            elif (side == "right" and \
                  noun_phrase.start > verb_phrase.start):
                return noun_phrase
  9. In this function, we will use the preceding functions to find triplets of subject-relation-object in the sentences:
    def find_triplet(sentence):
        doc = nlp(sentence)
        verb_phrases = get_verb_phrases(doc)
        noun_phrases = doc.noun_chunks
        verb_phrase = None
        if (len(verb_phrases) > 1):
            verb_phrase = \
            longer_verb_phrase(list(verb_phrases))
        else:
            verb_phrase = verb_phrases[0]
        left_noun_phrase = find_noun_phrase(verb_phrase, 
                                            noun_phrases, 
                                            "left")
        right_noun_phrase = find_noun_phrase(verb_phrase, 
                                             noun_phrases, 
                                             "right")
        return (left_noun_phrase, verb_phrase, 
                right_noun_phrase)
  10. We can now loop through our sentence list to find its relation triplets:
    for sentence in sentences:
        (left_np, vp, right_np) = find_triplet(sentence)
        print(left_np, "\t", vp, "\t", right_np)
  11. The result will be as follows:
    All living things        are made of     cells
    Cells    have    organelles

How it works…

The code finds triplets of subject-relation-object by looking for the root verb phrase and finding its surrounding nouns. The verb phrases are found using the textacy package, which provides a very useful tool for finding patterns of words of certain parts of speech. In effect, we can use it to write small grammars describing the necessary phrases.

Important note

The textacy package, while very useful, is not bug-free, so use it with caution.

Once the verb phrases have been found, we can prune through the sentence noun chunks to find those that are around the verb phrase containing the root.

A step-by-step explanation follows.

In step 1, we import the necessary packages and the find_root_of_sentence function from the previous recipe. In step 2, we initialize the spacy engine, and in step 3, we initialize a list with the sentences we will be using.

In step 4, we compile part of speech patterns that we will use for finding relations. For these two sentences, the patterns are AUX, VERB, ADP, and AUX.

In step 5, we create the contains_root function, which will make sure that a verb phrase contains the root of the sentence. It does that by checking the index of the root and making sure that it falls within the verb phrase span boundaries.

In step 6, we create the get_verb_phrases function, which extracts all the verb phrases from the Doc object that is passed in. It uses the part of speech patterns we created in step 4.

In step 7, we create the longer_verb_phrase function, which will find the longest verb phrase from a list. We do this because some verb phrases might be shorter than necessary. For example, in the sentence All living things are made of cells, both are and are made of will be found.

In step 8, we create the find_noun_phrase function, which finds noun phrases on either side of the verb. We specify the side as a parameter.

In step 9, we create the find_triplet function, which will find triplets of subject-relation-object in a sentence. In this function, first, we process the sentence with spaCy. Then, we use the functions defined in the previous steps to find the longest verb phrase and the nouns to the left- and right-hand sides of it.

In step 10, we apply the find_triplet function to the two sentences we defined at the beginning. The resulting triplets are correct.

In this recipe, we made a few assumptions that will not always be correct. The first assumption is that there will only be one main verb phrase. The second assumption is that there will be a noun chunk on either side of the verb phrase. Once we start working with sentences that are complex or compound, or contain relative clauses, these assumptions no longer hold. I leave it as an exercise for you to work with more complex cases.

There's more…

Once you've parsed out the entities and relations, you might want to input them into a knowledge graph for further use. There are a variety of tools you can use to work with knowledge graphs, such as neo4j.

You have been reading a chapter from
Python Natural Language Processing Cookbook
Published in: Mar 2021
Publisher: Packt
ISBN-13: 9781838987312
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image