Extracting entities and relations
It is possible to extract triplets of the subject entity-relation-object entity from documents, which are frequently used in knowledge graphs. These triplets can then be analyzed for further relations and inform other NLP tasks, such as searches.
Getting ready
For this recipe, we will need another Python package based on spacy
, called textacy
. The main advantage of this package is that it allows regular expression-like searching for tokens based on their part of speech tags. See the installation instructions in the Technical requirements section at the beginning of this chapter for more information.
How to do it…
We will find all verb phrases in the text, as well as all the noun phrases (see the previous section). Then, we will find the left noun phrase (subject) and the right noun phrase (object) that relate to a particular verb phrase. We will use two simple sentences, All living things are made of cells and Cells have organelles. Follow these steps:
- Import
spaCy
andtextacy
:import spacy import textacy from Chapter02.split_into_clauses import find_root_of_sentence
- Load the
spacy
engine:nlp = spacy.load('en_core_web_sm')
- We will get a list of sentences that we will be processing:
sentences = ["All living things are made of cells.", "Cells have organelles."]
- In order to find verb phrases, we will need to compile regular expression-like patterns for the part of speech combinations of the words that make up the verb phrase. If we print out parts of speech of verb phrases of the two preceding sentences, are made of and have, we will see that the part of speech sequences are
AUX
,VERB
,ADP
, andAUX
.verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}]]
- The
contains_root
function checks if a verb phrase contains the root of the sentence:def contains_root(verb_phrase, root): vp_start = verb_phrase.start vp_end = verb_phrase.end if (root.i >= vp_start and root.i <= vp_end): return True else: return False
- The
get_verb_phrases
function gets the verb phrases from a spaCyDoc
object:def get_verb_phrases(doc): root = find_root_of_sentence(doc) verb_phrases = textacy.extract.matches(doc, verb_patterns) new_vps = [] for verb_phrase in verb_phrases: if (contains_root(verb_phrase, root)): new_vps.append(verb_phrase) return new_vps
- The
longer_verb_phrase
function finds the longest verb phrase:def longer_verb_phrase(verb_phrases): longest_length = 0 longest_verb_phrase = None for verb_phrase in verb_phrases: if len(verb_phrase) > longest_length: longest_verb_phrase = verb_phrase return longest_verb_phrase
- The
find_noun_phrase
function will look for noun phrases either on the left- or right-hand side of the main verb phrase:def find_noun_phrase(verb_phrase, noun_phrases, side): for noun_phrase in noun_phrases: if (side == "left" and \ noun_phrase.start < verb_phrase.start): return noun_phrase elif (side == "right" and \ noun_phrase.start > verb_phrase.start): return noun_phrase
- In this function, we will use the preceding functions to find triplets of subject-relation-object in the sentences:
def find_triplet(sentence): doc = nlp(sentence) verb_phrases = get_verb_phrases(doc) noun_phrases = doc.noun_chunks verb_phrase = None if (len(verb_phrases) > 1): verb_phrase = \ longer_verb_phrase(list(verb_phrases)) else: verb_phrase = verb_phrases[0] left_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "left") right_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "right") return (left_noun_phrase, verb_phrase, right_noun_phrase)
- We can now loop through our sentence list to find its relation triplets:
for sentence in sentences: (left_np, vp, right_np) = find_triplet(sentence) print(left_np, "\t", vp, "\t", right_np)
- The result will be as follows:
All living things are made of cells Cells have organelles
How it works…
The code finds triplets of subject-relation-object by looking for the root verb phrase and finding its surrounding nouns. The verb phrases are found using the textacy
package, which provides a very useful tool for finding patterns of words of certain parts of speech. In effect, we can use it to write small grammars describing the necessary phrases.
Important note
The textacy
package, while very useful, is not bug-free, so use it with caution.
Once the verb phrases have been found, we can prune through the sentence noun chunks to find those that are around the verb phrase containing the root.
A step-by-step explanation follows.
In step 1, we import the necessary packages and the find_root_of_sentence
function from the previous recipe. In step 2, we initialize the spacy
engine, and in step 3, we initialize a list with the sentences we will be using.
In step 4, we compile part of speech patterns that we will use for finding relations. For these two sentences, the patterns are AUX
, VERB
, ADP
, and AUX
.
In step 5, we create the contains_root
function, which will make sure that a verb phrase contains the root of the sentence. It does that by checking the index of the root and making sure that it falls within the verb phrase span boundaries.
In step 6, we create the get_verb_phrases
function, which extracts all the verb phrases from the Doc
object that is passed in. It uses the part of speech patterns we created in step 4.
In step 7, we create the longer_verb_phrase
function, which will find the longest verb phrase from a list. We do this because some verb phrases might be shorter than necessary. For example, in the sentence All living things are made of cells, both are and are made of will be found.
In step 8, we create the find_noun_phrase
function, which finds noun phrases on either side of the verb. We specify the side as a parameter.
In step 9, we create the find_triplet
function, which will find triplets of subject-relation-object in a sentence. In this function, first, we process the sentence with spaCy. Then, we use the functions defined in the previous steps to find the longest verb phrase and the nouns to the left- and right-hand sides of it.
In step 10, we apply the find_triplet
function to the two sentences we defined at the beginning. The resulting triplets are correct.
In this recipe, we made a few assumptions that will not always be correct. The first assumption is that there will only be one main verb phrase. The second assumption is that there will be a noun chunk on either side of the verb phrase. Once we start working with sentences that are complex or compound, or contain relative clauses, these assumptions no longer hold. I leave it as an exercise for you to work with more complex cases.
There's more…
Once you've parsed out the entities and relations, you might want to input them into a knowledge graph for further use. There are a variety of tools you can use to work with knowledge graphs, such as neo4j.