Finding patterns in text using grammatical information
In this section, we will use the spaCy
Matcher
object to find patterns in the text. We will use the grammatical properties of the words to create these patterns. For example, we might be looking for verb phrases instead of noun phrases. We can specify grammatical patterns to match verb phrases.
Getting ready
We will be using the spaCy
Matcher
object to specify and find patterns. It can match different properties, not just grammatical. You can find out more in the documentation at https://spacy.io/usage/rule-based-matching/.
How to do it…
Your steps should be formatted like so:
- Run the file and language utility notebooks:
%run -i "../util/file_utils.ipynb" %run -i "../util/lang_utils.ipynb"
- Import the
Matcher
object and initialize it. We need to put in the vocabulary object, which is the same as the vocabulary of the model we will be using to process the text:from spacy.matcher import Matcher matcher = Matcher(small_model.vocab)
- Create a list of patterns and add them to the matcher. Each pattern is a list of dictionaries, where each dictionary describes a token. In our patterns, we only specify the part of speech for each token. We then add these patterns to the
Matcher
object. The patterns we will be using are a verb by itself (for example, paints), an auxiliary followed by a verb (for example,was observing
), an auxiliary followed by an adjective (for example,were late
), and an auxiliary followed by a verb and a preposition (for example,were staring at
). This is not an exhaustive list; feel free to come up with other examples:patterns = [ [{"POS": "VERB"}], [{"POS": "AUX"}, {"POS": "VERB"}], [{"POS": "AUX"}, {"POS": "ADJ"}], [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}] ] matcher.add("Verb", patterns)
- Read in the small part of the Sherlock Holmes text and process it using the small model:
sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt") doc = small_model(sherlock_holmes_part_of_text)
- Now, we find the matches using the
Matcher
object and the processed text. We then loop through the matches and print out the match ID, the string ID (the identifier of the pattern), the start and end of the match, and the text of the match:matches = matcher(doc) for match_id, start, end in matches: string_id = small_model.vocab.strings[match_id] span = doc[start:end] print(match_id, string_id, start, end, span.text)
The result will be as follows:
14677086776663181681 Verb 14 15 heard 14677086776663181681 Verb 17 18 mention 14677086776663181681 Verb 28 29 eclipses 14677086776663181681 Verb 31 32 predominates 14677086776663181681 Verb 43 44 felt 14677086776663181681 Verb 49 50 love 14677086776663181681 Verb 63 65 were abhorrent 14677086776663181681 Verb 80 81 take 14677086776663181681 Verb 88 89 observing 14677086776663181681 Verb 94 96 has seen 14677086776663181681 Verb 95 96 seen 14677086776663181681 Verb 103 105 have placed 14677086776663181681 Verb 104 105 placed 14677086776663181681 Verb 114 115 spoke 14677086776663181681 Verb 120 121 save 14677086776663181681 Verb 130 132 were admirable 14677086776663181681 Verb 140 141 drawing 14677086776663181681 Verb 153 154 trained 14677086776663181681 Verb 157 158 admit 14677086776663181681 Verb 167 168 adjusted 14677086776663181681 Verb 171 172 introduce 14677086776663181681 Verb 173 174 distracting 14677086776663181681 Verb 178 179 throw 14677086776663181681 Verb 228 229 was
The code finds some of the verb phrases in the text. Sometimes, it finds a partial match that is part of another match. Weeding out these partial matches is left as an exercise.
See also
We can use other attributes apart from parts of speech. It is possible to match on the text itself, its length, whether it is alphanumeric, the punctuation, the word’s case, the dep_
and morph
attributes, lemma, entity type, and others. It is also possible to use regular expressions on the patterns. For more information, see the spaCy documentation: https://spacy.io/usage/rule-based-matching.