Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Natural Language Processing Cookbook

You're reading from   Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781803245744
Length 312 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Saurabh Chakravarty Saurabh Chakravarty
Author Profile Icon Saurabh Chakravarty
Saurabh Chakravarty
Zhenya Antić Zhenya Antić
Author Profile Icon Zhenya Antić
Zhenya Antić
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Learning NLP Basics 2. Chapter 2: Playing with Grammar FREE CHAPTER 3. Chapter 3: Representing Text – Capturing Semantics 4. Chapter 4: Classifying Texts 5. Chapter 5: Getting Started with Information Extraction 6. Chapter 6: Topic Modeling 7. Chapter 7: Visualizing Text Data 8. Chapter 8: Transformers and Their Applications 9. Chapter 9: Natural Language Understanding 10. Chapter 10: Generative AI and Large Language Models 11. Index 12. Other Books You May Enjoy

Finding patterns in text using grammatical information

In this section, we will use the spaCy Matcher object to find patterns in the text. We will use the grammatical properties of the words to create these patterns. For example, we might be looking for verb phrases instead of noun phrases. We can specify grammatical patterns to match verb phrases.

Getting ready

We will be using the spaCy Matcher object to specify and find patterns. It can match different properties, not just grammatical. You can find out more in the documentation at https://spacy.io/usage/rule-based-matching/.

How to do it…

Your steps should be formatted like so:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Import the Matcher object and initialize it. We need to put in the vocabulary object, which is the same as the vocabulary of the model we will be using to process the text:
    from spacy.matcher import Matcher
    matcher = Matcher(small_model.vocab)
  3. Create a list of patterns and add them to the matcher. Each pattern is a list of dictionaries, where each dictionary describes a token. In our patterns, we only specify the part of speech for each token. We then add these patterns to the Matcher object. The patterns we will be using are a verb by itself (for example, paints), an auxiliary followed by a verb (for example, was observing), an auxiliary followed by an adjective (for example, were late), and an auxiliary followed by a verb and a preposition (for example, were staring at). This is not an exhaustive list; feel free to come up with other examples:
    patterns = [
        [{"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "ADJ"}],
        [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}]
    ]
    matcher.add("Verb", patterns)
  4. Read in the small part of the Sherlock Holmes text and process it using the small model:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    doc = small_model(sherlock_holmes_part_of_text)
  5. Now, we find the matches using the Matcher object and the processed text. We then loop through the matches and print out the match ID, the string ID (the identifier of the pattern), the start and end of the match, and the text of the match:
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = small_model.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id, string_id, start, end, span.text)

    The result will be as follows:

    14677086776663181681 Verb 14 15 heard
    14677086776663181681 Verb 17 18 mention
    14677086776663181681 Verb 28 29 eclipses
    14677086776663181681 Verb 31 32 predominates
    14677086776663181681 Verb 43 44 felt
    14677086776663181681 Verb 49 50 love
    14677086776663181681 Verb 63 65 were abhorrent
    14677086776663181681 Verb 80 81 take
    14677086776663181681 Verb 88 89 observing
    14677086776663181681 Verb 94 96 has seen
    14677086776663181681 Verb 95 96 seen
    14677086776663181681 Verb 103 105 have placed
    14677086776663181681 Verb 104 105 placed
    14677086776663181681 Verb 114 115 spoke
    14677086776663181681 Verb 120 121 save
    14677086776663181681 Verb 130 132 were admirable
    14677086776663181681 Verb 140 141 drawing
    14677086776663181681 Verb 153 154 trained
    14677086776663181681 Verb 157 158 admit
    14677086776663181681 Verb 167 168 adjusted
    14677086776663181681 Verb 171 172 introduce
    14677086776663181681 Verb 173 174 distracting
    14677086776663181681 Verb 178 179 throw
    14677086776663181681 Verb 228 229 was

The code finds some of the verb phrases in the text. Sometimes, it finds a partial match that is part of another match. Weeding out these partial matches is left as an exercise.

See also

We can use other attributes apart from parts of speech. It is possible to match on the text itself, its length, whether it is alphanumeric, the punctuation, the word’s case, the dep_ and morph attributes, lemma, entity type, and others. It is also possible to use regular expressions on the patterns. For more information, see the spaCy documentation: https://spacy.io/usage/rule-based-matching.

You have been reading a chapter from
Python Natural Language Processing Cookbook - Second Edition
Published in: Sep 2024
Publisher: Packt
ISBN-13: 9781803245744
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime