Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Natural Language Processing Cookbook

You're reading from   Python Natural Language Processing Cookbook Over 60 recipes for building powerful NLP solutions using Python and LLM libraries

Arrow left icon
Product type Paperback
Published in Sep 2024
Publisher Packt
ISBN-13 9781803245744
Length 312 pages
Edition 2nd Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Saurabh Chakravarty Saurabh Chakravarty
Author Profile Icon Saurabh Chakravarty
Saurabh Chakravarty
Zhenya Antić Zhenya Antić
Author Profile Icon Zhenya Antić
Zhenya Antić
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Learning NLP Basics 2. Chapter 2: Playing with Grammar FREE CHAPTER 3. Chapter 3: Representing Text – Capturing Semantics 4. Chapter 4: Classifying Texts 5. Chapter 5: Getting Started with Information Extraction 6. Chapter 6: Topic Modeling 7. Chapter 7: Visualizing Text Data 8. Chapter 8: Transformers and Their Applications 9. Chapter 9: Natural Language Understanding 10. Chapter 10: Generative AI and Large Language Models 11. Index 12. Other Books You May Enjoy

Extracting noun chunks

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relations between them. In this recipe, we will explore how to extract named entities from a text.

Getting ready

We will use the spaCy package, which has a function for extracting noun chunks, and the text from the sherlock_holmes_1.txt file as an example.

How to do it…

Use the following steps to get the noun chunks from a text:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Define the function that will print out the noun chunks. The noun chunks are contained in the doc.noun_chunks class variable:
    def print_noun_chunks(text, model):
        doc = model(text)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
  3. Read the text from the sherlock_holmes_1.txt file and use the function on the resulting text:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    print_noun_chunks(sherlock_holmes_part_of_text, small_model)

    This is the partial result. See the output of the notebook at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter02/noun_chunks_2.3.ipynb for the full printout. The function gets the pronouns, nouns, and noun phrases that are in the text correctly:

    Sherlock Holmes
    she
    the_ woman
    I
    him
    her
    any other name
    his eyes
    she
    the whole
    …

There’s more…

Noun chunks are spaCy Span objects and have all their properties. See the official documentation at https://spacy.io/api/token.

Let’s explore some properties of noun chunks:

  1. We will define a function that will print out the different properties of noun chunks. It will print the text of the noun chunk, its start and end indices within the Doc object, the sentence it belongs to (useful when there is more than one sentence), the root of the noun chunk (its main word), and the chunk’s similarity to the word emotions. Finally, it will print out the similarity of the whole input sentence to emotions:
    def explore_properties(sentence, model):
        doc = model(sentence)
        other_span = "emotions"
        other_doc = model(other_span)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
            print("Noun chunk start and end", "\t",
                noun_chunk.start, "\t", noun_chunk.end)
            print("Noun chunk sentence:", noun_chunk.sent)
            print("Noun chunk root:", noun_chunk.root.text)
            print(f"Noun chunk similarity to '{other_span}'",
                noun_chunk.similarity(other_doc))
        print(f"Similarity of the sentence '{sentence}' to 
            '{other_span}':",
            doc.similarity(other_doc))
  2. Set the sentence to All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
  3. Use the explore_properties function on the sentence using the small model:
    explore_properties(sentence, small_model)

    This is the result:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.4026421588260174
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' -0.036891259527462
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.03174900767577446

    You will also see a warning message similar to this one due to the fact that the small model does not ship with word vectors of its own:

    /tmp/ipykernel_1807/2430050149.py:10: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
      print(f"Noun chunk similarity to '{other_span}'", noun_chunk.similarity(other_doc))
  4. Now, let’s apply the same function to the same sentence with the large model:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
    explore_properties(sentence, large_model)

    The large model does come with its own word vectors and does not result in a warning:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.6302678068015664
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' 0.5744456705692561
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.640366414527618

    We see that the similarity of the All emotions noun chunk is high in relation to the word emotions, as compared to the similarity of the his cold, precise but admirably balanced mind noun chunk.

Important note

A larger spaCy model, such as en_core_web_lg, takes up more space but is more precise.

See also

The topic of semantic similarity will be explored in more detail in Chapter 3.

You have been reading a chapter from
Python Natural Language Processing Cookbook - Second Edition
Published in: Sep 2024
Publisher: Packt
ISBN-13: 9781803245744
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at R$50/month. Cancel anytime