Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Natural Language Processing Cookbook

You're reading from   Python Natural Language Processing Cookbook Over 50 recipes to understand, analyze, and generate text for implementing language processing tasks

Arrow left icon
Product type Paperback
Published in Mar 2021
Publisher Packt
ISBN-13 9781838987312
Length 284 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Zhenya Antić Zhenya Antić
Author Profile Icon Zhenya Antić
Zhenya Antić
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

Preface 1. Chapter 1: Learning NLP Basics 2. Chapter 2: Playing with Grammar FREE CHAPTER 3. Chapter 3: Representing Text – Capturing Semantics 4. Chapter 4: Classifying Texts 5. Chapter 5: Getting Started with Information Extraction 6. Chapter 6: Topic Modeling 7. Chapter 7: Building Chatbots 8. Chapter 8: Visualizing Text Data 9. Other Books You May Enjoy

Extracting noun chunks

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relationships between them. In this recipe, we will explore how to extract named entities from a piece of text.

Getting ready

We will be using the spacy package, which has a function for extracting noun chunks and the text from the sherlock_holmes_1.txt file as an example.

In this section, we will use another spaCy language model, en_core_web_md. Follow the instructions in the Technical requirements section to learn how to download it.

How to do it…

Use the following steps to get the noun chunks from a piece of text:

  1. Import the spacy package and the read_text_file from the code files of Chapter 1:
    import spacy
    from Chapter01.dividing_into_sentences import read_text_file

    Important note

    If you are importing functions from other chapters, run it from the directory that precedes Chapter02 and use the python -m Chapter02.extract_noun_chunks command.

  2. Read in the sherlock_holmes_1.txt file:
    text = read_text_file("sherlock_holmes_1.txt")
  3. Initialize the spacy engine and then use it to process the text:
    nlp = spacy.load('en_core_web_md')
    doc = nlp(text)
  4. The noun chunks are contained in the doc.noun_chunks class variable. We can print out the chunks:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text)

    This is the partial result. See this book's GitHub repository for the full printout, which can be found in the Chapter02/all_text_noun_chunks.txt file:

    Sherlock Holmes
    she
    the_ woman
    I
    him
    her
    any other name
    his eyes
    she
    the whole
    …

How it works…

The spaCy Doc object, as we saw in the previous recipe, contains information about grammatical relationships between words in a sentence. Using this information, spaCy determines noun phrases or chunks contained in the text.

In step 1, we import spacy and the read_text_file function from the Chapter01 module. In step 2, we read in the text from the sherlock_holmes_1.txt file.

In step 3, we initialize the spacy engine with a different model, en_core_web_md, which is larger and will most likely give better results. There is also the large model, en_core_web_lg, which is even larger. It will give better results, but the processing will be slower. After loading the engine, we run it on the text we loaded in step 2.

In step 4, we print out the noun chunks that appear in the text. As you can see, it gets the pronouns, nouns, and noun phrases that are in the text correctly.

There's more…

Noun chunks are spaCy Span objects and have all their properties. See the official documentation at https://spacy.io/api/token.

Let's explore some properties of noun chunks:

  1. Import the spacy package:
    import spacy
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. Set the sentence to All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
  4. Process the sentence with the spacy engine:
    doc = nlp(sentence)
  5. Let's look at the noun chunks in this sentence:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text)
  6. This is the result:
    All emotions
    his cold, precise but admirably balanced mind
  7. Some of the basic properties of noun chunks are its start and end offsets; we can print them out together with the noun chunks:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text, "\t", noun_chunk.start, "\t", 
              noun_chunk.end)

    The result will be as follows:

    All emotions     0       2
    his cold, precise but admirably balanced mind    11      19
  8. We can also print out the sentence where the noun chunk belongs:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text, "\t", noun_chunk.sent)

    Predictably, this results in the following:

    All emotions     All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    his cold, precise but admirably balanced mind    All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
  9. Just like a sentence, any noun chunk includes a root, which is the token that all other tokens depend on. In a noun phrase, that is the noun:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text, "\t", noun_chunk.root.text)
  10. The result will be as follows:
    All emotions     emotions
    his cold, precise but admirably balanced mind    mind
  11. Another very useful property of Span is similarity, which is the semantic similarity of different texts. Let's try it out. We will load another noun chunk, emotions, and process it using spacy:
    other_span = "emotions"
    other_doc = nlp(other_span)
  12. We can now compare it to the noun chunks in the sentence by using this code:
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.similarity(other_doc))

    This is the result:

    UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
      print(noun_chunk.similarity(other_doc))
    All emotions
    0.373233604751925
    his cold, precise but admirably balanced mind
    0.030945358271699138
  13. Although the result makes sense, with all emotions being more similar to emotions than to his cold, precise but admirably balanced mind, we get a warning. In order to fix this, we will use the medium spacy model, which contains vector representations for words. Substitute this line for the line in step 2; the rest of the code will remain the same:
    nlp = spacy.load('en_core_web_md')
  14. Now, when we run this code with the new model, we get this result:
    All emotions
    0.8876554549427152
    that one
    0.37378867755652434
    his cold, precise but admirably balanced mind
    0.5102475977383759

    The result shows the similarity of all emotions to emotions being very high, 0.89, and to his cold, precise but admirably balanced mind, 0.51. We can also see that the larger model detects another noun chunk, that one.

    Important note

    A larger spaCy model, such as en_core_web_md, takes up more space, but is more precise.

See also

The topic of semantic similarity will be explored in more detail in Chapter 3, Representing Text: Capturing Semantics.

You have been reading a chapter from
Python Natural Language Processing Cookbook
Published in: Mar 2021
Publisher: Packt
ISBN-13: 9781838987312
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime