Extracting noun chunks
Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relations between them. In this recipe, we will explore how to extract named entities from a text.
Getting ready
We will use the spaCy
package, which has a function for extracting noun chunks, and the text from the sherlock_holmes_1.txt
file as an example.
How to do it…
Use the following steps to get the noun chunks from a text:
- Run the file and language utility notebooks:
%run -i "../util/file_utils.ipynb" %run -i "../util/lang_utils.ipynb"
- Define the function that will print out the noun chunks. The noun chunks are contained in the
doc.noun_chunks
class variable:def print_noun_chunks(text, model): doc = model(text) for noun_chunk in doc.noun_chunks: print(noun_chunk.text)
- Read the text from the
sherlock_holmes_1.txt
file and use the function on the resulting text:sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt") print_noun_chunks(sherlock_holmes_part_of_text, small_model)
This is the partial result. See the output of the notebook at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter02/noun_chunks_2.3.ipynb for the full printout. The function gets the pronouns, nouns, and noun phrases that are in the text correctly:
Sherlock Holmes she the_ woman I him her any other name his eyes she the whole …
There’s more…
Noun chunks are spaCy
Span
objects and have all their properties. See the official documentation at https://spacy.io/api/token.
Let’s explore some properties of noun chunks:
- We will define a function that will print out the different properties of noun chunks. It will print the text of the noun chunk, its start and end indices within the
Doc
object, the sentence it belongs to (useful when there is more than one sentence), the root of the noun chunk (its main word), and the chunk’s similarity to the wordemotions
. Finally, it will print out the similarity of the whole input sentence toemotions
:def explore_properties(sentence, model): doc = model(sentence) other_span = "emotions" other_doc = model(other_span) for noun_chunk in doc.noun_chunks: print(noun_chunk.text) print("Noun chunk start and end", "\t", noun_chunk.start, "\t", noun_chunk.end) print("Noun chunk sentence:", noun_chunk.sent) print("Noun chunk root:", noun_chunk.root.text) print(f"Noun chunk similarity to '{other_span}'", noun_chunk.similarity(other_doc)) print(f"Similarity of the sentence '{sentence}' to '{other_span}':", doc.similarity(other_doc))
- Set the sentence to
All emotions, and that one particularly, were abhorrent to his cold, precise but admirably
balanced mind
:sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
- Use the
explore_properties
function on the sentence using the small model:explore_properties(sentence, small_model)
This is the result:
All emotions Noun chunk start and end 0 2 Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. Noun chunk root: emotions Noun chunk similarity to 'emotions' 0.4026421588260174 his cold, precise but admirably balanced mind Noun chunk start and end 11 19 Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. Noun chunk root: mind Noun chunk similarity to 'emotions' -0.036891259527462 Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.03174900767577446
You will also see a warning message similar to this one due to the fact that the small model does not ship with word vectors of its own:
/tmp/ipykernel_1807/2430050149.py:10: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available. print(f"Noun chunk similarity to '{other_span}'", noun_chunk.similarity(other_doc))
- Now, let’s apply the same function to the same sentence with the large model:
sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind." explore_properties(sentence, large_model)
The large model does come with its own word vectors and does not result in a warning:
All emotions Noun chunk start and end 0 2 Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. Noun chunk root: emotions Noun chunk similarity to 'emotions' 0.6302678068015664 his cold, precise but admirably balanced mind Noun chunk start and end 11 19 Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. Noun chunk root: mind Noun chunk similarity to 'emotions' 0.5744456705692561 Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.640366414527618
We see that the similarity of the
All emotions
noun chunk is high in relation to the wordemotions
, as compared to the similarity of thehis cold, precise but admirably balanced mind
noun chunk.
Important note
A larger spaCy
model, such as en_core_web_lg
, takes up more space but is more precise.
See also
The topic of semantic similarity will be explored in more detail in Chapter 3.