Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Python Natural Language Processing Cookbook
Python Natural Language Processing Cookbook

Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries , Second Edition

Arrow left icon
Profile Icon Zhenya Antić Profile Icon Saurabh Chakravarty
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback Sep 2024 312 pages 2nd Edition
eBook
€8.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Zhenya Antić Profile Icon Saurabh Chakravarty
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (4 Ratings)
Paperback Sep 2024 312 pages 2nd Edition
eBook
€8.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €26.99
Paperback
€33.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Python Natural Language Processing Cookbook

Playing with Grammar

Grammar is one of the main building blocks of language. Each human language, and programming language for that matter, has a set of rules that every person speaking it must follow, otherwise risking not being understood. These grammatical rules can be uncovered using NLP and are useful for extracting data from sentences. For example, using information about the grammatical structure of text, we can parse out subjects, objects, and relations between different entities.

In this chapter, you will learn how to use different packages to reveal the grammatical structure of words and sentences, as well as extract certain parts of sentences. These are the topics covered in this chapter:

  • Counting nouns – plural and singular nouns
  • Getting the dependency parse
  • Extracting noun chunks
  • Extracting the subjects and objects of the sentence
  • Finding patterns in text using grammatical information

Technical requirements

Please follow the installation requirements given in Chapter 1 to run the notebooks in this chapter.

Counting nouns – plural and singular nouns

In this recipe, we will do two things: determine whether a noun is plural or singular and turn plural nouns into singular, and vice versa.

You might need these two things for a variety of tasks. For example, you might want to count the word statistics, and for that, you most likely need to count the singular and plural nouns together. In order to count the plural nouns together with singular ones, you need a way to recognize that a word is plural or singular.

Getting ready

To determine whether a noun is singular or plural, we will use spaCy via two different methods: by looking at the difference between the lemma and the actual word and by looking at the morph attribute. To inflect these nouns, or turn singular nouns into plural or vice versa we will use the textblob package. We will also see how to determine the noun’s number using GPT-3 through the OpenAI API. The code for this section is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter02.

How to do it…

We will first use spaCy’s lemma information to infer whether a noun is singular or plural. Then, we will use the morph attribute of Token objects. We will then create a function that uses one of those methods. Finally, we will use GPT-3.5 to find out the number of nouns:

  1. Run the code in the file and language utility notebooks. If you run into an error saying that the small or large models do not exist, you need to open the lang_utils.ipynb file, uncomment, and run the statement that downloads the model:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Initialize the text variable and process it using the spaCy small model to get the resulting Doc object:
    text = "I have five birds"
    doc = small_model(text)
  3. In this step, we loop through the Doc object. For each token in the object, we check whether it’s a noun and whether the lemma is the same as the word itself. Since the lemma is the basic form of the word, if the lemma is different from the word, that token is plural:
    for token in doc:
        if (token.pos_ == "NOUN" and token.lemma_ != token.text):
            print(token.text, "plural")

    The result should be as follows:

    birds plural
  4. Now, we will check the number of a noun using a different method: the morph features of a Token object. The morph features are the morphological features of a word, such as number, case, and so on. Since we know that token 3 is a noun, we directly access the morph features and get the Number to get the same result as previously:
    doc = small_model("I have five birds.")
    print(doc[3].morph.get("Number"))

    Here is the result:

    ['Plur']
  5. In this step, we prepare to define a function that returns a tuple, (noun, number). In order to better encode the noun number, we use an Enum class that assigns numbers to different values. We assign 1 to singular and 2 to plural. Once we create the class, we can directly refer to the noun number variables as Noun_number.SINGULAR and Noun_number.PLURAL:
    class Noun_number(Enum):
        SINGULAR = 1
        PLURAL = 2
  6. In this step, we define the function. It takes as input the text, the spaCy model, and the method of determining the noun number. The two methods are lemma and morph, the same two methods we used in steps 3 and 4, respectively. The function outputs a list of tuples, each of the format (<noun text>, <noun number>), where the noun number is expressed using the Noun_number class defined in step 5:
    def get_nouns_number(text, model, method="lemma"):
        nouns = []
        doc = model(text)
        for token in doc:
            if (token.pos_ == "NOUN"):
                if method == "lemma":
                    if token.lemma_ != token.text:
                        nouns.append((token.text, 
                            Noun_number.PLURAL))
                    else:
                        nouns.append((token.text,
                            Noun_number.SINGULAR))
                elif method == "morph":
                    if token.morph.get("Number") == "Sing":
                        nouns.append((token.text,
                            Noun_number.PLURAL))
                    else:
                        nouns.append((token.text,
                            Noun_number.SINGULAR))
        return nouns
  7. We can use the preceding function and see its performance with different spaCy models. In this step, we use the small spaCy model with the function we just defined. Using both methods, we see that the spaCy model gets the number of the irregular noun geese incorrectly:
    text = "Three geese crossed the road"
    nouns = get_nouns_number(text, small_model, "morph")
    print(nouns)
    nouns = get_nouns_number(text, small_model)
    print(nouns)

    The result should be as follows:

    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
  8. Now, let’s do the same using the large model. If you have not yet downloaded the large model, do so by running the first line. Otherwise, you can comment it out. Here, we see that although the morph method still incorrectly assigns singular to geese, the lemma method provides the correct answer:
    !python -m spacy download en_core_web_lg
    large_model = spacy.load("en_core_web_lg")
    nouns = get_nouns_number(text, large_model, "morph")
    print(nouns)
    nouns = get_nouns_number(text, large_model)
    print(nouns)

    The result should be as follows:

    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
    [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]
  9. Let’s now use GPT-3.5 to get the noun number. In the results, we see that GPT-3.5 gives us an identical result and correctly identifies both the number for geese and the number for road:
    from openai import OpenAI
    client = OpenAI(api_key=OPEN_AI_KEY)
    prompt="""Decide whether each noun in the following text is singular or plural.
    Return the list in the format of a python tuple: (word, number). Do not provide any additional explanations.
    Sentence: Three geese crossed the road."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        max_tokens=256,
        top_p=1.0,
        frequency_penalty=0,
        presence_penalty=0,
        messages=[
            {"role": "system", "content": "You are a helpful 
                assistant."},
            {"role": "user", "content": prompt}
        ],
    )
    print(response.choices[0].message.content)

    The result should be as follows:

    ('geese', 'plural')
    ('road', 'singular')

There’s more…

We can also change the nouns from plural to singular, and vice versa. We will use the textblob package for that. The package should be installed automatically via the Poetry environment:

  1. Import the TextBlob class from the package:
    from textblob import TextBlob
  2. Initialize a list of text variables and process them using the TextBlob class via a list comprehension:
    texts = ["book", "goose", "pen", "point", "deer"]
    blob_objs = [TextBlob(text) for text in texts]
  3. Use the pluralize function of the object to get the plural. This function returns a list and we access its first element. Print the result:
    plurals = [blob_obj.words.pluralize()[0] 
        for blob_obj in blob_objs]
    print(plurals)

    The result should be as follows:

    ['books', 'geese', 'pens', 'points', 'deer']
  4. Now, we will do the reverse. We use the preceding plurals list to turn the plural nouns into TextBlob objects:
    blob_objs = [TextBlob(text) for text in plurals]
  5. Turn the nouns into singular using the singularize function and print:
    singulars = [blob_obj.words.singularize()[0] 
        for blob_obj in blob_objs]
    print(singulars)

    The result should be the same as the list we started with in step 2:

    ['book', 'goose', 'pen', 'point', 'deer']

Getting the dependency parse

A dependency parse is a tool that shows dependencies in a sentence. For example, in the sentence The cat wore a hat, the root of the sentence is the verb, wore, and both the subject, the cat, and the object, a hat, are dependents. The dependency parse can be very useful in many NLP tasks since it shows the grammatical structure of the sentence, with the subject, the main verb, the object, and so on. It can then be used in downstream processing.

The spaCy NLP engine does the dependency parse as part of its overall analysis. The dependency parse tags explain the role of each word in the sentence. ROOT is the main word that all other words depend on, usually the verb.

Getting ready

We will use spaCy to create the dependency parse. The required packages are part of the Poetry environment.

How to do it…

We will take a few sentences from the sherlock_holmes1.txt file to illustrate the dependency parse. The steps are as follows:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Define the sentence we will be parsing:
    sentence = 'I have seldom heard him mention her under any other name.'
  3. Define a function that will print the word, its grammatical function embedded in the dep_ attribute, and the explanation of that attribute. The dep_ attribute of the Token object shows the grammatical function of the word in the sentence:
    def print_dependencies(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, "\t", token.dep_, "\t", 
                spacy.explain(token.dep_))
  4. Now, let’s use this function on the first sentence in our list. We can see that the verb heard is the ROOT word of the sentence, with all other words depending on it:
    print_dependencies(sentence, small_model)

    The result should be as follows:

    I    nsubj    nominal subject
    have    aux    auxiliary
    seldom    advmod    adverbial modifier
    heard    ROOT    root
    him    nsubj    nominal subject
    mention    ccomp    clausal complement
    her    dobj    direct object
    under    prep    prepositional modifier
    any    det    determiner
    other    amod    adjectival modifier
    name    pobj    object of preposition
    .    punct    punctuation
  5. To explore the dependency parse structure, we can use the attributes of the Token class. Using the ancestors and children attributes, we can get the tokens that this token depends on and the tokens that depend on it, respectively. The function to print the ancestors is as follows:
    def print_ancestors(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, [t.text for t in token.ancestors])
  6. Now, let’s use this function on the first sentence in our list:
    print_ancestors(sentence, small_model)

    The output will be as follows. In the result, we see that heard has no ancestors since it is the main word in the sentence. All other words depend on it, and in fact, contain heard in their ancestor lists.

    The dependency chain can be seen by following the ancestor links for each word. For example, if we look at the word name, we see that its ancestors are under, mention, and heard. The immediate parent of name is under, the parent of under is mention, and the parent of mention is heard. A dependency chain will always lead to the root, or the main word, of the sentence:

    I ['heard']
    have ['heard']
    seldom ['heard']
    heard []
    him ['mention', 'heard']
    mention ['heard']
    her ['mention', 'heard']
    under ['mention', 'heard']
    any ['name', 'under', 'mention', 'heard']
    other ['name', 'under', 'mention', 'heard']
    name ['under', 'mention', 'heard']
    . ['heard']
  7. To see all the children, use the following function. This function prints out each word and the words that depend on it, its children:
    def print_children(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text,[t.text for t in token.children])
  8. Now, let’s use this function on the first sentence in our list:
    print_children(sentence, small_model)

    The result should be as follows. Now, the word heard has a list of words that depend on it since it is the main word in the sentence:

    I []
    have []
    seldom []
    heard ['I', 'have', 'seldom', 'mention', '.']
    him []
    mention ['him', 'her', 'under']
    her []
    under ['name']
    any []
    other []
    name ['any', 'other']
    . []
  9. We can also see left and right children in separate lists. In the following function, we print the children as two separate lists, left and right. This can be useful when doing grammatical transformations in the sentence:
    def print_lefts_and_rights(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text,
                [t.text for t in token.lefts],
                [t.text for t in token.rights])
  10. Let’s use this function on the first sentence in our list:
    print_lefts_and_rights(sentence, small_model)

    The result should be as follows:

    I [] []
    have [] []
    seldom [] []
    heard ['I', 'have', 'seldom'] ['mention', '.']
    him [] []
    mention ['him'] ['her', 'under']
    her [] []
    under [] ['name']
    any [] []
    other [] []
    name ['any', 'other'] []
    . [] []
  11. We can also see the subtree that the token is in by using this function:
    def print_subtree(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, [t.text for t in token.subtree])
  12. Let’s use this function on the first sentence in our list:
    print_subtree(sentence, small_model)

    The result should be as follows. From the subtrees that each word is part of, we can see the grammatical phrases that appear in the sentence, such as the noun phrase, any other name, and the prepositional phrase, under any other name:

    I ['I']
    have ['have']
    seldom ['seldom']
    heard ['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
    him ['him']
    mention ['him', 'mention', 'her', 'under', 'any', 'other', 'name']
    her ['her']
    under ['under', 'any', 'other', 'name']
    any ['any']
    other ['other']
    name ['any', 'other', 'name']
    . ['.']

See also

The dependency parse can be visualized graphically using the displaCy package, which is part of spaCy. Please see Chapter 7, Visualizing Text Data, for a detailed recipe on how to do the visualization.

Extracting noun chunks

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relations between them. In this recipe, we will explore how to extract named entities from a text.

Getting ready

We will use the spaCy package, which has a function for extracting noun chunks, and the text from the sherlock_holmes_1.txt file as an example.

How to do it…

Use the following steps to get the noun chunks from a text:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Define the function that will print out the noun chunks. The noun chunks are contained in the doc.noun_chunks class variable:
    def print_noun_chunks(text, model):
        doc = model(text)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
  3. Read the text from the sherlock_holmes_1.txt file and use the function on the resulting text:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    print_noun_chunks(sherlock_holmes_part_of_text, small_model)

    This is the partial result. See the output of the notebook at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter02/noun_chunks_2.3.ipynb for the full printout. The function gets the pronouns, nouns, and noun phrases that are in the text correctly:

    Sherlock Holmes
    she
    the_ woman
    I
    him
    her
    any other name
    his eyes
    she
    the whole
    …

There’s more…

Noun chunks are spaCy Span objects and have all their properties. See the official documentation at https://spacy.io/api/token.

Let’s explore some properties of noun chunks:

  1. We will define a function that will print out the different properties of noun chunks. It will print the text of the noun chunk, its start and end indices within the Doc object, the sentence it belongs to (useful when there is more than one sentence), the root of the noun chunk (its main word), and the chunk’s similarity to the word emotions. Finally, it will print out the similarity of the whole input sentence to emotions:
    def explore_properties(sentence, model):
        doc = model(sentence)
        other_span = "emotions"
        other_doc = model(other_span)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
            print("Noun chunk start and end", "\t",
                noun_chunk.start, "\t", noun_chunk.end)
            print("Noun chunk sentence:", noun_chunk.sent)
            print("Noun chunk root:", noun_chunk.root.text)
            print(f"Noun chunk similarity to '{other_span}'",
                noun_chunk.similarity(other_doc))
        print(f"Similarity of the sentence '{sentence}' to 
            '{other_span}':",
            doc.similarity(other_doc))
  2. Set the sentence to All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
  3. Use the explore_properties function on the sentence using the small model:
    explore_properties(sentence, small_model)

    This is the result:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.4026421588260174
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' -0.036891259527462
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.03174900767577446

    You will also see a warning message similar to this one due to the fact that the small model does not ship with word vectors of its own:

    /tmp/ipykernel_1807/2430050149.py:10: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
      print(f"Noun chunk similarity to '{other_span}'", noun_chunk.similarity(other_doc))
  4. Now, let’s apply the same function to the same sentence with the large model:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
    explore_properties(sentence, large_model)

    The large model does come with its own word vectors and does not result in a warning:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.6302678068015664
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' 0.5744456705692561
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.640366414527618

    We see that the similarity of the All emotions noun chunk is high in relation to the word emotions, as compared to the similarity of the his cold, precise but admirably balanced mind noun chunk.

Important note

A larger spaCy model, such as en_core_web_lg, takes up more space but is more precise.

See also

The topic of semantic similarity will be explored in more detail in Chapter 3.

Extracting subjects and objects of the sentence

Sometimes, we might need to find the subject and direct objects of the sentence, and that is easily accomplished with the spaCy package.

Getting ready

We will be using the dependency tags from spaCy to find subjects and objects. The code uses the spaCy engine to parse the sentence. Then, the subject function loops through the tokens, and if the dependency tag contains subj, it returns that token’s subtree, a Span object. There are different subject tags, including nsubj for regular subjects and nsubjpass for subjects of passive sentences, thus we want to look for both.

How to do it…

We will use the subtree attribute of tokens to find the complete noun chunk that is the subject or direct object of the verb (see the Getting the dependency parse recipe). We will define functions to find the subject, direct object, dative phrase, and prepositional phrases:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. We will use two functions to find the subject and the direct object of the sentence. These functions will loop through the tokens and return the subtree that contains the token with subj or dobj in the dependency tag, respectively. Here is the subject function. It looks for the token that has a dependency tag that contains subj and then returns the subtree that contains that token. There are several subject dependency tags, including nsubj and nsubjpass (for the subject of a passive sentence), so we look for the most general pattern:
    def get_subject_phrase(doc):
        for token in doc:
            if ("subj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  3. Here is the direct object function. It works similarly to get_subject_phrase but looks for the dobj dependency tag instead of a tag that contains subj. If the sentence does not have a direct object, it will return None:
    def get_object_phrase(doc):
        for token in doc:
            if ("dobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  4. Assign a list of sentences to a variable, loop through them, and use the preceding functions to print out their subjects and objects:
    sentences = [
        "The big black cat stared at the small dog.",
        "Jane watched her brother in the evenings.",
        "Laura gave Sam a very interesting book."
    ]
    for sentence in sentences:
        doc = small_model(sentence)
        subject_phrase = get_subject_phrase(doc)
        object_phrase = get_object_phrase(doc)
        print(sentence)
        print("\tSubject:", subject_phrase)
        print("\tDirect object:", object_phrase)

    The result will be as follows. Since the first sentence does not have a direct object, None is printed out. For the sentence The big black cat stared at the small dog, the subject is the big black cat and there is no direct object (the small dog is the object of the preposition at). For the sentence Jane watched her brother in the evenings, the subject is Jane and the direct object is her brother. In the sentence Laura gave Sam a very interesting book, the subject is Laura and the direct object is a very interesting book:

    The big black cat stared at the small dog.
      Subject: The big black cat
      Direct object: None
    Jane watched her brother in the evenings.
      Subject: Jane
      Direct object: her brother
    Laura gave Sam a very interesting book.
      Subject: Laura
      Direct object: a very interesting book

There’s more…

We can look for other objects, for example, the dative objects of verbs such as give and objects of prepositional phrases. The functions will look very similar, with the main difference being the dependency tags: dative for the dative object function, and pobj for the prepositional object function. The prepositional object function will return a list since there can be more than one prepositional phrase in a sentence:

  1. The dative object function checks the tokens for the dative tag. It returns None if there are no dative objects:
    def get_dative_phrase(doc):
        for token in doc:
            if ("dative" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  2. We can also combine the subject, object, and dative functions into one with an argument that specifies which object to look for:
    def get_phrase(doc, phrase):
        # phrase is one of "subj", "obj", "dative"
        for token in doc:
            if (phrase in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  3. Let us now define a sentence with a dative object and run the function for all three types of phrases:
    sentence = "Laura gave Sam a very interesting book."
    doc = small_model(sentence)
    subject_phrase = get_phrase(doc, "subj")
    object_phrase = get_phrase(doc, "obj")
    dative_phrase = get_phrase(doc, "dative")
    print(sentence)
    print("\tSubject:", subject_phrase)
    print("\tDirect object:", object_phrase)
    print("\tDative object:", dative_phrase)

    The result will be as follows. The dative object is Sam:

    Laura gave Sam a very interesting book.
      Subject: Laura
      Direct object: a very interesting book
      Dative object: Sam
  4. Here is the prepositional object function. It returns a list of objects of prepositions, which will be empty if there are none:
    def get_prepositional_phrase_objs(doc):
        prep_spans = []
        for token in doc:
            if ("pobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                prep_spans.append(doc[start:end])
        return prep_spans
  5. Let’s define a list of sentences and run the two functions on them:
    sentences = [
        "The big black cat stared at the small dog.",
        "Jane watched her brother in the evenings."
    ]
    for sentence in sentences:
        doc = small_model(sentence)
        subject_phrase = get_phrase(doc, "subj")
        object_phrase = get_phrase(doc, "obj")
        dative_phrase = get_phrase(doc, "dative")
        prepositional_phrase_objs = \
            get_prepositional_phrase_objs(doc)
        print(sentence)
        print("\tSubject:", subject_phrase)
        print("\tDirect object:", object_phrase)
        print("\tPrepositional phrases:", prepositional_phrase_objs)

    The result will be as follows:

    The big black cat stared at the small dog.
      Subject: The big black cat
      Direct object: the small dog
      Prepositional phrases: [the small dog]
    Jane watched her brother in the evenings.
      Subject: Jane
      Direct object: her brother
      Prepositional phrases: [the evenings]

    There is one prepositional phrase in each sentence. In the sentence The big black cat stared at the small dog, it is at the small dog, and in the sentence Jane watched her brother in the evenings, it is in the evenings.

It is left as an exercise for you to find the actual prepositional phrases with prepositions intact instead of just the noun phrases that are dependent on these prepositions.

Finding patterns in text using grammatical information

In this section, we will use the spaCy Matcher object to find patterns in the text. We will use the grammatical properties of the words to create these patterns. For example, we might be looking for verb phrases instead of noun phrases. We can specify grammatical patterns to match verb phrases.

Getting ready

We will be using the spaCy Matcher object to specify and find patterns. It can match different properties, not just grammatical. You can find out more in the documentation at https://spacy.io/usage/rule-based-matching/.

How to do it…

Your steps should be formatted like so:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Import the Matcher object and initialize it. We need to put in the vocabulary object, which is the same as the vocabulary of the model we will be using to process the text:
    from spacy.matcher import Matcher
    matcher = Matcher(small_model.vocab)
  3. Create a list of patterns and add them to the matcher. Each pattern is a list of dictionaries, where each dictionary describes a token. In our patterns, we only specify the part of speech for each token. We then add these patterns to the Matcher object. The patterns we will be using are a verb by itself (for example, paints), an auxiliary followed by a verb (for example, was observing), an auxiliary followed by an adjective (for example, were late), and an auxiliary followed by a verb and a preposition (for example, were staring at). This is not an exhaustive list; feel free to come up with other examples:
    patterns = [
        [{"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "ADJ"}],
        [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}]
    ]
    matcher.add("Verb", patterns)
  4. Read in the small part of the Sherlock Holmes text and process it using the small model:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    doc = small_model(sherlock_holmes_part_of_text)
  5. Now, we find the matches using the Matcher object and the processed text. We then loop through the matches and print out the match ID, the string ID (the identifier of the pattern), the start and end of the match, and the text of the match:
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = small_model.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id, string_id, start, end, span.text)

    The result will be as follows:

    14677086776663181681 Verb 14 15 heard
    14677086776663181681 Verb 17 18 mention
    14677086776663181681 Verb 28 29 eclipses
    14677086776663181681 Verb 31 32 predominates
    14677086776663181681 Verb 43 44 felt
    14677086776663181681 Verb 49 50 love
    14677086776663181681 Verb 63 65 were abhorrent
    14677086776663181681 Verb 80 81 take
    14677086776663181681 Verb 88 89 observing
    14677086776663181681 Verb 94 96 has seen
    14677086776663181681 Verb 95 96 seen
    14677086776663181681 Verb 103 105 have placed
    14677086776663181681 Verb 104 105 placed
    14677086776663181681 Verb 114 115 spoke
    14677086776663181681 Verb 120 121 save
    14677086776663181681 Verb 130 132 were admirable
    14677086776663181681 Verb 140 141 drawing
    14677086776663181681 Verb 153 154 trained
    14677086776663181681 Verb 157 158 admit
    14677086776663181681 Verb 167 168 adjusted
    14677086776663181681 Verb 171 172 introduce
    14677086776663181681 Verb 173 174 distracting
    14677086776663181681 Verb 178 179 throw
    14677086776663181681 Verb 228 229 was

The code finds some of the verb phrases in the text. Sometimes, it finds a partial match that is part of another match. Weeding out these partial matches is left as an exercise.

See also

We can use other attributes apart from parts of speech. It is possible to match on the text itself, its length, whether it is alphanumeric, the punctuation, the word’s case, the dep_ and morph attributes, lemma, entity type, and others. It is also possible to use regular expressions on the patterns. For more information, see the spaCy documentation: https://spacy.io/usage/rule-based-matching.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Leverage ready-to-use recipes with the latest LLMs, including Mistral, Llama, and OpenAI models
  • Use LLM-powered agents for custom tasks and real-world interactions
  • Gain practical, in-depth knowledge of transformers and their role in implementing various NLP tasks with open-source and advanced LLMs
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Harness the power of Natural Language Processing to overcome real-world text analysis challenges with this recipe-based roadmap written by two seasoned NLP experts with vast experience transforming various industries with their NLP prowess. You’ll be able to make the most of the latest NLP advancements, including large language models (LLMs), and leverage their capabilities through Hugging Face transformers. Through a series of hands-on recipes, you’ll master essential techniques such as extracting entities and visualizing text data. The authors will expertly guide you through building pipelines for sentiment analysis, topic modeling, and question-answering using popular libraries like spaCy, Gensim, and NLTK. You’ll also learn to implement RAG pipelines to draw out precise answers from a text corpus using LLMs. This second edition expands your skillset with new chapters on cutting-edge LLMs like GPT-4, Natural Language Understanding (NLU), and Explainable AI (XAI)—fostering trust and transparency in your NLP models. By the end of this book, you'll be equipped with the skills to apply advanced text processing techniques, use pre-trained transformer models, build custom NLP pipelines to extract valuable insights from text data to drive informed decision-making.

Who is this book for?

This updated edition of the Python Natural Language Processing Cookbook is for data scientists, machine learning engineers, and developers with a background in Python. Whether you’re looking to learn NLP techniques, extract valuable insights from textual data, or create foundational applications, this book will equip you with basic to intermediate skills. No prior NLP knowledge is necessary to get started. All you need is familiarity with basic programming principles. For seasoned developers, the updated sections offer the latest on transformers, explainable AI, and Generative AI with LLMs.

What you will learn

  • Understand fundamental NLP concepts along with their applications using examples in Python
  • Classify text quickly and accurately with rule-based and supervised methods
  • Train NER models and perform sentiment analysis to identify entities and emotions in text
  • Explore topic modeling and text visualization to reveal themes and relationships within text
  • Leverage Hugging Face and OpenAI LLMs to perform advanced NLP tasks
  • Use question-answering techniques to handle both open and closed domains
  • Apply XAI techniques to better understand your model predictions

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 13, 2024
Length: 312 pages
Edition : 2nd
Language : English
ISBN-13 : 9781803245744
Vendor :
Google
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Sep 13, 2024
Length: 312 pages
Edition : 2nd
Language : English
ISBN-13 : 9781803245744
Vendor :
Google
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 109.97
Generative AI Application Integration Patterns
€37.99
Python Natural Language Processing Cookbook
€33.99
Building LLM Powered  Applications
€37.99
Total 109.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Chapter 1: Learning NLP Basics Chevron down icon Chevron up icon
Chapter 2: Playing with Grammar Chevron down icon Chevron up icon
Chapter 3: Representing Text – Capturing Semantics Chevron down icon Chevron up icon
Chapter 4: Classifying Texts Chevron down icon Chevron up icon
Chapter 5: Getting Started with Information Extraction Chevron down icon Chevron up icon
Chapter 6: Topic Modeling Chevron down icon Chevron up icon
Chapter 7: Visualizing Text Data Chevron down icon Chevron up icon
Chapter 8: Transformers and Their Applications Chevron down icon Chevron up icon
Chapter 9: Natural Language Understanding Chevron down icon Chevron up icon
Chapter 10: Generative AI and Large Language Models Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(4 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Amazon Customer Oct 26, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a remarkable resource for anyone eager to dive into the world of Natural Language Processing (NLP). Authored by two seasoned NLP experts, this book offers a recipe-based approach that effectively addresses real-world text analysis challenges.One of the main strengths of this edition is its focus on the latest advancements in NLP, particularly large language models (LLMs) like GPT-4, and the practical application of these technologies using Hugging Face transformers. The authors provide a wealth of hands-on recipes that guide readers through essential techniques, such as entity extraction, sentiment analysis, and topic modeling, using popular libraries like spaCy, Gensim, and NLTK. This practical approach makes complex concepts accessible, allowing both beginners and seasoned developers to enhance their skills.The addition of new chapters on Natural Language Understanding (NLU) and Explainable AI (XAI) enriches the content, fostering a deeper understanding of model transparency and trustworthiness—an increasingly important aspect of AI applications. By the end of the book, readers will be well-equipped to build custom NLP pipelines and apply advanced techniques to extract valuable insights from text data.
Amazon Verified review Amazon
Om S Oct 15, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The Python Natural Language Processing Cookbook offers a hands-on, recipe-based approach to mastering NLP techniques, making it a valuable resource for both beginners and experienced developers. This second edition stands out by introducing the latest in Large Language Models (LLMs), such as GPT-4, Mistral, and Llama, while covering foundational NLP concepts like text classification, topic modeling, and information extraction. With practical examples using popular tools like Hugging Face and OpenAI, the book excels in showcasing how to implement LLM-powered agents and advanced NLP tasks. The new chapters on transformers, explainable AI, and natural language understanding (NLU) make it particularly relevant for anyone eager to dive into cutting-edge NLP technologies. Whether you're just starting out or looking to enhance your expertise, this book provides clear, actionable insights into NLP's evolving landscape.
Amazon Verified review Amazon
Advitya Gemawat Oct 16, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Throughout the book, I appreciated the infusion of both traditional NLP and GenAI concepts in almost every chapter of the book, which helps in ‘grounding’ GenAI concepts in foundational knowledge (get the pun? :))Here're my top takeaways from the book:📖 The traditional NLP content in Chapters 1 and 3-6 closely mirrors what I studied and utilized during college, making it a relevant resource for college students in similar NLP/AI classes.🏫 The dedicated Chapter 8 on Transformers is likely my top pick as a common topic for interview prep across multiple levels of recent grads and experienced AI practitioners.📊 Among all the content on LLMs, the 2 topics that especially stood out to me were the coding examples on:(1) Running an LLM locally, for quick experimentation and for college students with limited access to cloud resources(2) Building an Agent workflow: The ease of initializing built-in tools to perform actions like internet search as a reasoning step is honestly quite refreshing, due to the democratization of typical Function Calling capabilities that it demonstrates.More than anything, it's especially fun to get a refresher on foundational concepts and then position these in context of using modern tools that help build workflows at a higher level of abstraction.
Amazon Verified review Amazon
SA Sep 13, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a practical guide for solving NLP problems using Python. The book contains over 60 easy-to-follow recipes that help readers learn everything from basic NLP tasks like tokenization and lemmatization to more advanced topics, including transformers and large language models.The book is useful for a wide range of readers, from data scientists to software developers, because it explains concepts clearly and provides code examples that can be used right away. One of the highlights is that it covers the latest trends, like GPT models and transformers, which makes the book relevant in today’s fast-changing NLP field.Even though the book covers complex topics, it keeps the explanations simple, making it easy to understand and apply. The focus on Python and its libraries, like spaCy and Hugging Face, may limit its appeal for those looking for more general NLP approaches across different platforms.Overall, I found this book to be an excellent resource, motivating me to work on modern NLP projects with Python. It serves both as a learning guide and a useful reference for practical applications.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.