Counting nouns – plural and singular nouns
In this recipe, we will do two things: determine whether a noun is plural or singular and turn plural nouns into singular, and vice versa.
You might need these two things for a variety of tasks. For example, you might want to count the word statistics, and for that, you most likely need to count the singular and plural nouns together. In order to count the plural nouns together with singular ones, you need a way to recognize that a word is plural or singular.
Getting ready
To determine whether a noun is singular or plural, we will use spaCy
via two different methods: by looking at the difference between the lemma and the actual word and by looking at the morph
attribute. To inflect these nouns, or turn singular nouns into plural or vice versa we will use the textblob
package. We will also see how to determine the noun’s number using GPT-3 through the OpenAI API. The code for this section is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter02.
How to do it…
We will first use spaCy
’s lemma information to infer whether a noun is singular or plural. Then, we will use the morph
attribute of Token
objects. We will then create a function that uses one of those methods. Finally, we will use GPT-3.5 to find out the number of nouns:
- Run the code in the file and language utility notebooks. If you run into an error saying that the small or large models do not exist, you need to open the
lang_utils.ipynb
file, uncomment, and run the statement that downloads the model:%run -i "../util/file_utils.ipynb" %run -i "../util/lang_utils.ipynb"
- Initialize the
text
variable and process it using thespaCy
small model to get the resultingDoc
object:text = "I have five birds" doc = small_model(text)
- In this step, we loop through the
Doc
object. For each token in the object, we check whether it’s a noun and whether the lemma is the same as the word itself. Since the lemma is the basic form of the word, if the lemma is different from the word, that token is plural:for token in doc: if (token.pos_ == "NOUN" and token.lemma_ != token.text): print(token.text, "plural")
The result should be as follows:
birds plural
- Now, we will check the number of a noun using a different method: the
morph
features of aToken
object. Themorph
features are the morphological features of a word, such as number, case, and so on. Since we know that token3
is a noun, we directly access themorph
features and get theNumber
to get the same result as previously:doc = small_model("I have five birds.") print(doc[3].morph.get("Number"))
Here is the result:
['Plur']
- In this step, we prepare to define a function that returns a tuple,
(noun, number)
. In order to better encode the noun number, we use anEnum
class that assigns numbers to different values. We assign1
to singular and2
to plural. Once we create the class, we can directly refer to the noun number variables asNoun_number.SINGULAR
andNoun_number.PLURAL
:class Noun_number(Enum): SINGULAR = 1 PLURAL = 2
- In this step, we define the function. It takes as input the text, the
spaCy
model, and the method of determining the noun number. The two methods arelemma
andmorph
, the same two methods we used in steps 3 and 4, respectively. The function outputs a list of tuples, each of the format(<noun text>, <noun number>)
, where the noun number is expressed using theNoun_number
class defined in step 5:def get_nouns_number(text, model, method="lemma"): nouns = [] doc = model(text) for token in doc: if (token.pos_ == "NOUN"): if method == "lemma": if token.lemma_ != token.text: nouns.append((token.text, Noun_number.PLURAL)) else: nouns.append((token.text, Noun_number.SINGULAR)) elif method == "morph": if token.morph.get("Number") == "Sing": nouns.append((token.text, Noun_number.PLURAL)) else: nouns.append((token.text, Noun_number.SINGULAR)) return nouns
- We can use the preceding function and see its performance with different
spaCy
models. In this step, we use the smallspaCy
model with the function we just defined. Using both methods, we see that thespaCy
model gets the number of the irregular noungeese
incorrectly:text = "Three geese crossed the road" nouns = get_nouns_number(text, small_model, "morph") print(nouns) nouns = get_nouns_number(text, small_model) print(nouns)
The result should be as follows:
[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)] [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
- Now, let’s do the same using the large model. If you have not yet downloaded the large model, do so by running the first line. Otherwise, you can comment it out. Here, we see that although the
morph
method still incorrectly assigns singular togeese
, thelemma
method provides the correct answer:!python -m spacy download en_core_web_lg large_model = spacy.load("en_core_web_lg") nouns = get_nouns_number(text, large_model, "morph") print(nouns) nouns = get_nouns_number(text, large_model) print(nouns)
The result should be as follows:
[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)] [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]
- Let’s now use GPT-3.5 to get the noun number. In the results, we see that GPT-3.5 gives us an identical result and correctly identifies both the number for
geese
and the number forroad
:from openai import OpenAI client = OpenAI(api_key=OPEN_AI_KEY) prompt="""Decide whether each noun in the following text is singular or plural. Return the list in the format of a python tuple: (word, number). Do not provide any additional explanations. Sentence: Three geese crossed the road.""" response = client.chat.completions.create( model="gpt-3.5-turbo", temperature=0, max_tokens=256, top_p=1.0, frequency_penalty=0, presence_penalty=0, messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], ) print(response.choices[0].message.content)
The result should be as follows:
('geese', 'plural') ('road', 'singular')
There’s more…
We can also change the nouns from plural to singular, and vice versa. We will use the textblob
package for that. The package should be installed automatically via the Poetry environment:
- Import the
TextBlob
class from the package:from textblob import TextBlob
- Initialize a list of text variables and process them using the
TextBlob
class via a list comprehension:texts = ["book", "goose", "pen", "point", "deer"] blob_objs = [TextBlob(text) for text in texts]
- Use the
pluralize
function of the object to get the plural. This function returns a list and we access its first element. Print the result:plurals = [blob_obj.words.pluralize()[0] for blob_obj in blob_objs] print(plurals)
The result should be as follows:
['books', 'geese', 'pens', 'points', 'deer']
- Now, we will do the reverse. We use the preceding
plurals
list to turn the plural nouns intoTextBlob
objects:blob_objs = [TextBlob(text) for text in plurals]
- Turn the nouns into singular using the
singularize
function and print:singulars = [blob_obj.words.singularize()[0] for blob_obj in blob_objs] print(singulars)
The result should be the same as the list we started with in step 2:
['book', 'goose', 'pen', 'point', 'deer']