Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Natural Language Processing

You're reading from   Python Natural Language Processing Advanced machine learning and deep learning techniques for natural language processing

Arrow left icon
Product type Paperback
Published in Jul 2017
Publisher Packt
ISBN-13 9781787121423
Length 486 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Jalaj Thanaki Jalaj Thanaki
Author Profile Icon Jalaj Thanaki
Jalaj Thanaki
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Introduction FREE CHAPTER 2. Practical Understanding of a Corpus and Dataset 3. Understanding the Structure of a Sentences 4. Preprocessing 5. Feature Engineering and NLP Algorithms 6. Advanced Feature Engineering and NLP Algorithms 7. Rule-Based System for NLP 8. Machine Learning for NLP Problems 9. Deep Learning for NLU and NLG Problems 10. Advanced Tools 11. How to Improve Your NLP Skills 12. Installation Guide

Morphological analysis

Here, we are going to explore the basic terminology used in field of morphological analysis. The terminology and concepts will help you when you are solving real-life problems.

What is morphology?

Morphology is branch of linguistics that studies how words can be structured and formed.

What are morphemes?

In linguistics, a morpheme is the smallest meaningful unit of a given language. The important part of morphology is morphemes, which are the basic unit of morphology.

Let's take an example. The word boy consists of single morpheme whereas boys consists of two morphemes; one is boy and the other morpheme -s

What is a stem?

The part of a word that an affix is attached to is called as stem. The word tie is root whereas Untie is stem.

Now, let's understand morphological analysis.

What is morphological analysis?

Morphological analysis is defined as grammatical analysis of how words are formed by using morphemes, which are the minimum unit of meaning.

Generally, morphemes are affixes. Those affixes can be divided into four types:

  • Prefixes, which appear before a stem, such as unhappy
  • Suffixes, which appear after a stem, such as happiness
  • Infixes, which appear inside a stem, such as bumili (this means buy in Tagalog, a language from the Philippines)
  • Circumfixes surround a word. It is attached to the beginning and end of the stem. For example, kabaddangan (this means help in Tuwali Ifugao, another language from the Philippines)

Morphological analysis is used in word segmentation, and Part Of Speech (POS) tagging uses this analysis. I will explain about POS in the Lexical analysis section, so bear with me until we will connect the dots.

Let's take an example to practically explain the concepts that I have proposed. I would like to take the word Unexpected. Refer to Figure 3.3, which gives you an idea about the morphemes and how morphological analysis has taken place:

Figure 3.3: Morphemes and morphological analysis

In Figure 3.3, we have expressed Unexpected as morphemes and performed morphological analysis the morphemes. Here, Un is a Prefix, and ed is a Suffix. Un and ed can be considered as affixes, Unexpect is the Stem.

Let's refer to another important concept and try to relate it to the concept of morphemes. I'm talking about how you define a word. Let's see.

What is a word?

A word can be isolated from a sentence as the single smallest element of a sentence that carries meaning. This smallest single isolated part of a sentence is called a word.

Please refer to the morphemes definition again and try to relate it to the definition of word. The reason why I have told you to do this is that you may confuse words and morphemes, or maybe you are not sure what the difference is between them. It is completely fine that you have thought in this way. They are confusing if you do not understand them properly.

The definitions look similar, but there is a very small difference between words and morphemes. We can see the differences in the following table:

Morpheme

Word

Morphemes can or cannot stand alone. The word cat can stand alone but plural marker -s cannot stand alone. Here cat and -s both are morpheme.

A word can stand alone. So, words are basically free-standing units in sentences.

When a morpheme stands alone then that morpheme is called root because it conveys the meaning of its own, otherwise morpheme mostly takes affixes.

The analysis of what kind of affixes morpheme will take is covered under morphological analysis.

A word can consist of a single morpheme.

For example, cat is a standalone morpheme, but when you consider cats, then the suffix -s is there, which conveys the information that cat is one morpheme and the suffix -s indicates the grammatical information that the given morpheme is the plural form of cat.

For example: Cat is a standalone word.

Cats is also a standalone word.

Classification of morphemes

The classification of morphemes gives us lots of information about how the whole concept of morphological analysis works. Refer to Figure 3.4:

Figure 3.4: Classification of morphemes

There two major part of morphemes.

Free morphemes

Free morphemes can be explained as follows:

  • Free morphemes can stand alone and act as a word. They are also called unbound morphemes or free-standing morphemes.
  • Let's see some of examples:
    • Dog, cats, town, and house.
    • All the preceding words can be used with other words as well. Free morphemes can appear with other words as well. These kinds of words convey meaning that is different if you see the words individually.
      • Let's see examples of that:
        • Doghouse, town hall.
        • Here, the meaning of doghouse is different from the individual meanings of dog and house. The same applies to town hall.

Bound morphemes

Bound morphemes usually take affixes. They are further divided into two classes.

Derivational morphemes

Derivational morphemes are identified when infixes combine with the root and changes either the semantic meaning.

Now, let's look at some examples:

  • Take the word unkind. In this word, un is a prefix and kind is the root. The prefix un acts as a derivational morpheme that changes the meaning of the word kind to its opposite meaning, unkind.
  • Take the word happiness. In this word, -ness is a derivational morpheme and happy is the root word. So, -ness changes happy to happiness. Check the POS tag, happy is an adjective and happiness is a noun. Here, tags that indicate the class of word, such as adjective and noun, are called POS.

Inflectional morphemes

Inflection morphemes are suffixes that are added to a word to assign particular grammatical property to that word. Inflectional morphemes are considered to be grammatical markers that indicate tense, number, POS, and so on. So, in more simple language, we can say that inflectional morphemes are identified as types of morpheme that modify the verb tense, aspect, mood, person, number (singular and plural), gender, or case, without affecting the words meaning or POS.

Here's some examples:

  • In the word dogs, -s changes the number of dog. -s converts dog from singular to plural form of it
  • The word expected contains -ed, which is an inflectional morpheme that modifies the verb tense

Here is the code for generating the stem from morphemes. We are using the nltk and polyglot libraries. You can find the code on this link: https://github.com/jalajthanaki/NLPython/blob/master/ch3/3_1_wordsteam.py

See the code snippets in Figure 3.5 and Figure 3.6:

Figure 3.5: Generating stems from morphemes using NLTK

Now, let's see how the polyglot library has been used refer to Figure 3.6:

Figure 3.6: Generating stems from morphemes using the polyglot library

The output of the code snippet is displayed in Figure 3.7:

Figure 3.7: Output of code snippets in Figure 3.5 and Figure 3.6

What is the difference between a stem and a root?

This could be explained as follows:

Stem

Root

In order to generate a stem, we need to remove affixes from the word

A root cannot be further divided into smaller morphemes

From the stem, we can generate the root by further dividing it

A stem is generated by using a root plus derivational morphemes

The word Untie is stem

The word tie is root

Exercise

  1. Do a morphological analysis like we did in Figure 3.3 for the morphemes in redness, quickly, teacher, unhappy, and disagreement. Define prefixes, suffixes, verbs, and stems.
  2. Generate the stems of the words redness, quickly, teacher, disagreement, reduce, construction, deconstruction, and deduce using the nltk and polyglot libraries.
  3. Generate the stems and roots of disagree, disagreement, historical.

Lexical analysis

Lexical analysis is defined as the process of breaking down a text into words, phrases, and other meaningful elements. Lexical analysis is based on word-level analysis. In this kind of analysis, we also focus on the meaning of the words, phrases, and other elements, such as symbols.

Sometimes, lexical analysis is also loosely described as the tokenization process. So, before discussing tokenization, let's understand what a token is and what a POS tag is.

What is a token?

Tokens are defined as the meaningful elements that are generated by using techniques of lexical analysis.

What are part of speech tags?

A part of speech is a category of words or lexical items that have similar grammatical properties. Words belonging to the same part of speech (POS) category have similar behavior within the grammatical structure of sentences.

In English, POS categories are verb, noun, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article, or determiner.

Process of deriving tokens

Sentences are formed by stream of words and from a sentence we need to derive individual meaningful chunks which are called the tokens and process of deriving token is called tokenization:

  • The process of deriving tokens from a stream of text has two stages. If you have a lot of paragraphs, then first you need to do sentence tokenization, then word tokenization, and generate the meaning of the tokens.
  • Tokenization and lemmatization are processes that are helpful for lexical analysis. Using the nltk library, we can perform tokenization and lemmatization.
  • Tokenization can be defined as identifying the boundary of sentences or words.
  • Lemmatization can be defined as a process that identifies the correct intended POS and meaning of words that are present in sentences.
  • Lemmatization also includes POS tagging to disambiguate the meaning of the tokens. In this process, the context window is either phrase level or sentence level.

You can find the code at the GitHub link: https://github.com/jalajthanaki/NLPython/tree/master/ch3

The code snippet is shown in Figure 3.8:

Figure 3.8: Code snippet for tokenization

The output of the code in Figure 3.8 is shown in Figure 3.9:

Figure 3.9: Output of tokenization and lemmatization

Difference between stemming and lemmatization

Stemming and lemmatization both of these concepts are used to normalized the given word by removing infixes and consider its meaning. The major difference between these is as shown:

Stemming

Lemmatization

Stemming usually operates on single word without knowledge of the context

Lemmatization usually considers words and the context of the word in the sentence

In stemming, we do not consider POS tags

In lemmatization, we consider POS tags

Stemming is used to group words with a similar basic meaning together

Lemmatization concept is used to make dictionary or WordNet kind of dictionary.

Applications

You must think how this lexical analysis has been used for developing NLP applications. So, here we have listed some of the NLP applications which uses lexical analysis concepts:

  • Lexical analysis such as sentence tokenization and stop word identification is often used in preprocessing.
  • Lexical analysis also used to develop a POS tagger. A POS tagger is a tool that generates POS tags for a stream of text.
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime