Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Python 3 Text Processing with NLTK 3 Cookbook
Python 3 Text Processing with NLTK 3 Cookbook

Python 3 Text Processing with NLTK 3 Cookbook: , Second Edition

Arrow left icon
Profile Icon Jacob Perkins
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.8 (12 Ratings)
Paperback Aug 2014 304 pages 2nd Edition
eBook
S$36.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial
Arrow left icon
Profile Icon Jacob Perkins
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.8 (12 Ratings)
Paperback Aug 2014 304 pages 2nd Edition
eBook
S$36.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial
eBook
S$36.99 S$52.99
Paperback
S$66.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Python 3 Text Processing with NLTK 3 Cookbook

Chapter 2. Replacing and Correcting Words

In this chapter, we will cover the following recipes:

  • Stemming words
  • Lemmatizing words with WordNet
  • Replacing words matching regular expressions
  • Removing repeating characters
  • Spelling correction with Enchant
  • Replacing synonyms
  • Replacing negations with antonyms

Introduction

In this chapter, we will go over various word replacement and correction techniques. The recipes cover the gamut of linguistic compression, spelling correction, and text normalization. All of these methods can be very useful for preprocessing text before search indexing, document classification, and text analysis.

Stemming words

Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook, and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words, and its usage in NLTK will be covered in the next section.

Note

The resulting stem is not always a valid word. For example, the stem of cookery is cookeri. This is a feature, not a bug.

How to do it...

NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem:

>&gt...

Lemmatizing words with WordNet

Lemmatization is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word that means the same thing. However, the word you end up with can be completely different. A few examples will explain this.

Getting ready

Make sure that you have unzipped the wordnet corpus in nltk_data/corpora/wordnet. This will allow the WordNetLemmatizer class to access WordNet. You should also be familiar with the part-of-speech tags covered in the Looking up Synsets for a word in WordNet recipe of Chapter 1, Tokenizing Text and WordNet Basics.

How to do it...

We will use the WordNetLemmatizer class to find lemmas:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
&apos...

Replacing words matching regular expressions

Now, we are going to get into the process of replacing words. If stemming and lemmatization are a kind of linguistic compression, then word replacement can be thought of as error correction or text normalization.

In this recipe, we will replace words based on regular expressions, with a focus on expanding contractions. Remember when we were tokenizing words in Chapter 1, Tokenizing Text and WordNet Basics, and it was clear that most tokenizers had trouble with contractions? This recipe aims to fix this by replacing contractions with their expanded forms, for example, by replacing "can't" with "cannot" or "would've" with "would have".

Getting ready

Understanding how this recipe works will require a basic knowledge of regular expressions and the re module. The key things to know are matching patterns and the re.sub() function.

How to do it...

First, we need to define a number of replacement patterns...

Removing repeating characters

In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love. However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.

Getting ready

As in the previous recipe, we will be making use of the re module, and more specifically, backreferences. A backreference is a way to refer to a previously matched group in a regular expression. This will allow us to match and remove repeating characters.

How to do it...

We will create a class that has the same form as the RegexpReplacer class from the previous recipe. It will have a replace() method that takes a single word and returns a more correct version of that word, with the dubious repeating characters removed. This code can be found in replacers.py in the...

Introduction


In this chapter, we will go over various word replacement and correction techniques. The recipes cover the gamut of linguistic compression, spelling correction, and text normalization. All of these methods can be very useful for preprocessing text before search indexing, document classification, and text analysis.

Stemming words


Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook, and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words, and its usage in NLTK will be covered in the next section.

Note

The resulting stem is not always a valid word. For example, the stem of cookery is cookeri. This is a feature, not a bug.

How to do it...

NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem:

>&gt...

Lemmatizing words with WordNet


Lemmatization is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always left with a valid word that means the same thing. However, the word you end up with can be completely different. A few examples will explain this.

Getting ready

Make sure that you have unzipped the wordnet corpus in nltk_data/corpora/wordnet. This will allow the WordNetLemmatizer class to access WordNet. You should also be familiar with the part-of-speech tags covered in the Looking up Synsets for a word in WordNet recipe of Chapter 1, Tokenizing Text and WordNet Basics.

How to do it...

We will use the WordNetLemmatizer class to find lemmas:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lemmatizer.lemmatize('cooking')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'
>>> lemmatizer.lemmatize('cookbooks...
Left arrow icon Right arrow icon

Description

This book is intended for Python programmers interested in learning how to do natural language processing. Maybe you’ve learned the limits of regular expressions the hard way, or you’ve realized that human language cannot be deterministically parsed like a computer language. Perhaps you have more text than you know what to do with, and need automated ways to analyze and structure that text. This Cookbook will show you how to train and use statistical language models to process text in ways that are practically impossible with standard programming tools. A basic knowledge of Python and the basic text processing concepts is expected. Some experience with regular expressions will also be helpful.

What you will learn

  • Tokenize text into sentences, and sentences into words
  • Look up words in the WordNet dictionary
  • Apply spelling correction and word replacement
  • Access the builtin text corpora and create your own custom corpus
  • Tag words with parts of speech
  • Chunk phrases and recognize named entities
  • Grammatically transform phrases and chunks
  • Classify text and perform sentiment analysis

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 26, 2014
Length: 304 pages
Edition : 2nd
Language : English
ISBN-13 : 9781782167853
Category :
Languages :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Aug 26, 2014
Length: 304 pages
Edition : 2nd
Language : English
ISBN-13 : 9781782167853
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total S$ 208.97
Python 3 Text Processing with NLTK 3 Cookbook
S$66.99
Python Data Analysis
S$74.99
Mastering Object-oriented Python
S$66.99
Total S$ 208.97 Stars icon

Table of Contents

11 Chapters
1. Tokenizing Text and WordNet Basics Chevron down icon Chevron up icon
2. Replacing and Correcting Words Chevron down icon Chevron up icon
3. Creating Custom Corpora Chevron down icon Chevron up icon
4. Part-of-speech Tagging Chevron down icon Chevron up icon
5. Extracting Chunks Chevron down icon Chevron up icon
6. Transforming Chunks and Trees Chevron down icon Chevron up icon
7. Text Classification Chevron down icon Chevron up icon
8. Distributed Processing and Handling Large Datasets Chevron down icon Chevron up icon
9. Parsing Specific Data Types Chevron down icon Chevron up icon
A. Penn Treebank Part-of-speech Tags Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.8
(12 Ratings)
5 star 50%
4 star 16.7%
3 star 8.3%
2 star 8.3%
1 star 16.7%
Filter icon Filter
Top Reviews

Filter reviews by




Sheng-miao Kung Nov 19, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It's a GREAT book!
Amazon Verified review Amazon
Luis Felipe Yepez Barrios Feb 15, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good book
Amazon Verified review Amazon
P.Rotondo Nov 01, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Finally one of the best cookbook I've never read.Author is highly lucid and consistent in his explanation; He leaves nothing to chance and,when required, He delves into the topic.Material is extremely useful, code is very well designed! (I have read many texts on Python where code was catastrophic…) ; Excellent Support.
Amazon Verified review Amazon
rakesh patra Aug 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It’s a very good information and with lots of hands on code. Really useful for ppl who are in there mid journey on NLP research
Amazon Verified review Amazon
Glenn W. Feb 08, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Informative
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.