Chapter 1. Tokenizing Text and WordNet Basics
In this chapter, we will cover the following recipes:
- Tokenizing text into sentences
- Tokenizing sentences into words
- Tokenizing sentences using regular expressions
- Training a sentence tokenizer
- Filtering stopwords in a tokenized sentence
- Looking up Synsets for a word in WordNet
- Looking up lemmas and synonyms in WordNet
- Calculating WordNet Synset similarity
- Discovering word collocations
Introduction
Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage. NLTK is often used for rapid prototyping of text processing programs and can even be used in production applications. Demos of select NLTK functionality and production-ready APIs are available at http://text-processing.com.
This chapter will cover the basics of tokenizing text and using WordNet...