You're reading from Hands-On Natural Language Processing with PyTorch 1.x Build smart, AI-driven linguistic applications using deep learning and NLP techniques

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781789802740

Length 276 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Deep Learning

Author (1):

Thomas Dop

View More author details

Table of Contents (14) Chapters

Preface

1. Section 1: Essentials of PyTorch 1.x for NLP

2. Chapter 1: Fundamentals of Machine Learning and Deep Learning FREE CHAPTER

3. Chapter 2: Getting Started with PyTorch 1.x for NLP

4. Section 2: Fundamentals of Natural Language Processing

5. Chapter 3: NLP and Text Embeddings

6. Chapter 4: Text Preprocessing, Stemming, and Lemmatization

7. Section 3: Real-World NLP Applications Using PyTorch 1.x

8. Chapter 5: Recurrent Neural Networks and Sentiment Analysis

9. Chapter 6: Convolutional Neural Networks for Text Classification

10. Chapter 7: Text Translation Using Sequence-to-Sequence Neural Networks

11. Chapter 8: Building a Chatbot Using Attention-Based Neural Networks

12. Chapter 9: The Road Ahead

13. Other Books You May Enjoy

Leave a review - let other readers know what you think

Tokenization

Next, we will learn about tokenization for NLP, a way of pre-processing text for entry into our models. Tokenization splits our sentences up into smaller parts. This could involve splitting a sentence up into its individual words or splitting a whole document up into individual sentences. This is an essential pre-processing step for NLP that can be done fairly simply in Python:

We first take a basic sentence and split this up into individual words using the word tokenizer in NLTK:
```
text = 'This is a single sentence.'
tokens = word_tokenize(text)
print(tokens)
```
This results in the following output:
Figure 3.18 – Splitting the sentence
Note how a period (.) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:
```
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
print(no_punctuation)
```
This results in the following output:
Figure 3.19...

The rest of the chapter is locked

You're reading from Hands-On Natural Language Processing with PyTorch 1.x Build smart, AI-driven linguistic applications using deep learning and NLP techniques

Table of Contents (14) Chapters

Tokenization

Authors (1)

Other recommended products

You're reading from Hands-On Natural Language Processing with PyTorch 1.x Build smart, AI-driven linguistic applications using deep learning and NLP techniques

Table of Contents (14) Chapters

Tokenization

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products