You're reading from Hands-On Natural Language Processing with PyTorch 1.x Build smart, AI-driven linguistic applications using deep learning and NLP techniques

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781789802740

Length 276 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Deep Learning

Author (1):

Thomas Dop

View More author details

Table of Contents (14) Chapters

Preface

1. Section 1: Essentials of PyTorch 1.x for NLP

2. Chapter 1: Fundamentals of Machine Learning and Deep Learning FREE CHAPTER

3. Chapter 2: Getting Started with PyTorch 1.x for NLP

4. Section 2: Fundamentals of Natural Language Processing

5. Chapter 3: NLP and Text Embeddings

6. Chapter 4: Text Preprocessing, Stemming, and Lemmatization

7. Section 3: Real-World NLP Applications Using PyTorch 1.x

8. Chapter 5: Recurrent Neural Networks and Sentiment Analysis

9. Chapter 6: Convolutional Neural Networks for Text Classification

10. Chapter 7: Text Translation Using Sequence-to-Sequence Neural Networks

11. Chapter 8: Building a Chatbot Using Attention-Based Neural Networks

12. Chapter 9: The Road Ahead

13. Other Books You May Enjoy

Leave a review - let other readers know what you think

Text preprocessing

Textual data can come in a variety of formats and styles. Text may be in a structured, readable format or in a more raw, unstructured format. Our text may contain punctuation and symbols that we don't wish to include in our models or may contain HTML and other non-textual formatting. This is of particular concern when scraping text from online sources. In order to prepare our text so that it can be input into any NLP models, we must perform preprocessing. This will clean our data so that it is in a standard format. In this section, we will illustrate some of these preprocessing steps in more detail.

Removing HTML

When scraping text from online sources, you may find that your text contains HTML markup and other non-textual artifacts. We do not generally want to include these in our NLP inputs for our models, so these should be removed by default. For example, in HTML, the <b> tag indicates that the text following it should be in bold font. However...