You're reading from Natural Language Processing with AWS AI Services Derive strategic insights from unstructured data with Amazon Textract and Amazon Comprehend

Product type Paperback

Published in Nov 2021

Publisher Packt

ISBN-13 9781801812535

Length 508 pages

Edition 1st Edition

Languages

Processing

Tools

Amazon Textract

Concepts

Mobile Application Development

Authors (2):

Mona M

Premkumar Rangarajan

View More author details

Table of Contents (23) Chapters

Preface

1. Section 1:Introduction to AWS AI NLP Services

2. Chapter 1: NLP in the Business Context and Introduction to AWS AI Services FREE CHAPTER

3. Chapter 2: Introducing Amazon Textract

4. Chapter 3: Introducing Amazon Comprehend

5. Section 2: Using NLP to Accelerate Business Outcomes

6. Chapter 4: Automating Document Processing Workflows

7. Chapter 5: Creating NLP Search

8. Chapter 6: Using NLP to Improve Customer Service Efficiency

9. Chapter 7: Understanding the Voice of Your Customer Analytics

10. Chapter 8: Leveraging NLP to Monetize Your Media Content

11. Chapter 9: Extracting Metadata from Financial Documents

12. Chapter 10: Reducing Localization Costs with Machine Translation

13. Chapter 11: Using Chatbots for Querying Documents

14. Chapter 12: AI and NLP in Healthcare

15. Section 3: Improving NLP Models in Production

16. Chapter 13: Improving the Accuracy of Document Processing Workflows

17. Chapter 14: Auditing Named Entity Recognition Workflows

18. Chapter 15: Classifying Documents and Setting up Human in the Loop for Active Learning

19. Chapter 16: Improving the Accuracy of PDF Batch Processing

20. Chapter 17: Visualizing Insights from Handwritten Content

21. Chapter 18: Building Secure, Reliable, and Efficient NLP Solutions

22. Other Books You May Enjoy

Understanding why NLP is becoming mainstream

According to this report (https://www.marketsandmarkets.com/Market-Reports/natural-language-processing-nlp-825.html, accessed on March 23, 2021), the global NLP market is expected to grow to USD 35.1 billion by 2026, at a Compound Annual Growth Rate (CAGR) of 20.3% during the forecast period. This is not surprising considering the impact ML is making across every industry (such as finance, retail, manufacturing, energy, utilities, real estate, healthcare, and so on) in organizations of every size, primarily driven by the advent of cloud computing and the economies of scale available.

This article about Emergence Cycle (https://blogs.gartner.com/anthony_bradley/2020/10/07/announcing-gartners-new-emergence-cycle-research-for-ai/), a research into emerging technologies in NLP (based on patents submitted, and looking at technology still in labs or recently released), shows the most mature usage of NLP is multimedia content analysis. This trend is true based on our experience, and content analysis to gain strategic insights is a common NLP requirement based on our discussions with a number of organizations across industries:

Figure 1.5 – Gartner's NLP Emergence Cycle 2020

For example, in 2020, when the world was struggling with the effects of the pandemic, a number of organizations adopted AI and specifically NLP to power predictions on the virus spread patterns, assimilate knowledge on virus behavior and vaccine research, and monitor the effectiveness of safety measures, to name a few. In April 2020, AWS launched an NLP-powered search site called https://cord19.aws/ using an AWS AI service called Amazon Kendra (https://aws.amazon.com/kendra/). The site provides an easy interface to search the COVID-19 Open Research Dataset using natural language questions. As the dataset is constantly updated based on the latest research on COVID-19, CORD-19 Search, due to its support for NLP, makes it easy to navigate this ever-expanding collection of research documents and find precise answers to questions. The search results provide not only specific text that contains the answer to the question but also the original body of text in which these answers are located:

Figure 1.6 – CORD-19 search results

Fred Hutchinson Cancer Research Center is a research institute focused on curing cancer by 2025. Matthew Trunnell, Chief Information Officer of Fred Hutchinson Cancer Research Center, has said the following:

"The process of developing clinical trials and connecting them with the right patients requires research teams to sift through and label mountains of unstructured clinical record data. Amazon Comprehend Medical will reduce this time burden from hours to seconds. This is a vital step toward getting researchers rapid access to the information they need when they need it so they can find actionable insights to advance lifesaving therapies for patients."

For more details and usage examples of Amazon Comprehend and Amazon Comprehend Medical, please refer to Chapter 3, Introducing Amazon Comprehend.

So, how can AI and NLP help us cure cancer or prepare for a pandemic? It's about recognizing patterns where none seem to exist. Unstructured text, such as documents, social media posts, and email messages, is similar to the treasure waiting in Ali Baba's cave. To understand why, let's briefly look at how NLP works.

NLP models train by learning what are called word embeddings, which are vector representations of words in large collections of documents. These embeddings capture semantic relationships and word distributions in documents, thereby helping to map the context of a word based on its relationship to other words in the document. The two common training architectures for learning word embeddings are Skip-gram and Continuous Bag of Words (CBOW). In Skip-gram, the embeddings of the input word are used to derive the distribution of the related words to predict the context, and in CBOW, the embeddings of the related words are used to predict the word in the middle. Both are neural network-based architectures and work well for context-based analytics use cases.

Now that we understand the basics of NLP (analyzing patterns in text by converting words to their vector representations), when we look at training models using text data from disparate data sources, unique insights are often derived due to patterns that emerge that previously appeared hidden when looked at within a narrower context, because we are using numbers to find relationships in text. For example, The Rubber Episode in the Amazon Prime TV show This Giant Beast That Is The Global Economy shows how a fungal disease has the potential to devastate the global economy, even though at first it might appear there is no link between the two. According to the US National Library of Medicine, natural rubber accounts for 40% of the world's consumption, and the South American Leaf Blight (SALB) fungal disease has the potential to spread worldwide and severely inhibit rubber production. Airplanes can't land without rubber, and its uses are so myriad that it would have unprecedented implications on the economy. This an example of a pattern that ML and NLP models are so good at finding specific items of interest across vast text corpora.

Before AWS and cloud computing revolutionized access to advanced technologies, setting up NLP models for text analytics was challenging to say the least. The most common reasons were as follows:

Lack of skills: Expertise in identifying data, feature engineering, building models, training, and tuning are all tasks that require a unique combination of skills, including software engineering, mathematics, statistics, and data engineering, that only a few practitioners have.
Initial infrastructure setup cost: ML training is an iterative process, often requiring a trial-and-error approach to tune the models to get the desired accuracy. Further training and inference may require GPU acceleration based on the volume of data and the number of requests, requiring a high initial investment.
Scalability with the current on-premises environment: Running ML training and inference from on-premises servers constrains the elasticity required to scale compute and storage based on model size, data volumes, and inference throughput needed. For example, training large-scale transformer models may require massively parallel clusters, and capacity planning for such scenarios is challenging.
Availability of tools to help orchestrate the various moving parts of NLP training: As mentioned before, the ML workflow comprises many tasks, such as data discovery, feature engineering, algorithm selection, model building, which includes training and fine-tuning the models several times, and finally deploying those models into production. Furthermore, getting an accurate model is a highly iterative process. Each of these tasks requires purpose-built tools and expertise to achieve the level of efficiency needed for good models, which is not easy.

Not anymore. The AWS AI services for natural language capabilities enable adding speech and text intelligence to applications using API calls rather than needing to develop and train models. NLU services provide the ability to convert speech to text with Amazon Transcribe (https://aws.amazon.com/transcribe/) or text to speech with Amazon Polly (https://aws.amazon.com/polly/). For NLP requirements, Amazon Textract (https://aws.amazon.com/textract/) enables applications to read and process handwritten and printed text from images and PDF documents, and with Amazon Comprehend (https://aws.amazon.com/comprehend/), applications can quickly analyze text and find insights and relationships with no prior ML training. For example, Assent, a supply chain data management company, used Amazon Textract to read forms, tables, and free-form text, and Amazon Comprehend to derive business-specific entities and values from the text. In this book, we will be walking you through how to use these services for some popular workflows. For more details, please refer to Chapter 4, Automating Document Processing Workflows.

In this section, we saw some examples of NLP's significance in solving real-world challenges, and what exactly it means. We understood that finding patterns in data can bring new meaning to light, and NLP models are very good at deriving these patterns. We then reviewed some technology challenges in NLP implementations and saw a brief overview of the AWS AI services. In the next section, we will introduce the AWS ML stack, and provide a brief overview of each of the layers.