You're reading from Natural Language Processing with AWS AI Services Derive strategic insights from unstructured data with Amazon Textract and Amazon Comprehend

Product type Paperback

Published in Nov 2021

Publisher Packt

ISBN-13 9781801812535

Length 508 pages

Edition 1st Edition

Languages

Processing

Tools

Amazon Textract

Concepts

Mobile Application Development

Authors (2):

Mona M

Premkumar Rangarajan

View More author details

Table of Contents (23) Chapters

Preface

1. Section 1:Introduction to AWS AI NLP Services

2. Chapter 1: NLP in the Business Context and Introduction to AWS AI Services FREE CHAPTER

3. Chapter 2: Introducing Amazon Textract

4. Chapter 3: Introducing Amazon Comprehend

5. Section 2: Using NLP to Accelerate Business Outcomes

6. Chapter 4: Automating Document Processing Workflows

7. Chapter 5: Creating NLP Search

8. Chapter 6: Using NLP to Improve Customer Service Efficiency

9. Chapter 7: Understanding the Voice of Your Customer Analytics

10. Chapter 8: Leveraging NLP to Monetize Your Media Content

11. Chapter 9: Extracting Metadata from Financial Documents

12. Chapter 10: Reducing Localization Costs with Machine Translation

13. Chapter 11: Using Chatbots for Querying Documents

14. Chapter 12: AI and NLP in Healthcare

15. Section 3: Improving NLP Models in Production

16. Chapter 13: Improving the Accuracy of Document Processing Workflows

17. Chapter 14: Auditing Named Entity Recognition Workflows

18. Chapter 15: Classifying Documents and Setting up Human in the Loop for Active Learning

19. Chapter 16: Improving the Accuracy of PDF Batch Processing

20. Chapter 17: Visualizing Insights from Handwritten Content

21. Chapter 18: Building Secure, Reliable, and Efficient NLP Solutions

22. Other Books You May Enjoy

Overcoming the challenges in building NLP solutions

We read earlier that the main difference between the algorithms used for regular programming and those used for ML is the ability of ML algorithms to modify their processing based on the input data fed to them. In the NLP context, as in other areas of ML, these differences add significant value and accelerate enterprise business outcomes. Consider, for example, a book publishing organization that needs to create an intelligent search capability displaying book recommendations to users based on topics of interest they enter.

In a traditional world, you would need multiple teams to go through the entire book collection, read books individually, identify keywords, phrases, topics, and other relevant information, create an index to associate book titles, authors, and genres to these keywords, and link this with the search capability. This is a massive effort that takes months or years to set up based on the size of the collection, the number of people, and their skill levels, and the accuracy of the index is prone to human error. As books are updated to newer editions, and new books are added or removed, this effort would have to be repeated incrementally. This is also a significant cost and time investment that may deter many unless that time and those resources have already been budgeted for.

To bring in a semblance of automation in our previous example, we need the ability to digitize text from documents. However, this is not the only requirement, as we are interested in deriving context-based insights from the books to power a recommendations index for a reader. And if we are talking about, for example, a publishing house such as Packt, with 7,500+ books in its collection, we need a solution that not only scales to process large numbers of pages, but also understands relationships in text, and provides interpretations based on semantics, grammar, word tokenization, and language to create smart indexes. We will cover a detailed walkthrough of this solution, along with code samples and demo videos, in Chapter 5, Creating NLP Search.

Today's enterprises are grappling with leveraging meaningful insights from their data primarily due to the pace at which it is growing. Until a decade or so, most organizations used relational databases for all their data management needs, and some still do even today. This was fine because the data volume need was in single-digit terabytes or less. In the last few years, the technology landscape has witnessed a significant upheaval with smartphones becoming ubiquitous, the large-scale proliferation of connected devices (in the billions), the ability to dynamically scale infrastructure in size and into new geographies, and storage and compute costs becoming cheaper due to the democratization offered by the cloud. All of this means applications get used more often, have much larger user bases, more processing power, and capabilities, can accelerate their pace of innovation with faster go-to-market cycles, and as a result, have a need to store and manage petabytes of data. This, coupled with application users demanding faster response times and higher throughput, has put a strain on the performance of relational databases, fueling a move toward purpose-built databases such as Amazon DynamoDB, a key-value and document database that delivers single-digit millisecond latency at any scale.

While this move signals a positive trend, what is more interesting is how enterprises utilize this data to gain strategic insights. After all, data is only as useful as the information we can glean from it. We see many organizations, while accepting the benefits of purpose-built tools, implementing these changes in silos. So, there are varying levels of maturity in properly harnessing the advantages of data. Some departments use an S3 data lake (https://aws.amazon.com/products/storage/data-lake-storage/) to source data from disparate sources and run ML to derive context-based insights, others are consolidating their data in purpose-built databases, while the rest are still using relational databases for all their needs.

You can see a basic explanation of the main components of a data lake in the following Figure 1.5, An example of an Amazon S3 data lake:

Figure 1.4 – An example of an Amazon S3 data lake

Let's see how NLP can continue to add business value in this situation by referring back to our book publishing example. Suppose we successfully built our smart indexing solution, and now we need to update it with book reviews received via Twitter feeds. The searchable index should provide book recommendations based on review sentiment (for example, don't recommend a book if reviews are negative > 50% in the last 3 months). Traditionally, business insights are generated by running a suite of reports on behemoth data warehouses that collect, mine, and organize data into marts and dimensions. A tweet may not even be under consideration as a data source. These days, things have changed and mining social media data is an important aspect of generating insights. Setting up business rules to examine every tweet is a time-consuming and compute-intensive task. Furthermore, since a tweet is unstructured text, a slight change in semantics may impact the effectiveness of the solution.

Now, if you consider model training, the infrastructure required to build accurate NLP models typically uses the deep learning architecture called Transformers (please see https://www.packtpub.com/product/transformers-for-natural-language-processing/9781800565791) that use sequence-to-sequence processing without needing to process the tokens in order, resulting in a higher degree of parallelization. Transformer model families use billions of parameters with the training architecture using clusters of instances for distributed learning, which adds to time and costs.

AWS offers AI services that allow you, with just a few lines of code, to add NLP to your applications for the sentiment analysis of unstructured text at an almost limitless scale and immediately take advantage of the immense potential waiting to be discovered in unstructured text. We will cover AWS AI services in more detail from Chapter 2, Introducing Amazon Textract, onward.

In this section, we reviewed some challenges organizations encounter when building NLP solutions, such as complexities in digitizing paper-based text, understanding patterns from structured and unstructured data, and how resource-intensive these solutions can be. Let's now understand why NLP is an important mainstream technology for enterprises today.