Understanding why NLP is becoming mainstream
According to this report (https://www.marketsandmarkets.com/Market-Reports/natural-language-processing-nlp-825.html, accessed on March 23, 2021), the global NLP market is expected to grow to USD 35.1 billion by 2026, at a Compound Annual Growth Rate (CAGR) of 20.3% during the forecast period. This is not surprising considering the impact ML is making across every industry (such as finance, retail, manufacturing, energy, utilities, real estate, healthcare, and so on) in organizations of every size, primarily driven by the advent of cloud computing and the economies of scale available.
This article about Emergence Cycle (https://blogs.gartner.com/anthony_bradley/2020/10/07/announcing-gartners-new-emergence-cycle-research-for-ai/), a research into emerging technologies in NLP (based on patents submitted, and looking at technology still in labs or recently released), shows the most mature usage of NLP is multimedia content analysis. This trend is true based on our experience, and content analysis to gain strategic insights is a common NLP requirement based on our discussions with a number of organizations across industries:
For example, in 2020, when the world was struggling with the effects of the pandemic, a number of organizations adopted AI and specifically NLP to power predictions on the virus spread patterns, assimilate knowledge on virus behavior and vaccine research, and monitor the effectiveness of safety measures, to name a few. In April 2020, AWS launched an NLP-powered search site called https://cord19.aws/ using an AWS AI service called Amazon Kendra (https://aws.amazon.com/kendra/). The site provides an easy interface to search the COVID-19 Open Research Dataset using natural language questions. As the dataset is constantly updated based on the latest research on COVID-19, CORD-19 Search, due to its support for NLP, makes it easy to navigate this ever-expanding collection of research documents and find precise answers to questions. The search results provide not only specific text that contains the answer to the question but also the original body of text in which these answers are located:
Fred Hutchinson Cancer Research Center is a research institute focused on curing cancer by 2025. Matthew Trunnell, Chief Information Officer of Fred Hutchinson Cancer Research Center, has said the following:
For more details and usage examples of Amazon Comprehend and Amazon Comprehend Medical, please refer to Chapter 3, Introducing Amazon Comprehend.
So, how can AI and NLP help us cure cancer or prepare for a pandemic? It's about recognizing patterns where none seem to exist. Unstructured text, such as documents, social media posts, and email messages, is similar to the treasure waiting in Ali Baba's cave. To understand why, let's briefly look at how NLP works.
NLP models train by learning what are called word embeddings, which are vector representations of words in large collections of documents. These embeddings capture semantic relationships and word distributions in documents, thereby helping to map the context of a word based on its relationship to other words in the document. The two common training architectures for learning word embeddings are Skip-gram and Continuous Bag of Words (CBOW). In Skip-gram, the embeddings of the input word are used to derive the distribution of the related words to predict the context, and in CBOW, the embeddings of the related words are used to predict the word in the middle. Both are neural network-based architectures and work well for context-based analytics use cases.
Now that we understand the basics of NLP (analyzing patterns in text by converting words to their vector representations), when we look at training models using text data from disparate data sources, unique insights are often derived due to patterns that emerge that previously appeared hidden when looked at within a narrower context, because we are using numbers to find relationships in text. For example, The Rubber Episode in the Amazon Prime TV show This Giant Beast That Is The Global Economy shows how a fungal disease has the potential to devastate the global economy, even though at first it might appear there is no link between the two. According to the US National Library of Medicine, natural rubber accounts for 40% of the world's consumption, and the South American Leaf Blight (SALB) fungal disease has the potential to spread worldwide and severely inhibit rubber production. Airplanes can't land without rubber, and its uses are so myriad that it would have unprecedented implications on the economy. This an example of a pattern that ML and NLP models are so good at finding specific items of interest across vast text corpora.
Before AWS and cloud computing revolutionized access to advanced technologies, setting up NLP models for text analytics was challenging to say the least. The most common reasons were as follows:
- Lack of skills: Expertise in identifying data, feature engineering, building models, training, and tuning are all tasks that require a unique combination of skills, including software engineering, mathematics, statistics, and data engineering, that only a few practitioners have.
- Initial infrastructure setup cost: ML training is an iterative process, often requiring a trial-and-error approach to tune the models to get the desired accuracy. Further training and inference may require GPU acceleration based on the volume of data and the number of requests, requiring a high initial investment.
- Scalability with the current on-premises environment: Running ML training and inference from on-premises servers constrains the elasticity required to scale compute and storage based on model size, data volumes, and inference throughput needed. For example, training large-scale transformer models may require massively parallel clusters, and capacity planning for such scenarios is challenging.
- Availability of tools to help orchestrate the various moving parts of NLP training: As mentioned before, the ML workflow comprises many tasks, such as data discovery, feature engineering, algorithm selection, model building, which includes training and fine-tuning the models several times, and finally deploying those models into production. Furthermore, getting an accurate model is a highly iterative process. Each of these tasks requires purpose-built tools and expertise to achieve the level of efficiency needed for good models, which is not easy.
Not anymore. The AWS AI services for natural language capabilities enable adding speech and text intelligence to applications using API calls rather than needing to develop and train models. NLU services provide the ability to convert speech to text with Amazon Transcribe (https://aws.amazon.com/transcribe/) or text to speech with Amazon Polly (https://aws.amazon.com/polly/). For NLP requirements, Amazon Textract (https://aws.amazon.com/textract/) enables applications to read and process handwritten and printed text from images and PDF documents, and with Amazon Comprehend (https://aws.amazon.com/comprehend/), applications can quickly analyze text and find insights and relationships with no prior ML training. For example, Assent, a supply chain data management company, used Amazon Textract to read forms, tables, and free-form text, and Amazon Comprehend to derive business-specific entities and values from the text. In this book, we will be walking you through how to use these services for some popular workflows. For more details, please refer to Chapter 4, Automating Document Processing Workflows.
In this section, we saw some examples of NLP's significance in solving real-world challenges, and what exactly it means. We understood that finding patterns in data can bring new meaning to light, and NLP models are very good at deriving these patterns. We then reviewed some technology challenges in NLP implementations and saw a brief overview of the AWS AI services. In the next section, we will introduce the AWS ML stack, and provide a brief overview of each of the layers.