A chatbot is a software system that is used to have a conversation with people via text or speech. Chatbots are used for various purposes, including customer service, enterprise operations, and healthcare. According to the different goals, chatbots have two main types: task-oriented bots and chitchat bots. Task-oriented bots have the goal of finishing specific tasks by interacting with people, such as booking a flight ticket for someone, while chitchat bots are more like human beings—their goal is to respond to users' messages smoothly, just as with chitchat between people.
A chatbot is a diamond in the crown for NLP. The application of a chatbot is challenging, and we typically do not find the same patterns being used everywhere, from both technology and business perspectives. Here, we try to clear the fog and introduce some common processes for developing task-oriented chatbots focusing on vertical domains. Open-domain chitchat chatbots are also very important and interesting, but they are not within the scope of this book.
In the next section, we will discuss the advantages of chatbots in the business domain.
Is a chatbot really necessary?
Before we deep dive into the technology, we should ask ourselves the following question after looking at client requirements: do we really need a chatbot?
If you go to McDonald's, you have probably seen the automatic order system. It has a big touchscreen with some big buttons and pictures. It supports multiple ways of payment and requires customers to go through only a few intuitive steps to buy the food they want. Nowadays, in many McDonald's outlets, we only have one or two employees at the counter that deal with customers using cash payments, and most of the customers are already quite used to the automatic order system.
This is an example of a user interface (UI) requirement that deals with single and clear customer goals and with a few intuitive steps. Similar kinds of examples are purchasing movie tickets, booking train or plane tickets, booking hotel rooms, and buying coffee or food. Although many of these are used especially in academic research as chatbot examples, we have to understand that a chatbot may not be the best choice compared to a big touchscreen and buttons with pictures.
The UI scenarios in which a chatbot has a certain advantage are listed here:
- Customer service in vertical domains where customers generate a large number of similar questions and requirements. Goals are clear or semi-clear, and customers potentially need help and guidance to understand their own needs.
- Customer service (chatbot) owns domain expert knowledge (for example, knowledge graph) and strong experience in answering questions (historical customer service conversational data) and can solve customer problems within minutes.
- If the chatbot cannot eventually solve the customer's problem, it should collect as much information as possible and switch to manual customer service with all that information.
In many scenarios, the 10 most frequently asked questions can already solve a majority of the general problems customers have. The advantage of using a chatbot is that it can automatically retrieve customer profiles, read instantly from a large volume of knowledge bases, perform multiple rounds of conversations, and quickly give personalized solutions according to user needs.
Some example scenarios in which a chatbot may have an advantage are listed here:
- Hospital reception or medical consulting
- Online shopping customer service
- After-sales service
- Investment consulting
- Bank services
We have already seen many chatbot applications in the preceding scenarios. However, there is still a long way to go for chatbot applications to work in real life.
In the next section, we will learn about the theoretical principles of chatbots.
Introduction to chatbot architecture
In the early days, chatbots were mainly based on templates and rules. An example is AI Markup Language (AIML). AIML is quite powerful. It can extract important information by rules from users' questions, and it can run scripts to get information through an external application programming interface (API) to enrich the answers. There is a chatbot called Artificial Linguistic Internet Computer Entity (Alicebot) that is based on AIML, and it contains more than 40,000 different kinds of data, which literally constructs a huge rule-based knowledge base.
An advantage of using rules is that we can achieve high precision. However, there is also an obvious disadvantage: there can be many alternative formats of the same questions, and the best rules will only be able to cover part of them. Take an example of a weather bot—a user can have hundreds of ways of asking about the weather. Also, it becomes very difficult to maintain it once we have more and more rules written in the system. Very easily, there can be contradicting rules, and many times, a change in business logic means we need to rewrite a good part of all the rules.
Another way to build a chatbot is to have a huge QA database. When a user question comes in, the system calculates the similarity between that question and all the questions in the database, chooses the most similar one, and gives the corresponding answer. There are many similar tasks in the competitions held by Zhihu and Quora. Those websites do not want users to raise many duplicated questions, so they will match the new questions to existing questions and alert users if there is a high chance of duplication. Techniques such as skip-thought that calculate sentence embeddings were invented to tackle this sentence-similarity problem.
Recently, the mainstream process for building a chatbot has become unified. It mainly consists of five different modules to build a chatbot, outlined as follows:
- ASR to convert user speech into text
- NLU to interpret user input
- DM to take decisions on the next action with respect to the current dialogue status
- Natural-language generation (NLG) to generate text-based responses to the user
- TTS to convert text output into voice
In this book, we mainly focus on NLU and DM.
Here, we briefly introduce each of the modules.
Automatic Speech Recognition (ASR)
ASR converts human speech into corresponding text. There are many open source and commercial solutions for ASR, but we are not covering them in this book.
Natural Language Understanding (NLU)
NLU interprets text-based user input. It recognizes the intent and the relevant entities from a user's input. The NLU module mainly classifies a user's question at the sentence level and gets the user's clear intent by intent classification. The NLU module also recognizes the key entities in the word level from a user's question and performs slot filling. For multi-domain dialogue systems, there is an additional task before the intent classification and NER—that is, domain classification. Domain classification is used to predict the domain (topic) users want to talk about—for example, is that user talking about the music domain ("Play Michael Jackson's Billie Jean"), the navigation domain ("Navigate to Carrefour"), or the radio domain ("Turn on radio 106.6 FM")? Of course, this domain classification is unnecessary for single-domain dialogue systems that are focused on only one domain. Since the Rasa framework is designed for single-domain dialogue systems, it does not include the domain classification feature. In this book, we will focus on how to implement a single-domain dialogue system by using Rasa.
Here is a simple example for intent classification and NER. A user inputs I want to eat pizza
. The NLU module can quickly recognize that the user's intent is Restaurant Search
and the key entity is pizza
. With intent and key entities, it helps the following DM module to make queries in the backend database to extract target information or continue more rounds of conversation to fill in the other missing slots to complete the question.
From an NLP and ML point of view, intent recognition is a typical text classification task, and slot filling is a typical ER task.
Both tasks need label data. Here is an example of the labels. It consists of intents such as greet
, affirm
, restaurant_search
, and medical
. Within the intent of restaurant_search
, it also contains a food
type of entity. Within the intent of medical
, it also contains a disease
type of entity. In reality, we will need way more label data to be able to train a working model.
Here are some training data samples used by the Rasa framework (we will introduce this in the next section). The data format clearly shows that it contains text and labels:
{
"common_examples": [{
"text": "Hello",
"intent": "greet",
"entities": []
},
{
"text": "Good Morning",
"intent": "greet",
"entities": []
},
{
"text": "Where can I find a place for ramen?",
"intent": "restaurant_search",
"entities": [{
"start": 7,
"end": 8,
"value": "ramen",
"entity": "food"
}]
},
{
"text": "I'm having a fever. What medicine should I take?",
"intent": "medical",
"entities": [{
"start": 3,
"end": 4,
"value": "fever",
"entity": "disease"
}]
}
]
}
At a first glance, this seems very similar to the rule-based AIML data. In fact, we are using that label data to train a much more complicated ML model. This model will be able to generalize way more scenarios compared to a rule-based system—for example, we give pizza
and ramen
as examples of food. When the user inputs cake
and salad
, a good NLU system should be able to label them as food entities as well.
The user input text will need to go through NLP preprocessing, such as sentence split, tokenization, POS labeling, and so on. For certain applications, it is also important to do coreference resolution to replace the original pronouns with complete names to reduce ambiguation.
Then, we need to do feature engineering and model training. Traditionally, there can be many manually engineered features such as number_of_tokens
, symbols_in_between
, and bag_of_words_in_between
. Then, we perform traditional ML classification algorithms such as linear classification or support-vector machines (SVMs) to do intent classification, and traditional sequential labeling models such as a hidden Markov model (HMM) and CRF to do ER. On the other hand, we can also use word2vec
to do UL on a large corpus to embed hidden features of words into word embeddings and input them into DNN models such as convolutional neural networks (CNNs) or RNNs to do intent classification and ER.
By training a model, we can achieve higher recall so that the system can cover more different kinds of user input. We can also make use of the rule-based modules we mentioned before to generate new features from those high-precision rules, to help us train a better ML model. The whole architecture is illustrated in the following diagram:
Figure 1.8 – A complex NLU system
Later, we will see how Rasa works in its NLU module to implement NLP in an efficient and open style.
Dialogue Management (DM)
DM decides the current action of a user according to previous conversations. DM is the control center for the process of human-machine conversation and is particularly important for multi-turn task-oriented dialogue systems. The main task of the DM module is to coordinate and manage the whole conversation flow. By analyzing and maintaining the context, the DM module decides if a user's intent is clear enough and information in the entity slots is good enough to start database queries or perform corresponding actions.
When the DM module thinks the information from user input is not complete or too ambiguous, it will start managing a multi-turn conversation context and keep prompting the user to get more information or provide the user with possible items to choose from. DM is responsible for storing and maintaining the current conversation status, the user's action history, the system's action history, and potential results from the knowledge base. When DM decides that it has clearly got all the information needed, it then converts the user's request into a corresponding query into the database (for example, a knowledge graph) to search for the right information or act to complete the task (for example, checking out for shopping, calling a friend's number with Siri, or pulling up a curtain with smart home devices).
The following diagram shows the workflow and functions of DM:
Figure 1.9 – DM in the dialogue system
In real-life use cases, DM is responsible for many small tasks and is highly customized according to product requirements. Many implementations of DM use a rule-based system, and it's not an easy task to either code or maintain it. In recent work, including Rasa, people have started to model the DM status into a sequential labeling SL task. Some advanced work makes use of deep RL, where a user-simulation module is added. We will see later how Rasa implements the DM module in an easy and elegant way with Rasa Core.
Natural Language Generation (NLG)
NLG converts the agent's response into human-readable text. There are mainly two ways of doing this: template-based methods or DL-based methods. The template-based methods create simple responses without too much flexibility. However, as templates are designed by humans, they generally have great readability for humans. DL-based methods can generate flexible and personalized responses. However, as it is automatically generated by DNNs, it is difficult to control the quality and stability of the results. In real situations, people tend to use the template-based method and add new functionalities (for example, choose randomly from a pool of templates) to add more flexibility.
NLG is almost the last challenging mile in human-machine interaction. For a chitchat bot, we normally apply a Seq2Seq generative model to a large volume of corpus and directly generate a response to the user's input. However, this does not normally work for a customer service chatbot that is task-oriented and only for a vertical domain. Users need accurate and concise responses to their inquiries. We are still working toward one day where we have lots of data to train a working model that generates texts that almost come from a real human being—perhaps models such as GPT3 already achieve this.
Still, most of the current NLG modules use rule-based templates. This is like the reverse operation for slot filling, to fill results into a template and generate a response to users. More advanced works also use DL to automatically generate templates with slots based on training data.
There are also some works that try to use DL to train an end-to-end (E2E) task-oriented chatbot. Some researchers tried to convert each of the NLU, DM, and NLG modules into DL modules. Some also add a user simulation to train an E2E RL model. Another important piece of academic research work is on memory networks. A memory network is similar to Seq2Seq and encodes the entire knowledge base into a complicated DNN and then combines this with encoded questions to decode to a target answer. This work was applied to machine reading tasks such as the Stanford Question Answering Dataset (SQuAD) competition from Stanford University and got some great results. As for task-oriented chatbots, this is still pioneering work and needs to be tested.
Text to Speech (TTS)
TTS converts normal language text into speech. TTS has been developed over many years, and there are mature solutions in the industry that are production-ready. In real-life use cases, as with ASR, we tend to use the TTS engine or service provided by professional vendors. We will not cover TTS in this book.
So far, we have learned a lot of necessary knowledge about chatbots. It's now time to do something real. In the next chapter, we will introduce basic knowledge of the Rasa framework, which is a conversational AI framework for real production.