How Well Does It Work? – Evaluation
In this chapter, we will address the question of quantifying how well a natural language understanding (NLU) system works. Throughout this book, we assumed that we want the NLU systems that we develop to do a good job on the tasks that they are designed for. However, we haven’t dealt in detail with the tools that enable us to tell how well a system works – that is, how to evaluate it. This chapter will illustrate a number of evaluation techniques that will enable you to tell how well the system works, as well as to compare systems in terms of performance. We will also look at some ways to avoid drawing erroneous conclusions from evaluation metrics.
The topics we will cover in this chapter are as follows:
- Why evaluate an NLU system?
- Evaluation paradigms
- Data partitioning
- Evaluation metrics
- User testing
- Statistical significance of differences
- Comparing three text classification methods