Summary
In this chapter, you learned about a number of important topics related to evaluating NLU systems. You learned how to separate data into different subsets for training and testing, and you learned about the most commonly used NLU performance metrics – accuracy, precision, recall, F1, AUC, and confusion matrices – and how to use these metrics to compare systems. You also learned about related topics, such as comparing systems with ablation, evaluation with shared tasks, statistical significance testing, and user testing.
The next chapter will start Part 3 of this book, where we cover systems in action – applying NLU at scale. We will start Part 3 by looking at what to do if a system isn’t working. If the original model isn’t adequate or the system models a real-world situation that changes, what has to be changed? The chapter discusses topics such as adding new data and changing the structure of the application.