User testing
In addition to direct system measurements, it is also possible to evaluate systems with user testing, where test users who are representative of a system’s intended users interact with it.
User testing is a time-consuming and expensive type of testing, but sometimes, it is the only way that you can find out qualitative aspects of system performance – for example, how easy it is for users to complete tasks with a system, or how much they enjoy using it. Clearly, user testing can only be done on aspects of the system that users can perceive, such as conversations, and users should be only expected to evaluate the system as a whole – that is, users can’t be expected to reliably discriminate between the performance of the speech recognition and the NLU components of the system.
Carrying out a valid and reliable evaluation with users is actually a psychological experiment. This is a complex topic, and it’s easy to make mistakes that...