Evaluating production systems
Developing a prototype ChatGPT application with a RAG component is relatively straightforward, but preparing it for production and maintaining it effectively calls for consistent evaluation, which is challenging. Using LLMs to assist in evaluation is an obvious approach; however, be aware that there are pitfalls caused by LLMs displaying positional bias, answer-style preference, and variable results from one run to the next, which leads to the need for a more structured approach. It’s also worth remembering that the use case and nature of an agent will influence your evaluation approach. For example, a simple RAG question and answer compared to a more conversational agent will need a different method of evaluation.
It’s important to understand that, as in any machine learning (ML) project, you should evaluate the RAG pipeline’s performance using a validation dataset and an evaluation metric. This involves assessing the components...