What is LLM evaluation?
LLM evaluation, or LLM evals, is the systematic process of assessing LLMs and the intelligent applications that use them. This involves profiling their performance on specific tasks, reliability under certain conditions, effectiveness in particular use cases, and other criteria to understand a model’s overall capabilities. You want to make sure that your intelligent application meets certain standards as measured by your evaluations.
You also should be able to measure how the AI system’s performance evolves as you change components of the application or data used in the application. For example, if you want to change the LLM used in your application or a prompt, you should be able to measure the impact of these changes with evaluations.
Being able to measure the impact of changes is particularly important as the quality of an application improves. Once an intelligent application is “pretty good,” it can be quite challenging...