Summary
In this chapter, you explored methods for evaluating LLM outputs in your intelligent application. You learned what LLM evaluation is and why it’s important for your intelligent application. Model benchmarking is a form of evaluation that can help you determine which LLMs to use in your application.
Once your application has functional AI modules, you can make evaluation datasets and run metrics on them to measure performance and change over time. In addition to the automated evaluations, you can perform manual human review to further measure application quality. Finally, you can use reference-free metrics as guardrails within your application.
In the next chapter, you will learn how to optimize the semantic data model to enhance retrieval accuracy and overall performance.