LLM Output Evaluation
Regardless of the form factor of your intelligent application, you must evaluate your use of large language models (LLMs). The evaluation of a computational system determines the system’s performance, gauges its reliability, and analyzes its security and privacy.
AI systems are non-deterministic. You cannot be certain what an AI system will output until you run an input through it. This means that you must evaluate how the AI system performs on a variety of inputs to have confidence that it performs in line with your requirements. To be able to change the AI system without introducing any unexpected regressions, you also need to have robust evaluations. Evaluations can help catch these regressions before releasing the AI system to customers.
In LLM-powered intelligent applications, evaluations measure the effect of components such as the model chosen and any hyperparameters used with the model, such as temperature, prompting, and retrieval-augmented...