Comparing two outputs
This evaluation requires an evaluator, a dataset of inputs, and two or more LLMs, chains, or agents to compare. The evaluation aggregates the results to determine the preferred model.
The evaluation process involves several steps:
- Create the evaluator: Load the evaluator using the
load_evaluator()
function, specifying the type of evaluator (in this case,pairwise_string
). - Select the dataset: Load a dataset of inputs using the
load_dataset()
function. - Define models to compare: Initialize the LLMs, chains, or agents to compare using the necessary configurations. This involves initializing the language model and any additional tools or agents required.
- Generate responses: Generate outputs for each of the models before evaluating them. This is typically done in batches to improve efficiency.
- Evaluate pairs: Evaluate the results by comparing the outputs of different models for each input. This is often done using a random selection...