Summary
In this chapter, we learned about various methods for evaluating and monitoring the Amazon Bedrock models.
We began by exploring the two primary model evaluation methods offered by Amazon Bedrock: automatic model evaluation and human evaluation. The automatic model evaluation process involves running an evaluation algorithm script on either a built-in or custom dataset, assessing metrics such as accuracy, toxicity, and robustness. Human evaluation, on the other hand, incorporates human input into the evaluation process, ensuring that the models not only deliver accurate results but also that those results align with real-world expectations and requirements.
Furthermore, we discussed open source tools such as fmeval
and Ragas, which provide additional evaluation capabilities that are specifically tailored for LLMs and RAG pipelines.
Moving on to the section on monitoring, we discussed how Amazon CloudWatch can be leveraged to gain valuable insights into model performance...