Building a multimodal agent with LangChain
So far, we’ve covered the main aspects of multimodality and how to achieve it with modern LFMs. As we saw throughout Part 2 of this book, LangChain offers a variety of components that we leveraged massively, such as chains, agents, tools, and so on. As a result, we already have all the ingredients we need to start building our multimodal agent.
However, in this chapter, we will adopt three approaches to tackle the problem:
- The agentic, out-of-the-box approach: Here we will leverage the Azure Cognitive Services toolkit, which offers native integrations toward a set of AI models that can be consumed via API, and that covers various domains such as image, audio, OCR, etc.
- The agentic, custom approach: Here, we are going to select single models and tools (including defining custom tools) and concatenate them into a single agent that can leverage all of them.
- The hard-coded approach: Here, we are going to build...