Summary
This chapter introduced us to the world of multimodal modular RAG, which uses distinct modules for different data types (text and image) and tasks. We leveraged the functionality of LlamaIndex, Deep Lake, and OpenAI, which we explored in the previous chapters. The Deep Lake VisDrone dataset further introduced us to drone technology for analyzing images and identifying objects. The dataset contained images, labels, and bounding box information. Working on drone technology involves multimodal data, encouraging us to develop skills that we can use across many domains, such as wildlife tracking, streamlining commercial deliveries, and making safer infrastructure inspections.
We built a multimodal modular RAG-driven generative AI system. The first step was to define a baseline user query for both LLM and multimodal queries. We began by querying the Deep Lake textual dataset that we implemented in Chapter 3. LlamaIndex seamlessly ran a query engine to retrieve, augment, and...