Using embeddings to understand unstructured data
So far, we’ve focused on how to explore your structured data. What about unstructured data, such as images or text? Recall that we converted PDF text chunks into a specific format called embeddings in Chapter 3’s RAG chatbot project work. We require embeddings, meaning numerical vector representations of the data, to perform a similarity (or hybrid) search between chunks of text. That way, when someone asks our chatbot a question, such as “What are the economic impacts of automation technologies using LLMs?” the chatbot will be able to search through the stored chunks of text from the arXiv articles, retrieve the most relevant chunks, and use those to better answer the question. For more visual readers, see the data preparation workflow in Figure 4.14. We completed the Data Preparation step in Chapter 3. We’ll run through the remaining setup steps in the workflow now.
Figure...