Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!
This article is an excerpt from the book, Pretrain Vision and Large Language Models in Python, by Emily Webber. Master the art of training vision and large language models with conceptual fundaments and industry-expert guidance. Learn about AWS services and design patterns, with relevant coding examples
In this article, we’ll explore trends in foundation model application development, like using LangChain to build interactive dialogue applications, along with techniques like retrieval augmented generation to reduce LLM hallucination. We’ll explore ways to use generative models to solve classification tasks, human-centered design, and other generative modalities like code, music, product documentation, powerpoints, and more! We’ll talk through AWS offerings like SageMaker JumpStart Foundation Models, Amazon Bedrock, Amazon Titan, and Amazon Code Whisperer.
In particular, we’ll dive into the following topics:
Now that you’ve learned about foundation models, and especially large language models, let’s talk through a few key ways you can use them to build applications. One of the most significant takeaways of the ChatGPT moment in December 2022 is that customers clearly love for their chat to be knowledgeable about every moment in the conversation, remember topics mentioned earlier, and encompassing all the twists and turns of dialogue. Said another way, beyond generic question answering, there’s a clear consumer preference for a chat to be chained. Let’s take a look at an example in the following screenshot:
Figure 15.1 – Chaining questions for chat applications
The key difference between the left - and the right-hand side of Figure 15.1 is that on the left-hand side, the answers are discontinuous. That means the model simply sees each question as a single entity before providing its response. On the right-hand side, however, the answers are continuous. That means the entire dialogue is provided to the model, with the newest question at the bottom. This helps to ensure the continuity of responses, with the model more capable of maintaining the context.
How can you set this up yourself? Well, on the one hand, what I’ve just described isn’t terribly difficult. Imagine just reading from your HTML page, packing in all of that call and response data into the prompt, and siphoning out the response to return it to your end user. If you don’t want to build it yourself, however, you can just use a few great open-source options!
If you haven’t seen it before, let me quickly introduce you to LangChain. Available for free on GitHub here: https://github.com/hwchase17/langchain, LangChain is an open-source toolkit built by Harrison Chase and more than 600 other contributors. It provides functionality similar to the famous ChatGPT by pointing to OpenAI’s API, or any other foundation model, but letting you as the developer and data scientist create your own frontend and customer experience.
Decoupling the application from the model is a smart move; in the last few months alone the world has seen nothing short of hundreds of new large language models come online, with teams around the world actively developing more. When your application interacts with the model via a single API call, then you can more easily move from one model to the next as the licensing, pricing, and capabilities upgrade over time. This is a big plus for you!
Another interesting open-source technology here is Haystack (26). Developed by the German start-up, Deepset, Haystack is a useful tool for, well, finding a needle in a haystack. Specifically, they operate like an interface for you to bring your own LLMs into expansive question/answering scenarios. This was their original area of expertise, and since then have expanded quite a bit!
At AWS, we have an open-source template for building applications with LangChain on AWS. It’s available on GitHub here: https://github.com/3coins/langchain-aws-template.
In the following diagram, you can see a quick representation of the architecture:
While this can point to any front end, we provide an example template you can use to get off the ground for your app. You can also easily point to any custom model, whether it’s on a SageMaker endpoint or in the new AWS service, Bedrock! More on that a bit later in this chapter. As you can see in the previous image, in this template you can easily run a UI anywhere that interacts with the cloud. Let’s take a look at all of the steps.:
1. First, the UI hits the API gateway.
2. Second, credentials are retrieved via IAM.
3. Third, the service is invoked via Lambda.
4. Fourth, the model credentials are retrieved via Secrets Manager.
5. Fift h, your model is invoked through either an API call to a serverless model SDK, or a custom model you’ve trained that is hosted on a SageMaker endpoint is invoked.
6. Sixth, look up the relevant conversation history in DynamoDB to ensure your answer is accurate.
How does this chat interface ensure it’s not hallucinating answers? How does it point to a set of data stored in a database? Through retrieval augmented generation (RAG), which we will cover next.
As explained in the original 2020 (1) paper, RAG is a way to retrieve documents relevant to a given query. Imagine your chat application takes in a question about a specific item in your database, such as one of your products. Rather than having the model make up the answer, you’d be better off retrieving the right document from your database and simply using the LLM to stylize the response. That’s where RAG is so powerful; you can use it to ensure the accuracy of your generated answers stays high, while keeping the customer experience consistent in both style and tone. Let’s take a closer look:
Figure 15.3 – RAG
First, a question comes in from the left-hand side. In the top left , you can see a simple question, Defi ne “middle ear”. This is processed by a query encoder, which is simply a language model producing an embedding of the query. This embedding is then applied to the index of a database, with many candidate algorithms in use here: K Nearest Neighbors, Maximum Inner Product Search (MIPS), and others. Once you’ve retrieved a set of similar documents, you can feed the best ones into the generator, the final model on the right-hand side. This takes the input documents and returns a simple answer to the question. Here, the answer is The middle ear includes the tympanic cavity and the three ossicles.
Interestingly, however, the LLM here doesn’t really define what the middle ear is. It’s actually answering the question, “what objects are contained within the middle ear?” Arguably, any definition of the middle ear would include its purpose, notably serving as a buffer between your ear canal and your inner ear, which helps you keep your balance and lets you hear. So, this would be a good candidate for expert reinforcement learning with human feedback, or RLHF, optimization.
As shown in Figure 15.3, this entire RAG system is tunable. That means you can and should fine-tune the encoder and decoder aspects of the architecture to dial in model performance based on your datasets and query types. Another way to classify documents, as we’ll see, is generation!
As we learned in Chapter 13, Prompt Engineering, there are many ways you can push your language model to output the type of response you are looking for. One of these ways is actually to have it classify what it sees in the text! Here is a simple diagram to illustrate this concept:
Figure 15.4 – Using generation in place of classification
As you can see in the diagram, with the traditional classification you train the model ahead of time to perform one task: classification. This model may do well on classification, but it won’t be able to handle new tasks at all. This key drawback is one of the main reasons why foundation models, and especially large language models, are now so popular: they are extremely flexible and can handle many different tasks without needing to be retrained.
On the right-hand side of Figure 15.4, you can see we’re using the same text as the starting point, but instead of passing it to an encoder-based text model, we’re passing it to a decoder-based model and simply adding the instruction to classify this sentence into positive or negative sentiment. You could just as easily say, “tell me more about how this customer really feels,” or “how optimistic is this home buyer?” or “help this homebuyer find a different house that meets their needs.” Arguably each of those three instructions is slightly different, veering away from pure classification and into more general application development or customer experience. Expect to see more of this over time! Let’s look at one more key technique for building applications with LLMs: keeping humans in the loop.
We touched on this topic previously, in Chapter 2, Dataset Preparation: Part One, Chapter 10, FineTuning and Evaluating, Chapter 11, Detecting, Mitigating, and Monitoring Bias, and Chapter 14, MLOps for Vision and Language. Let me say this yet again; I believe that human labeling will become even more of a competitive advantage that companies can provide. Why? Building LLMs is now incredibly competitive; you have both the open source and proprietary sides actively competing for your business. Open source options are from the likes of Hugging Face and Stability, while proprietary offerings are from AI21, Anthropic, and OpenAI. The differences between these options are questionable; you can look up the latest models at the top of the leaderboard from Stanford’s HELM (2), which incidentally falls under their human-centered AI initiative. With enough fine-tuning and customization, you should generally be able to meet performance.
What then determines the best LLM applications, if it’s not the foundation model? Obviously, the end-to-end customer experience is critical, and will always remain so. Consumer preferences wax and wane over time, but a few tenets remain for general technology: speed, simplicity, flexibility, and low cost. With foundation models we can clearly see that customers prefer explainability and models they can trust. This means that application designers and developers should grapple with these long-term consumer preferences, picking solutions and systems that maximize them. As you may have guessed, that alone is no small task.
Beyond the core skill of designing and building successful applications, what else can we do to stay competitive in this brave new world of LLMs? I would argue that amounts to customizing your data. Focus on making your data and your datasets unique: singular in purpose, breadth, depth, and completeness. Lean into labeling your data with the best resources you can, and keep that a core part of your entire application workflow. This brings you to continuous learning, or the ability of the model to constantly get better and better based on signals from your end users.
Next, let’s take a look at upcoming generative modalities.
Since the 2022 ChatGPT moment, most of the technical world has been fascinated by the proposition of generating novel content. While this was always somewhat interesting, the meeting of high-performance foundation models with an abundance of media euphoria over the capabilities, combined with a post-pandemic community with an extremely intense fear of missing out, has led us to the perfect storm of a global fixation on generative AI.
Is this a good thing? Honestly, I’m happy to finally see the shift ; I’ve been working on generating content with AI/ML models in some fashion since at least 2019, and as a writer and creative person myself, I’ve always thought this was the most interesting part of machine learning. I was very impressed by David Foster’s book (3) on the topic. He’s just published an updated version of this to include the latest foundation models and methods! Let’s quickly recap some other types of modalities that are common in generative AI applications today.
Generating code should be no surprise to most of you; its core similarities to language generation make it a perfect candidate! Fine-tuning an LLM to spit out code in your language of choice is pretty easy; here’s my 2019 project (4) doing exactly that with the SageMaker example notebooks! Is the code great? Absolutely not, but fortunately, LLMs have come a long way since then. Many modern code-generating models are excellent, and thanks to a collaboration between Hugging Face and ServiceNow we have an open-source model to use! This is called StarCoder and is available for free on Hugging
Face right here: https://huggingface.co/bigcode/starcoder.
What I love about using an open-source LLM for code generation is that you can customize it! This means you can point to your own private code repositories, tokenize the data, update the model, and immediately train this LLM to generate code in the style of your organization! At the organizational level, you might even do some continued pretraining on an open-source LLM for code generation on your own repositories to speed up all of your developers. We’ll take a look at more ways you can use
LLMs to write your own code faster in the next section when we focus on AWS offerings, especially Amazon Code Whisperer. (27)
The rest of the preceding content can all be great candidates for your own generative AI projects. Truly, just as we saw general machine learning moving from the science lab into the foundation of most businesses and projects, it’s likely that generative capabilities in some fashion will do the same.
Does that mean engineering roles will be eliminated? Honestly, I doubt it. Just as the rise of great search engines didn’t eliminate software engineering roles but made them more fun and doable for a lot of people, I’m expecting generative capabilities to do the same. They are great at searching many possibilities and quickly finding great options, but it’s still up to you to know the ins and outs of your consumers, your product, and your design. Models aren’t great at critical thinking, but they are good at coming up with ideas and finding shortcomings, at least in words.
Now that we’ve looked at other generative modalities at a very high level, let’s learn about AWS offerings for foundation models!
On AWS, as you’ve seen throughout the book, you have literally hundreds of ways to optimize your foundation model development and operationalization. Let’s now look at a few ways AWS is explicitly investing to improve the customer experience in this domain:
Close readers of Swami’s blog post will also notice updates to our latest ML infrastructure, such as the second edition of the inferentia chip, inf2, and a Trainium instance with more bandwidth, trn1n. We also released our code generation service, CodeWhisperer, at no cost to you!
In summary, the field of pretraining foundation models is filled with innovation. We have exciting advancements like LangChain and AWS's state-of-the-art solutions such as Amazon Bedrock and Titan, opening up vast possibilities in AI development. Open-source tools empower developers, and the focus on human-centered design remains crucial. As we embrace continuous learning and explore new generative methods, we anticipate significant progress in content creation and software development. By emphasizing customization, innovation, and responsiveness to user preferences, we stand on the cusp of fully unleashing the potential of foundation models, reshaping the landscape of AI applications. Keep an eye out for the thrilling journey ahead in the realm of AI.
Emily Webber is a Principal Machine Learning Specialist Solutions Architect at Amazon Web Services. She has assisted hundreds of customers on their journey to ML in the cloud, specializing in distributed training for large language and vision models. She mentors Machine Learning Solution Architects, authors countless feature designs for SageMaker and AWS, and guides the Amazon SageMaker product and engineering teams on best practices in regards around machine learning and customers. Emily is widely known in the AWS community for a 16-video YouTube series featuring SageMaker with 160,000 views, plus a Keynote at O’Reilly AI London 2019 on a novel reinforcement learning approach she developed for public policy.