Deploying the LLM Twin service
The last step is implementing the architecture presented in the previous section. More concretely, we will deploy the LLM microservice using AWS SageMaker and the business microservice using FastAPI. Within the business microservice, we will glue the RAG logic written in Chapter 9 with our fine-tuned LLM Twin, ultimately being able to test out the inference pipeline end to end.
Serving the ML model is one of the most critical steps in any ML application’s life cycle, as users can only interact with our model after this phase is completed. If the serving architecture isn’t designed correctly or if the infrastructure isn’t working properly, it doesn’t matter that you have implemented a powerful and excellent model. As long as the user cannot appropriately interact with it, it has near zero value from a business point of view. For example, if you have the best code assistant on the market, but the latency to use it is too...