Exploring the LLM Twin’s inference pipeline deployment strategy
Now that we’ve understood all the design choices available for implementing the deployment strategy of the LLM Twin’s inference pipeline, let’s explore the concrete decisions we made to actualize it.
Our primary objective is to develop a chatbot that facilitates content creation. To achieve this, we will process requests sequentially, with a strong emphasis on low latency. This necessitates the selection of an online real-time inference deployment architecture.
On the monolith versus microservice aspect, we will split the ML service between a REST API server containing the business logic and an LLM microservice optimized for running the given LLM. As the LLM requires a powerful machine to run the inference, and we can further optimize it with various engines to speed up the latency and memory usage, it makes the most sense to go with the microservice architecture. By doing so, we can...