Elements of a production machine learning system
Modern machine learning algorithms are very capable because they use large quantities of data and consist of a large number of trainable parameters. The largest available models are Generative Pre-trained Transformer-3 (GPT-3) from OpenAI (with 175 billion parameters) and Megatron-Turing from NVidia (356 billion parameters). These models can create texts (novels) and make conversations but also write program code, create user interfaces, or write requirements.
Now, such large models cannot be used on a desktop computer, laptop, or even in a dedicated server. They need advanced computing infrastructure, which can withstand long-term training and evaluation of such large models. Such infrastructure also needs to provide means to automatically provide these models with data, monitor the training process, and, finally, provide the possibility for the users to access the models to make inferences. One of the modern ways of providing such...