Serving Transformer Models
So far, we’ve explored many aspects surrounding Transformers, and you’ve learned how to train and use a Transformer model from scratch. You also learned how to fine-tune them for many tasks. However, we still don’t know how to serve these models in production. Like any other real-life and modern solution, natural language processing (NLP)-based solutions must be able to be served in a production environment. However, metrics such as response time must be taken into consideration while developing such solutions.
This chapter will explain how to serve a Transformer-based NLP solution in environments where a CPU/GPU is available. TensorFlow Extended (TFX) as a solution for machine learning deployment will be described here. Also, other solutions for serving Transformers as application programming interfaces (APIs) such as FastAPI will be illustrated. You will also learn about the basics of Docker, as well as how to Dockerize your service...