How PyTorch is bridging the gap between research and production at Facebook: PyTorch team at F8 conference

PyTorch, the machine learning library which was originally developed as a research framework by a Facebook intern in 2017, has now grown into a popular deep learning workflow. One of the most loved products by Facebook, PyTorch is free, open source, and used for applications like computer vision and natural language processing (NLP).

At the F8 conference held this year, the PyTorch team consisting of Joe Spisak, the project manager for PyTorch at Facebook AI and Dmytro Dzhulgakov, the tech lead at Facebook AI gave a talk on how Facebook is developing and scaling AI experiences with PyTorch.

Spisak describes PyTorch as an eager and graph-based execution that is defined by ‘run’. This means that when a user executes a Python code, it generates a graph on the fly. It is dynamic in nature and allows the compilation of the static graph. The dynamic neural networks are accessible, thus, allowing the user to change the parameters very quickly. This feature comes in handy for applications like control flow in NLP. Another important feature of PyTorch, according to Spisak, is the ability to generate accurately distributed training models that possess close to billion parameters, including the cutting-edge ones. It also has a simple and easy API that is very intuitive by nature. This is one of the qualities of PyTorch which has endeared many developers, claims Spisak.

Become a pro at Deep Learning with PyTorch!

If you want to become an expert in building and training neural network models with high speed and flexibility in text, vision, and advanced analytics using PyTorch 1.x, read our book Deep Learning with PyTorch 1.x - Second Edition written by Sri. Yogesh K., Laura Mitchell, et al.

It will give you an insight into solving real-world problems using CNNs, RNNs, and LSTMs, along with discovering state-of-the-art modern deep learning architectures, such as ResNet, DenseNet, and Inception.

How PyTorch is bridging the gap between research and production at Facebook

Dzhulgakov points out how general advances in AI are driven by innovative research in the fields of academia or industry and why it’s necessary to bridge this big lag between research and production.

He says, “If you have a new idea and you want to take it all the way through to deployment, you usually need to go through multiple steps - figure out what the approach is and then find the training data maybe prepare massage it a little bit. Actually, build and train your model after that and then there is this painful step of transferring your model to a production environment which often historically involved reimplementation of a lot of code so you can actually take and deploy it and scale-up.”

According to Dzhulgakov, PyTorch is trying to minimize this big gap by encouraging advances and experimentations in the field, so that the research is brought into production in a few days, instead of months.

Challenges in bringing research to production

Following are the various classes of challenges associated with bringing research to production, according to the PyTorch team.

Hardware efficiency: In case of a tight latency constraint environment, users are required to fit all the hardware into the performance budget. On the other hand, an underused hardware environment can lead to an increase in cost.

Scalability: In Facebook’s recent work, Dzhulgakov says, they have trained on billions of public images, thus indicating significant accuracy gains as compared to regular datasets like imageNet. Similarly, when models are taken to inference, it means that billions of inferences per second are running with multiple diverse models sharing the same hardware.

Cross-platform: Neural networks are mostly not isolated as they need to be deployed inside their target application. It has a lot of interdependence with the surrounding code and application, thus posing different constraints like the user will not be able to run Python code or the user will have to work on very constrained computer capabilities if running a mobile device, and more.

Reliability: A lot of PyTorch jobs run for multiple weeks on hundreds of GPUs, hence it is important to design a reliable software which can tolerate hardware failures and deliver results.

How PyTorch is tackling these challenges

In order to tackle the above-listed challenges, Dzhulgakov says Facebook develops systems that can take up a training job and perform optimizations focused on performance for the performance-critical pieces. The system also applies “recipes for reliability” so that the developer written modeling code is automatically transformed. The Jit package comes into the picture here and acts like a key factor that is built to capture the structure of the Python program with minimal changes. The main goal of the Jit package is to make this process almost seamless.

He asserts that PyTorch has been successful since it feels like regular programming in Python and most of its users start developing in traditional PyTorch mode (eager mode) just by writing and prototyping in the program. “For the subset of promising models which shall show what results you need to bring to production either scale up, so you can apply techniques provided by Jit to exist mental codes and annotated in order to run in so-called script code.”

The Jit is like a subset of Python with a thread list of request semantics, which allows the user to apply transparent transformations for the eager mode to the user. The annotations include adding a few lines of Python code on top of the function in such a way that it can be done incrementally on function by function or module by module fashion. This hybrid fashion ensures that the model works along the way. Such powerful PyTorch tools permit the user to share the same code base between research and production environments.

Next, Dzhulgakov deduces that the common factor between research and production is that both teams of developers work on the same code base built on top of PyTorch. Thus, they share the codes among the teams that have a common domain like text classification or object detection or reinforcement learning. These developers prototype models, train new algorithms and address new tasks for quickly transitioning this functionality to the opposite environment. Watch the full talk to see Dzhulgakov’s examples of PyTorch bridging the gap between research and production at Facebook.

If you want to become an expert at implementing deep learning applications in PyTorch, check out our latest book Deep Learning with PyTorch 1.x - Second Edition written by Sri. Yogesh K., Laura Mitchell, and Et al. This book will show you how to apply neural networks to domains such as computer vision and NLP. It will also guide you to build, train, and scale a model with PyTorch and cover complex neural networks such as GANs and autoencoders for producing text and images.

NVIDIA releases Kaolin, a PyTorch library to accelerate research in 3D computer vision and AI

Introducing ESPRESSO, an open-source, PyTorch based, end-to-end neural automatic speech recognition (ASR) toolkit for distributed training across GPUs

Facebook releases PyTorch 1.3 with named tensors, PyTorch Mobile, 8-bit model quantization, and more

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

PyTorch announces the availability of PyTorch Hub for improving machine learning research reproducibility