Computer vision
This book is about NLP, not computer vision. However, in the previous section, we implemented general purpose sequences that can be applied to many domains. Computer vision is one of them.
The title of the article by Dosovitskiy et al. (2021) says it all: An image is worth 16x16 words: Transformers for Image Recognition at Scale. The authors processed an image as sequences. The results proved their point.
Google has made vision transformers available in a Colaboratory notebook. Open Vision_Transformer_MLP_Mixer.ipynb
in the Chapter16
directory of this book’s GitHub repository.
Open Vision_Transformer_MLP_Mixer.ipynb
contains a transformer computer vision model in JAX()
. JAX combines Autograd and XLA. JAX can differentiate Python and NumPy functions. JAX speeds up Python and NumPy by using compilation techniques and parallelization.
The notebook is self-explanatory. You can explore it to see how it works. However, bear in mind that when Industry...