In this chapter, we first discussed how image captioning powered by modern end-to-end deep learning works, then summarized how to train such a model using the TensorFlow im2txt model project. We discussed in detail how to find the correct input node names and output node names, and how to freeze the model and then use the latest graph transformation tool and the memmapped conversion tool to fix some nasty bugs while loading the model on mobiles. After that, we showed detailed tutorials on how to build iOS and Android apps using the model and making new sequence inferences with the LSTM RNN component of the model.
It's pretty amazing that, after training with tens of thousands of image captioning examples, and powered by modern CNN and LSTM models, we can build and use a model that can generate a sensible natural language description of a picture on our mobile devices...