Chapter 18. Debugging TensorFlow Models
As we learned in this book, TensorFlow programs are used to build and train models that can be used for prediction in various kinds of tasks. When training the model, you build the computation graph, run the graph for training, and evaluate the graph for predictions. These tasks repeat until you are satisfied with the quality of the model, and then save the graph along with the learned parameters. In production, the graph is built or restored from a file and populated with the parameters.
Building deep learning models is a complex art and the TensorFlow API and its ecosystem are equally complex. When we build and train models in TensorFlow, sometimes we get different kinds of errors, or the models do not work as expected. As an example, how often do you see yourself getting stuck in one or more of the following situations:
- Getting NaN in loss and metrics output
- The loss or some other metric doesn't improve even after several iterations
In such situations...