Monitoring model training progress
In the previous section, we saw how easy it is to launch a Vertex AI custom training job with desired configurations and machine types. These Vertex AI training jobs are really useful for running large-scale experiments where training uses high compute (multiple GPUs or TPUs) and also may run for a few days. Such long-running experiments are not very feasible to run in a Jupyter Notebook-based environment. Another great thing about launching Vertex AI jobs is that all the metadata and lineage are tracked in a systematic way so that we can come back later and look into our past experiments and compare them with the latest ones in an easy and accurate way.
Another important aspect is monitoring the live progress of training jobs (including metrics such as loss and accuracy). For this purpose, we can easily set up Vertex AI TensorBoard within our Vertex AI job and track the progress in a near real-time fashion. In this section, we will set up a TensorBoard...