Tracking notebook and pipeline versioning
Data scientists usually start by experimenting with Python notebooks offline, where interactive execution is a key benefit. Python notebooks have come a long way since the days of Jupyter notebooks (https://jupyter-notebook.readthedocs.io/en/stable/). The success and popularity of Jupyter notebooks are undeniable. However, there are limitations when it comes to using version control for Jupyter notebooks since Jupyter notebooks are stored as JSON data with mixed output and code. This is especially difficult if we trying to track code using MLflow as we're only using Jupyter's native format, whose file extension is .ipynb
. You may not be able to see the exact Git hash in the MLflow tracking server for each run using a Jupyter notebook either. There are a lot of interesting debates on whether or when a Jupyter notebook should be used, especially in a production environment (see a discussion here: https://medium.com/mlops-community/jupyter...