Integrating MLflow with Apache Spark
Apache Spark is a very scalable and popular big data framework that allows data processing at a large scale. For more details and documentation, please go to https://spark.apache.org/. As a big data tool, it can be used to speed up parts of your ML inference, as it can be set at a training or an inference level.
In this particular case, we will illustrate how to implement it to use the model developed in the previous section on the Databricks environment to scale the batch-inference job to larger amounts of data.
In other to explore Spark integration with MLflow, we will execute the following steps:
- Create a new notebook named
inference_job_spark
in Python, linking to a running cluster where thebitpred_poc.ipynb
notebook was just created. - Upload your data to
dbfs
on the File/Upload data link in the environment. - Execute the following script in a cell of the notebook, changing the
logged_model
anddf
filenames for the ones...