Summary
In this chapter, you learned a few techniques to horizontally scale out standard Python-based ML libraries such as scikit-learn, XGBoost, and more. First, techniques for scaling out EDA using a PySpark DataFrame API were introduced and presented along with code examples. Then, techniques for distributing ML model inferencing and scoring were presented using a combination of MLflow pyfunc functionality and Spark DataFrames. Techniques for scaling out ML models using embarrassingly parallel computing techniques using Apache Spark were also presented. Distributed model tuning of models, trained using standard Python ML libraries using a third-party package called spark_sklearn
, were presented. Then, pandas UDFs were introduced to scale out arbitrary Python code in a vectorized manner for creating high-performance, low-overhead Python user-defined functions right within PySpark. Finally, Koalas was introduced as a way for pandas developers to use a pandas-like API without having...