Regression models on big data
We may find ourselves working with data that's too big to fit on our single machine, in which case we need to use big data solutions. These solutions are the same as for logistic regression on the classification side. In fact, many of these use a GLM, or generalized linear model, that can be configured to be logistic or linear regression. Some of our options are:
- Vowpal Wabbit
- H2O
- TensorFlow
- Spark (
pyspark
) - Dask
- AWS SageMaker's or Google Cloud's Linear Learner
Of course, the solution we choose depends on what we have available and is easiest to use. The cloud solutions like AWS and GCP generally provide easy-to-use services, and we may have to use those for other solutions like Spark or Dask anyway. However, if we have an on-premises cluster with certain software installed and/or specific data privacy concerns, we may not want to use cloud solutions.