Training a text classifier with Azure Databricks
For this chapter, you can download the ready-to-use workbook from GitHub at https://github.com/marconline/azure-databricks. The file is SMS Spam Classification.py
.
We will use this brand new cluster for something concrete, now. As you already know, Apache Spark contains MLLib, and by using this library, data scientists can focus on their data problems and models, instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).
Before we continue with this walkthrough, I'm going to state the problem we need to solve. We want to create a model that is capable of predicting whether an SMS message is spam or not. In other words, we want to build a model that, given an SMS text as input, is able to predict if it is a spam message or not.
This is a typical supervised machine learning problem (that is, a single class classification problem) and, as such, we need to have a set of data to use...