Linear regression with SGD optimization in Spark 2.0
In this recipe, we use Spark RDD-based regression API to how to use an iterative optimization technique to minimize the cost function and arrive at a solution for a linear regression.
We examine how Spark uses an iterative to converge on a solution to the regression problem using a well-known method called Gradient Descent. Spark provides a more implementation known as SGD, which is used to compute the intercept (in this case set to 0) and the weights for the parameters.
How to do it...
- We use a dataset from the UCI machine library depository. You can download the entire dataset from the following URL:
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
The dataset comprises 14 columns with the first 13 columns being the independent variables (features) that try to explain the median price (last column) of an owner-occupied house in Boston, USA.
We have chosen and cleaned the first eight columns as features. We use the first...