Logistic regression is commonly used in statistics inference.
The following implementation of the binomial logistic regression classifier exposes a single classify
method to comply with our desire to reduce the complexity and life cycle of objects. The model weights
parameters are computed during training when the LogBinRegression
class/model is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted.
The goal of the training of a model using expected values is to compute the optimal weights that minimizes the error or cost function. We select the batch gradient descent algorithm to minimize the cumulative error between the predicted and expected values for all the observations. Although there are quite a few alternative optimizers, the gradient descent is quite robust and simple enough for this first chapter. The algorithm consists of updating the weights wi of the regression model by minimizing the cost.
Note
Cost function
M5: Cost (or compound error = predicted – expected):
M6: The batch gradient descent method to update model weights wi is as follows:
For those interested in learning about of optimization techniques, the Summary of optimization techniques section in the Appendix A, Basic Concepts presents an overview of the most commonly used optimizers. The batch descent gradient method is also used for the training of the multilayer perceptron (refer to The training epoch section under The multilayer perceptron section in Chapter 9, Artificial Neural Networks).
The execution of the batch gradient descent algorithm follows these steps:
Initialize the weights of the regression model.
Shuffle the order of observations and expected values.
Aggregate the cost or error for the entire observation set.
Update the model weights using the cost as the objective function.
Repeat from step 2 until either the maximum number of iterations is reached or the incremental update of the cost is close to zero.
The purpose of shuffling the order of the observations between iterations is to avoid the minimization of the cost reaching a local minimum.
Tip
Batch and stochastic gradient descent
The stochastic gradient descent is a variant of the gradient descent that updates the model weights after computing the error on each observation. Although the stochastic gradient descent requires a higher computation effort to process each observation, it converges toward the optimal value of weights fairly quickly after a small number of iterations. However, the stochastic gradient descent is sensitive to the initial value of the weights and the selection of the learning rate, which is usually defined by an adaptive formula.
The train
method consists of iterating through the computation of the weight using a simple descent gradient method. The method computes weights
and returns an instance of the LogBinRegressionModel
model:
The train
method extracts the number of weights, nWeights
, for the regression model as the number of variables in each observation + 1 (line 21
). The method initializes weights
with random values over [0, 1] (line 22
). The weights are computed through the tail recursive gradientDescent
method, and the method returns a new model for the binomial logistic regression (line 23
).
Tip
Unwrapping values from Try
It is usually not recommended to invoke the get
method to a Try
value, unless it is enclosed in a Try
statement. The best course of action is to do the following:
1. Catch the failure with match{ case Success(m) => ..case Failure(e) =>}
2. Extract the getOrElse( /* … */ )
result safely
3. Propagate the results as a Try
type map( _.m)
Let's take a look at the computation for weights
through the minimization of the cost function in the gradientDescent
method:
The gradientDescent
method recurses on the vector of pairs (observations and expected values), obsAndLbl
, cost
, and the model weights
(line 24
). It throws an exception if the maximum number of iterations allowed for the optimization is reached (line 25
). It shuffles the order of the observations (line 26
) before computing the errorGrad
derivatives of the cost over each weights (line 27
). The computation of the derivative of the cost (or error = predicted value – expected value) in formula M5 returns a pair of cumulative cost and derivative values using the formula (line 28
).
Next, the method computes the overall compound cost using the formula M4 (line 29
), converts it to a relative incremental relativeError
cost that is compared to the eps
convergence criteria (line 30
). The method extracts derivatives
of cost over weights by transposing the matrix of errors, and then prepends the bias 1.0
value to match the array of weights (line 31
).
Note
Bias value
The purpose of the bias value is to prepend 1.0
to the vector of observation so it can be directly processed (for example, zip and dot) with the weights. For instance, a regression model for two-dimensional observations (x, y) has three weights (w0, w1, w2). The bias value +1 is prepended to the observations to compute the predicted value 1.0: w0 + x.w1, +
y.w2.
This technique is used in the computation of the activation function of the multilayer perceptron, as described in the The multilayer perceptron section in Chapter 9, Artificial Neural Networks.
The formula M6 updates the weights for the next iteration (line 32
) before invoking the method with new weights, cost, and iteration count (line 33
).
Let's take a look at the shuffling of the order of observations using a random sequence generator. The following implementation is an alternative to the Scala standard library method scala.util.Random.shuffle
for shuffling elements of collections. The purpose is to change the order of observations and labels between iterations in order to prevent the optimizer to reach a local minimum. The shuffle
method permutes the order in the labelObs
vector of observations by partitioning it into segments of random size and reversing the order of the other segment:
Once the order of the observations is updated, the vector of pair (observations, labels) is easily built through a map (line 34
). The actual shuffling of the index is performed in the following shuffle
recursive function:
The maximum size of partition of the maxChunkSize
vector observations is randomly computed (line 35
). The method extracts the next slice (start
, end
) (line 36
). The slice is either added to the existing indices vector and returned once all the observations have been shuffled (line 37
) or passed to the next invocation.
The slice
method returns an array of indices over the range (start
, end
) either in the right order if the number of segments processed is odd, or in reverse order if the number of segment processed is even:
Note
Iterative versus tail recursive computation
The tail recursion in Scala is a very efficient alternative to the iterative algorithm. Tail recursion avoids the need to create a new stack frame for each invocation of the method. It is applied to the implementation of many machine learning algorithms presented throughout the book.
In order to train the model, we need to label the input data. The labeling process consists of associating the relative price movement during a session (price at close/price at open – 1) with one of the following two configurations:
The two classes of training observations are segregated by a decision boundary drawn on the scatter plot in the previous section. The labeling process is usually quite cumbersome and should be automated as much as possible.
Note
Automated labeling
Although quite convenient, automated creation of training labels is not without risk as it may mislabel singular observations. This technique is used in this test for convenience, but it is not recommended unless a domain expert reviews the labels manually.
Once the model is successfully created through training, it is available to classify new observation. The runtime classification of observations using the binomial logistic regression is implemented by the classify
method:
The method applies the logistic function to the linear inner product, linear
, of the new obs
and weights
observations of the model (line 37
). The method returns the tuple (the predicted class of the observation {0, 1}, prediction value) where the class is defined by comparing the prediction to the boundary value 0.0
(line 38
).
The computation of the dot
product of weights and observations uses the bias value as follows:
The alternative implementation of the dot
product of weights and observations consists of extracting the first w.head
weight:
The dot
method is used in the classify
method.