The classification of trading sessions according to their volatility is as follows:
Throughout the book, we will rely on financial data to evaluate and discuss the merit of different data processing and machine learning methods. In this example, the data is extracted from Yahoo! Finances using the CSV format with the following fields:
- Date
- Price at open
- Highest price in session
- Lowest price in session
- Price at session close
- Volume
- Adjust price at session close
Let's create a simple program that loads the content of the file, executes some simple preprocessing functions, and creates a simple model. We selected the CSCO stock price between January 1, 2012 and December 1, 2013 as our data input.
Let's consider two variables, price and volume, as illustrated by the following screenshot. The top graph displays the variation of the price of Cisco stock over time and the bottom bar chart represents the daily trading volume on Cisco stock over time:
The first step is loading the dataset from a local file. Typically, large datasets are loaded from a database or distributed filesystem such as Hadoop Distributed File System (HDFS), as shown here:
The transform
method will be described in the next section.
The data file is extracted through an invocation of the Source.fromFile
static method, and then the fields are extracted through a map (line 1
). The header (first) row is removed with a call to drop
(line 2
).
Tip
Data extraction
The Source.fromFile.getLines.map
invocation pipeline method returns an iterator, which needs to be converted into an array to store the information into memory.
The file has to be closed to avoid leaking of the file handle (line 3
).
Tip
Code readability
A long pipeline of Scala high-order methods make the code and underlying code quite difficult to read. It is recommended to break down long chains of method calls. The following code is an example of a long chain of method calls:
We can break down such method calls into several steps as follows:
We strongly encourage you to consult the excellent guide Effective Scala, written by Marius Eriksen from Twitter. This is definitively a must read for any Scala developer [1:10].
Preprocessing the dataset
The next step is to normalize the data in the range [-0.5, 0.5] to be trained by the logistic binary classifier. It is time to introduce a non-sense statistics class.
We select the computation of mean and standard deviation of the two time series as the first step of the preprocessing phase. The computation of these statistics can be implemented by the reduce methods reduceLeft
and foldLeft
:
However, this implementation has one major drawback: the dataset (price in this example) has to be traversed for each method (mean
, stdDev
, min
, max
, and so on).
One of the solutions is to create a class that computes the counters and the statistics on demand using, once again, the lazy values:
We made the statistics object generic by using the view bounds T <% Double
, which assumes a conversion from type T
to Double
. By defining the statistics as tuple counters (minimum value, maximum value, sum of values, and sum of square values) and folding these values into a statistics object, we limit the number of invocations of the foldLeft
reducer method to 1
, and therefore, avoid the recomputation of these statistics for the existing dataset each time new data is added.
The code illustrates the use and benefit of lazy values in Scala. The mean is computed only if and when needed.
Normalization and Gauss distribution
Statistics are usually used to normalize data into a probability value [0, 1] as required by most classification or clustering algorithms. It is logical to add the normalization method to the Stats
class, as we have already extracted the min
and max
values:
The same approach is used to compute the multivariate normal distribution:
The price action chart has a very interesting characteristic. At a closer look, a sudden change in price and increase in volume occurs about every three months or so. Experienced investors will undoubtedly recognize that those price-volume patterns are related to the release of quarterly earnings of Cisco. Such regular but unpredictable patterns can be a source of concern or opportunity if risk can be managed. The strong reaction of the stock price to the release of corporate earnings may scare some long-term investors while enticing day traders.
The following graph visualizes the potential correlation between sudden price change (volatility) and heavy trading volume:
Let's try to correlate the volatility of the stock price with volume. For the sake of this exercise, we define the volatility as the maximum variation of the stock price within each trading session: the relative difference between the highest price during the trading session and the lowest price during the session.
The YahooFinancials
enumeration extracts historical stock prices and session volume from a CSV file. For example, the volatility is extracted from the CSV fields of each line in the CSV file as follows:
The transform
method uses the YahooFinancials
enumeration to generate the input data for the model:
The volatility
and volume
data is normalized using the Stats.normalize
method defined earlier.
Although charting is not the primary goal of this book, we thought that you will benefit from a brief introduction to JFreeChart. The skeleton code to generate a scatter plot is rather simple. The most relevant code is the transformation of the XYTSeries
into graphical JFreeChart's XYSeries
:
Note
Visualization
The JFreeChart library is introduced as a robust charting tool. The visualization of the results of a computation is beyond the scope of this book. The code related to plots and charts is omitted from the book in order to keep the code snippets concise and dedicated to machine learning. In a few occasions, output data is formatted as a CSV file to be simply imported into a spreadsheet.
Here is an example of a plot using the ScatterPlot.display
method:
There is a level of correlation between session volume and session volatility. We can use this information to classify trading sessions by their volatility.
Creating a model (learning)
The objective of the training is to build a model that can discriminate between volatile and nonvolatile trading sessions. For the sake of the exercise, session volatility has been defined as session price high and session price low coupled with heavy trading volume, which constitute the two parameters of the model.
Logistic regression is commonly used in statistics inference. The following implementation of the binary logistic regression classifier exposes a single method, classify
, to comply with our desire to reduce the complexity and life cycle of objects. The model parameters, weights
, are computed during training when the LogBinRegression
class/model is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted:
The training method, train
, consists of iterating through the computation of the weight using a simple descent gradient. The method computes the weights and returns an option, so the model is either trained and ready for runtime classification or nonexistent (None
):
The iteration is encapsulated in the Scala find
method that exists if the algorithm converges (diff < eps
). The model parameters, weights
, are set to None
if the maximum number of iterations is reached.
The training method, train
, iterates across the set of observations by computing the gradient between the predicted and observed values. In our simplistic approach, the gradient is computed as a linear function of the sigmoid of the sum of the product of the weight and training observations. As for any optimization problem, the initialization of the solution vector, weights
, is critical. We choose to initialize the weight with random values, although in practice, you would use a more deterministic approach to initialize the model parameters.
In order to train the model, we need to label data. The process consists of tagging every trading session as volatile and non volatile according to the observations (relative session volatility and session volume). The labeling process is usually quite cumbersome; therefore, let's generate the label automatically. A trading session is considered volatile if a volatility and volume are both greater than 60 percent of the maximum relative volatility and volume:
Note
Automated labeling
Although quite convenient, automated creation of training labels is not without risk because it may mislabel singular observations. This technique is used in this test for convenience but it is not recommended unless a domain expert reviews the labels manually.
The model is created (trained) by a simple instantiation of the logistic binary classifier:
The training run is configured with a maximum of 300
iterations, a gradient slope of 0.00005
, and convergence criteria of 0.02
.
Finally, the model can be tested with a new fresh dataset, not related to the training set:
It is just a matter of executing the classification method (exceptions, conditions on method arguments, and returned values are omitted):
The result of the classification is (true,0.516)
for the first sample and (false,0.1180)
for the second sample.
Note
Validation
The simple classification, in this test case, is provided for illustrating the runtime application of the model. It does not constitute a validation of the model by any stretch of imagination. The next chapter digs into validation metrics and methodology.