The use of statistical methods in data science
Statistical analysis is the key to many data science tasks. It is used for many types of analysis ranging from the computation of simple mean and medium to complex multiple regression analysis. Chapter 5, Statistical Data Analysis Techniques, introduces this type of analysis and the Java support available.
Statistical analysis is not always an easy task. In addition, advanced statistical techniques often require a particular mindset to fully comprehend, which can be difficult to learn. Fortunately, many techniques are not that difficult to use and various libraries mitigate some of these techniques' inherent complexity.
Regression analysis, in particular, is an important technique for analyzing data. The technique attempts to draw a line that matches a set of data. An equation representing the line is calculated and can be used to predict future behavior. There are several types of regression analysis, including simple and multiple regression. They vary by the number of variables being considered.
The following graph shows the straight line that closely matches a set of data points representing the population of Belgium over several decades:
Simple statistical techniques, such as mean and standard deviation, can be computed using basic Java. They can also be handled by libraries such as Apache Commons. For example, to calculate the median, we can use the Apache Commons DescriptiveStatistics
class. This is illustrated next where the median of an array of doubles is calculated. The numbers are added to an instance of this class, as shown here:
double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5, 17.8, 16.5, 12.5}; DescriptiveStatistics statTest = new SynchronizedDescriptiveStatistics(); for(double num : testData){ statTest.addValue(num); }
The getPercentile
method returns the value stored at the percentile specified in its argument. To find the median, we use the value of 50
.
out.println("The median is " + statTest.getPercentile(50));
Our output is as follows:
The median is 16.2
In Chapter 5, Statistical Data Analysis Techniques, we will demonstrate how to perform regression analysis using the Apache Commons SimpleRegression
class.