Profiling data
The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data
subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax
class for normalizing a dataset using the minimum and maximum values
.
Immutable statistics
The mean and standard deviation are the most commonly used statistics.
Note
Mean and variance
Arithmetic mean:
Variance:
Variance adjusted for sampling bias:
Let's extend the MinMax
class with some basic statistics capabilities, Stats
:
class Stats[T: ToDouble](values: Vector[T]) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values.size)/(values.size-1) lazy val stdDev = sqrt(variance) … }
The...