Profiling data
The selection of a pre-processing, clustering, or classification algorithm depends highly on the quality and profile of input data (observations and expected values whenever available). The Step 3 – pre-processing data
subsection in the Let's kick the tires section of Chapter 1, Getting Started introduced the MinMax
class for normalizing a dataset using the minimum and maximum values
.
Immutable statistics
The mean and standard deviation are the most commonly used statistics.
Note
Mean and variance
Arithmetic mean:
data:image/s3,"s3://crabby-images/c12ac/c12ac47af16c395728e591053967d25aa6db434a" alt="Immutable statistics"
Variance:
data:image/s3,"s3://crabby-images/1b9ce/1b9cee5611a4dfc7211dae44a3a75a763603ddbd" alt="Immutable statistics"
Variance adjusted for sampling bias:
data:image/s3,"s3://crabby-images/9889b/9889baae18bcacb399f4a808b6440ffa7f26599f" alt="Immutable statistics"
Let's extend the MinMax
class with some basic statistics capabilities, Stats
:
class Stats[T: ToDouble](values: Vector[T]) extends MinMax[T](values) { val zero = (0.0. 0.0) val sums= values./:(zero)((s,x) =>(s._1 + x,s._2 + x*x)) //1 lazy val mean = sums._1/values.size //2 lazy val variance = (sums._2 - mean*mean*values.size)/(values.size-1) lazy val stdDev = sqrt(variance) … }
The...