In the previous section, we looked at statistics for columns containing a single numeric value. It is often the case that, for machine learning (ML), a more common way to represent data is as vectors of multiple numeric values. A vector is a generalized structure that consists of one or more elements of the same data type. For example, the following is an example of a vector of three elements of type double:
[2.0,3.0,5.0]
[4.0,6.0,7.0]
Computing statistics in the classic way won't work for vectors. It is also quite common to have weights associated with these vectors. There are times when the weights have to considered as well while computing statistics on such a data type.
Spark MLLib's Summarizer (https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/stat/Summarizer.html) provides several convenient methods to compute stats on vector...