References
The best reference is the online documentation, including:
- The pipeline API: http://spark.apache.org/docs/latest/ml-features.html
- A full list of transformers: http://spark.apache.org/docs/latest/mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines
Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills provides a detailed and up-to-date introduction to machine learning with Spark.
There are several books that introduce machine learning in more detail than we can here. We have mentioned The Elements of Statistical Learning, by Friedman, Tibshirani and Hastie several times in this book. It is one of the most complete introductions to the mathematical underpinnings of machine learning currently available.
Andrew Ng's Machine Learning course on https://www.coursera.org/ provides a good introduction to machine learning. It uses Octave/MATLAB as the programming language, but should be straightforward to adapt to Breeze and Scala.