Other libraries
For mining massive datasets, we can utilize the Algebird abstract algebra library for Scala, also open sourced by Twitter. The code was originally developed as part of the Scalding Matrix API. As it had broader applications in aggregation systems, such as Scalding and Storm, it became a separate library.
Locality Sensitivity Hashing is a technique that minimizes the data space and can provide an approximate similarity. It is based on the idea that items that have high-dimensional properties can be hashed into a smaller space but still produce results with high accuracy.
An implementation of the approximate Jaccard item-similarity using Locality Sensitive Hashing (LSH) is provided in the source code accompanying this book.
Another interesting open source project that integrates Mahout vectors into Scalding and provides implementations of Naive Bayes classifiers and K-Means is Ganitha, which can be found at https://github.com/tresata/ganitha. This library, among others, simplifies...