You're reading from Apache Mahout Essentials Implement top-notch machine learning algorithms for classification, clustering, and recommendations with Apache Mahout

Product type Paperback

Published in Jun 2015

Publisher

ISBN-13 9781783554997

Length 164 pages

Edition 1st Edition

Languages

Java

Tools

Mahout

Concepts

Machine Learning

Author (1):

Jayani Withanawasam

View More author details

Table of Contents (8) Chapters

Preface

1. Introducing Apache Mahout FREE CHAPTER

2. Clustering

3. Regression and Classification

4. Recommendations

5. Apache Mahout in Production

6. Visualization

Index

Machine learning libraries

Machine learning libraries can be categorized using different criteria, which are explained in the sections that follow.

Open source or commercial

Free and open source libraries are cost-effective solutions, and most of them provide a framework that allows you to implement new algorithms on your own. However, support for these libraries is not as good as the support available for proprietary libraries. However, some open source libraries have very active mailing lists to address this issue.

Apache Mahout, OpenCV, MLib, and Mallet are some open source libraries.

MATLAB is a commercial numerical environment that contains a machine learning library.

Scalability

Machine learning algorithms are resource-intensive (CPU, memory, and storage) operations. Also, most of the time, they are applied on large volumes of datasets. So, decentralization (for example, data and algorithms), distribution, and replication techniques are used to scale out a system:

Apache Mahout (data distributed over clusters and parallel algorithms)
Spark MLib (distributed memory-based Spark architecture)
MLPACK (low memory or CPU requirements due to the use of C++)
GraphLab (multicore parallelism)

Languages used

Most of the machine learning libraries are implemented using languages such as Java, C#, C++, Python, and Scala.

Algorithm support

Machine learning libraries, such as R and Weka, have many machine learning algorithms implemented. However, they are not scalable. So, when it comes to scalable machine learning libraries, Apache Mahout has better algorithm support than Spark MLib at the moment, as Spark MLib is relatively young.

Batch processing versus stream processing

Stream processing mechanisms, for example, Jubatus and Samoa, update a model instantaneously just after receiving data using incremental learning.

In batch processing, data is collected over a period of time and then processed together. In the context of machine learning, the model is updated after collecting data for a period of time. The batch processing mechanism (for example, Apache Mahout) is mostly suitable for processing large volumes of data.

LIBSVM implements support vector machines and it is specialized for that purpose.

A comparison of some of the popular machine learning libraries is given in the following table Table 1: Comparison between popular machine learning libraries:

Machine learning library	Open source or commercial	Scalable?	Language used	Algorithm support
MATLAB	Commercial	No	Mostly C	High
R packages	Open source	No	R	High
Weka	Open source	No	Java	High
Sci-Kit Learn	Open source	No	Python
Apache Mahout	Open source	Yes	Java	Medium
Spark MLib	Open source	Yes	Scala	Low
Samoa	Open source	Yes	Java