Apache Mahout
In this section, we will have a quick look at Apache Mahout.
Do you know how Mahout got its name?
As you can see in the logo, a mahout is a person who drives an elephant. Hadoop's logo is an elephant. So, this is an indicator that Mahout's goal is to use Hadoop in the right manner.
The following are the features of Mahout:
- It is a project of the Apache software foundation
- It is a scalable machine learning library
- The MapReduce implementation scales linearly with the data
- Fast sequential algorithms (the runtime does not depend on the size of the dataset)
- It mainly contains clustering, classification, and recommendation (collaborative filtering) algorithms
- Here, machine learning algorithms can be executed in sequential (in-memory mode) or distributed mode (MapReduce is enabled)
- Most of the algorithms are implemented using the MapReduce paradigm
- It runs on top of the Hadoop framework for scaling
- Data is stored in HDFS (data storage) or in memory
- It is a Java library (no user interface!)
- The latest released version is 0.9, and 1.0 is coming soon
- It is not a domain-specific but a general purpose library
Note
For those of you who are curious! What are the problems that Mahout is trying to solve? The following problems that Mahout is trying to solve:
The amount of available data is growing drastically.
The computer hardware market is geared toward providing better performance in computers. Machine learning algorithms are computationally expensive algorithms. However, there was no framework sufficient to harness the power of hardware (multicore computers) to gain better performance.
The need for a parallel programming framework to speed up machine learning algorithms.
Mahout is a general parallelization for machine learning algorithms (the parallelization method is not algorithm-specific).
No specialized optimizations are required to improve the performance of each algorithm; you just need to add some more cores.
Linear speed up with number of cores.
Each algorithm, such as Naïve Bayes, K-Means, and Expectation-maximization, is expressed in the summation form. (I will explain this in detail in future chapters.)
For more information, please read Map-Reduce for Machine Learning on Multicore, which can be found at http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf.
Setting up Apache Mahout
Download the latest release of Mahout from https://mahout.apache.org/general/downloads.html.
If you are referencing Mahout as a Maven project, add the following dependency in the pom.xml
file:
<dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version> </dependency>
If required, add the following Maven dependencies as well:
<dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-math</artifactId> <version>${mahout.version}</version> </dependency> <dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-integration</artifactId> <version>${mahout.version}</version> </dependency>
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
More details on setting up a Maven project can be found at http://maven.apache.org/.
Follow the instructions given at https://mahout.apache.org/developers/buildingmahout.html to build Mahout from the source.
The Mahout command-line launcher is located at bin/mahout
.