Machine learning libraries
Machine learning libraries can be categorized using different criteria, which are explained in the sections that follow.
Open source or commercial
Free and open source libraries are cost-effective solutions, and most of them provide a framework that allows you to implement new algorithms on your own. However, support for these libraries is not as good as the support available for proprietary libraries. However, some open source libraries have very active mailing lists to address this issue.
Apache Mahout, OpenCV, MLib, and Mallet are some open source libraries.
MATLAB is a commercial numerical environment that contains a machine learning library.
Scalability
Machine learning algorithms are resource-intensive (CPU, memory, and storage) operations. Also, most of the time, they are applied on large volumes of datasets. So, decentralization (for example, data and algorithms), distribution, and replication techniques are used to scale out a system:
- Apache Mahout (data distributed over clusters and parallel algorithms)
- Spark MLib (distributed memory-based Spark architecture)
- MLPACK (low memory or CPU requirements due to the use of C++)
- GraphLab (multicore parallelism)
Languages used
Most of the machine learning libraries are implemented using languages such as Java, C#, C++, Python, and Scala.
Algorithm support
Machine learning libraries, such as R and Weka, have many machine learning algorithms implemented. However, they are not scalable. So, when it comes to scalable machine learning libraries, Apache Mahout has better algorithm support than Spark MLib at the moment, as Spark MLib is relatively young.
Batch processing versus stream processing
Stream processing mechanisms, for example, Jubatus and Samoa, update a model instantaneously just after receiving data using incremental learning.
In batch processing, data is collected over a period of time and then processed together. In the context of machine learning, the model is updated after collecting data for a period of time. The batch processing mechanism (for example, Apache Mahout) is mostly suitable for processing large volumes of data.
LIBSVM implements support vector machines and it is specialized for that purpose.
A comparison of some of the popular machine learning libraries is given in the following table Table 1: Comparison between popular machine learning libraries:
Machine learning library |
Open source or commercial |
Scalable? |
Language used |
Algorithm support |
---|---|---|---|---|
MATLAB |
Commercial |
No |
Mostly C |
High |
R packages |
Open source |
No |
R |
High |
Weka |
Open source |
No |
Java |
High |
Sci-Kit Learn |
Open source |
No |
Python | |
Apache Mahout |
Open source |
Yes |
Java |
Medium |
Spark MLib |
Open source |
Yes |
Scala |
Low |
Samoa |
Open source |
Yes |
Java |