You're reading from Java for Data Science Examine the techniques and Java tools supporting the growing field of data science

Product type Paperback

Published in Jan 2017

Publisher Packt

ISBN-13 9781785280115

Length 386 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Data Science

Authors (2):

Jennifer L. Reese

Richard M. Reese

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Started with Data Science FREE CHAPTER

2. Data Acquisition

3. Data Cleaning

4. Data Visualization

5. Statistical Data Analysis Techniques

6. Machine Learning

7. Neural Networks

8. Deep Learning

9. Text Analysis

10. Visual and Audio Analysis

11. Mathematical and Parallel Techniques for Data Analysis

12. Bringing It All Together

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct results
Unsupervised learning: The data does not contain results, but the model is expected to find relationships on its own
Semi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled data
Reinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leaves
Support vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictions
Bayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.