Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning in Java

You're reading from   Machine Learning in Java Helpful techniques to design, build, and deploy powerful machine learning applications in Java

Arrow left icon
Product type Paperback
Published in Nov 2018
Publisher Packt
ISBN-13 9781788474399
Length 300 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Ashish Bhatia Ashish Bhatia
Author Profile Icon Ashish Bhatia
Ashish Bhatia
Bostjan Kaluza Bostjan Kaluza
Author Profile Icon Bostjan Kaluza
Bostjan Kaluza
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Applied Machine Learning Quick Start FREE CHAPTER 2. Java Libraries and Platforms for Machine Learning 3. Basic Algorithms - Classification, Regression, and Clustering 4. Customer Relationship Prediction with Ensembles 5. Affinity Analysis 6. Recommendation Engines with Apache Mahout 7. Fraud and Anomaly Detection 8. Image Recognition with Deeplearning4j 9. Activity Recognition with Mobile Phone Sensors 10. Text Mining with Mallet - Topic Modeling and Spam Detection 11. What Is Next? 12. Other Books You May Enjoy

Data and problem definition

When presented with a problem definition, we need to ask questions that will help in understanding the objective and target information from the data. We could ask very common questions, such as: what is the expected finding once the data is explored? What kind of information can be extracted after data exploration? Or, what kind of format is required so the question can be answered? Asking the right question will give a clearer understanding of how to proceed further. Data is simply a collection of measurements in the form of numbers, words, observations, descriptions of things, images, and more.

Measurement scales

The most common way to represent data is using a set of attribute-value pairs. Consider the following example:

Bob = { 
height: 185cm, 
eye color: blue, 
hobbies: climbing, sky diving 
} 

For example, Bob has attributes named height, eye color, and hobbies with the values 185cm, blue, climbing, and sky diving respectively.

A set of data can be presented simply as a table, where columns correspond to attributes or features and rows correspond to particular data examples or instances. In supervised machine learning, the attribute whose value we want to predict the outcome, Y, from the values of the other attributes, X, is denoted as the class or target variable, as shown in the following table:

Name

Height [cm]

Eye color

Hobbies

Bob

185.0

Blue

Climbing, sky diving

Anna

163.0

Brown

Reading

...

...

...

...

The first thing we notice is how much the attribute values vary. For instance, height is a number, eye color is text, and hobbies are a list. To gain a better understanding of the value types, let's take a closer look at the different types of data or measurement scales. Stanley Smith Stevens (1946) defined the following four scales of measurement with increasingly expressive properties:

  • Nominal data consists of data that is mutually exclusive, but not ordered. Examples include eye color, marital status, type of car owned, and so on.
  • Ordinal data correspond to categories where order matters, but not the difference between the values, such as pain level, student letter grades, service quality ratings, IMDb movie ratings, and so on.
  • Interval data consists of data where the difference between two values is meaningful, but there is no concept of zero, for instance, standardized exam scores, temperature in Fahrenheit, and so on.
  • Ratio data has all of the properties of an interval variable and also a clear definition of zero; when the variable is equal to zero, this variable would be missing. Variables such as height, age, stock prices, and weekly food spending are ratio variables.

Why should we care about measurement scales? Well, machine learning depends heavily on the statistical properties of the data; hence, we should be aware of the limitations that each data type possesses. Some machine learning algorithms can only be applied to a subset of measurement scales.

The following table summarizes the main operations and statistics properties for each of the measurement types:

Property

Nominal

Ordinal

Interval

Ratio

1

Frequency of distribution

True

True

True

True

2

Mode and median

True

True

True

3

Order of values is known

True

True

True

4

Can quantify difference between each value

True

True

5

Can add or subtract values

True

True

6

Can multiply and divide values

True

7

Has true zero

True

Furthermore, nominal and ordinal data correspond to discrete values, while interval and ratio data can correspond to continuous values as well. In supervised learning, the measurement scale of the attribute values that we want to predict dictates the kind of machine algorithm that can be used. For instance, predicting discrete values from a limited list is called classification and can be achieved using decision trees, while predicting continuous values is called regression, which can be achieved using model trees.

You have been reading a chapter from
Machine Learning in Java - Second Edition
Published in: Nov 2018
Publisher: Packt
ISBN-13: 9781788474399
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime