Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Java Machine Learning

You're reading from   Mastering Java Machine Learning A Java developer's guide to implementing machine learning and big data architectures

Arrow left icon
Product type Paperback
Published in Jul 2017
Publisher Packt
ISBN-13 9781785880513
Length 556 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Uday Kamath Uday Kamath
Author Profile Icon Uday Kamath
Uday Kamath
Krishna Choppella Krishna Choppella
Author Profile Icon Krishna Choppella
Krishna Choppella
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Machine Learning Review 2. Practical Approach to Real-World Supervised Learning FREE CHAPTER 3. Unsupervised Machine Learning Techniques 4. Semi-Supervised and Active Learning 5. Real-Time Stream Machine Learning 6. Probabilistic Graph Modeling 7. Deep Learning 8. Text Mining and Natural Language Processing 9. Big Data Machine Learning – The Final Frontier A. Linear Algebra B. Probability Index

Datasets used in machine learning

To learn from data, we must be able to understand and manage data in all forms. Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all. In this section, we present a high-level classification of datasets with commonly occurring examples.

Based on their structure, or the lack thereof, datasets may be classified as containing the following:

  • Structured data: Datasets with structured data are more amenable to being used as input to most machine learning algorithms. The data is in the form of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available chiefly in flat files or relational databases. The records of financial transactions at a bank shown in the following figure are an example of structured data:
    Datasets used in machine learning

    Financial card transactional data with labels of fraud

  • Transaction or market data: This is a special form of structured data where each entry corresponds to a collection of items. Examples of market datasets are the lists of grocery items purchased by different customers or movies viewed by customers, as shown in the following table:
    Datasets used in machine learning

    Market dataset for items bought from grocery store

  • Unstructured data: Unstructured data is normally not available in well-known formats, unlike structured data. Text data, image, and video data are different formats of unstructured data. Usually, a transformation of some form is needed to extract features from these forms of data into a structured dataset so that traditional machine learning algorithms can be applied.
    Datasets used in machine learning

    Sample text data, with no discernible structure, hence unstructured. Separating spam from normal messages (ham) is a binary classification problem. Here true positives (spam) and true negatives (ham) are distinguished by their labels, the second token in each instance of data. SMS Spam Collection Dataset (UCI Machine Learning Repository), source: Tiago A. Almeida from the Federal University of Sao Carlos.

  • Sequential data: Sequential data have an explicit notion of "order" to them. The order can be some relationship between features and a time variable in time series data, or it can be symbols repeating in some form in genomic datasets. Two examples of sequential data are weather data and genomic sequence data. The following figure shows the relationship between time and the sensor level for weather:
    Datasets used in machine learning

    Time series from sensor data

    Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all three genomic sequences:

    Datasets used in machine learning

    Genomic sequences of DNA as a sequence of symbols.

  • Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in a structured record format or an unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containing relevant claim details with claimants related through addresses, phone numbers, and so on. This can be viewed in a graph structure. Using the World Wide Web as an example, we have web pages available as unstructured data containing links, and graphs of relationships between web pages that can be built using web links, producing some of the most extensively mined graph datasets today:
    Datasets used in machine learning

    Insurance claim data, converted into a graph structure showing the relationship between vehicles, drivers, policies, and addresses

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image