You're reading from Java for Data Science Examine the techniques and Java tools supporting the growing field of data science

Product type Paperback

Published in Jan 2017

Publisher Packt

ISBN-13 9781785280115

Length 386 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Data Science

Authors (2):

Jennifer L. Reese

Richard M. Reese

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Started with Data Science FREE CHAPTER

2. Data Acquisition

3. Data Cleaning

4. Data Visualization

5. Statistical Data Analysis Techniques

6. Machine Learning

7. Neural Networks

8. Deep Learning

9. Text Analysis

10. Visual and Audio Analysis

11. Mathematical and Parallel Techniques for Data Analysis

12. Bringing It All Together

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

    {
      "kind": "youtube#searchResult",
      "etag": etag,
      "id": {
        "kind": string,
        "videoId": string,
        "channelId": string,
        "playlistId": string
      },
      ...
    }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.