You're reading from Java for Data Science Examine the techniques and Java tools supporting the growing field of data science

Product type Paperback

Published in Jan 2017

Publisher Packt

ISBN-13 9781785280115

Length 386 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Data Science

Authors (2):

Jennifer L. Reese

Richard M. Reese

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Started with Data Science FREE CHAPTER

2. Data Acquisition

3. Data Cleaning

4. Data Visualization

5. Statistical Data Analysis Techniques

6. Machine Learning

7. Neural Networks

8. Deep Learning

9. Text Analysis

10. Visual and Audio Analysis

11. Mathematical and Parallel Techniques for Data Analysis

12. Bringing It All Together

The importance and process of cleaning data

Once the data has been acquired, it will need to be cleaned. Frequently, the data will contain errors, duplicate entries, or be inconsistent. It often needs to be converted to a simpler data type such as text. Data cleaning is often referred to as data wrangling, reshaping, or munging. They are effectively synonyms.

When data is cleaned, there are several tasks that often need to be performed, including checking its validity, accuracy, completeness, consistency, and uniformity. For example, when the data is incomplete, it may be necessary to provide substitute values.

Consider CSV data. It can be handled in one of several ways. We can use simple Java techniques such as the String class' split method. In the following sequence, a string array, csvArray, is assumed to hold comma-delimited data. The split method populates a second array, tokenArray.

for(int i=0; i<csvArray.length; i++) { 
    tokenArray[i] = csvArray[i].split(","); 
}

More complex data types require APIs to retrieve the data. For example, in Chapter 3, Data Cleaning, we will use the Jackson Project (https://github.com/FasterXML/jackson) to retrieve fields from a JSON file. The example uses a file containing a JSON-formatted presentation of a person, as shown next:

{ 
 "firstname":"Smith",
 "lastname":"Peter", 
 "phone":8475552222,
 "address":["100 Main Street","Corpus","Oklahoma"] 
}

The code sequence that follows shows how to extract the values for fields of a person. A parser is created, which uses getCurrentName to retrieve a field name. If the name is firstname, then the getText method returns the value for that field. The other fields are handled in a similar manner.

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser( 
        new File("Person.json")); 
    while (parser.nextToken() != JsonToken.END_OBJECT) { 
        String token = parser.getCurrentName(); 
        if ("firstname".equals(token)) { 
            parser.nextToken(); 
            String fname = parser.getText(); 
            out.println("firstname : " + fname); 
        } 
        ... 
    } 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
}

The output of this example is as follows:

firstname : Smith

Simple data cleaning may involve converting the text to lowercase, replacing certain text with blanks, and removing multiple whitespace characters with a single blank. One way of doing this is shown next, where a combination of the String class' toLowerCase, replaceAll, and trim methods are used. Here, a string containing dirty text is processed:

dirtyText = dirtyText 
    .toLowerCase() 
    .replaceAll("[\\d[^\\w\\s]]+", "  
    .trim(); 
while(dirtyText.contains("  ")){ 
      dirtyText = dirtyText.replaceAll("  ", " "); 
}

Stop words are words such as the, and, or but that do not always contribute to the analysis of text. Removing these stop words can often improve the results and speed up the processing.

The LingPipe API can be used to remove stop words. In the next code sequence, a TokenizerFactory class instance is used to tokenize text. Tokenization is the process of returning individual words. The EnglishStopTokenizerFactory class is a special tokenizer that removes common English stop words.

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer( 
    text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
      out.print(word + " "); 
}

Consider the following text, which was pulled from the book, Moby Dick:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

The output will be as follows:

call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

These are just a couple of the data cleaning tasks discussed in Chapter 3, Data Cleaning.