Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Java Data Analysis
Java Data Analysis

Java Data Analysis: Data mining, big data analysis, NoSQL, and data visualization

Arrow left icon
Profile Icon John R. Hubbard
Arrow right icon
Can$12.99 Can$55.99
eBook Sep 2017 412 pages 1st Edition
eBook
Can$12.99 Can$55.99
Paperback
Can$69.99
Subscription
Free Trial
Arrow left icon
Profile Icon John R. Hubbard
Arrow right icon
Can$12.99 Can$55.99
eBook Sep 2017 412 pages 1st Edition
eBook
Can$12.99 Can$55.99
Paperback
Can$69.99
Subscription
Free Trial
eBook
Can$12.99 Can$55.99
Paperback
Can$69.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Java Data Analysis

Chapter 1. Introduction to Data Analysis

Data analysis is the process of organizing, cleaning, transforming, and modeling data to obtain useful information and ultimately, new knowledge. The terms data analytics, business analytics, data mining, artificial intelligence, machine learning, knowledge discovery, and big data are also used to describe similar processes. The distinctions of these fields probably lie more in their areas of application than in their fundamental nature. Some argue that these are all part of the new discipline of data science.

The central process of gaining useful information from organized data is managed by the application of computer science algorithms. Consequently, these will be a central focus of this book.

Data analysis is both an old field and a new one. Its origins lie among the mathematical fields of numerical methods and statistical analysis, which reach back into the eighteenth century. But many of the methods that we shall study gained prominence much more recently, with the ubiquitous force of the internet and the consequent availability of massive datasets.

In this first chapter, we look at a few famous historical examples of data analysis. These can help us appreciate the importance of the science and its promise for the future.

Origins of data analysis

Data is as old as civilization itself, maybe even older. The 17,000-year-old paintings in the Lascaux caves in France could well have been attempts by those primitive dwellers to record their greatest hunting triumphs. Those records provide us with data about humanity in the Paleolithic era. That data was not analyzed, in the modern sense, to obtain new knowledge. But its existence does attest to the need humans have to preserve their ideas in data.

Five thousand years ago, the Sumerians of ancient Mesopotamia recorded far more important data on clay tablets. That cuneiform writing included substantial accounting data about daily business transactions. To apply that data, the Sumerians invented not only text writing, but also the first number system.

In 1086, King William the Conqueror ordered a massive collection of data to determine the extent of the lands and properties of the crown and of his subjects. This was called the Domesday Book, because it was a final tallying of people's (material) lives. That data was analyzed to determine ownership and tax obligations for centuries to follow.

The scientific method

On November 11, 1572, a young Danish nobleman named Tycho Brahe observed the supernova of a star that we now call SN 1572. From that time until his death 30 years later, he devoted his wealth and energies to the accumulation of astronomical data. His young German assistant, Johannes Kepler, spent 18 years analyzing that data before he finally formulated his three laws of planetary motion in 1618.

The scientific method

Figure 1 Kepler

Historians of science usually attribute Kepler's achievement as the beginning of the Scientific Revolution. Here were the essential steps of the scientific method: observe nature, collect the data, analyze the data, formulate a theory, and then test that theory with more data. Note the central step here: data analysis.

Of course, Kepler did not have either of the modern tools that data analysts use today: algorithms and computers on which to implement them. He did, however, apply one technological breakthrough that surely facilitated his number crunching: logarithms. In 1620, he stated that Napier's invention of logarithms in 1614 had been essential to his discovery of the third law of planetary motion.

Kepler's achievements had a profound effect upon Galileo Galilei a generation later, and upon Isaac Newton a generation after him. Both men practiced the scientific method with spectacular success.

Actuarial science

One of Newton's few friends was Edmund Halley, the man who first computed the orbit of his eponymous comet. Halley was a polymath, with expertise in astronomy, mathematics, physics, meteorology, geophysics, and cartography.

In 1693, Halley analyzed mortality data that had been compiled by Caspar Neumann in Breslau, Germany. Like Kepler's work with Brahe's data 90 years earlier, Halley's analysis led to new knowledge. His published results allowed the British government to sell life annuities at the appropriate price, based on the age of the annuitant.

Most data today is still numeric. But most of the algorithms we will be studying apply to a much broader range of possible values, including text, images, audio and video files, and even complete web pages on the internet.

Calculated by steam

In 1821, a young Cambridge student named Charles Babbage was poring over some trigonometric and logarithmic tables that had been recently computed by hand. When he realized how many errors they had, he exclaimed, "I wish to God these calculations had been executed by steam." He was suggesting that the tables could have been computed automatically by some mechanism that would be powered by a steam engine.

Calculated by steam

Babbage was a mathematician by avocation, holding the same Lucasian Chair of Mathematics at Cambridge University that Isaac Newton had held 150 years earlier and that Stephen Hawking would hold 150 years later. However, he spent a large part of his life working on automatic computing. Having invented the idea of a programmable computer, he is generally regarded as the first computer scientist. His assistant, Lady Ada Lovelace, has been recognized as the first computer programmer.

Babbage's goal was to build a machine that could analyze data to obtain useful information, the central step of data analysis. By automating that step, it could be carried out on much larger datasets and much more rapidly. His interest in trigonometric and logarithmic tables was related to his objective of improving methods of navigation, which was critical to the expanding British Empire.

A spectacular example

In 1854, cholera broke out among the poor in London. The epidemic spread quickly, partly because nobody knew the source of the problem. But a physician named John Snow suspected it was caused by contaminated water. At that time, most Londoners drew their water from public wells that were supplied directly from the River Thames. The following figure shows the map that Snow drew, with black rectangles indicating the frequencies of cholera occurrences:

A spectacular example

Figure 3 Dr. Snow's Cholera Map

If you look closely, you can also see the locations of nine public water pumps, marked as black dots and labeled PUMP. From this data, we can easily see that the pump at the corner of Broad Street and Cambridge Street is in the middle of the epidemic. This data analysis led Snow to investigate the water supply at that pump, discovering that raw sewage was leaking into it through a break in the pipe.

By also locating the public pumps on the map, he demonstrated that the source was probably the pump at the corner of Broad Street and Cambridge Street. This was one of the first great examples of the successful application of data analysis to public health (for more information, see https://www1.udel.edu/johnmack/frec682/cholera/cholera2.html). President James K. Polk and composer Pyotr Ilyich Tchaikovsky were among the millions who died from cholera in the nineteenth century. But even today the disease is still a pandemic, killing around 100,000 per year world-wide.

Herman Hollerith

The decennial United States Census was mandated by the U. S. Constitution in 1789 for the purposes of apportioning representatives and taxes. The first census was taken in 1790 when the U. S. population was under four million. It simply counted free men. But by 1880, the country had grown to over 50 million, and the census itself had become much more complicated, recording dependents, parents, places of birth, property, and income.

Herman Hollerith

Figure 4 Hollerith

The 1880 census took over eight years to compile. The United States Census Bureau realized that some sort of automation would be required to complete the 1890 census. They hired a young engineer named Herman Hollerith, who had proposed a system of electronic tabulating machines that would use punched cards to record the data.

This was the first successful application of automated data processing. It was a huge success. The total population of nearly 62 million was reported after only six weeks of tabulation.

Hollerith was awarded a Ph.D. from MIT for his achievement. In 1911, he founded the Computing-Tabulating-Recording Company, which became the International Business Machines Corporation (IBM) in 1924. Recently IBM built the supercomputer Watson, which was probably the most successful commercial application of data mining and artificial intelligence yet produced.

ENIAC

During World War II, the U. S. Navy had battleships with guns that could shoot 2700-pound projectiles 24 miles. At that range, a projectile spent almost 90 seconds in flight. In addition to the guns' elevation, angle of amplitude, and initial speed of propulsion, those trajectories were also affected by the motion of the ship, the weather conditions, and even the motion of the earth's rotation. Accurate calculations of those trajectories posed great problems.

To solve these computational problems, the U. S. Army contracted an engineering team at the University of Pennsylvania to build the Electronic Numerical Integrator and Computer (ENIAC), the first complete electronic programmable digital computer. Although not completed until after the war was over, it was a huge success.

It was also enormous, occupying a large room and requiring a staff of engineers and programmers to operate. The input and output data for the computer were recorded on Hollerith cards. These could be read automatically by other machines that could then print their contents.

ENIAC played an important role in the development of the hydrogen bomb. Instead of artillery tables, it was used to simulate the first test run for the project. That involved over a million cards.

ENIAC

Figure 5 ENIAC

VisiCalc

In 1979, Harvard student Dan Bricklin was watching his professor correct entries in a table of finance data on a chalkboard. After correcting a mistake in one entry, the professor proceeded to correct the corresponding marginal entries. Bricklin realized that such tedious work could be done much more easily and accurately on his new Apple II microcomputer. This resulted in his invention of VisiCalc, the first spreadsheet computer program for microcomputers. Many agree that that innovation transformed the microcomputer from a hobbyist's game platform to a serious business tool.

The consequence of Bricklin's VisiCalc was a paradigm shift in commercial computing. Spreadsheet calculations, an essential form of commercial data processing, had until then required very large and expensive mainframe computing centers. Now they could be done by a single person on a personal computer. When the IBM PC was released two years later, VisiCalc was regarded as essential software for business and accounting.

Data, information, and knowledge

The 1854 cholera epidemic case is a good example for understanding the differences between data, information, and knowledge. The data that Dr. Snow used, the locations of cholera outbreaks and water pumps, was already available. But the connection between them had not yet been discovered. By plotting both datasets on the same city map, he was able to determine that the pump at Broad street and Cambridge street was the source of the contamination. That connection was new information. That finally led to the new knowledge that the disease is transmitted by foul water, and thus the new knowledge on how to prevent the disease.

Why Java?

Java is, as it has been for over a decade, the most popular programming language in the world. And its popularity is growing. There are several good reasons for this:

  • Java runs the same way on all computers
  • It supports the object-oriented programming (OOP) paradigm
  • It interfaces easily with other languages, including the database query language SQL
  • Its Javadoc documentation is easy to access and use
  • Most open-source software is written in Java, including that which is used for data analysis

Python may be easier to learn, R may be simpler to run, JavaScript may be easier for developing websites, and C/C++ may be faster, but for general purpose programming, Java can't be beat.

Java was developed in 1995 by a team led by James Gosling at Sun Microsystems. In 2010, the Oracle Corporation bought Sun for $7.4 B and has supported Java since then. The current version is Java 8, released in 2014. But by the time you buy this book, Java 9 should be available; it is scheduled to be released in late 2017.

As the title of this book suggests, we will be using Java in all our examples.

Note

Appendix includes instructions on how to set up your computer with Java.

Java Integrated Development Environments

To simplify Java software development, many programmers use an Integrated Development Environment (IDE). There are several good, free Java IDEs available for download. Among them are:

  • NetBeans
  • Eclipse
  • JDeveloper
  • JCreator
  • IntelliJ IDEA

These are quite similar in how they work, so once you have used one, it's easy to switch to another.

Although all the Java examples in this book can be run at the command line, we will instead show them running on NetBeans. This has several advantages, including:

  • Code listings include line numbers
  • Standard indentation rules are followed automatically
  • Code syntax coloring

Here is the standard Hello World program in NetBeans:

Java Integrated Development Environments

Listing 1 Hello World program

When you run this program in NetBeans, you will see some of its syntax coloring: gray for comments, blue for reserved words, green for objects, and orange for strings.

In most cases, to save space, we will omit the header comments and the package designation from the listing displays, showing only the program, like this:

Java Integrated Development Environments

Listing 2 Hello World program abbreviated

Or, sometimes just we'll show the main() method, like this:

Java Integrated Development Environments

Listing 3 Hello World program abbreviated further

Nevertheless, all the complete source code files are available for download at the Packt Publishing website.

Here is the output from the Hello World program:

Java Integrated Development Environments

Figure 6 Output from the Hello World program

Note

Appendix describes how to install and start using NetBeans.

Summary

The first part of this chapter described some important historical events that have led to the development of data analysis: ancient commercial record keeping, royal compilations of land and property, and accurate mathematical models in astronomy, physics, and navigation. It was this activity that led Babbage to invent the computer. Data analysis was borne from necessity in the advance of civilization, from the identification of the source of cholera, to the management of economic data, and the modern processing of massive datasets.

This chapter also briefly explained our choice of the Java programming language for the implementation of the data analysis algorithms to be studied in this book. And finally, it introduced the NetBeans IDE, which we will also use throughout the book.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get your basics right for data analysis with Java and make sense of your data through effective visualizations.
  • Use various Java APIs and tools such as Rapidminer and WEKA for effective data analysis and machine learning.
  • This is your companion to understanding and implementing a solid data analysis solution using Java

Description

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the aim of discovering useful information. Java is one of the most popular languages to perform your data analysis tasks. This book will help you learn the tools and techniques in Java to conduct data analysis without any hassle. After getting a quick overview of what data science is and the steps involved in the process, you’ll learn the statistical data analysis techniques and implement them using the popular Java APIs and libraries. Through practical examples, you will also learn the machine learning concepts such as classification and regression. In the process, you’ll familiarize yourself with tools such as Rapidminer and WEKA and see how these Java-based tools can be used effectively for analysis. You will also learn how to analyze text and other types of multimedia. Learn to work with relational, NoSQL, and time-series data. This book will also show you how you can utilize different Java-based libraries to create insightful and easy to understand plots and graphs. By the end of this book, you will have a solid understanding of the various data analysis techniques, and how to implement them using Java.

Who is this book for?

If you are a student or Java developer or a budding data scientist who wishes to learn the fundamentals of data analysis and learn to perform data analysis with Java, this book is for you. Some familiarity with elementary statistics and relational databases will be helpful but is not mandatory, to get the most out of this book. A firm understanding of Java is required.

What you will learn

  • Develop Java programs that analyze data sets of nearly any size, including text
  • Implement important machine learning algorithms such as regression, classification, and clustering
  • Interface with and apply standard open source Java libraries and APIs to analyze and visualize data
  • Process data from both relational and non-relational databases and from time-series data
  • Employ Java tools to visualize data in various forms
  • Understand multimedia data analysis algorithms and implement them in Java.

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 19, 2017
Length: 412 pages
Edition : 1st
Language : English
ISBN-13 : 9781787286405
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Sep 19, 2017
Length: 412 pages
Edition : 1st
Language : English
ISBN-13 : 9781787286405
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Can$ 260.97
Machine Learning: End-to-End guide for Java developers
Can$120.99
Java Data Analysis
Can$69.99
Big Data Analytics with Java
Can$69.99
Total Can$ 260.97 Stars icon
Banner background image

Table of Contents

13 Chapters
1. Introduction to Data Analysis Chevron down icon Chevron up icon
2. Data Preprocessing Chevron down icon Chevron up icon
3. Data Visualization Chevron down icon Chevron up icon
4. Statistics Chevron down icon Chevron up icon
5. Relational Databases Chevron down icon Chevron up icon
6. Regression Analysis Chevron down icon Chevron up icon
7. Classification Analysis Chevron down icon Chevron up icon
8. Cluster Analysis Chevron down icon Chevron up icon
9. Recommender Systems Chevron down icon Chevron up icon
10. NoSQL Databases Chevron down icon Chevron up icon
11. Big Data Analysis with Java Chevron down icon Chevron up icon
A. Java Tools Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.