Packt+ | Advance your knowledge in tech

0

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Java Data Analysis

You're reading from Java Data Analysis Data mining, big data analysis, NoSQL, and data visualization

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787285651

Length 412 pages

Edition 1st Edition

Languages

Java

Tools

RapidMiner

Concepts

Big Data

Author (1):

John R. Hubbard

View More author details

Table of Contents (14) Chapters

Preface

1. Introduction to Data Analysis FREE CHAPTER

2. Data Preprocessing

3. Data Visualization

4. Statistics

5. Relational Databases

6. Regression Analysis

7. Classification Analysis

8. Cluster Analysis

9. Recommender Systems

10. NoSQL Databases

11. Big Data Analysis with Java

A. Java Tools

Index

Apache Hadoop

Apache Hadoop is an open-source software system that allows for the distributed storage and processing of very large datasets. It implements the MapReduce framework.

The system includes these modules:

Hadoop Common: The common libraries and utilities that support the other Hadoop modules
Hadoop Distributed File System (HDFS™): A distributed filesystem that stores data on commodity machines, providing high-throughput access across the cluster
Hadoop YARN: A platform for job scheduling and cluster resource management
Hadoop MapReduce: An implementation of the Google MapReduce framework

Hadoop originated as the Google File System in 2003. Its developer, Doug Cutting, named it after his son's toy elephant. By 2006, it had become HDFS, the Hadoop Distributed File System.

In April of 2006, using MapReduce, Hadoop set a record of sorting 1.8 TB of data, distributed in 188 nodes, in under 48 hours. Two years later, it set the world record by sorting one terabyte of data in 209 seconds...

The rest of the chapter is locked

Register for a free Packt account to unlock a world of extra content!

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €18.99/month. Cancel anytime

Authors (1)

John R. Hubbard

John R. Hubbard

John R. Hubbard has been doing computer-based data analysis for over 40 years at colleges and universities in Pennsylvania and Virginia. He holds an MSc in computer science from Penn State University and a PhD in mathematics from the University of Michigan. He is currently a professor of mathematics and computer science, Emeritus, at the University of Richmond, where he has been teaching data structures, database systems, numerical analysis, and big data. Dr. Hubbard has published many books and research papers, including six other books on computing. Some of these books have been translated into German, French, Chinese, and five other languages. He is also an amateur timpanist.

See other products by John R. Hubbard

Other recommended products

Related to this chapter

Raspberry Pi 3 Projects for Java Programmers

Raspberry Pi 3 Projects for Java Programmers

This book will try to create starting points for Java developers who would like to extend their knowledge on how to interact with hardware on the Raspberry Pi by providing small real world usable projects. After reading this book the reader will be able to build their own real world usable projects not limited to Home Automation, IoT and/or Robotics utilizing logic, user- and web interfaces.

May 2017 9h 32m

Hands-On Business Intelligence with DAX

Hands-On Business Intelligence with DAX

This book follows a step-by-step explanation of essential concepts, practical examples, and self-assessment questions. You will begin by learning the basics of DAX, along with important concepts such as evaluation contexts and data modeling, before moving on to more advanced topics such as query optimization.

Jan 2020 13h 24m

Statistics Crash Course for Beginners

Statistics Crash Course for Beginners

Through both theoretical and practical study with Python, this course will get you up to speed with all you need to know about statistics in programming—a core study of machine learning.

Mar 2021 10h 58m

Practical Discrete Mathematics

Practical Discrete Mathematics

Discrete math deals with studying finite and distinct elements. With this book, you'll learn the discrete math language and methods crucial to studying and describing objects and functions in computer science. You'll also focus on the mathematics of machine learning and computer science and prepare to understand real-world algorithm development.

Feb 2021 11h 0m

Personalised recommendations for you

Based on your interests and search pattern

Modern Computer Vision with PyTorch

Modern Computer Vision with PyTorch

This book provides a hands-on approach to solving over 30 prominent real-world computer vision problems using PyTorch 2.x on actual datasets. Here you'll learn to build a neural network from scratch and optimize hyperparameters, perform image classification, multi-object detection, segmentation, and more. You'll also explore facial expression manipulation and combining CV with NLP and RL techniques, build generative AI applications, and take your model to production on AWS. By the end of this book, you'll master modern NN architectures and confidently solve real-world CV problems.

Jun 2024 24h 52m

Data Governance Handbook

Data Governance Handbook

This book provides a highly focused view of real business outcomes powered by data governance, that resonate with non-data executives such as CFOs and CEOs. You'll also find useful insights into how to implement data governance initiatives.

May 2024 13h 8m

Data Engineering with Databricks Cookbook

Data Engineering with Databricks Cookbook

This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you'll implement DataOps and DevOps practices, and orchestrate data workflows.

May 2024 14h 36m

Azure Data Engineer Associate Certification Guide

Azure Data Engineer Associate Certification Guide

Unlock the power of Azure data engineering with this certification guide, elevating your skills in data processing, storage, and security with the help of practical insights, hands-on exercises, and the latest advancements.

May 2024 18h 16m

Microsoft Power BI Cookbook

Microsoft Power BI Cookbook

Microsoft Power BI is the most sought-after platform for BI professionals' visualization needs. Explore the latest Power BI features, future AI enhancements, and integration with other Power Platform tools via new recipes in this updated edition.

Jul 2024 19h 56m

Python Data Cleaning Cookbook

Python Data Cleaning Cookbook

The book shows you how to clean, wrangle, and view data from multiple perspectives, including dataset and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations and learn to manipulate data to get it down to a form that can be useful for making the right decisions.

May 2024 16h 12m

Microsoft Azure AI Fundamentals AI-900 Exam Guide

Microsoft Azure AI Fundamentals AI-900 Exam Guide

This AI-900 study guide will help you prepare and practice for the certification exam. You'll delve into AI workloads, ML principles, computer vision, NLP, knowledge mining, and generative AI using Azure cloud services.

May 2024 9h 36m

Using Stable Diffusion with Python

Using Stable Diffusion with Python

This book shows you how to use Python to control Stable Diffusion and generate high-quality images. In addition to covering the basic usage of the diffusers package, the book provides solutions for extending the package for more advanced purposes.

Jun 2024 11h 44m

Getting Started with DuckDB

Getting Started with DuckDB

This hands-on book teaches you to analyze large datasets with blazing speed and ease. You will learn how to use DuckDB to quickly load, query, transform, analyze, and visualize data effectively through a series of practical examples.

Jun 2024 12h 44m

Databricks Certified Associate Developer for Apache Spark Using Python

Databricks Certified Associate Developer for Apache Spark Using Python

This guide gets you ready for certification with expert-backed content, key exam concepts, and topic reviews. Additionally, you'll be able to make the most of Apache Spark 3.0 to modernize workloads and more using specific tools and techniques.