What this book covers
Chapter 1, Obtaining and Cleaning Data, covers different ways to read and write data as well as to clean it to get rid of noise. It also familiarizes the readers with different data file types, such as PDF, ASCII, CSV, TSV, XML, and JSON. The chapter also covers recipes for extracting web data.
Chapter 2, Indexing and Searching Data, covers how to index data for fast searching using Apache Lucene. The techniques described in this chapter can be seen as the basis for modern-day search techniques.
Chapter 3, Analyzing Data Statistically, covers the application of Apache Math API to collect and analyze statistics from data. The chapter also covers higher level concepts such as the statistical significance test, which is the standard tool for researchers when they compare their results with benchmarks.
Chapter 4, Learning from Data - Part 1, covers basic classification, clustering, and feature selection exercises using the Weka machine learning Workbench.
Chapter 5, Learning from Data - Part 2, is a follow-up chapter that covers data import and export, classification, and feature selection using another Java library named the Java Machine Learning (Java-ML) Library. The chapter also covers basic classification with the Stanford Classifier and Massive Online Access (MOA).
Chapter 6, Retrieving Information from Text Data, covers the application of data science to text data for information retrieval. It covers the application of core Java as well as popular libraries such as OpenNLP, Stanford CoreNLP, Mallet, and Weka for the application of machine learning to information extraction and retrieval tasks.
Chapter 7, Handling Big Data, covers the application of big data platforms for machine learning, such as Apache Mahout and Spark-MLib.
Chapter 8, Learn Deeply from Data, covers the very basics of deep learning using the Deep Learning for Java (DL4j) library. We cover the word2vec algorithm, belief networks, and auto-encoders.
Chapter 9, Visualizing Data, covers the GRAL package to generate an appealing and informative display based on data. Among the many functionalities of the package, fundamental and basic plots have been selected.