Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Java for Data Science
Mastering Java for Data Science

Mastering Java for Data Science: Analytics and more for production-ready applications

eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Mastering Java for Data Science

Data Science Using Java

This book is about building data science applications using the Java language. In this book, we will cover all the aspects of implementing projects from data preparation to model deployment.

The readers of this book are assumed to have some previous exposure to Java and data science, and the book will help to take this knowledge to the next level. This means learning how to effectively tackle a specific data science problem and get the most out of the available data.

This is an introductory chapter where we will prepare the foundation for all the other chapters. Here we will cover the following topics:

  • What is machine learning and data science?
  • Cross Industry Standard Process for Data Mining (CRIPS-DM), a methodology for doing data science projects
  • Machine learning libraries in Java for medium and large-scale data science applications

By the end of this chapter, you will know how to approach a data science project and what Java libraries to use to do that.

Data science

Data science is the discipline of extracting actionable knowledge from data of various forms. The name data science emerged quite recently--it was invented by DJ Patil and Jeff Hammerbacher and popularized in the article Data Scientist: The Sexiest Job of the 21st Century in 2012. But the discipline itself had existed before for quite a while and previously was known by other names such as data mining or predictive analytics. Data science, like its predecessors, is built on statistics and machine learning algorithms for knowledge extraction and model building.

The science part of the term data science is no coincidence--if we look up science, its definition can be summarized to systematic organization of knowledge in terms testable explanations and predictions. This is exactly what data scientists do, by extracting patterns from available data, they can make predictions about future unseen data, and they make sure the predictions are validated beforehand. 

Nowadays, data science is used across many fields, including (but not limited to):

  • Banking: Risk management (for example, credit scoring), fraud detection, trading
  • Insurance: Claims management (for example, accelerating claim approval), risk and losses estimation, also fraud detection
  • Health care: Predicting diseases (such as strokes, diabetes, cancer) and relapses
  • Retail and e-commerce: Market basket analysis (identifying product that go well together), recommendation engines, product categorization, and personalized searches

This book covers the following practical use cases:

  • Predicting whether an URL is likely to appear on the first page of a search engine
  • Predicting how fast an operation will be completed given the hardware specifications
  • Ranking text documents for a search engine
  • Checking whether there is a cat or a dog on a picture
  • Recommending friends in a social network
  • Processing large-scale textual data on a cluster of computers

In all these cases, we will use data science to learn from data and use the learned knowledge to solve a particular business problem.

We will also use a running example throughout the book, building a search engine. We will use it to illustrate many data science concepts such as, supervised machine learning, dimensionality reduction, text mining, and learning to rank models. 

Machine learning

Machine learning is a part of computer science, and it is at the core of data science. The data itself, especially in big volumes, is hardly useful, but inside it hides highly valuable patterns. With the help of machine learning, we can recognize these hidden patterns, extract them, and then apply the learned information to the new unseen items. 

For example, given the image of an animal, a machine learning algorithm can say whether the picture is a dog or a cat; or, given the history of a bank client, it will say how likely the client is to default, that is, to fail to pay the debt.

Often, machine learning models are seen as black boxes that take in a data point and output a prediction for it. In this book, we will look at what is inside these black boxes and see how and when it is best to use them.

The typical problems that machine learning solves can be categorized in the following groups:

  • Supervised learning: For each data point, we have a label--extra information that describes the outcome that we want to learn. In the cats versus dogs case, the data point is an image of the animal; the label describes whether it's a dog or a cat.
  • Unsupervised learning: We only have raw data points and no label information is available. For example, we have a collection of e-mails and we would like to group them based on how similar they are. There is no explicit label associated with the e-mails, which makes this problem unsupervised.
  • Semi-supervised learning: Labels are given only for a part of the data.
  • Reinforcement learning: Instead of labels, we have a reward; something the model gets by interacting with the environment it runs in. Based on the reward, it can adapt and maximize it. For example, a model that learns how to play chess gets a positive reward each time it eats a figure of the opponent, and gets a negative reward each time it loses a figure; and the reward is proportional to the value of the figure. 

Supervised learning

As we discussed previously, for supervised learning we have some information attached to each data point, the label, and we can train a model to use it and to learn from it. For example, if we want to build a model that tells us whether there is a dog or a cat on a picture, then the picture is the data point and the information whether it is a dog or a cat is the label. Another example is predicting the price of a house--the description of a house is the data point, and the price is the label. 

We can group the algorithms of supervised learning into classification and regression algorithms based on the nature of this information.

In classification problems, the labels come from some fixed finite set of classes, such as {cat, dog}, {default, not default}, or {office, food, entertainment, home}. Depending on the number of classes, the classification problem can be binary (only two possible classes) or multi-class (several classes).

Examples of classification algorithms are Naive Bayes, logistic regression, perceptron, Support Vector Machine (SVM), and many others. We will discuss classification algorithms in more detail in the first part of Chapter 4, Supervised Learning - Classification and Regression.

In regression problems, the labels are real numbers. For example, a person can have a salary in the range from $0 per year to several billions per year. Hence, predicting the salary is a regression problem.

Examples of regression algorithms are linear regression, LASSO, Support Vector Regression (SVR), and others. These algorithms will be described in more detail in the second part of Chapter 4, Supervised Learning - Classification and Regression.

Some of the supervised learning methods are universal and can be applied to both classification and regression problems. For example, decision trees, random forest, and other tree-based methods can tackle both types. We will discuss one such algorithm, gradient boosting machines in Chapter 7, Extreme Gradient Boosting.

Neural networks can also deal with both classification and regression problems, and we will talk about them in Chapter 8Deep Learning with DeepLearning4J.

Unsupervised learning

Unsupervised learning covers the cases where we have no labels available, but still want to find some patterns hidden in the data. There are several types of unsupervised learning, and we will look into cluster analysis, or clustering and unsupervised dimensionality reduction. 

Clustering

Typically, when people talk about unsupervised learning, they talk about cluster analysis or clustering. A cluster analysis algorithm takes a set of data points and tries to categorize them into groups such that similar items belong to the same group, and different items do not. There are many ways where it can be used, for example, in customer segmentation or text categorization.

Customer segmentation is an example of clustering. Given some description of customers, we try to put them into groups such that the customers in one group have similar profiles and behave in a similar way. This information can be used to understand what do the people in these groups want, and this can be used to target them with better advertisements and other promotional messages.

Another example is text categorization. Given a collection of texts, we would like to find common topics among these texts and arrange the texts according to these topics. For example, given a set of complaints in an e-commerce store, we may want to put ones that talk about similar things together, and this should help the users of the system navigate through the complaints easier.

Examples of cluster analysis algorithms are hierarchical clustering, k-means, density-based spatial clustering of applications with noise (DBSCAN), and many others. We will talk about clustering in detail in the first part of Chapter 5, Unsupervised Learning - Clustering and Dimensionality Reduction.

Dimensionality reduction

Another group of unsupervised learning algorithms is dimensionality reduction algorithms. This group of algorithms compresses the dataset, keeping only the most useful information. If our dataset has too much information, it can be hard for a machine learning algorithm to use all of it at the same time. It may just take too long for the algorithm to process all the data and we would like to compress the data, so processing it takes less time. 

There are multiple algorithms that can reduce the dimensionality of the data, including Principal Component Analysis (PCA), Locally linear embedding, and t-SNE. All these algorithms are examples of unsupervised dimensionality reduction techniques.

Not all dimensionality reduction algorithms are unsupervised; some of them can use labels to reduce the dimensionality better. For example, many feature selection algorithms rely on labels to see what features are useful and what are not. 

We will talk more about this in Chapter 5Unsupervised Learning - Clustering and Dimensionality Reduction.

Natural Language Processing

Processing natural language texts is very complex, they are not very well structured and require a lot of cleaning and normalizing. Yet the amount of textual information around us is tremendous: a lot of text data is generated every minute, and it is very hard to retrieve useful information from them. Using data science and machine learning is very helpful for text problems as well; they allow us to find the right text, process it, and extract the valuable bits of information.

There are multiple ways we can use the text information. One example is information retrieval, or, simply, text search--given a user query and a collection of documents, we want to find what are the most relevant documents in the corpus with respect to the query, and present them to the user. Other applications include sentiment analysis--predicting whether a product review is positive, neutral or negative, or grouping the reviews according to how they talk about the products. 

We will talk more about information retrieval, Natural Language Processing (NLP) and working with texts in Chapter 6, Working with Text - Natural Language Processing and Information Retrieval. Additionally, we will see how to process large amounts of text data in Chapter 9Scaling Data Science.  

The methods we can use for machine learning and data science are very important. What is equally important is the the way we create them and then put them to use in production systems. Data science process models help us make it more organized and systematic, which is why we will talk about them next.

Data science process models

Applying data science is much more than just selecting a suitable machine learning algorithm and using it on the data. It is always good to keep in mind that machine learning is only a small part of the project; there are other parts such as understanding the problem, collecting the data, testing the solution and deploying to the production.

When working on any project, not just data science ones, it is beneficial to break it down into smaller manageable pieces and complete them one-by-one. For data science, there are best practices that describe how to do it the best way, and they are called process models. There are multiple models, including CRISP-DM and OSEMN.

In this chapter, CRISP-DM is explained as Obtain, Scrub, Explore, Model, and iNterpret (OSEMN), which is more suitable for data analysis tasks and addresses many important steps to a lesser extent.

CRISP-DM

Cross Industry Standard Process for Data Mining (CRISP-DM) is a process methodology for developing data mining applications. It was created before the term data science became popular, it's reliable and time-tested by several generations of analytics. These practices are still useful nowadays and describe the high-level steps of any analytical project quite well. 

The CRISP-DM methodology breaks down a project into the following steps:

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Evaluation
  • Deployment

The methodology itself defines much more than just these steps, but typically knowing what the steps are and what happens at each step is enough for a successful data science project. Let's look at each of these steps separately.

The first step is Business Understanding. This step aims at learning what kinds of problems the business has and what they want to achieve by solving these problems. To be successful, a data science application must be useful for the business. The result of this step is the formulation of a problem which we want to solve and what is the desired outcome of the project.

The second step is Data Understanding. In this step, we try to find out what data can be used to solve the problem. We also need to find out if we already have the data; if not, we need to think how we can we get it. Depending on what data we find (or do not find), we may want to alter the original goal.

When the data is collected, we need to explore it. The process of reviewing the data is often called Exploratory Data Analysis and it is an integral part of any data science project. it helps to understand the processes that created the data, and can already suggest approaches for tackling the problem. The result of this step is the knowledge about which data sources are needed to solve the problem. We will talk more about this step in Chapter 3, Exploratory Data Analysis.

The third step of CRISP-DM is Data Preparation. For a dataset to be useful, it needs to be cleaned and transformed to a tabular form. The tabular form means that each row corresponds to exactly one observation. If our data is not in this shape, it cannot be used by most of the machine learning algorithms. Thus, we need to prepare the data such that it eventually can be converted to a matrix form and fed to a model.

Also, there could be different datasets that contain the needed information, and they may  not be homogenous. What this means is that we need to convert these datasets to some common format, which can be read by the model.

This step also includes Feature Engineering--the process of creating features that are most informative for the problem and describe the data in the best way.

Many data scientists say that they spend most of their time on this step when building Data Science applications. We will talk about this step in Chapter 2, Data Processing Toolbox and throughout the book.

The fourth step is Modeling. In this step, the data is already in the right shape and we feed it to different Machine Learning algorithms. This step also includes parameter tuning, feature selection, and selecting the best model.

Evaluation of the quality of the models from the machine learning point of view happens during this step. The most important thing to check is the ability to generalize, and this is typically done via cross validation. In this step, we also may want to go back to the previous step and do extra cleaning and feature engineering. The outcome is a model that is potentially useful for solving the problem defined in Step 1.

The fifth step is Evaluation. It includes evaluating the model from the business perspective--not from the machine learning perspective. This means that we need to perform a critical review of the results so far and plan the next steps. Does the model achieve what we want? Additionally, some of the findings may lead to reconsidering the initial question. After this step, we can go to the deployment step or re-iterate the process.

The, final, sixth step is Model Deployment. During this step, the produced model is added to the production, so the result is the model integrated to the live system. We will cover this step in Chapter 10Deploying Data Science Models.

Often, evaluation is hard because it is not always possible to say whether the model achieves the desired result or not. In these cases, the evaluation and deployment steps can be combined into one, the model is deployed and applied only to a part of users, and then the data for evaluating the model is collected. We will also briefly cover the ways of doing them, such as A/B testing and multi-armed bandits, in the last chapter of the book.

A running example

There will be many practical use cases throughout the book, sometimes a couple in each chapter. But we will also have a running example, building a search engine. This problem is interesting for a number of reasons:

  • It is fun
  • Business in almost any domain can benefit from a search engine
  • Many businesses already have text data; often it is not used effectively, and its use can be improved
  • Processing text requires a lot of effort, and it is useful to learn to do this effectively

We will try to keep it simple, yet, with this example, we will touch on all the technical parts of the data science process throughout the book:

  • Data Understanding: Which data can be useful for the problem? How can we obtain this data?
  • Data Preparation: Once the data is obtained, how can we process it? If it is HTML, how do we extract text from it? How do we extract individual sentences and words from the text?
  • Modeling: Ranking documents by their relevance with respect to a query is a data science problem and we will discuss how it can be approached.
  • Evaluation: The search engine can be tested to see if it is useful for solving the business problem or not.
  • Deployment: Finally, the engine can be deployed as a REST service or integrated directly to the live system.

We will obtain and prepare the data in Chapter 2Data Processing Toolbox, understand the data in Chapter 3Exploratory Data Analysis, build simple models and evaluate them in Chapter 4, Supervised Machine Learning - Classification and Regression, look at how to process text in Chapter 6Working with Text - Natural Language Processing and Information Retrieval, see how to apply it to millions of webpages in Chapter 9Scaling Data Science, and, finally, learn how we can deploy it in Chapter 10Deploying Data Science Models.

Data science in Java

In this book, we will use Java for doing data science projects. Java might not seem a good choice for data science at first glance, unlike Python or R, it has fewer data science and machine learning libraries, it is more verbose and lacks interactivity. On the other hand, it has a lot of upsides as follows:

  • Java is a statically typed language, which makes it easier to maintain the code base and harder to make silly mistakes--the compiler can detect some of them.
  • The standard library for data processing is very rich, and there are even richer external libraries.
  • Java code is typically faster than the code in scripting languages that are usually used for data science (such as R or Python).
  • Maven, the de-facto standard for dependency management in the Java world, makes it very easy to add new libraries to the project and avoid version conflicts.
  • Most of big data frameworks for scalable data processing are written in either Java or JVM languages, such as Apache Hadoop, Apache Spark, or Apache Flink.
  • Very often production systems are written in Java and building models in other languages adds unnecessary levels of complexity. Creating the models in Java makes it easier to integrate them to the product.

Next, we will look at the data science libraries available in Java.

Data science libraries

While there are not as many data science libraries in Java compared to R, there are quite a few. Additionally, it is often possible to use machine learning and data mining libraries written in other JVM languages, such as Scala, Groovy, or Clojure. Because these languages share the runtime environment, it makes it very easy to import libraries written in Scala and use them directly in Java code.

We can divide the libraries into the following categories:

  • Data processing libraries
  • Math and stats libraries
  • Machine learning and data mining libraries
  • Text processing libraries

Now we will see each of them in detail. 

Data processing libraries

The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution. 

There are very powerful extensions to the standard library such as:

We will cover both the standard API for data processing and its extensions in Chapter 2Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:

<dependency> 
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>

When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.

Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.

When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.

For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms. 

There are a few data frame libraries available in Java. Some of them are as follows:

We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book. 

There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.

Additionally, there are libraries for distributed data processing such as:

We will talk about distributed data processing in Chapter 9Scaling Data Science.

Math and stats libraries

The math support in the standard Java library is quite limited, and only includes methods such as log for computing the logarithm, exp for computing the exponent and other basic methods.

There are external libraries with richer support of mathematics. For example:

Also, many machine learning libraries come with some extra math functionality, often linear algebra, stats, and optimization.

Machine learning and data mining libraries

There are quite a few machine learning and data mining libraries available for Java and other JVM languages. Some of them are as follows:

  • Weka (http://www.cs.waikato.ac.nz/ml/weka/) is probably the most famous data mining library in Java, contains a lot of algorithms and has many extensions.
  • JavaML (http://java-ml.sourceforge.net/) is quite an old and reliable ML library, but unfortunately not updated anymore
  • Smile (http://haifengl.github.io/smile/) is a promising ML library that is under active development at the moment and a lot of new methods are being added there.
  • JSAT (https://github.com/EdwardRaff/JSAT) contains quite an impressive list of machine learning algorithms.
  • H2O (http://www.h2o.ai/) is a framework for distributed ML written in Java, but is available for multiple languages, including Scala, R, and Python.
  • Apache Mahout (http://mahout.apache.org/) is used for in-core (one machine) and distributed machine learning. The Mahout Samsara framework allows writing the code in a framework-independent way and then executes it on Spark, Flink, or H2O.

There are several libraries that specialize solely on neural networks:

We will cover some of these libraries throughout the book.

Text processing

It is possible to do simple text processing using only the standard Java library with classes such as StringTokenizer, the java.text package, or the regular expressions.

In addition to that, there is a big variety of text processing frameworks available for Java as follows:

Most NLP libraries have very similar functionality and coverage of algorithms, which is why selecting which one to use is usually a matter of habit or taste. They all typically have tokenization, parsing, part-of-speech tagging, named entity recognition, and other algorithms for text processing. Some of them (such as StanfordNLP) support multiple languages, and some support only English.

We will cover some of these libraries in Chapter 6Working with Text - Natural Language Processing and Information Retrival.

Summary

In this chapter, we briefly discussed data science and what role machine learning plays in it. Then we talked about doing a data science project, and what methodologies are useful for it. We discussed one of them, CRISP-DM, the steps it defines, how these steps are related and the outcome of each step.

Finally, we spoke about why doing a data science project in Java is a good idea, it is statically compiled, it's fast, and often the existing production systems already run in Java. We also mentioned libraries and frameworks one can use to successfully accomplish a data science project using the Java language.

With this foundation, we will now go to the most important (and most time-consuming) step in a data science project--Data Preparation.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • An overview of modern Data Science and Machine Learning libraries available in Java
  • Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.
  • Easy-to-follow illustrations and the running example of building a search engine.

Description

Java is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises. Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort. This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data. Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.

Who is this book for?

This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it. If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!

What you will learn

  • Get a solid understanding of the data processing toolbox available in Java
  • Explore the data science ecosystem available in Java
  • Find out how to approach different machine learning problems with Java
  • Process unstructured information such as natural language text
  • or images
  • Create your own search engine
  • Get state-of-the-art performance with XGBoost
  • Learn how to build deep neural networks with DeepLearning4j
  • Build applications that scale and process large amounts of data
  • Deploy data science models to production and evaluate their
  • performance
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 27, 2017
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781782174271
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Apr 27, 2017
Length: 364 pages
Edition : 1st
Language : English
ISBN-13 : 9781782174271
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 170.97
Java for Data Science
$54.99
Mastering Java Machine Learning
$60.99
Mastering Java for Data Science
$54.99
Total $ 170.97 Stars icon
Banner background image

Table of Contents

10 Chapters
Data Science Using Java Chevron down icon Chevron up icon
Data Processing Toolbox Chevron down icon Chevron up icon
Exploratory Data Analysis Chevron down icon Chevron up icon
Supervised Learning - Classification and Regression Chevron down icon Chevron up icon
Unsupervised Learning - Clustering and Dimensionality Reduction Chevron down icon Chevron up icon
Working with Text - Natural Language Processing and Information Retrieval Chevron down icon Chevron up icon
Extreme Gradient Boosting Chevron down icon Chevron up icon
Deep Learning with DeepLearning4J Chevron down icon Chevron up icon
Scaling Data Science Chevron down icon Chevron up icon
Deploying Data Science Models Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(1 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Luca Massaron May 07, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Disclaimer: I am one of the technical reviewers of this book and an author of other data science and machine learning books.I strongly recommend reading this book in order to have a complete well-rounded overview of what it means doing data science in a company environment today. Commonly we think of data science as just the series of data preparation, experimentation and model development that can lead to a PoC, a Proof of Concept. Clearly, such can be easily done in Python or R in a successful way, but to discover soon after that the project has to be completely rewritten and rethought, if it has to go beyond the PoC status and become an integrated part of the software of company that sustains its competitive advantage in the market.Alexey clearly describes what Java can do for your data science projects, allowing you to go into production with your data science projects leveraging an existing Java ecosystem, which is common to so many small and large companies around the World. Most IT people consider Java as solid, mature, trusted, and reliable and it is (most likely) already been applied to existing ETL processing on the data you are using for your projects. Java can really be the key allowing your data science projects to go smoothly and successfully into production.Moreover, Alexey illustrates in the book many original interesting examples, uncommon to many other books around (no, really no Iris dataset here) and it presents so many fabulous tricks he clearly derived from his many years of practice in data science, that the book is a gem in itself. I found myself learning something new each chapter I read :-)
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela