What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Apache Spark for Data Science Cookbook

Chapter 2. Tricky Statistics with Spark

In this chapter, you will learn the following recipes:

Working with Pandas
Variable identification
Sampling data
Summary and descriptive statistics
Generating frequency tables
Installing Pandas on Linux
Installing Pandas from source
Using IPython with PySpark
Creating Pandas DataFrames over Spark
Splitting, slicing, sorting, filtering and grouping DataFrames over Spark.
Implementing co-variance and correlation using DataFrames over Spark.
Concatenating and merging operations over DataFrames
Complex operations over DataFrames.
Sparkling Pandas

Key benefits

Use Apache Spark for data processing with these hands-on recipes

Implement end-to-end, large-scale data analysis better than ever before

Work with powerful libraries such as MLLib, SciPy, NumPy, and Pandas to gain insights from your data

Description

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark’s selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease. This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark’s data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.

Who is this book for?

This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.

What you will learn

Explore the topics of data mining, text mining, Natural Language Processing, information retrieval, and machine learning.

Solve real-world analytical problems with large data sets.

Address data science challenges with analytical tools on a distributed system like Spark (apt for iterative algorithms), which offers in-memory processing and more flexibility for data analysis at scale.

Get hands-on experience with algorithms like Classification, regression, and recommendation on real datasets using Spark MLLib package.

Learn about numerical and scientific computing using NumPy and SciPy on Spark.

Use Predictive Model Markup Language (PMML) in Spark for statistical data mining models.

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Frequently bought together

€41.99

€36.99

€45.99

Total € 124.97

pavan kumar jalla Sep 10, 2019

As a big data engineer for 3 years in the industry, I was looking around for a solid hands on book for data science, this book has great content and well structred right from the beginning till the end, which takes you a deep dive into data science concepts, appreciate the author for sharing her knowledge.would recommend to anyone who is looking for practical data science approach.

Amazon Verified review

Brandon Jan 23, 2017

This book represents a useful resource to learn Spark programming model and how to employ it in several tasks. The approach followed is very practical, with code provided in every chapter, which guarantees a fast learning process. As technical reviewer of this book I feel to suggest it to people who want to understand how to perform data exploration, analysis and visualization tasks in Spark. With the many use cases covered in the book, it will represent a resource to inspire solutions for daily working tasks.

Dimitri Shvorob Jun 01, 2017

I would dismiss a five-star review by the book's technical reviewer - conflict of interest, anyone? - and "Apache Spark for Data Science Cookbook" is not a five-star book. It is, however, a decent book which compensates for the Packt-standard weakness of explanations with a thoughtful collection of (Scala) code, paying attention to the less glamorous but essential job of data manipulation. And yet, I hesitate to recommend it, and feel that a combo of "Machine Learning with Spark" by Pentreath and "Spark for Data Science" by Duvvuri and Singhal would be a better choice. I would suggest getting all three and deciding which one(s) to leave.

Santanu Feb 25, 2017

This book does not improve you spark knowledge. Only bunch of code with input and output. No proper comments on code.

Apache Spark for Data Science Cookbook: Solve real-world analytical problems

What do you get with eBook?

Contact Details

Billing Address

Working with Pandas

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs