Subscription

Explore Products

Best Sellers

New Releases

Books

Events

Videos

Audiobooks

Packt Hub

Free Learning

You're reading from Mastering Apache Spark 2.x Advanced techniques in complex Big Data processing, streaming analytics and machine learning

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781786462749

Length 354 pages

Edition 2nd Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Author (1):

Romeo Kienzler

View More author details

Table of Contents (15) Chapters

Preface

1. A First Taste and What’s New in Apache Spark V2

2. Apache Spark SQL FREE CHAPTER

3. The Catalyst Optimizer

4. Project Tungsten

5. Apache Spark Streaming

6. Structured Streaming

7. Apache Spark MLlib

8. Apache SparkML

9. Apache SystemML

10. Deep Learning on Apache Spark with DeepLearning4j and H2O

11. Apache Spark GraphX

12. Apache Spark GraphFrames

13. Apache Spark with Jupyter Notebooks on IBM DataScience Experience

14. Apache Spark on Kubernetes

DataFrames

We have already used DataFrames in previous examples; it is based on a columnar format. Temporary tables can be created from it but we will expand on this in the next section. There are many methods available to the data frame that allow data manipulation and processing.

Let's start with a simple example and load some JSON data coming from an IoT sensor on a washing machine. We are again using the Apache Spark DataSource API under the hood to read and parse JSON data. The result of the parser is a data frame. It is possible to display a data frame schema as shown here:

As you can see, this is a nested data structure. So, the doc field contains all the information that we are interested in, and we want to get rid of the meta information that Cloudant/ApacheCouchDB added to the original JSON file. This can be accomplished by a call to the select method on the DataFrame...

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You have been reading a chapter from

Mastering Apache Spark 2.x - Second Edition

Published in: Jul 2017

Publisher: Packt

ISBN-13: 9781786462749

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €18.99/month. Cancel anytime

Authors (1)

Romeo Kienzler

Romeo Kienzler works as the chief data scientist in the IBM Watson IoT worldwide team, helping clients to apply advanced machine learning at scale on their IoT sensor data. He holds a Master's degree in computer science from the Swiss Federal Institute of Technology, Zurich, with a specialization in information systems, bioinformatics, and applied statistics.

See other products by Romeo Kienzler