Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Spark 2.x for Java Developers

You're reading from   Apache Spark 2.x for Java Developers Explore big data at scale using Apache Spark 2.x Java APIs

Arrow left icon
Product type Paperback
Published in Jul 2017
Publisher Packt
ISBN-13 9781787126497
Length 350 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Sourav Gulati Sourav Gulati
Author Profile Icon Sourav Gulati
Sourav Gulati
Sumit Kumar Sumit Kumar
Author Profile Icon Sumit Kumar
Sumit Kumar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Introduction to Spark FREE CHAPTER 2. Revisiting Java 3. Let Us Spark 4. Understanding the Spark Programming Model 5. Working with Data and Storage 6. Spark on Cluster 7. Spark Programming Model - Advanced 8. Working with Spark SQL 9. Near Real-Time Processing with Spark Streaming 10. Machine Learning Analytics with Spark MLlib 11. Learning Spark GraphX

What's new in Spark 2.X?

  • Unified DataFrame and Dataset: The Spark 2.X release has unified both the APIs. Now Dataframe is just a row in Dataset without any data type information implicitly attached.
  • SparkSession: Prior to Spark 2.X, there were different entry points for different Spark jobs; that is, for Spark SQL we had sqlContext and if Hive features were also required then HiveContext was the entry point. With Spark 2.X this ambiguity has been removed and now we have one single entry point called SparkSession. However, it is to be noted that all the module-specific entry points are still very much around and have not been deprecated yet.
  • Catalog API: Spark 2.X has introduced the Catalog API for accessing metadata information in Spark SQL. It can be seen as parallel to Hcatalog in Hive. It is a great step in unifying the metadata structure around Spark SQL so that the very same metadata can be exposed to non-Spark SQL applications. It is also helpful in debugging the temporary registered table in a Spark SQL session. Metadata of both sqlContext and HiveContext are available now, as the Catalog API can be accessed by SparkSession.
  • Structured streaming: Structured streaming makes Spark SQL available in streaming job by continuously running the Spark SQL job and aggregating the updated results on a streaming datasets. The Dataframe and Dataset are available for operations in structured streaming along with the windowing function.
  • Whole-stage code generation: The code generation engine has been modified to generate more performance-oriented code by avoiding virtual function dispatches, transfer of intermediate operation data to memory, and so on.
  • Accumulator API: A new simpler and more performant Accumulator API has been added to the Spark 2.X release and the older API has been deprecated.
  • A native SQL parser that supports both ANSI-SQL as well as Hive SQL has been introduced in the current Spark build.
  • Hive-style bucketing support too has been added to the list of supported SQL functions in Spark SQL.
  • Subquery support has been added in Spark SQL and supports other variations of the clause such as NOT IN, IN, EXISTS, and so on.
  • Native CSV data source, based on the databricks implementation has been incorporated in Spark.
  • The new spark.ml package which is based on Dataframe has been introduced with an objective to deprecate spark.mllib once the newly introduced package matures enough in features to replace the old package.
  • Machine learning pipelines and models can now be persisted across all languages supported by Spark.
You have been reading a chapter from
Apache Spark 2.x for Java Developers
Published in: Jul 2017
Publisher: Packt
ISBN-13: 9781787126497
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime