What this book covers
Chapter 1, Big Data and Data Science – An Introduction, this chapter discusses briefly about the various challenges in big data analytics and how Apache Spark solves those problems on a single platform. This chapter also explains how data analytics has evolved to what it is now and also gives a basic idea on the Spark stack.
Chapter 2, The Spark Programming Model, this chapter talks about the design considerations of Apache Spark and the supported programming languages. It also explains the Spark core components and covers the RDD API in details, which is the basic building block of Spark.
Chapter 3, Introduction to DataFrames, this chapter explains about the DataFrames, which are the most handy and useful component for the data scientists to work at ease. It explains about Spark SQL and the Catalyst optimizer that empowers DataFrames. Also, various DataFrames operations are demonstrated with code examples.
Chapter 4, Unified Data Access, this chapter talks about the various ways we source data from different sources, consolidate and work in a unified way. It covers the streaming aspect of real time data collection and operating on them. It also talks about the under-the-hood fundamentals of these APIs.
Chapter 5, Data Analysis on Spark, this chapter discuss about the complete data analytics lifecycle. With ample code examples, it explains how to source data from different sources, prepare the data using data cleaning and transformation techniques, and perform descriptive and inferential statistics to generate hidden insights from data.
Chapter 6, Machine Learning, this chapter explains various machine learning algorithms, how they are implemented in the MLlib library and how they can be used with the pipeline API for a streamlined execution. This chapter covers the fundamentals of all the algorithms covered so it could serve as a one stop reference.
Chapter 7, Extending Spark with SparkR, this chapter is primarily intended for the R programmers who want to leverage Spark for Data Analytics. It explains how to program with SparkR and how to use the machine learning algorithms of R libraries.
Chapter 8, Analyzing Unstructured Data, this chapter discusses only about unstructured data analysis. It explains how to source unstructured data, process it and perform machine learning on it. It also covers some of the dimension reduction techniques which were not covered in the “Machine Learning” chapter.
Chapter 9, Visualizing Big Data, in this chapter, readers learn various visualization techniques that are supported on Spark. It explains the different kinds of visualization requirements of data engineers, data scientists and business users; and also suggests right kinds of tools and techniques. It also talks about leveraging IPython/Jupyter notebook and Zeppelin, an Apache project for data visualization.
Chapter 10,Putting It All Together, till now the book has discussed about most of the data analytics components in different chapters separately. This chapter is an effort to stich various steps on a typical data science project and demonstrate a step-by-step approach to a full blown analytics project execution.
Chapter 11,Building Data Science Applications, till now the book has mostly discussed about the data science components along with a full blown execution example. This chapter provides a heads up on how to build data products that can be deployed in production. It also gives an idea on the current development status of the Apache Spark project and what is in store for it.