You're reading from Essential PySpark for Scalable Data Analytics A beginner's guide to harnessing the power and ease of PySpark 3

Product type Paperback

Published in Oct 2021

Publisher Packt

ISBN-13 9781800568877

Length 322 pages

Edition 1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1):

Sreeram Nudurupati

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Data Engineering

2. Chapter 1: Distributed Computing Primer FREE CHAPTER

3. Chapter 2: Data Ingestion

4. Chapter 3: Data Cleansing and Integration

5. Chapter 4: Real-Time Data Analytics

6. Section 2: Data Science

7. Chapter 5: Scalable Machine Learning with PySpark

8. Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

9. Chapter 7: Supervised Machine Learning

10. Chapter 8: Unsupervised Machine Learning

11. Chapter 9: Machine Learning Life Cycle Management

12. Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

13. Section 3: Data Analysis

14. Chapter 11: Data Visualization with PySpark

15. Chapter 12: Spark SQL Primer

16. Chapter 13: Integrating External Tools with Spark SQL

17. Chapter 14: The Data Lakehouse

18. Other Books You May Enjoy

Summary

In this chapter, you learned the concept of Distributed Computing. We discovered why Distributed Computing has become very important, as the amount of data being generated is growing rapidly, and it is not practical or feasible to process all your data using a single specialist system.

You then learned about the concept of Data Parallel Processing and reviewed a practical example of its implementation by means of the MapReduce paradigm.

Then, you were introduced to an in-memory, unified analytics engine called Apache Spark, and learned how fast and efficient it is for data processing. Additionally, you learned it is very intuitive and easy to get started for developing data processing applications. You also got to understand the architecture and components of Apache Spark and how they come together as a framework.

Next, you came to understand RDDs, which are the core abstraction of Apache Spark, how they store data on a cluster of machines in a distributed manner, and how you can leverage higher-order functions along with lambda functions to implement Data Parallel Processing via RDDs.

You also learned about the Spark SQL engine component of Apache Spark, how it provides a higher level of abstraction than RRDs, and that it has several built-in functions that you might already be familiar with. You learned to leverage the DataFrame DSL to implement your data processing business logic in an easier and more familiar way. You also learned about Spark's SQL API, how it is ANSI SQL standards-compliant, and how it allows you to seamlessly perform SQL analytics on large amounts of data efficiently.

You also came to know some of the prominent improvements in Apache Spark 3.0, such as adaptive query execution and dynamic partition pruning, which help make Spark 3.0 much faster in performance than its predecessors.

Now that you have learned the basics of big data processing with Apache Spark, you are ready to embark on a data analytics journey using Spark. A typical data analytics journey starts with acquiring raw data from various source systems, ingesting it into a historical storage component such as a data warehouse or a data lake, then transforming the raw data by cleansing, integrating, and transforming it to get a single source of truth. Finally, you can gain actionable business insights through clean and integrated data, leveraging descriptive and predictive analytics. We will cover each of these aspects in the subsequent chapters of this book, starting with the process of data cleansing and ingestion in the following chapter.

You're reading from Essential PySpark for Scalable Data Analytics A beginner's guide to harnessing the power and ease of PySpark 3

Table of Contents (19) Chapters

Summary

Authors (1)

Personalised recommendations for you