Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Essential PySpark for Scalable Data Analytics

You're reading from   Essential PySpark for Scalable Data Analytics A beginner's guide to harnessing the power and ease of PySpark 3

Arrow left icon
Product type Paperback
Published in Oct 2021
Publisher Packt
ISBN-13 9781800568877
Length 322 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Sreeram Nudurupati Sreeram Nudurupati
Author Profile Icon Sreeram Nudurupati
Sreeram Nudurupati
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Section 1: Data Engineering
2. Chapter 1: Distributed Computing Primer FREE CHAPTER 3. Chapter 2: Data Ingestion 4. Chapter 3: Data Cleansing and Integration 5. Chapter 4: Real-Time Data Analytics 6. Section 2: Data Science
7. Chapter 5: Scalable Machine Learning with PySpark 8. Chapter 6: Feature Engineering – Extraction, Transformation, and Selection 9. Chapter 7: Supervised Machine Learning 10. Chapter 8: Unsupervised Machine Learning 11. Chapter 9: Machine Learning Life Cycle Management 12. Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark 13. Section 3: Data Analysis
14. Chapter 11: Data Visualization with PySpark 15. Chapter 12: Spark SQL Primer 16. Chapter 13: Integrating External Tools with Spark SQL 17. Chapter 14: The Data Lakehouse 18. Other Books You May Enjoy

Summary

In this chapter, you learned the concept of Distributed Computing. We discovered why Distributed Computing has become very important, as the amount of data being generated is growing rapidly, and it is not practical or feasible to process all your data using a single specialist system.

You then learned about the concept of Data Parallel Processing and reviewed a practical example of its implementation by means of the MapReduce paradigm.

Then, you were introduced to an in-memory, unified analytics engine called Apache Spark, and learned how fast and efficient it is for data processing. Additionally, you learned it is very intuitive and easy to get started for developing data processing applications. You also got to understand the architecture and components of Apache Spark and how they come together as a framework.

Next, you came to understand RDDs, which are the core abstraction of Apache Spark, how they store data on a cluster of machines in a distributed manner, and how you can leverage higher-order functions along with lambda functions to implement Data Parallel Processing via RDDs.

You also learned about the Spark SQL engine component of Apache Spark, how it provides a higher level of abstraction than RRDs, and that it has several built-in functions that you might already be familiar with. You learned to leverage the DataFrame DSL to implement your data processing business logic in an easier and more familiar way. You also learned about Spark's SQL API, how it is ANSI SQL standards-compliant, and how it allows you to seamlessly perform SQL analytics on large amounts of data efficiently.

You also came to know some of the prominent improvements in Apache Spark 3.0, such as adaptive query execution and dynamic partition pruning, which help make Spark 3.0 much faster in performance than its predecessors.

Now that you have learned the basics of big data processing with Apache Spark, you are ready to embark on a data analytics journey using Spark. A typical data analytics journey starts with acquiring raw data from various source systems, ingesting it into a historical storage component such as a data warehouse or a data lake, then transforming the raw data by cleansing, integrating, and transforming it to get a single source of truth. Finally, you can gain actionable business insights through clean and integrated data, leveraging descriptive and predictive analytics. We will cover each of these aspects in the subsequent chapters of this book, starting with the process of data cleansing and ingestion in the following chapter.

You have been reading a chapter from
Essential PySpark for Scalable Data Analytics
Published in: Oct 2021
Publisher: Packt
ISBN-13: 9781800568877
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime