You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Length 318 pages

Edition 1st Edition

Languages

Python

Tools

MLflow

Concepts

Data Science

Author (1):

Brian Lipp

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1:Fundamental Data Knowledge

2. Chapter 1: Modern Data Processing Architecture FREE CHAPTER

3. Chapter 2: Understanding Data Analytics

4. Part 2: Data Engineering Toolset

5. Chapter 3: Apache Spark Deep Dive

6. Chapter 4: Batch and Stream Data Processing Using PySpark

7. Chapter 5: Streaming Data with Kafka

8. Part 3:Modernizing the Data Platform

9. Chapter 6: MLOps

10. Chapter 7: Data and Information Visualization

11. Chapter 8: Integrating Continous Integration into Your Workflow

12. Chapter 9: Orchestrating Your Data Workflows

13. Part 4:Hands-on Project

14. Chapter 10: Data Governance

15. Chapter 11: Building out the Groundwork

16. Chapter 12: Completing Our Project

17. Index

Why subscribe?

18. Other Books You May Enjoy

Stream processing

Streaming is a very useful mode of processing data and can come with a large amount of complexity. One thing a purist must consider is that Spark doesn’t do “streaming data” – Spark does micro-batch data processing. So, it will load whatever the new messages are and run a batch process on them in a continuous loop while checking for new data. A pure streaming data processing engine such as Apache Flink will only process one new load of “data.” So, as a simple example, let’s say there are 100 new messages in a Kafka queue; Spark would process all of them in one micro-batch. Flink, on the other hand, would process each message separately.

Spark Structured Streaming is a DataFrame API on top of the normal Spark Streaming, much like the DataFrame API sits on the RDD API. Streaming DataFrames are optimized just like normal DataFrames, so I suggest always using structured streaming over normal Spark Streaming. Also, Spark...

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €18.99/month. Cancel anytime

Authors (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.

See other products by Brian Lipp