Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Simplifying Data Engineering and Analytics with Delta

You're reading from   Simplifying Data Engineering and Analytics with Delta Create analytics-ready data that fuels artificial intelligence and business intelligence

Arrow left icon
Product type Paperback
Published in Jul 2022
Publisher Packt
ISBN-13 9781801814867
Length 334 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Anindita Mahapatra Anindita Mahapatra
Author Profile Icon Anindita Mahapatra
Anindita Mahapatra
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Section 1 – Introduction to Delta Lake and Data Engineering Principles
2. Chapter 1: Introduction to Data Engineering FREE CHAPTER 3. Chapter 2: Data Modeling and ETL 4. Chapter 3: Delta – The Foundation Block for Big Data 5. Section 2 – End-to-End Process of Building Delta Pipelines
6. Chapter 4: Unifying Batch and Streaming with Delta 7. Chapter 5: Data Consolidation in Delta Lake 8. Chapter 6: Solving Common Data Pattern Scenarios with Delta 9. Chapter 7: Delta for Data Warehouse Use Cases 10. Chapter 8: Handling Atypical Data Scenarios with Delta 11. Chapter 9: Delta for Reproducible Machine Learning Pipelines 12. Chapter 10: Delta for Data Products and Services 13. Section 3 – Operationalizing and Productionalizing Delta Pipelines
14. Chapter 11: Operationalizing Data and ML Pipelines 15. Chapter 12: Optimizing Cost and Performance with Delta 16. Chapter 13: Managing Your Data Journey 17. Other Books You May Enjoy

What this book covers

Chapter 1, Introduction to Data Engineering, covers how data is the new oil. Just as oil has to burn to get heat and light, data also has to be harnessed to get valuable insights. The quality of insights will depend on the quality of the data. So, understanding how to manage data is an important function for every industry vertical. This chapter introduces the fundamental principles of data engineering and addresses the growing trends in the industry of data-driven organizations and how to leverage IT data operation units as a competitive advantage instead of viewing them as a cost center.

Chapter 2, Data Modeling and ETL, covers how leveraging the scalability and elasticity of the cloud helps turn on compute on demand and move CAPEX allocation towards OPEX. This chapter introduces common big data design patterns and best practices for modeling big data.

Chapter 3, Delta – The Foundational Block for Big Data, introduces Delta as a file format and points out features that Delta brings to the table over vanilla Parquet and why it is a natural choice for any pipeline. Delta is an overloaded term – it is a protocol first, a table next, and a lake finally!

Chapter 4, Unifying Batch and Streaming with Delta, covers how the trend is towards real-time ingestion, analysis, and consumption of data. Batching is actually a type of streaming workload. Reader/writer isolation is necessary in an environment with multiple producers/consumers involving the same data assets to work independently with the promise that bad or partial data is never presented to the user.

Chapter 5, Data Consolidation in Delta Lake, covers how bringing data together from various silos is only the first step towards building a data lake. The real deal is in increased reliability, quality, and governance, which needs to be enforced to get the most out of the data and infrastructure investment while adding value to any BI or AI use case built on top of it.

Chapter 6, Solving Common Data Pattern Scenarios with Delta, covers common CRUD operations on big data and looks at use cases where they can be applied as a repeatable blueprint.

Chapter 7, Delta for Data Warehouse Use Cases, covers the journey from databases to data warehouses to data lakes, and finally, to lakehouses. The unification of data platforms has never been more important. Is it possible to house all kinds of use cases with a single architecture paradigm? This chapter focuses on the data handling needs and capability requirements that drive the next round of innovation.

Chapter 8, Handling Atypical Data Scenarios with Delta, covers several conditions, such as data imbalance, skew, and bias, that need to be addressed to ensure data is not only cleansed and transformed per the business requirements but is also conducive to the underlying compute and for the use case at hand. Even when the logic of the pipelines has been ironed out, there are other statistical attributes of the data that need to be addressed to ensure that the data characteristics for which it was initially designed still hold and make the most of the distributed compute.

Chapter 9, Delta for Reproducible Machine Learning Pipelines, emphasizes that if ML is hard, then reproducible ML and productionizing of ML is even harder. A large part of ML is data preparation. The quality of insights will be as good as the quality of the data that is used to build the models. In this chapter, we look at the role of Delta in ensuring reproducible ML.

Chapter 10, Delta for Data Products and Services, covers consumption patterns of data democratization that ensure the curated data gets into the hands of the consumers in a timely and secure manner so that the insights can be leveraged meaningfully. Data can be served both as a product and a service, especially in the context of a mesh architecture involving multiple lines of businesses specializing in different domains.

Chapter 11, Operationalizing Data and ML Pipelines, looks at the aspects of a mature pipeline that make it considered production worthy. A lot of the data around us remains in unstructured form and carries a wealth of information, and integrating it with more structured transactional data is where firms can not only get competitive intelligence but also begin to get a holistic view of their customers to employ predictive analytics.

Chapter 12, Optimizing Cost and Performance with Delta, looks at how running a pipeline faster has cost implications that translate directly to increased infrastructural savings. This applies to both the ETL pipeline that brings in the data and curates it as well as the consumption pipeline where the stakeholders tap into this curated data. In this chapter, we look at strategies such as file skipping, z-ordering, small file coalescing, and bloom filtering to improve query runtime.

Chapter 13, Managing Your Data Journey, emphasizes the need for policies around data access and data use that need to be honored as per regulatory and compliance guidelines. In some industries, it may be necessary to provide evidence of all data access and transformations. Hence, there is a need to be able to set controls in place, detect if something has been changed, and provide a transparent audit trail.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime