Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Data Lakehouse in Action
Data Lakehouse in Action

Data Lakehouse in Action: Architecting a modern and scalable data analytics platform

eBook
€17.99 €26.99
Paperback
€32.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Data Lakehouse in Action

Chapter 1: Introducing the Evolution of Data Analytics Patterns

Data analytics is an ever-changing field. A little history will help you appreciate the strides in this field and how data architectural patterns have evolved to fulfill the ever-changing need for analytics.

First, let's start with some definitions:

  • What is analytics? Analytics is defined as any action that converts data into insights.
  • What is data architecture? Data architecture is the structure that enables the storage, transformation, exploitation, and governance of data.

Analytics and the data architecture that enables analytics goes a long way. Let's now explore some of the patterns that have evolved over the last few decades.

This chapter explores the genesis of data growth and explains the need for a new paradigm in data architecture. This chapter starts by examining the predominant paradigm, the enterprise data warehouse, popular in the 1990s and 2000s. It explores the challenges associated with this paradigm and then covers the drivers that caused an explosion in data. It further examines the rise of a new paradigm, the data lake, and its challenges. Furthermore, this chapter ends by advocating the need for a new paradigm, the data lakehouse. It clarifies the key benefits delivered by a well-architected data lakehouse.

We'll cover all of this in the following topics:

  • Discovering the enterprise data warehouse era
  • Exploring the five factors of change
  • Investigating the data lake era
  • Introducing the data lakehouse paradigm

Discovering the enterprise data warehouse era

The Enterprise Data Warehouse (EDW) pattern, popularized by Ralph Kimball and Bill Inmon, was predominant in the 1990s and 2000s. The needs of this era were relatively straightforward (at least compared to the current context). The focus was predominantly on optimizing database structures to satisfy reporting requirements. Analytics was synonymous with reporting. Machine learning was a specialized field and was not ubiquitous in enterprises.

A typical EDW pattern is depicted in the following figure:

Figure 1.1 – A typical EDW pattern

Figure 1.1 – A typical EDW pattern

As shown in Figure 1.1, the pattern entailed source systems composed of databases or flat-file structures. The data sources are predominantly structured, that is, rows and columns. A process called Extract-Transform-Load (ETL) first extracts the data from the source systems. Then, the process transforms the data into a shape and form that is conducive for analysis. Once the data is transformed, it is loaded into an EDW. From there, the subsets of data are then populated to downstream data marts. Data marts can be conceived of as mini data warehouses that cater to the business requirements of a specific department.

As you can imagine, this pattern primarily was focused on the following:

  • Creating a data structure that is optimized for storage and modeled for reporting
  • Focusing on the reporting requirements of the business
  • Harnessing the structured data into actionable insights

Every coin has two sides. The EDW pattern is not an exception. It has its pros and it has its cons. This pattern has survived the test of time. It was widespread and well adopted because of the following key advantages:

  • Since most of the analytical requirements were related to reporting, this pattern effectively addressed many organizations' reporting requirements.
  • Large enterprise data models were able to structure an organization's data into logical and physical models. This pattern gave a structure to manage the organization's data in a modular and efficient manner.
  • Since this pattern catered only to structured data, the technology required to harness structured data was evolved and readily available. Relational Database Management Systems (RDBMSes) evolved and were juxtaposed appropriately to harness its features for reporting.

However, it also had its own set of challenges that surfaced as the data volumes grew and new data formats started emerging. A few challenges associated with the EDW pattern are as follows:

  • This pattern was not as agile as the changing business requirements wanted it to be. Any change in the reporting requirement had to go through a long-winded process of data model changes, ETL code changes, and respective changes to the reporting system. Often, the ETL process was a specialized skill and became a bottleneck for reducing data to insight turnover time. The nature of analytics is unique. The more you see the output, the more you demand. Many EDW projects were deemed a failure. The failure was not from a technical perspective, but from a business perspective. Operationally, the design changes required to cater to these fast-evolving requirements were too difficult to handle.
  • As the data volumes grew, this pattern proved too cost prohibitive. Massive parallel-processing database technologies started evolving that specialized in data warehouse workloads. The cost of maintaining these databases was prohibitive as well. It involved expensive software prices, frequent hardware refreshes, and a substantial staffing cost. The return on investment was no longer justifiable.
  • As the format of data started evolving, the challenges associated with the EDW became more evident. Database technologies were developed to cater to semi-structured data (JSON). However, the fundamental concept was still RDBMS-based. The underlying technology was not able to effectively cater to these new types of data. There was more value in analyzing data that was not structured. The sheer variety of data was too complex for EDWs to handle.
  • The EDW was focused predominantly on Business Intelligence (BI). It facilitated the creation of scheduled reports, ad hoc data analysis, and self-service BI. Although it catered to most of the personas who performed analysis, it was not conducive to AI/ML use cases. The data in the EDW was already cleansed and structured with a razor-sharp focus on reporting. This left little room for a data scientist (statistical modelers at that time) to explore data and create a new hypothesis. In short, the EDW was primarily focused on BI.

While the EDW pattern was becoming mainstream, a perfect storm was flourishing that changed the landscape. The following section will focus on five different factors that came together to change the data architecture pattern for good.

Exploring the five factors of change

The year 2007 changed the world as we know it; the day Steve Jobs took the stage and announced the iPhone launch was a turning point in the age of data. That day brewed the perfect "data" storm.

A perfect storm is a meteorological event that occurs as a result of a rare combination of factors. In the world of data evolution, such a perfect storm occurred in the last decade, one that has catapulted data as a strategic enterprise asset. Five ingredients caused the perfect "data" storm.

Figure 1.2 – Ingredients of the perfect "data" storm

Figure 1.2 – Ingredients of the perfect "data" storm

As depicted in Figure 1.2, there were five factors to the perfect storm. An exponential growth of data and an increase in computing power were the first two factors. These two factors coincided with a decrease in storage cost. The rise of AI and the advancement of cloud computing coalesced at the same time to form the perfect storm.

These factors developed independently and converged together, changing and shaping industries. Let's look into each of these factors briefly.

The exponential growth of data

The exponential growth of data is the first ingredient of the perfect storm.

Figure 1.3 – Estimated data growth between 2010 and 2020

Figure 1.3 – Estimated data growth between 2010 and 2020

According to the International Data Corporation (IDC), by 2025, the total data volumes generated will reach around 163 ZB (zettabytes), that is, a trillion gigabytes. In 2010, that number was approximately 0.5 ZB. This exponential growth of data is attributed to a vast improvement in internet technologies that have fueled the growth of many industries. The telecommunications industry was the major industry that was transformed. This, in turn, transformed many other industries. Data became ubiquitous and every business craved more data bandwidth. Social media platforms started to be used as well. The likes of Facebook, Twitter, and Instagram flooded the internet space with more data. Streaming services and e-commerce also generated tons of data. This generated data was used to forge and influence consumer behaviors. Last, but not least, the technological leaps in the Internet of Things (IoT) space generated loads of data.

The traditional EDW pattern was not able to cope with this growth in data. They were designed for structured data. Big data had changed the definition of usable data. The data now was big (volume); some of them were continuously flowing (velocity), generated in different shapes and forms (variety), and from a plethora of sources with noise (veracity).

The increase in compute

The exponential increase in computing power is the second ingredient of the perfect storm.

Figure 1.4 – Estimated growth in transistors per microprocessors between 2010 and 2020

Figure 1.4 – Estimated growth in transistors per microprocessors between 2010 and 2020

Moore's law is the prediction made by American engineer Gordon Moore in 1965 that the number of transistors per silicon chip doubles every year. This law has been faithful to its forecast so far. In 2010, the number of transistors in a microprocessor was around 2 billion. In 2020, that number stood at 54 billion. This exponential increase in computing power dovetails with the rise of cloud computing technologies that provide limitless compute at an affordable price point.

The increase in computing power at a reasonable price point provided a much-needed impetus for big data. Organizations can now procure more and more compute at a much lower price point. The compute available in cloud computing can now be used to process and analyze data on demand.

The decrease in storage cost

The rapid decrease in storage cost is the third ingredient of the perfect storm.

Figure 1.5 – The estimated decrease in storage cost between 2010 and 2020

Figure 1.5 – The estimated decrease in storage cost between 2010 and 2020

The cost of storage has also exponentially decreased. In 2010, the average cost of storing a GB of data in a Hard Disk Drive (HDD) was around $0.1. That number has reduced to approximately $0.01 in 10 years. In the traditional EDW pattern, organizations had to be picky about which data they had to store for analysis and which data could be discarded. Holding data was an expensive proposition. However, the exponential decrease in storage cost meant that all data could now be stored at a fraction of the previous cost. There was now no need to pick and choose what should be stored and what should be discarded. Data in whatever shape or form could now be kept at a fraction of price. The mantra of store first and analyze later could now be implemented.

The rise of artificial intelligence

Artificial Intelligence (AI) systems are not new to the world. In fact, their genesis goes back to the 1950s, when statistical models were used to estimate values of data points based on past data. This field was out of focus for an extended period, as the computing power and large corpus of data required to run these models were not available.

Figure 1.6 – Timeline of the evolution of AI

Figure 1.6 – Timeline of the evolution of AI

However, after a long hibernation, AI technologies saw a resurgence in the early 2010s. This resurgence was partly due to the abundance of powerful computing resources and the equal availability of data. AI models now could be trained faster, and the results were stunningly accurate.

The factor of reduced storage cost and more available computing resources was a boon for AI. More and more complex models could now be trained.

Figure 1.7 – Accuracy of AI systems in matching humans for image recognition

Figure 1.7 – Accuracy of AI systems in matching humans for image recognition

This was especially true for deep learning algorithms. For instance, a deep learning technique called Convoluted Neural Networks (CNNs) has become very popular for detecting images. Over a period, deeper and deeper neural networks were created. Now, AI systems have surpassed human beings in detecting objects.

As AI systems became more accurate, they gained in popularity. This fueled cyclic behavior, and more and more businesses were employing AI in their digital transformation agenda.

The advancement of cloud computing

The fifth ingredient for the perfect "data" storm is the rise of cloud computing. Cloud computing is the on-demand availability of computing and storage resources. The typical public cloud service providers include big technology companies such as Amazon (AWS), Microsoft (Azure), and Google (GCP). Cloud computing eliminates the need to host large servers for computing and storage on the organization's data center. Depending on the service subscribed to in the cloud, organizations can also reduce their dependencies on software and hardware maintenance. Cloud provides a plethora of on-demand services at a very economical price point. The cloud computing landscape has constantly been rising since 2010. Worldwide spending on public clouds started at around $77 billion in 2010 and has reached around $441 billion in 2020. Cloud computing also enabled the rise of the Digitally Native Business (DNB). It propelled the rise of organizations such as Uber, Deliveroo, TikTok, and Instagram, to name a few.

Cloud computing has been a boon for data. With the rise of cloud computing, data can now be stored at a fraction of the cost. The comparatively limitless compute power that the cloud provides translates into the ability to rapidly transform data. Cloud computing also provides innovative data platforms that can be utilized at a click of a button.

These five ingredients crossed paths at an opportune moment to challenge the existing data architecture patterns. The perfect "data" storm facilitated the rise of a new data architecture paradigm focused on big data, the data lake.

Investigating the data lake era

The genesis of the data lake starts in 2004. In 2004, Google researchers Jeffery Dean and Sanjay Ghemawat published a paper titled MapReduce: Simplified Data Processing on Large Clusters. This paper laid the foundation of a new technology that evolved into Hadoop, whose original authors are Doug Cutting and Mike Cafarella.

Hadoop was later incorporated into Apache Software Foundation, a decentralized open source community of developers. Hadoop has been one of the top open source projects within the Apache ecosystem.

Hadoop was based on a simple concept – divide and conquer. The idea entailed three steps:

  1. Distribute data into multiple files and distribute them across the various nodes in a cluster.
  2. Use compute nodes to process the data locally in the nodes of each cluster.
  3. Use an orchestrator that communicates with each node and aggregates data for the final output.

Over the years, this concept gained traction, and a new kind of paradigm emerged for analytics. This architecture paradigm is the data lake paradigm. A typical data lake pattern can be depicted in the following figure:

Figure 1.8 – A typical data lake pattern

Figure 1.8 – A typical data lake pattern

This pattern addressed the challenges prevalent in the EDW pattern. The advantages that the data lake architecture pattern can offer are evident. The key advantages are as follows:

  • The data lake caters to both structured and unstructured data. The Hadoop ecosystem was primarily developed to store and process data formats such as JSON, text, and images. The EDW pattern was not designed to store or analyze these data types.
  • The data lake pattern can process large volumes of data at a relatively cheaper cost. The volumes of data that data lakes can store and process are in the order of high Terabytes (TBs) or Petabytes (PB). The EDW pattern found these large volumes of data challenging to store and process efficiently.
  • Data lakes can better address fast-changing business requirements. The evolving AI technologies can leverage data lakes better.

This pattern is widely adopted as it is the need of the hour. However, it has its own challenges. A few challenges associated with this pattern are as follows:

  • It is easy for a data lake to become a data swamp. Data lakes take in data, any form of data, and store it in its raw form. The philosophy is to ingest data first and then figure out what to do with it. This causes easy slippage of governance, and it becomes challenging to govern the data lake. With no proper data governance, data starts to mushroom all over the place in a data lake, and soon it becomes a data swamp.
  • Data lakes also have challenges with the rapid evolution of technology. The data lake paradigm mainly relies on open source software. Open source software evolves rapidly into behemoths that can become too difficult to manage. The software is predominantly community-driven, and it doesn't have proper enterprise support. This causes a lot of maintenance overhead and implementation complexities. Many features that are demanded by enterprises are missing from open source software, for example, a robust security framework.
  • Data lakes focus a lot more on AI enablement than BI. It was natural that the open source software evolution focused more on enabling AI. AI was having its own journey and was riding the wave, cresting together with Hadoop. BI was seen as retro, as it was already mature in its life cycle.

Soon, it became evident that the data lake pattern alone wouldn't be sustainable in the long run. There was a need for a new paradigm that fuses these two patterns.

Introducing the data lakehouse paradigm

In 2006, Clive Humbly, a British mathematician, coined the now-famous phrase, "Data is the new oil." It was akin to peering through a crystal ball and peeking into the future. Data is the lifeblood of organizations. The competitive advantage is defined by how an organization uses data. Data management is paramount in this age of digital transformation. More and more organizations are embracing digital transformation programs, and data is at the core of these transformations.

As discussed earlier, the paradigms of the EDW and data lakes were opportune for their times. They had their benefits and their challenges. A new paradigm needed to emerge that was disciplined at its core and flexible at its edges.

Figure 1.9 – Data lakehouse paradigm

Figure 1.9 – Data lakehouse paradigm

The new data architectural paradigm is called the data lakehouse. It strives to combine the advantages of both the data lake and the EDW paradigms while minimizing their challenges.

An adequately architected data lakehouse delivers four key benefits.

Figure 1.10 – Benefits of the data lakehouse

Figure 1.10 – Benefits of the data lakehouse

  1. It derives insights from both structured and unstructured data: The data lakehouse architecture should be able to store, transform, and integrate structured and unstructured data. It should be able to fuse them together and enable the extraction of valuable insights from the data.
  2. It caters to different personas of the organizations: Data is a dish with different tastes for different personas. The data lakehouse should be able to cater to the needs of these personas. The data lakehouse caters to a range of organizational personas and fulfills their requirements for insights. A data scientist should get their playground for testing their hypothesis. An analyst should be able to analyze data using their tools of choice, and business users should be able to get their reports accurately and on time. It democratizes data for analytics.
  3. It facilitates the adoption of a robust governance framework: The primary challenge with the data lake architecture pattern was the lack of a strong governance framework. It was easy for a data lake to become a data swamp. In contrast, an EDW architecture was stymied by too much governance for too little content. The data lakehouse architecture strives to hit the governance balance. It seeks to achieve the proper governance for the correct data type with access to the right stakeholder.
  4. It leverages cloud computing: Data lakehouse architecture needs to be agile and innovative. The pattern needs to adapt to the changing organizational requirements and reduce the data to insight turnover time. To achieve this agility, it is imperative to adopt cloud computing technology. The cloud computing platforms offer the innovativeness required. It provides the appropriate technology stack with scalability and flexibility, and fulfills the demands of a modern data analytics platform.

The data lakehouse paradigm addresses the challenges faced by the EDW and the data lake paradigm. Yet, it does have its own set of challenges that needs to be managed. A few of those challenges are as follows:

  • Architectural complexity: Given that the data lakehouse pattern amalgamates the EDW and the data lake pattern, it is inevitable that it will have its fair share of architectural complexity. The complexity manifests in the form of multiple components required to fruition the pattern. Architectural patterns are quid pro quo; it is vital to carefully trade off architectural complexity with the potential business benefit. The data lakehouse architecture needs to tread that path carefully.
  • Required holistic data governance: The challenges pertinent to the data lake paradigm do not magically go away with the data lakehouse paradigm. The biggest challenge of a data lake was that it was prone to becoming a data swamp. As the data lakehouse grows in its scope and complexity, the lack of a holistic governance framework is a sure-shot way of creating a swamp out of a data lakehouse.
  • Balancing flexibility with discipline: The data lakehouse paradigm strives to be flexible and to adapt to changing business requirements with agility. The ethos under which it operates is to have discipline at the core and flexibility at the edges. Achieving this objective is a careful balancing act that clearly defines the limits of flexibility and the strictness of discipline. The data lakehouse stewards play an essential role in ensuring this balance.

Let's recap what we've discussed in this chapter.

Summary

This chapter was about the genesis of a new paradigm. It is important to have a view of the genesis so that we understand the shortcomings of the predecessors and how new frameworks can evolve to address these shortcomings. Understanding the drivers that caused this evolution is also important. Developments in other fields of technology such as storage, cloud computing, and AI have had a ripple effect on data architecture. In this chapter, we started out by exploring the EDW architecture pattern that was predominant for a long time. Then, we explored the factors that created the perfect "data" storm. Subsequently, the chapter delved into the data lake architecture pattern. The need for a new architectural paradigm, the data lakehouse, was then discussed. The chapter concluded by highlighting the critical benefits of the new architectural paradigm.

The next chapter aims to zoom in on the components of the data lakehouse architecture.

Further reading

Left arrow icon Right arrow icon

Key benefits

  • Understand how data is ingested, stored, served, governed, and secured for enabling data analytics
  • Explore a practical way to implement Data Lakehouse using cloud computing platforms like Azure
  • Combine multiple architectural patterns based on an organization’s needs and maturity level

Description

The Data Lakehouse architecture is a new paradigm that enables large-scale analytics. This book will guide you in developing data architecture in the right way to ensure your organization's success. The first part of the book discusses the different data architectural patterns used in the past and the need for a new architectural paradigm, as well as the drivers that have caused this change. It covers the principles that govern the target architecture, the components that form the Data Lakehouse architecture, and the rationale and need for those components. The second part deep dives into the different layers of Data Lakehouse. It covers various scenarios and components for data ingestion, storage, data processing, data serving, analytics, governance, and data security. The book's third part focuses on the practical implementation of the Data Lakehouse architecture in a cloud computing platform. It focuses on various ways to combine the Data Lakehouse pattern to realize macro-patterns, such as Data Mesh and Data Hub-Spoke, based on the organization's needs and maturity level. The frameworks introduced will be practical and organizations can readily benefit from their application. By the end of this book, you'll clearly understand how to implement the Data Lakehouse architecture pattern in a scalable, agile, and cost-effective manner.

Who is this book for?

This book is for data architects, big data engineers, data strategists and practitioners, data stewards, and cloud computing practitioners looking to become well-versed with modern data architecture patterns to enable large-scale analytics. Basic knowledge of data architecture and familiarity with data warehousing concepts are required.

What you will learn

  • Understand the evolution of the Data Architecture patterns for analytics
  • Become well versed in the Data Lakehouse pattern and how it enables data analytics
  • Focus on methods to ingest, process, store, and govern data in a Data Lakehouse architecture
  • Learn techniques to serve data and perform analytics in a Data Lakehouse architecture
  • Cover methods to secure the data in a Data Lakehouse architecture
  • Implement Data Lakehouse in a cloud computing platform such as Azure
  • Combine Data Lakehouse in a macro-architecture pattern such as Data Mesh

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 17, 2022
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781801815932
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Mar 17, 2022
Length: 206 pages
Edition : 1st
Language : English
ISBN-13 : 9781801815932
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 95.97
Principles of Data Fabric
€23.99
Data Engineering with AWS
€38.99
Data Lakehouse in Action
€32.99
Total 95.97 Stars icon

Table of Contents

13 Chapters
PART 1: Architectural Patterns for Analytics Chevron down icon Chevron up icon
Chapter 1: Introducing the Evolution of Data Analytics Patterns Chevron down icon Chevron up icon
Chapter 2: The Data Lakehouse Architecture Overview Chevron down icon Chevron up icon
PART 2: Data Lakehouse Component Deep Dive Chevron down icon Chevron up icon
Chapter 3: Ingesting and Processing Data in a Data Lakehouse Chevron down icon Chevron up icon
Chapter 4: Storing and Serving Data in a Data Lakehouse Chevron down icon Chevron up icon
Chapter 5: Deriving Insights from a Data Lakehouse Chevron down icon Chevron up icon
Chapter 6: Applying Data Governance in the Data Lakehouse Chevron down icon Chevron up icon
Chapter 7: Applying Data Security in a Data Lakehouse Chevron down icon Chevron up icon
PART 3: Implementing and Governing a Data Lakehouse Chevron down icon Chevron up icon
Chapter 8: Implementing a Data Lakehouse on Microsoft Azure Chevron down icon Chevron up icon
Chapter 9: Scaling the Data Lakehouse Architecture Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1
(9 Ratings)
5 star 33.3%
4 star 55.6%
3 star 0%
2 star 11.1%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Babacar May 24, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
“Data Lakehouse in Action: Architecting a modern and scalable data analytics platform” is a comprehensive and well-organized book for people who want to understand the most important concepts about Data Lakehouse architecture.Even though the author stated that the concepts have been explained in a non-technical and straightforward manner and that “anyone could easily understand and become well-versed with modern data architecture patterns to enable large-scale analytics”, I think that This book assumes that you already have a general understanding of Datalake and Entreprise Data Darehouse and some fundamental architectural concepts in Data.In the data lakehouse approach, the author could be more flexible in terms of layers but lot of examples in the book have used the same datalake layers (datastores), the 3+1 traditional layers (or datastore: raw, intermediate, processed and archived) are referred.The book exposes well the different components of the ELTL pattern, which is used to ingest and process batch data and streaming in a data lakehouse.The book is well written, with good explanations, and concrete examples, and it covers a large variety of topics. It is ideal when you want to learn and explore the key architecture elements of a modern data analytics platform using a recipe-based approach.I loved the “Further reading section” which summarizes the section you just read and introduces a next one.Data governance, Security and data masking are discussed from a perspective of data lakehouseAnd finally, the author went through an example in Azure Cloud. #azure #data #dataanalytics #datalake #datalakehouse #modernarchitecture
Amazon Verified review Amazon
Young May 11, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book gives me more comprehensive view of the data engineering. Good reading.
Amazon Verified review Amazon
M-P Aug 15, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Diese Architektur ist neu und spannend und wird in diesem Buch von allen Seiten (organisatorisch und technisch) super beleuchtet.
Amazon Verified review Amazon
Chiranjeevi Bevara Apr 12, 2022
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The author did a great job of explaining the complex topics in a simple way. Anyone who wants to gain deep knowledge of the Data Lakehouse concept must read this book.All the Data Lakehouse components,ingestion and processing data, storing and serving data, deriving insights, data governance, data security are very well explained.The author has explained Data Lakehouse implementation using Microsoft Azure. It would have been good if he had covered Data Lakehouse implementation using aws and gcp as well.Overall, I enjoyed reading this book. Thank you Packt and Pradeep Menon for this wonderful book.
Amazon Verified review Amazon
STHAMMA Mar 20, 2022
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Data Lakehouse in Action - Book covers Theoretical Evaluation of Data Lakehouse architecture, Ingestion and Processing , Data Governance , Data security , Data lakehouse on Azure cloud and scaled Data lakehouse. Hands on exercises with simple examples could have been more value added to this book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.