Data Lake for Enterprises

Introduction to Data

Through this book, we are embarking on a huge task of implementing a technology masterpiece for your enterprise. In this journey, you will not only have to learn many new tools and technologies but also have to know a good amount of jargon and theoretical stuff. This will surely help you in your journey to reach the ultimate goal of creating the masterpiece, namely Data lake.

This part of the book aims at preparing you for a tough road ahead so that you are quite clear in the head as to what you want to achieve. The concept of a Data lake has evolved over time in enterprises, starting with concepts of data warehouse which contained data for long term retention and stored differently for reporting and historic needs. Then the concept of data mart came into existence which would expose small sets of data with enterprise relevant attributes. Data lake evolved with these concepts as a central data repository for an enterprise that could capture data as is, produce processed data, and serve the most relevant enterprise information.

The topic or technology of Data lake is not new, but very few enterprises have implemented a fully functional Data lake in their organization. Through this book, we want enterprises to start thinking seriously on investing in a Data lake. Also, with the help of you engineers, we want to give the top management in your organization a glimpse of what can be achieved by creating a Data lake which can then be used to implement a use case more relevant to your own enterprise.

So, fasten your seatbelt, hold on tight, and let's start the journey!

Rest assured that after completing this book, you will help your enterprise (small or big) to think and model their business in a data-centric approach, using Data lake as its technical nucleus.

The intent of this chapter is to give the reader insight into data, big data, and some of the important details in connection with data. The chapter gives some important textbook-based definitions, which need to be understood in depth so that the reader is convinced about how data is relevant to an enterprise. The reader would also have grasped the main crux of the difference between data and big data. The chapter soon delves into the types of data in depth and where we can find in an enterprise.

The latter part of the chapter tries to enlighten the user with the current state of enterprises with regard to data management and also tries to give a high-level glimpse on what enterprises are looking to transform themselves into, with data at the core. The whole book is based on a real-life example, and the last section is dedicated to explaining this example in more detail. The example is detailed in such a manner that the reader would get a good amount of concepts implemented in the form of this example.

Exploring data

Data refers to a set of values of qualitative or quantitative variables.

Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

- Wikipedia

Data can be broadly categorized into three types:

Structured data
Unstructured data
Semi-structured data

Structured data is data that we conventionally capture in a business application in the form of data residing in a relational database (relational database management system (RDBMS)) or non-relational database (NoSQL - originally referred to as non SQL).

Structured data can again be broadly categorized into two, namely raw and cleansed data. Data that is taken in as it is, without much cleansing or filtering, is called raw data. Data that is taken in with a lot of cleansing and filtering, catering to a particular analysis by business users, is called cleansed data.

All the other data, which doesn’t fall in the category of structured, can be called unstructured data. Data collected in the form of videos, images, and so on are examples of unstructured data.

There is a third category called semi-structured data, which has come into existence because of the Internet and is becoming more and more predominant with the evolution of social sites. The Wikipedia definition of semi-structured data is as follows:

Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

Some of the examples of semi-structured data are the well-known data formats, namely JavaScript Object Notation (JSON) and Extensible Markup Language (XML).

The following figure (Figure 01) covers whatever we discussed on different types of data, in a pictorial fashion. Please don't get confused by seeing spreadsheets and text files in the structured section. This is because the data presented in the following figure is in the form of a record, which, indeed, qualifies it to be structured data:

Figure 01: Types of Data

What is Enterprise Data?

Enterprise data refers to data shared by employees and their partners in an organization, across various departments and different locations, spread across different continents. This is data that is valuable to the enterprise, such as financial data, business data, employee personal data, and so on, and the enterprise spends considerable time and money to keep this data secure and clean in all aspects.

During all this, this so-called enterprise data passes the current state and becomes stale, or rather dead, and lives in some form of storage, which is hard to analyze and retrieve. This is where the significance of this data and having a single place to analyze it in order to discover various future business opportunities leads to the implementation of a Data lake.

Enterprise data falls into three major high-level categories, as detailed next:

Master data refers to the data that details the main entities within an enterprise. Looking at the master data, one can, in fact, find the business that the enterprise is involved in. This data is usually managed and owned by different departments. The other categories of data, as follows, need the master data to make meaningful values of them.
Transaction data refers to the data that various applications (internal and external) produce while transacting various business processes within an enterprise. This also includes people-related data, which, in a way, doesn’t categorize itself as business data but is significant. This data, when analyzed, can give businesses many optimization techniques to be employed. This data also depends and often refers to the master data.
Analytic data refers to data that is actually derived from the preceding two kinds of enterprise data. This data gives enough insight into various entities (master data) in the enterprise and can also combine with transaction data to make positive recommendations, which can be implemented by the enterprise, after performing the necessary due diligence.

The previously explained different types of enterprise data are very significant to the enterprise, because of which most enterprises have a process for the management of these types of data, commonly known as enterprise data management. This aspect is explained in more detail in the following section.

The following diagram shows the various enterprise data types available and how they interact with each other:

Figure 02: Different types of Enterprise Data

The preceding figure shows that master data is being utilized by both transaction and analytic data. Analytic data also depends on transaction data for deriving meaningful insights as needed by users who use these data for various clients.

Enterprise Data Management

Ability of an organization to precisely define, easily integrate and effectively retrieve data for both internal applications and external communication

- Wikipedia

EDM emphasizes data precision, granularity and meaning and is concerned with how the content is integrated into business applications as well as how it is passed along from one business process to another.

- Wikipedia

As the preceding wikipedia definition clearly states, EDM is the process or strategy of determining how this enterprise data needs to be stored, where it has to be stored, and what technologies it has to use to store and retrieve this data in an enterprise. Being very valuable, this data has to be secured using the right controls and needs to be managed and owned in a defined fashion. It also defines how the data can be taken out to communicate with both internal and external applications alike. Furthermore, the policies and processes around the data exchange have to be well defined.

Looking at the previous paragraph, it seems that it is very easy to have EDM in place for an enterprise, but in reality, it is very difficult. In an enterprise, there are multiple departments, and each department churns out data; based on the significance of these departments, the data churned would also be very relevant to the organization as a whole. Because of the distinction and data relevance, the owner of each data in EDM has different interests, causing conflicts and thus creating problems in the enterprise. This calls for various policies and procedures along with ownership of each data in EDM.

In the context of this book, learning about enterprise data, enterprise data management, and issues around maintaining an EDM are quite significant. This is the reason why it's good to know these aspects at the start of the book itself. In the following sections we will discuss big data concepts and ways in which big data can be incorporated into enterprise data management and extend its capabilities with opportunities that could not be imagined without big data technologies.

Big data concepts

Let me start this section by giving the Wikipedia definition for Big Data:

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. The term "big data" often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

- Wikipedia

Let's try explaining, the two sentences that are given in the preceding Wikipedia definition. Earlier, big data referred to any data that is large and complex in nature. There isn't any specified size of data for it to be called big data. This data was considered so big that conventional data processing applications found it difficult to use it in a meaningful fashion. In the last decade or so, many technologies have evolved in this space in order to analyze such big data in the enterprise. Nowadays, the term big data is used to refer to any sort of analysis method that can comprehend and extract this complex data and make valuable use of it in the enterprise.

Big data and 4Vs

Whenever you encountered the term big data being overly used, you must have come across an important aspect with regard to it, called 4Vs (until recently, it was 3Vs, and then the fourth, very significant, V got introduced). The 4Vs, namely variety, velocity, volume, and veracity (in no particular order) determine whether the data we call Big Data really qualifies to be called big:

Variety: In the context of big data, variety has a very important place. Variety refers to vivid types of data and the myriad sources from which these are arrived at. With the proliferation of technologies and the ever-growing number of applications (enterprise and different personal ones), there is high emphasis on data variety. This is not going to come down any time soon; rather, this is set to increase over a period of time, for sure. Broadly, data types can be categorized into structured and unstructured. Applications during this time deal mainly with structured data stored mostly in a relational database management system (RDBMS). This is very common, but nowadays, there has been the need to look at more unstructured data, and some of the examples can be video content, image content, file content in the form of binaries, and so on.
Velocity: In the context of big data, velocity is referred to in two aspects. First is the rate at which the data is generated, and second is the capability by which the enormous amount of data can be analyzed in real time to derive some meaningful conclusions. As the proverb goes, Time is money, this V is a very important aspect, which makes it easy to take quick decisions in real time. This aspect is one of the strongholds of some of the businesses, especially retail. Giving the customer a personalized and timely offer can be the deciding factor of the customer buying a product from you or ditching you to select a more favorable one.
Volume: In the context of big data, volume refers to the amount/scale of data that needs to be analyzed for a meaningful result to be derived. There isn't a quantitative figure that categorizes a data to be falling into big data. But usually, this volume is definitely more than what a conventional application is handling as of now. So, in general, this is quite big and does pose a problem for a traditional application to deal with in a day-to-day fashion (OLTP - OnLine Transaction Processing). For many businesses, analyzing and making use of social data has become a necessity. These social apps (Facebook, Google+, LinkedIn, and so on) have billions of registered users producing billions of data (structured and unstructured) in a day-to-day fashion. In addition to this, there are applications that themselves produce a huge amount of data in the form of conventional transactions and other analytics (behavioral, location-based, and so on). Also, with the growing number of wearables and sensors that emit data every millisecond, the volume aspect is going to be very important, and this is not going to come down any time soon.

As detailed in the previous section, until recently, there used to be 3Vs. But quite recently, the fourth V was introduced by IBM, namely veracity. For data growing at an exponential rate and as deduced from different reliable and unreliable sources, the significance of this V is huge.

You must have already heard/read of fake news/material being circulated in various social media when there is something important happening in the world. This V brings this a very important aspect of accuracy in big data. With proliferation of data, especially in social channels, this V is going to be very important, and rather than 3Vs, it is leaning highly towards 4Vs of Big Data.

Veracity: In the context of big data, veracity refers to accuracy of data being analyzed to get to a meaningful result. With a variety of sources, especially the not-so-reliable user-entered unstructured data, the data coming from some of these channels has to be consumed in a judicial manner. If an enterprise wants to use this data to generate business, its authenticity has to be verified to an even greater extent.

Big Data and its significant 4V's are shown in a pictorial representation, as follows:

Figure 03: 4V's of Big Data

Figure 03 clearly shows what the 4V's are and what each of these V's means, with adequate bullet points for easy understanding.

Relevance of data

To any enterprise, data is very important. Enterprises have been collecting a good amount of past data and keeping it in a data warehouse for analysis. This proves the importance of data for enterprises for past data analysis and using this for future enterprise growth. In the last decade or so, with the proliferation of social media and myriads of applications (internal to the enterprise and external cloud offerings), the data collected has grown exponentially. This data is increasing in amount as the day goes by, but enterprises are finding it really difficult to make use of these high volumes of diverse data in an effective manner. Data relevance is at the highest for enterprises nowadays as they are now trying to make use of this collected data to transform or energize their existing business.

A business user when fed with these huge amounts of data and right tools can derive real good value. For example, if customer-related data from various applications flows into a place where this data can be analyzed, this data could give a good amount of valuable insights, such as who is the customer who engages with various website pages of the enterprise and how. These derivations can be used as a way in which they can look at either changing their existing business model or tweaking certain business processes to derive maximum profit for the enterprise. For example, looking at various insights from centralized customer data, a new business model can be thought through, say in the form of starting to look at giving loyalty points to such customers. This data can also be made use of, giving more personalized offers closer to customer recommendations. For example, looking at the customer behavior, rather than giving a general offer, more personalized offers suiting the customer's needs could be offered. However, these are fully dependent on the business, and there isn't one approach fitting all the scenarios. These data can, however, be transformed and cleansed to make them more usable for a business user through different data visualizations techniques available as of now in the form of different types of graphs and charts.

Data is relevant, but where exactly this data lives in an enterprise is detailed in the following section.

Vit Soupal (Head of Big Data, Deutsche Telekom AG) in one of his blogs defines these 4V’s of big data as technical parameters and defines another three V’s bringing in business into context. We thought that we would not cover these additional V’s in our book, but these are definitely required for Data lake (Big Data) to be successful in an enterprise.

These additional 3 Vs (business parameters) are as follows:

Vision: Every enterprise embarking on Big Data (Data lake) should have a well-defined vision and should also be ready to transform processes to make full use of it. Also, management in the enterprise should fully understand this and should be in a position to make decisions considering its merits.
Visualization: Data lake is expected to have a huge amount of data. Some will make a lot of sense and some won't at certain points in time. Data scientists work on these data and derive meaningful deductions, and these need to be communicated in an effective manner to the management. For Big Data to be successful, visualization of meaningful data in various formats is required and mandated.
Value: Big Data should be of value to the enterprise. These values could bring about changes in business processes or bring in new innovative solutions (say IoT) and entirely transform the business model.

Vit also gives a very good way of representing these 7 V’s as shown in the following figure:

Figure 04: 7 V's of big data

Figure 04 shows that Big Data becomes successful in an enterprise only if both business and technical attributes are met.

The preceding figure (Figure 04) conveys that Data lake needs to have a well-defined vision and then a different variety of data flows with different velocity and volume into the lake. The data coming into the lake has different quality attributes (veracity). The data in Data lake requires various kinds of visualization to be really useful to various departments and higher management. These useful visualizations will derive various value to the organization and would also help in making various decisions helpful to the enterprise. Technical attributes in a Data lake are needed for sure (variety, velocity, volume, and veracity), but business attributes/parameters are very much required (vision, visualization, and value), and these make Data lake a success in the enterprise.

Quality of data

There is no doubt that high-quality data (cleansed data) is an irresistible asset to an organization. But in the same way, bad quality or mediocre quality data, if used to make decisions for an enterprise, cannot only be bad for your enterprise but can also tarnish the brand value of your enterprise, which is very hard to get back. The data, in general, becomes not so usable if it is inconsistent, duplicate, ambiguous, and incomplete. Business users wouldn't consider using these data if they do not have a pleasant experience while using these data for various analyzes. That's when we realize the importance of the fourth V, namely veracity.

Quality of data is an assessment of data to ascertain its fit for the purpose in a given context, where it is going to be used. There are various characteristics based on which data quality can be ascertained. Some of which, not in any particular order, are as follows:

Correctness/accuracy: This measures the degree to which the collected data describes the real-world entity that's being captured.
Completeness: This is measured by counting the attributes captured during the data-capturing process to the expected/defined attributes.
Consistency: This is measured by comparing the data captured in multiple systems, converging them, and showing a single picture (single source of truth).
Timeliness: This is measured by the ability to provide high-quality data to the right people in the right context at a specified/defined time.
Metadata: This is measured by the amount of additional data about captured data. As the term suggests, it is data about data, which is useful for defining or getting more value about the data itself.
Data lineage: Keeping track of data across a data life cycle can have immense benefits to the organization. Such traceability of data can provide very interesting business insights to an organization.

There are characteristics/dimensions other than what have been described in the preceding section, which can also determine the quality of data. But this is just detailed in the right amount here so that at least you have this concept clear in the head; these will become clearer as you go through the next chapters in this book.

Where does this data live in an enterprise?

The data in an enterprise lives in different formats in the form of raw data, binaries (images and videos), and so on, and in different application's persistent storage internally within an organization or externally in a private or public cloud. Let's first classify these different types of data. One way to categorize where the data lives is as follows:

Intranet (within enterprise)
Internet (external to enterprise)

Another way in which data living in an enterprise can be categorized is in the form of different formats in which they exist, as follows:

Data stores or persistent stores (RDBMS or NoSQL)
Traditional data warehouses (making use of RDBMS, NoSQL etc.)
File stores

Now let's get into a bit more detail about these different data categories.

Intranet (within enterprise)

In simple terms, enterprise data that only exists and lives within its own private network falls in the category of intranet.

Various applications within an enterprise exist within the enterprise's own network, and access is denied to others apart from designated employees. Due to this reason, the data captured using these applications lives within an enterprise in a secure and private fashion.

The applications churning out this data can be data of the employees or various transactional data captured while using enterprises in day-to-day applications.

Technologies used to establish intranet for an enterprise include Local Area Network (LAN) and Wide Area Network (WAN). Also, there are multiple application platforms that can be used within an enterprise, enabling intranet culture within the enterprise and its employees. The data could be stored in a structured format in different stores, such as traditional RDBMS and NoSQL databases. In addition to these stores, there lies unstructured data in the form of different file types. Also, most enterprises have traditional data warehouses, where data is cleansed and made ready to be analyzed.

Internet (external to enterprise)

A decade or so ago, most enterprises had their own data centers, and almost all the data would reside in that. But with the evolution of cloud, enterprises are looking to put some data outside their own data center into cloud, with security and controls in place, so that the data is never accessed by unauthorized people. Going the cloud way also takes a good amount of operational costs away from the enterprise, and that is one of the biggest advantages. Let's get into the subcategories in this space in more detail.

Business applications hosted in cloud

With the various options provided by cloud providers, such as SaaS, PaaS, IaaS, and so on, there are ways in which business applications can be hosted in cloud, taking care of all the essential enterprise policies and governance. Because of this, many enterprises have chosen this as a way to host internally developed applications in these cloud providers. Employees use these applications from the cloud and go about doing their day-to-day operations very similar to how they would have for a business application hosted within an enterprise's own data center.

Third-party cloud solutions

With so many companies now providing their applications/services hosted in cloud, enterprises needing them could use these as is and not worry about maintaining and managing on-premises infrastructure. These products, just by being on the cloud, provide enterprises with huge incentives with regard to how they charge for these services.

Due to this benefit, enterprises favorably choose these cloud products, and due to its mere nature, the enterprises now save their data (very specific to their business)in the cloud on someone else infrastructure, with the cloud provider having full control on how these data live in there.

Google BigQuery is one such piece of software, which, as a service product, allows us to export the enterprise data to their cloud, running this software for various kinds of analysis work. The good thing about these products is that after the analysis, we can decide on whether to keep this data for future use or just discard it. Due to the elastic (ability to expand and contract at will, with regard to hardware in this case) nature of cloud, you can very well ask for a big machine if your analysis is complex, and after use, you can just discard or reduce these servers back to their old configuration.

Due to this nature, Google BigQuery calls itself anEnterprise Cloud Data Warehouse, and it does stay true to its promise. It gives speed and scale to enterprises along with the important security, reliability, and availability. It also gives integration with other similar software products again in cloud for various other needs.

Google BigQuery is just one example; there are other similar software available in cloud with varying degrees of features. Enterprises nowadays need to do many things quickly, and they don't want to spend time doing research on this and hosting these in their own infrastructure due to various overheads; these solutions give all they want without much trouble and at a very handy price tag.

The list of such solutions at this stage is ever growing, and I don't think that naming these is required. So we picked BigQuery as an example to explain this very nature.

Similar to software as a service available in the cloud, there are many business applications available in cloud as services. One such example is Salesforce. Basically, Salesforce is a Customer Relationship Management (CRM) solution, but it does have many packaged features in it. It's not a sales pitch, but I just want to give some very important features such business applications in cloud bring to the enterprise. Salesforce brings all the customer information together and allows enterprises to build a customer-centric business model from sales, business analysis, and customer service.

Being in cloud, it also brings many of the features that software as a service in cloud brings.

Because of the ever-increasing impact of cloud on enterprises, a good amount of enterprise data now lives on the Internet (in cloud), obviously taking care of privacy and other common features an enterprise data should comply with to safeguard enterprise’s business objectives.

Social data (structured and unstructured)

Social connection of an enterprise nowadays is quite crucial, and even though enterprise data doesn’t live in social sites, it does have rich information fed by the real customer on enterprise business and its services.

Comments and suggestions on these special sites can indeed be used to reinvent the way enterprises do business and interact with the customers.

Comments in these social sites can damage the reputation and brand of an enterprise if no due diligence in taken on these comments from customers. The enterprise takes these social sites really seriously nowadays, because of which even though it doesn't have enterprise data, it does have customer reviews and comments, which, in a way, show how customer perceive the brand.

Because of this nature, I would like to classify this data also as enterprise data fed in by non-enterprise users. Its very important to take care of the fourth V, namely veracity in big data while analyzing this data as there are people out there who want to use these just as channels to get some undue advantages while dealing with the enterprise in the process of the business.
Another way of categorizing enterprise data is by the way the data is finally getting stored. Let's see this categorization in more detail in the following section.

Data stores or persistent stores (RDBMS or NoSQL)

This data, whether on premises (enterprise infrastructure) or in cloud, is stored as structured data in the so-called traditional RDBMS or new generation NoSQL persistent stores. This data comes into these stores through business applications, and most of the data is scattered in nature, and enterprises can easily find a sense of each and every data captured without much trouble. The main issue when data is stored in a traditional RDBMS kind of store is when the amount of data grows beyond an acceptable state. In that situation, the amount of analysis that we can make of the data takes a good amount of effort and time. Because of this, enterprises force themselves to segregate this data into production (data that can be queried and made use of by the business application) and non-production (data that is old and not in the production system, rather moved to a different storage).

Because of this segregation, analysis usually spans a few years and doesn't give enterprises a large span of how the business was dealing with certain business parameters. Say for example, if the production has five years of sales data, and 15 years of sales data is in the non-production storage, the users, when dealing with sales data analysis, just have a view of the last five years of data. There might be trends that are changing every five years, and this can only be known when we do an analysis of 20 years of sales data. Most of the time, because of RDBMS, storing and analyzing huge data is not possible. Even if this is possible, it is time consuming and doesn't give a great deal of flexibility, which an analyst looks for. This renders to the analyst a certain restricted analysis, which can be a big problem if the enterprise is looking into this data for business process tweaks.

The so-called new generation NoSQL (different databases in this space have different capabilities) gives more flexibility on analysis and the amount of data storage. It also gives the kind of performance and other aspects that analysts look for, but it still lacks certain aspects.

Even though the data is stored in an individual business application, it doesn't have a single view from various business application data, and that is what implementing a proper Data lake would bring into the enterprise.

Traditional data warehouse

As explained in the previous section, due to the amount of data captured in production business applications, almost all the time, the data in production is segregated from non-production. The non-production data usually lives in different forms/areas of the enterprise and flows into a different data store (usually RDBMS or NoSQL) called the data warehouse. Usually, the data is cleansed and cut out as required by the data analyst. Cutting out the data again puts a boundary on the type of analysis an analyst can do on the data. In most cases, there should be hidden gems of data that haven’t flown into the data warehouse, which would result in more analysis, using which the enterprises can tweak certain processes; however, since they are cleansed and cut out, this innovative analysis never happens. This aspect is also something that needs correction. The Data lake approach explained in this book allows the analyst to bring in any data captured in the production business application to do any analysis as the case may be.

The way these data warehouses are created today is by employing an ETL (Extract, Transform, Load) from the production database to the data warehouse database. ETL is entrusted with cleaning the data as needed by the analyst who works with these data warehouses for various analyses.

File stores

Business applications are ever changing, and new applications allow the end users to capture data in different formats apart from keying in data (using a keyboard), which are structured in nature.

Another way in which the end users now feed in data is in the form of documents in different formats. Some of the well-known formats are as follows:

Different document formats (PDF, DOC, XLS, and so on)
Binary formats
- Image-based formats (JPG, PNG, and so on)
- Audio formats (MP3, RAM, AC3)
- Video formats (MP4, MPEG, MKV)

As you saw in the previous sections, dealing with structured data itself is in question, and now we are bringing in the analysis of unstructured data. But analysis of this data is also as important nowadays as structured ones. By implementing Data lake, we could bring in new technologies surrounding this lake, which will allow us to make some good value out of this unstructured data as well, using the latest and greatest technologies in this space.

Apart from various file formats and data living in it, we have many applications that allow end users to capture a huge amount of data in the form of sentences, which also need analysis. To deal with these comments from end users manually is a Herculean task, and in this modern age, we need to decipher the sentences/comments in an automatic fashion and get a view of their sentiment. Again, there are many such technologies available that can make sense of this data (free flowing text) and help enterprises deal with it in the right fashion.

For example, if we do have a suggestion capturing system in place for an enterprise and (let's say) we have close to 1000 suggestions that we get in a day, because of the nature of the business, it's very hard to get into the filtering of these suggestions. Here, we could use technologies aiding in the sentiment analysis of these comments, and according to the rating these analysis tools provide, perform an initial level of filtering and then hand it over to the human who can understand and make use of it.

Enterprise's current state

As explained briefly in the previous sections, the current state of enterprise data in an organization can be summarized in bullets points as follows:

Conventional DW (Data Warehouse) /BI (Business Intelligence):
- Refined/ cleansed data transferred from production business application using ETL.
- Data earlier than a certain period would have already been transferred to a storage, which is hard to retrieve, such as magnetic tape storage.
- Some of its notable deficiencies are as follows:
  - A subset of production data in a cleansed format exists in DW; for any new element in DW, effort has to be made
  - A subset of the data is again in DW, and the rest gets transferred to permanent storage
  - Usually, analysis is really slow, and it is optimized again to perform queries, which are, to an extent, defined

Siloed Big Data:
- Some departments would have taken the right step in building big data. But departments generally don’t collaborate with each other, and this big data becomes siloed and doesn't give the value of a true big data for the enterprise.
- Some of its deficiencies are as follows:
  - Because of its siloed nature, the analyst is again constrained and not able to mix and match data between departments.
  - A good amount of money would have been spent to build and maintain/manage this and usually over a period of time is not sustainable.
Myriad of non-connected applications:
- There is a good amount of applications on premises and on cloud.
- Applications apart from churning structured data also produce unstructured data.
- Some of the deficiencies are as follows:
  - Don't talk to each other
  - Even if it talks, data scientists are not able to use it in an effective way to transform the enterprise in a meaningful way
  - Replication of technology usage for handling many aspects in each business application

We wouldn't say that creating or investing in Data lake is a silver bullet to solve all the aforementioned deficiencies. But it is definitely a step in the right direction, and every enterprise should at least spend some time discussing whether this is indeed required, and if it is a yes, don't deliberate over it too much and take the next step in the path of implementation.

Data lake is an enterprise initiative, and when built, it has to be with the consent of all the stakeholders, and it should have buy-ins from the top executives. It can definitely find ways to improve processes by which enterprises do business. It can help the higher management know more about their business and can increase the success rate of the decision-making process.

Enterprise digital transformation

Digital transformation is the application of digital technologies to fundamentally impact all aspects of business and society.

- infoworld.com

Digital transformation (DX) is an industry buzzword and a very strong initiative that every enterprise is taking without much deliberation. As the word suggests, it refers to transforming enterprises with information technology as one of its core pillars. Investing in technologies would definitely happen as part of this initiative, but data is one of the key aspects in achieving the so-called transformation.

Enterprises has known the importance of data and its analysis more and more in recent times, and that has definitely made every enterprise think out-of-the-box; this initiative is a way to establish data at the center.

As part of this business transformation, enterprises should definitely have Data lake as one of the core investments, with every department agreeing to share their data to flow into this Data lake, without much prejudice or pride.

Enterprises embarking on this journey

A Forrester Consulting research study commissioned by Accenture Interactive found that the key drivers of digital transformation are profitability, customer satisfaction, and increased speed-to-market.

Many enterprises are, in fact, already on the path of digital transformation. It is no more a buzzword, and enterprises are indeed making every effort to transform themselves using technology as one of the drivers and, you guessed it right, the other one being data.

Enterprises taking this path have clearly defined objectives. Obviously, this changes according to the enterprise and the business they are in. But some of the common ones, in no particular order, are as follows:

Radically improve customer experience
Reduce cost
Increase in revenue
Bring in differentiation from competitors
Tweak business processes and, in turn, change the business model

Some examples

There are a number of clear examples about what enterprises want to achieve in this space, some of which are as follows:

Ability to segment customers and give them personalized products. Targeting campaigns to the right person at the right time.
Bringing in more technologies and reducing manual work, basically digitizing many aspects in the enterprise.
Using social information clubbed together with enterprise data to make some important decisions.
Predicting the future in a more quantitative fashion and taking necessary steps and also preparing accordingly, well in advance, obviously.
Taking business global using technology as an important vehicle.

The next section details one of the use cases that enterprises want to achieve as part of digital transformation, with data as the main contributor. Understanding the use case is important as this is the use case we will try to implement in this book throughout.

Data lake use case enlightenment

We saw the importance of data in an enterprise. What enterprises face today is how to mine this data for information that can be used in favor of the business.

Even if we are able to bring this data into one place somehow, it's quite difficult to deal with this huge quantity of data and that too in a reasonable time. This is when the significance of Data lake comes into the picture. The next chapter details, in a holistic fashion, what Data lake is. Before getting there, let's detail the use case that we are trying to achieve throughout this book, with Data lake taking the center stage.

Data lake implementation using modern technologies would bring in many benefits, some of which are given as follows:

Ability for business users, using various analyzes, to find various important aspects in the business with regard to people, processes, and also a good insight into various customers
Allowing the business to do these analytics in a modest time frame rather than waiting for weeks or months
Performance and quickness of data analysis in the hands of business users to quickly tweak business processes

The use case that we will be covering throughout this book is called Single Customer View. Single Customer View (SCV) is a well-known term in the industry, and so it has quite a few definitions, one of which is as follows:

A Single Customer View is an aggregated, consistent and holistic representation of the data known by an organisation about its customers.

- Wikipedia

Enterprises keeps customer data in varying degrees siloed in different business applications. The use case aims at collating these varying degrees of data from these business applications into one and helping the analysts looking at this data create a single customer view with all the distinct data collected. This single view brings in the capability of segmenting customers and helping the business to target the right customers with the right content.

The significance of this use case for the enterprise can be narrowed down to points as listed next:

Customer segmentation
Collating information
Improving customer relations and, in turn, bringing is retention
Deeper analytics/insight, and so on

Conceptually, the following figure (Figure 05) summarizes the use case that we plan to implement throughout this book. Structured, semi-structured, and unstructured data is fed into the Data lake. From the Data lake, the Single Customer View (SCV) is derived in a holistic fashion. The various data examples are also depicted in each category, which we will implement in this book. Doing so gives a full use of a Data lake in an enterprise and is more realistic:

Figure 05: Conceptual view of Data lake use case for SCV

Figure 05 shows that our Data lake acquires data from various sources (variety), has different velocities and volumes. This is more a conceptual high-level view of what we will be achieving after going through the whole book.

We are really excited, and we hope you are, too!

Filter reviews by

All

Amazon verified reviews

Sherihan Sheriff Dec 12, 2017

An excellent guide for both beginners and seasoned professionals that gives a practical insight on building a data lake using Big data technologies. Looking forward to more similar work from the authors in future.

Amazon Verified review

aussiejim Sep 30, 2018

I like the diagrams that simplified the various conceptsall in all I found this a useful resource

Anonymous Jun 04, 2019

I am writing a detailed review in hopes that it will help others decide if this book is right for them. More importantly, I hope that the author will see these comments and correct some of the current issues in the next version.I was looking for a book to increase my knowledge of data lake implementation patterns, with technical details on batch vs real time processing, data storage, and data processing strategies. I liked the outline and approach the author chose to discuss these topics (refer to TOC), and it did contain some useful information that I was able to apply to my situation. For anyone using the Apache tools it describes several of the major technologies and when to and when not to use them. I was able to apply these to other technologies as well.The problem is that the book is full of bad grammar, misspelled words (e.g., “willn’t”), wordy/repetitive sentences (see example below), and sections where the pictures don’t match the accompanying text (e.g., the author refers to colors on a B/W picture). I give the book 3 ½ stars out of 5 in its current state. It would be a 4 if they had a tech writer proof read it, fix the grammar issues and rewrite some of the sentences to be easier to understand. With a second pass to fix consistency issues, it would be a solid 5.Detailed example of wordy/repetitive sentences…(Coped from Chapter 8 – Data Processing using Apache Flink)The technology that we have shortlisted to do this very important job of processing data is Apache Flink. I have to say that this selection was quite difficult as we have another technology in mind, namely Apache Spark, which was really strong in this area and more matured. But we decided to go with Flink in the end considering its pros. However, we have also detailed Spark a bit as opposed to other chapters in which we have just named other options and left it, because of its significance in this space.(2 pages later)For covering our use case and to build Data Lake we use Apache Flink in this layer as the technology. Other strong technology choices namely Apache Spark will also be explained a bit as we do feel that this is an equally good choice, in this layer. This chapter dives deep into Flink, though.(next page)The technology choice in this layer was really tough for us. Apache Spark was initially our choice, but Apache Flink had something in it that made us think over and at the time of writing this book, the industry did have some pointers favoring Flink and this made us do the final choice as Flink. However, we could have implemented this layer using Spark and it would have worked well for sure.After 50 pages of Flink related discussions, there is ½ page high level overview of Apache Spark.

Dimitri Shvorob Apr 04, 2018

If it is "data lake" that piqued your interest, don't bother - as far as I can tell, it is just the current buzzword for company's data estate. "Data Lake for Enterprises" is a big-data book, starting with a discussion of Nathan Marz's "lambda architecture" and continuing with a tour of a set of big-data technologies which could be used to flesh out elements of that architecture. The Manning-published "Big Data" by Marz and Warren immediately suggests itself as an alternative, and I am sure that others exists - it's too bad that the earlier reviews mention none. Unfortunately, I am not a big-data guy, and cannot offer competent advice. I can say that (a) Stephen Yegge's complaints are overblown - as could be expected from Packt, the book is sloppily written and never proof-read, but it is not difficult to understand, (b) when skimmed, the book has made a decent impression.

VG Dec 22, 2017

Expected a lot more. It is lots of small bits and pieces of information trying to touch too many topics, mixing concepts (Very little of it) and implementation products (more of it).