Components of the Big Data ecosystem
The next step on journey to Big Data is to understand the levels and layers of abstraction, and the components around the same. The following figure depicts some common components of Big Data analytical stacks and their integration with each other. The caveat here is that, in most of the cases, HDFS/Hadoop forms the core of most of the Big-Data-centric applications, but that's not a generalized rule of thumb.
Talking about Big Data in a generic manner, its components are as follows:
- A storage system can be one of the following:
- HDFS (short for Hadoop Distributed File System) is the storage layer that handles the storing of data, as well as the metadata that is required to complete the computation
- NoSQL stores that can be tabular stores such as HBase or key-value based columnar Cassandra
- A computation or logic layer can be one of the following:
- MapReduce: This is a combination of two separate processes, the mapper and the reducer. The mapper executes first and takes up the raw dataset and transforms it to another key-value data structure. Then, the reducer kicks in, which takes up the map created by the mapper job as an input, and collates and converges it into a smaller dataset.
- Pig: This is another platform that's put on top of Hadoop for processing, and it can be used in conjunction with or as a substitute for MapReduce. It is a high-level language and is widely used for creating processing components to analyze very large datasets. One of the key aspects is that its structure is amendable to various degrees of parallelism. At its core, it has a compiler that translates Pig scripts to MapReduce jobs.
It is used very widely because:
- Programming in Pig Latin is easy
- Optimizing the jobs is efficient and easy
- It is extendible
- Application logic or interaction can be one of the following:
- Hive: This is a data warehousing layer that's built on top of the Hadoop platform. In simple terms, Hive provides a facility to interact with, process, and analyze HDFS data with Hive queries, which are very much like SQL. This makes the transition from the RDBMS world to Hadoop easier.
- Cascading: This is a framework that exposes a set of data processing APIs and other components that define, share, and execute the data processing over the Hadoop/Big Data stack. It's basically an abstracted API layer over Hadoop. It's widely used for application development because of its ease of development, creation of jobs, and job scheduling.
- Specialized analytics databases, such as:
- Databases such as Netezza or Greenplum have the capability for scaling out and are known for a very fast data ingestion and refresh, which is a mandatory requirement for analytics models.
The Big Data analytics architecture
Now that we have skimmed through the Big Data technology stack and the components, the next step is to go through the generic architecture for analytical applications.
We will continue the discussion with reference to the following figure:
If you look at the diagram, there are four steps on the workflow of an analytical application, which in turn lead to the design and architecture of the same:
- Business solution building (dataset selection)
- Dataset processing (analytics implementation)
- Automated solution
- Measured analysis and optimization
Now, let's dive deeper into each segment to understand how it works.
Building business solutions
This is the first and most important step for any application. This is the step where the application architects and designers identify and decide upon the data sources that will be providing the input data to the application for analytics. The data could be from a client dataset, a third party, or some kind of static/dimensional data (such as geo coordinates, postal code, and so on).While designing the solution, the input data can be segmented into business-process-related data, business-solution-related data, or data for technical process building. Once the datasets are identified, let's move to the next step.
Dataset processing
By now, we understand the business use case and the dataset(s) associated with it. The next steps are data ingestion and processing. Well, it's not that simple; we may want to make use of an ingestion process and, more often than not, architects end up creating an ETL (short for Extract Transform Load) pipeline. During the ETL step, the filtering is executed so that we only apply processing to meaningful and relevant data. This filtering step is very important. This is where we are attempting to reduce the volume so that we have to only analyze meaningful/valued data, and thus handle the velocity and veracity aspects. Once the data is filtered, the next step could be integration, where the filtered data from various sources reaches the landing data mart. The next step is transformation. This is where the data is converted to an entity-driven form, for instance, Hive table, JSON, POJO, and so on, and thus marking the completion of the ETL step. This makes the data ingested into the system available for actual processing.
Depending upon the use case and the duration for which a given dataset is to be analyzed, it's loaded into the analytical data mart. For instance, my landing data mart may have a year's worth of credit card transactions, but I just need one day's worth of data for analytics. Then, I would have a year's worth of data in the landing mart, but only one day's worth of data in the analytics mart. This segregation is extremely important because that helps in figuring out where I need real-time compute capability and which data my deep learning application operates upon.
Solution implementation
Now, we will implement various aspects of the solution and integrate them with the appropriate data mart. We can have the following:
- Analytical Engine: This executes various batch jobs, statistical queries, or cubes on the landing data mart to arrive at the projections and trends based on certain indexes and variances.
- Dashboard/Workbench: This depicts some close to real-time information on to some UX interfaces. These components generally operate on low latency, close to real-time, analytical data marts.
- Auto learning synchronization mechanism: Perfect for advanced analytical application, this captures patterns and evolves the data management methodologies. For example, if I am a mobile operator at a tourist place, I might be more cautious about my resource usage during the day and at weekends, but over a period of time I may learn that during vacations I see a surge in roaming, so I can learn and build these rules into my data mart and ensure that data is stored and structured in an analytics-friendly manner.
Presentation
Once the data is analyzed, the next and most important step in the life cycle of any application is the presentation/visualization of the results. Depending upon the target audience of the end business user, the data visualization can be achieved using a custom-built UI presentation layer, business insight reports, dashboards, charts, graphs, and so on.
The requirement could vary from autorefreshing UI widgets to canned reports and ADO queries.