Big data ecosystem

If you work in almost any mainstream industry today, chances are that you may have heard of some of the following terms and phrases:

Big data
Distributed, scalable, and elastic
On-premise versus the cloud
SQL versus NoSQL
Artificial intelligence, machine learning, and deep learning

But what do all these terms and phrases actually mean, how do they all fit together, and where do you start? The aim of this section is to answer all of those questions in a clear and concise manner.

Horizontal scaling

First of all, let's return to some of the data-centric problems that we described earlier. Given the huge explosion in the mass creation and consumption of data today, clearly we cannot continue to keep adding CPUs, memory, and/or hard drives to a single machine (in other words, vertical scaling). If we did, there would very quickly come a point where migrating to more powerful hardware would lead to diminishing returns while incurring significant costs. Furthermore, the ability to scale would be physically bounded by the biggest machine available to us, thereby limiting the growth potential of an organization.

Horizontal scaling, of which sharding is an example, is the process by which we can increase or decrease the amount of computational resources available to us via the addition or removal of hardware and/or software. Typically, this would involve the addition (or removal) of servers or nodes to a cluster of nodes. Crucially, however, the cluster acts as a single logical unit at all times, meaning that it will still continue to function and process requests regardless of whether resources were being added to it or taken away. The difference between horizontal and vertical scaling is illustrated in Figure 1.2:

Figure 1.2: Vertical scaling versus horizontal scaling

Distributed systems

Horizontal scaling allows organizations to become much more cost efficient when data and processing requirements grow beyond a certain point. But simply adding more machines to a cluster would not be of much value by itself. What we now need are systems that are capable of taking advantage of horizontal scalability and that work across multiple machines seamlessly, irrespective of whether the cluster contains one machine or 10,000 machines.

Distributed systems do precisely that—they work seamlessly across a cluster of machines and automatically deal with the addition (or removal) of resources from that cluster. Distributed systems can be broken down into the following types:

Distributed filesystems
Distributed databases
Distributed processing
Distributed messaging
Distributed streaming
Distributed ledgers

Distributed data stores

Let's return to the problems faced by a single-machine RDBMS. We have seen how sharding can be employed as one method to scale relational databases horizontally in order to optimize costs as data grows for small- to medium-sized databases. However, the issue with sharding is that each node acts in a standalone manner with no knowledge of the other nodes in the cluster, meaning that a custom broker is required to both partition the data across the shards and to process read and write requests.

Distributed data stores, on the other hand, work out of the box as a single logical unit spanning a cluster of nodes.

Note that a data store is just a general term used to describe any type of repository used to persist data. Distributed data stores extend this by storing data on more than one node, and often employ replication.

Client applications view the distributed data store as a single entity, meaning that no matter which node in the cluster physically handles the client request, the same results will be returned. Distributed filesystems, such as the Apache Hadoop Distributed File System (HDFS) discussed in the next section, belong to the class of distributed data stores and are used to store files in their raw format. When data needs to be modeled in some manner, then distributed databases can be used. Depending on the type of distributed database, it can either be deployed on top of a distributed filesystem or not.

Distributed filesystems

Think of the hard drive inside your desktop, laptop, smartphone, or other personal device you own. Files are written to and stored on local hard drives and retrieved as and when you need them. Your local operating system manages read and write requests to your local hard drive by maintaining a local filesystem—a means by which the operating system keeps track of how the disk is organized and where files are located.

As your personal data footprint grows, you take up more and more space on your local hard drive until it reaches its capacity. At this time, you may seek to purchase a larger capacity hard drive to replace the one inside your device, or you may seek to purchase an extra hard drive to complement your existing one. In the case of the latter, you personally manage which of your personal files reside on which hard drive, or perhaps use one of them to archive files you rarely use to free up space on your primary drive. Hopefully, you also maintain backups of your personal files should the worse happen and your device or primary hard drive malfunctions!

A distributed filesystem (DFS) extends the notion of local filesystems, while offering a number of useful benefits. In a distributed filesystem within the context of our big data ecosystem, data is physically split across the nodes and disks in a cluster. Like distributed data stores in general, a distributed filesystem provides a layer of abstraction and manages read and write requests across the cluster itself, meaning that the physical split is invisible to requesting client applications which view the distributed filesystem as one logical entity just like a conventional local filesystem.

Furthermore, distributed filesystems provide useful benefits out of the box, including the following:

Data replication, where data can be configured to be automatically replicated across the cluster for fault tolerance in the event one or more of the nodes or disks should fail
Data integrity checking
The ability to persist huge files, typically gigabytes (GB) to terabytes (TB) in size, which would not normally be possible on conventional local filesystems

The HDFS is a well-known example of a distributed filesystem within the context of our big data ecosystem. In the HDFS, a master/slave architecture is employed, consisting of a single NameNode that manages the distributed filesystem, and multiple DataNodes, which typically reside on each node in the cluster and manage the physical disks attached to that node as well as where the data is physically persisted to. Just as with traditional filesystems, HDFS supports standard filesystem operations, such as opening and closing files and directories. When a client application requests a file to be written to the HDFS, it is split into one or more blocks that are then mapped by the NameNode to the DataNodes, where they are physically persisted. When a client application requests a file to be read from the HDFS, the DataNodes fulfill this request.

One of the core benefits of HDFS is that it provides fault tolerance inherently through its distributed architecture, as well as through data replication. Since typically there will be multiple nodes (potentially thousands) in an HDFS cluster, it is resilient to hardware failure as operations can be automatically offloaded to the healthy parts of the cluster while the non-functional hardware is being recovered or replaced. Furthermore, when a file is split into blocks and mapped by the NameNode to the DataNodes, these blocks can be configured to automatically replicate across the DataNodes, taking into account the topology of the HDFS cluster.

Therefore, if a failure did occur, for example, a disk failure on one of the DataNodes, data would still be available to client applications. The high-level architecture of an HDFS cluster is illustrated in Figure 1.3:

Figure 1.3: Apache Hadoop distributed filesystem high-level architecture

To learn more about the Apache Hadoop framework, please visit http://hadoop.apache.org/.

Distributed databases

Distributed filesystems, like conventional filesystems, are used to store files. In the case of distributed filesystems such as the HDFS, these files can be very large. Ultimately, however, they are used to store files. When data requires modeling, we need something more than just a filesystem; we need a database.

Distributed databases, just like single-machine databases, allow us to model our data. Unlike single-machine databases, however, the data, and the data model itself, spans, and is preserved across, all the nodes in a cluster acting as a single logical database. This means that not only can we take advantage of the increased performance, throughput, fault tolerance, resilience, and cost-effectiveness offered by distributed systems, but we can also model our data and thereafter query that data efficiently, no matter how large it is or how complex the processing requirements are. Depending on the type of distributed database, it can either be deployed on top of a distributed filesystem (such as Apache HBase deployed on top of the HDFS) or not.

In our big data ecosystem, it is often the case that distributed filesystems such as the HDFS are used to host data lakes. A data lake is a centralized data repository where data is persisted in its original raw format, such as files and object BLOBs. This allows organizations to consolidate their disparate raw data estate, including structured and unstructured data, into a central repository with no predefined schema, while offering the ability to scale over time in a cost-effective manner.

Thereafter, in order to actually deliver business value and actionable insight from this vast repository of schema-less data, data processing pipelines are engineered to transform this raw data into meaningful data conforming to some sort of data model that is then persisted into serving or analytical data stores typically hosted by distributed databases. These distributed databases are optimized, depending on the data model and type of business application, to efficiently query the large volumes of data held within them in order to serve user-facing business intelligence (BI), data discovery, advanced analytics, and insights-driven applications and APIs.

Examples of distributed databases include the following:

Apache HBase: https://hbase.apache.org/
Apache Cassandra: http://cassandra.apache.org/
Apache CouchDB: http://couchdb.apache.org/
Apache Ignite: https://ignite.apache.org/
Greenplum Database: https://greenplum.org/
MongoDB: https://www.mongodb.com/

Apache Cassandra is an example of a distributed database that employs a masterless architecture with no single point of failure that supports high throughput in processing huge volumes of data. In Cassandra, there is no master copy of the data. Instead, data is automatically partitioned, based on partitioning keys and other features inherent to how Cassandra models and stores data, and replicated, based on a configurable replication factor, across other nodes in the cluster. Since the concept of master/slave does not exist, a gossip protocol is employed so that the nodes in the Cassandra cluster may dynamically learn about the state and health of other nodes.

In order to process read and write requests from a client application, Cassandra will automatically elect a coordinator node from the available nodes in the cluster, a process that is invisible to the client. To process write requests, the coordinator node will, based on the partitioning features of the underlying distributed data model employed by Cassandra, contact all applicable nodes where the write request and replicas should be persisted to. To process read requests, the coordinator node will contact one or more of the replica nodes where it knows the data in question has been written to, again based on the partitioning features of Cassandra. The underlying architecture employed by Cassandra can therefore be visualized as a ring, as illustrated in Figure 1.4. Note that although the topology of a Cassandra cluster can be visualized as a ring, that does not mean that a failure in one node results in the failure of the entire cluster. If a node becomes unavailable for whatever reason, Cassandra will simply continue to write to the other applicable nodes that should persist the requested data, while maintaining a queue of operations pertaining to the failed node. When the non-functional node is brought back online, Cassandra will automatically update it:

Figure 1.4: Cassandra topology illustrating a write request with a replication factor of 3

NoSQL databases

Relational Database Management Systems, such as Microsoft SQL Server, PostgreSQL, and MySQL, allow us to model our data in a structured manner across tables that represent the entities found in our data model that may be identified by primary keys and linked to other entities via foreign keys. These tables are pre-defined, with a schema consisting of columns of various data types that represent the attributes of the entity in question.

For example, Figure 1.5 describes a very simple relational schema that could be utilized by an e-commerce website:

Figure 1.5: A simple relational database model for an e-commerce website

NoSQL, on the other hand, simply refers to the class of databases where data is not modeled in a conventional relational manner. So, if data is not modeled relationally, how is it modeled in NoSQL databases? The answer is that there are various types of NoSQL databases depending on the use case and business application in question. These various types are summarized in the following sub-sections.

It is a common, but mistaken, assumption that NoSQL is a synonym for distributed databases. In fact, there is an ever increasing list of RDBMS vendors whose products are designed to be scalable and distributed to accommodate huge volumes of structured data. The reason that this mistaken assumption arose is because it is often the case in real-world implementations that NoSQL databases are used to persist huge amounts of structured, semi-structured, and unstructured data in a distributed manner, hence the reason they have become synonymous with distributed databases. However, like relational databases, NoSQL databases are designed to work even on a single machine. It is the way that data is modeled that distinguishes relational, or SQL, databases from NoSQL databases.

Document databases

Document databases, such as Apache CouchDB and MongoDB, employ a document data model to store semi-structured and unstructured data. In this model, a document is used to encapsulate all the information pertaining to an object, usually in JavaScript Object Notation (JSON) format, meaning that a single document is self-describing. Since they are self-describing, different documents may have different schema. For example a document describing a movie item, as illustrated in the following JSON file, would have a different schema from a document describing a book item:

[
   {
        "title" : "The Imitation Game",
        "year": 2014
        "metadata" : {
            "directors" : [ "Morten Tyldum"],
            "release_date" : "2014-11-14T00:00:00Z",
            "rating" : 8.0,
            "genres" : ["Biography", "Drama", "Thriller"],
            "actors" : ["Benedict Cumberbatch", "Keira Knightley"]
        }
    }
]

Because documents are self-contained representations of objects, they are particularly useful for data models in which individual objects are updated frequently, thereby avoiding the need to update the entire database schema, as would be required with relational databases. Therefore, document databases tend to be ideal for use cases involving catalogs of items, for example e-commerce websites, and content management systems such as blogging platforms.

Columnar databases

Relational databases traditionally persist each row of data contiguously, meaning that each row will be stored in sequential blocks on disk. This type of database is referred to as row-oriented. For operations involving typical statistical aggregations such as calculating the average of a particular attribute, the effect of row-oriented databases is that every attribute in that row is read during processing, regardless of whether they are relevant to the query or not. In general, row-oriented databases are best suited for transactional workloads, also known as online transaction processing (OLTP), where individual rows are frequently written to and where the emphasis is on processing a large number of relatively simple queries, such as short inserts and updates, quickly. Examples of use cases include retail and financial transactions where database schemas tend to be highly normalized.

On the other hand, columnar databases such as Apache Cassandra and Apache HBase are column-oriented, meaning that each column is persisted in sequential blocks on disk. The effect of column-oriented databases is that individual attributes can be accessed together as a group, rather than individually by row, thereby reducing disk I/O for analytical queries since the amount of data that is loaded from disk is reduced. For example, consider the following table:

Product ID	Name	Category	Unit price
1001	USB drive 64 GB	Storage	25.00
1002	SATA HDD 1 TB	Storage	50.00
1003	SSD 256 GB	Storage	60.00

In a row-oriented database, the data is persisted to disk as follows:

(1001, USB drive 64 GB, storage, 25.00), (1002, SATA HDD 1 TB, storage, 50.00), (1003, SSD 256 GB, storage, 60.00)

However, in a column-oriented database, the data is persisted to disk as follows:

(1001, 1002, 1003), (USB drive 64 GB, SATA HDD 1 TB, SSD 256 GB), (storage, storage, storage), (25.00, 50.00, 60.00)

In general, column-oriented databases are best suited for analytical workloads, also known as online analytical processing (OLAP), where the emphasis is on processing a low number of complex analytical queries typically involving aggregations. Examples of use cases include data mining and statistical analysis, where database schemas tend to be either denormalized or follow a star or snowflake schema design.

Key-value databases

Key-value databases, such as Redis, Oracle Berkley DB, and Voldemort, employ a simple key-value data model to store data as a collection of unique keys mapped to value objects. This is illustrated in the following table that maps session IDs for web applications to session data:

Key (session ID)	Value (session data)
`ab2e66d47a04798`	`{userId: "user1", ip: "75.100.144.28", date: "2018-09-28"}`
`62f6nhd47a04dshj`	`{userId: "user2", ip: "77.189.90.26", date: "2018-09-29"}`
`83hbnndtw3e6388`	`{userId: "user3", ip: "73.43.181.201", date: "2018-09-30"}`

Key-value data structures are found in many programming languages where they are commonly referred to as dictionaries or hash maps. Key-value databases extend these data structures through their ability to partition and scale horizontally across a cluster, thereby effectively providing huge distributed dictionaries. Key-value databases are particularly useful as a means to improve the performance and throughput of systems that are required to handle potentially millions of requests per second. Examples of use cases include popular e-commerce websites, storing session data for web applications, and facilitating caching layers.

Graph databases

Graph databases, such as Neo4j and OrientDB, model data as a collection of vertices (also called nodes) linked together by one or more edges (also called relationships or links). In real-world graph implementations, vertices are often used to represent real-world entities such as individuals, organizations, vehicles, and addresses. Edges are then used to represent relationships between vertices.

Both vertices and edges can have an arbitrary number of key-value pairs, called properties, associated with them. For example, properties associated with an individual vertex may include a name and date of birth. Properties associated with an edge linking an individual vertex with another individual vertex may include the nature and length of the personal relationship. The collection of vertices, edges, and properties together form a data structure called a property graph. Figure 1.6 illustrates a simple property graph representing a small social network:

Figure 1.6: A simple property graph

Graph databases are employed in a wide variety of scenarios where the emphasis is on analyzing the relationships between objects rather than just the object's data attributes themselves. Common use cases include social network analysis, fraud detection, combating serious organized crime, customer recommendation systems, complex reasoning, pattern recognition, blockchain analysis, cyber security, and network intrusion detection.

Apache TinkerPop is an example of a graph computing framework that provides a layer of abstraction between the graph data model and the underlying mechanisms used to store and process graphs. For example, Apache TinkerPop can be used in conjunction with an underlying Apache Cassandra or Apache HBase database to store huge distributed graphs containing billions of vertices and edges partitioned across a cluster. A graph traversal language called Gremlin, a component of the Apache TinkerPop framework, can then be used to traverse and analyze the distributed graph using one of the Gremlin language variants including Gremlin Java, Gremlin Python, Gremlin Scala, and Gremlin JavaScript. To learn more about the Apache TinkerPop framework, please visit http://tinkerpop.apache.org/.

CAP theorem

As discussed previously, distributed data stores allow us to store huge volumes of data while providing the ability to horizontally scale as a single logical unit at all times. Inherent to many distributed data stores are the following features:

Consistency refers to the guarantee that every client has the same view of the data. In practice, this means that a read request to any node in the cluster should return the results of the most recent successful write request. Immediate consistency refers to the guarantee that the most recent successful write request should be immediately available to any client.
Availability refers to the guarantee that the system responds to every request made by a client, whether that request was successful or not. In practice, this means that every client request receives a response regardless of whether individual nodes are non-functional.
Partition tolerance refers to the guarantee of resilience given a failure in inter-node network communication. In other words, in the event that there is a network failure between a particular node and another set of nodes, referred to as a network partition, the system will continue to function. In practice, this means that the system should have the ability to replicate data across the functional parts of the cluster to cater for intermittent network failures and in order to guarantee that data is not lost. Thereafter, the system should heal gracefully once the partition has been resolved.

The CAP theorem simply states that a distributed system cannot simultaneously be immediately consistent, available, and partition-tolerant. A distributed system can simultaneously only ever offer any two of the three. This is illustrated in Figure 1.7:

Figure 1.7: The CAP theorem

CA distributed systems offer immediate consistency and high availability, but are not tolerant to inter-node network failure, meaning that data could be lost. CP distributed systems offer immediate consistency and are resilient to network failure, with no data loss. However, they may not respond in the event of an inter-node network failure. AP distributed systems offer high availability and resilience to network failure with no data loss. However, read requests may not return the most recent data.

Distributed systems, such as Apache Cassandra, allow for the configuration of the level of consistency required. For example, let's assume we have provisioned an Apache Cassandra cluster with a replication factor of 3. In Apache Cassandra, a consistency configuration of ONE means that a write request is considered successful as soon as one copy of the data is persisted, without the need to wait for the other two replicas to be written. In this case, the system is said to be eventually consistent, as the other two replicas will eventually be persisted. A subsequent and immediate read request may either return the latest data if it is processed by the updated replica, or it may return outdated data if it is processed by one of the other two replicas that have yet to be updated (but will eventually be). In this scenario, Cassandra is an AP distributed system exhibiting eventual consistency and the tolerance of all but one of the replicas failing. It also provides the fastest performing system in this context.

A consistency configuration of ALL in Apache Cassandra means that a write request is considered successful only if all replicas are persisted successfully. A subsequent and immediate read request will always return the latest data. In this scenario, Cassandra is a CA distributed system exhibiting immediate consistency, but with no tolerance of failure. It also provides the slowest performing system in this context.

Finally, a consistency configuration of QUORUM in Apache Cassandra means that a write request is considered successful only when a strict majority of replicas are persisted successfully. A subsequent and immediate read request also using QUORUM consistency will wait until data from two replicas (in the case of a replication factor of 3) is received and, by comparing timestamps, will always return the latest data. In this scenario, Cassandra is also a CA distributed system exhibiting immediate consistency, but with the tolerance of a minority of the replicas failing. It also provides a median-performing system in this context.

Ultimately, in the real world, however, data loss is not an option for most business critical systems and a trade-off between performance, consistency and availability is required. Therefore, the choice tends to come down to either CP or AP distributed systems, with the winner driven by business requirements.

Distributed search engines

Distributed search engines, such as Elasticsearch based on Apache Lucene, transform data into highly-optimized data structures for quick and efficient searching and analysis. In Apache Lucene, data is indexed into documents containing one or more fields representing analyzed data attributes of various data types. A collection of documents forms an index, and it is this index that is searched when queries are processed, returning the relevant documents that fulfill the query. A suitable analogy would be when trying to find pages relating to a specific topic in a textbook. Instead of searching every page one by one, the reader may instead use the index at the back of the book to find relevant pages quicker. To learn more about Apache Lucene, please visit http://lucene.apache.org/.

Elasticsearch extends Apache Lucene by offering the ability to partition and horizontally scale search indexes and analytical queries over a distributed cluster, coupled with a RESTful search engine and HTTP web interface for high-performance searching and analysis. To learn more about Elasticsearch, please visit https://www.elastic.co/products/elasticsearch.

Distributed processing

We have seen how distributed data stores such as the HDFS and Apache Cassandra allow us to store and model huge volumes of structured, semi-structured, and unstructured data partitioned over horizontally scalable clusters providing fault tolerance, resilience, high availability, and consistency. But in order to provide actionable insights and to deliver meaningful business value, we now need to be able to process and analyze all that data.

Let's return to the traditional data processing scenario we described near the start of this chapter. Typically, the data transformation and analytical programming code written by an analyst, data engineer or software engineer (for example, in SQL, Python, R or SAS) would rely upon the input data to be physically moved to the remote processing server or machine where the code to be executed resided. This would often take the form of a programmatic query embedded inside the code itself, for example, a SQL statement via an ODBC or JDBC connection, or via flat files such as CSV and XML files moved to the local filesystem. Although this approach works fine for small- to medium-sized datasets, there is a physical limit bounded by the computational resources available to the single remote processing server. Furthermore, the introduction of flat files such as CSV or XML files, introduces an additional, and often unnecessary, intermediate data store that requires management and increases disk I/O.

The major problem with this approach, however, is that the data needs to be moved to the code every time a job is executed. This very quickly becomes impractical when dealing with increased data volumes and frequencies, such as the volumes and frequencies we associate with big data.

We therefore need another data processing and programming paradigm—one where code is moved to the data and that works across a distributed cluster. In other words, we require distributed processing!

The fundamental idea behind distributed processing is the concept of splitting a computational processing task into smaller tasks. These smaller tasks are then distributed across the cluster and process specific partitions of the data. Typically, the computational tasks are co-located on the same nodes as the data itself to increase performance and reduce network I/O. The results from each of the smaller tasks are then aggregated in some manner before the final result is returned.

MapReduce

MapReduce is an example of a distributed data processing paradigm capable of processing big data in parallel across a cluster of nodes. A MapReduce job splits a large dataset into independent chunks and consists of two stages—the first stage is the Map function that creates a map task for each range in the input, outputting a partitioned group of key-value pairs. The output of the map tasks then act as inputs to reduce tasks, whose job it is to combine and condense the relevant partitions in order to solve the analytical problem. Before beginning the map stage, data is often sorted or filtered based on some condition pertinent to the analysis being undertaken. Similarly, the output of the reduce function may be subject to a finalization function to further analyze the data.

Let's consider a simple example to bring this rather abstract definition to life. The example that we will consider is that of a word count. Suppose that we have a text file containing millions of lines of text, and we wish to count the number of occurrences of each unique word in this text file as a whole. Figure 1.8 illustrates how this analysis may be undertaken using the MapReduce paradigm:

Figure 1.8: Word count MapReduce program

In this example, the original text file containing millions of lines of text is split up into its individual lines. Map tasks are applied to ranges of those individual lines, splitting them into individual word tokens, in this case, using a whitespace tokenizer, and thereafter emitting a collection of key-value pairs where the key is the word.

A shuffling process is undertaken that transfers the partitioned key-value pairs emitted by the map tasks to the reduce tasks. Sorting of the key-value pairs, grouped by key, is also undertaken. This helps to identify when a new reduce task should start. To reduce the amount of data transferred from the map tasks to the reduce tasks during shuffling, an optional combiner may be specified that implements a local aggregation function. In this example, a combiner is specified that sums, locally, the number of occurrences of each key or word for each map output.

The reduce tasks then take those partitioned key-value pairs and reduce those values that share the same key, outputting new (but unsorted) key-value pairs unique by key. In this example, the reduce tasks simply sum the number of occurrences of that key. The final output of the MapReduce job in this case is simply a count of the number occurrences of each word across the whole original text file.

In this example, we used a simple text file that had been split up by a newline character that is then mapped to key-value pairs based on a whitespace tokenizer. But the same paradigm can easily be extended to distributed data stores, where large volumes of data have already been partitioned across a cluster, thereby allowing us to perform data processing on a huge scale.

Apache Spark

Apache Spark is a well-known example of a general-purpose distributed processing engine capable of handling petabytes (PB) of data. Because it is a general-purpose engine, it is suited to a wide variety of use cases at scale, including the engineering and execution of Extract-Transform-Load (ETL) pipelines using its Spark SQL library, interactive analytics, stream processing using its Spark Streaming library, graph-based processing using its GraphX library, and machine learning using its MLlib library. We will be employing Apache Spark's machine learning library in later chapters. For now, however, it is important to get an overview of how Apache Spark works under the hood.

Apache Spark software services run in Java Virtual Machines (JVM), but that does not mean Spark applications must be written in Java. In fact, Spark exposes its API and programming model to a variety of language variants, including Java, Scala, Python, and R, any of which may be used to write a Spark application. In terms of its logical architecture, Spark employs a master/worker architecture as illustrated in Figure 1.9:

Figure 1.9: Apache Spark logical architecture

Every application written in Apache Spark consists of a Driver Program. The driver program is responsible for splitting a Spark application into tasks, which are then partitioned across the Worker nodes in the distributed cluster and scheduled to execute by the driver. The driver program also instantiates a SparkContext, which tells the application how to connect to the Spark cluster and its underlying services.

The worker nodes, also known as slaves, are where the computational processing physically occurs. Typically, Spark worker nodes are co-located on the same nodes as where the underlying data is also persisted to improve performance. Worker nodes spawn processes called Executors, and it is these executors that are responsible for executing the computational tasks and storing any locally-cached data. Executors communicate with the driver program in order to receive scheduled functions, such as map and reduce functions, which are then executed. The Cluster Manager is responsible for scheduling and allocating resources across the cluster and must therefore be able to communicate with every worker node, as well as the driver. The driver program requests executors from the cluster manager (since the cluster manager is aware of the resources available) so that it may schedule tasks.

Apache Spark is bundled with its own simple cluster manager which, when used, is referred to as Spark Standalone mode. Spark applications deployed to a standalone cluster will, by default, utilize all nodes in the cluster and are scheduled in a First-In-First-Out (FIFO) manner. Apache Spark also supports other cluster managers, including Apache Mesos and Apache Hadoop YARN, both of which are beyond the scope of this book.

RDDs, DataFrames, and datasets

So how does Spark store and partition data during its computational processing? Well, by default, Spark holds data in-memory, which helps to make it such a quick processing engine. In fact, as of Spark 2.0 and onward, there are three sets of APIs used to hold data— resilient distributed datasets (RDDs), DataFrames, and Datasets.

RDDs

RDDs are an immutable and distributed collection of records partitioned across the worker nodes in a Spark cluster. They offer fault tolerance since, in the event of non-functional nodes or damaged partitions, RDD partitions can be recomputed, since all the dependency information needed to replicate each of its partitions is stored by the RDD itself. They also provide consistency, since each partition is immutable. RDDs are commonly used in Spark today in situations where you do not need to impose a schema when processing the data, such as when unstructured data is processed. Operations may be executed on RDDs via a low-level API that provides two broad categories of operation:

Transformations: These are operations that return another RDD. Narrow transformations, for example, map operations, are operations that can be executed on arbitrary partitions of data without being dependent on other partitions. Wide transformations, for example, sorting, joining, and grouping, are operations that require the data to be repartitioned across the cluster. Certain transformations, such as sorting and grouping, require the data to be redistributed across partitions, a process known as shuffling. Because data needs to be redistributed, wide transformations requiring shuffling are expensive operations and should be minimized in Spark applications where possible.
Actions: These are computational operations that return a value back to the driver, not another RDD. RDDs are said to be lazily evaluated, meaning that transformations are only computed when an action is called.

DataFrames

Like RDDs, they are an immutable and distributed collection of records partitioned across the worker nodes in a Spark cluster. However, unlike RDDs, data is organized into named columns conceptually similar to tables in relational databases and tabular data structures found in other programming languages, such as Python and R. Because DataFrames offer the ability to impose a schema on distributed data, they are more easily exposed to more familiar programming languages such as SQL, which makes them a popular and arguably easier data structure to work with and manipulate, as well as being more efficient than RDDs.

The main disadvantage of DataFrames however is that, similar to Spark SQL string queries, analysis errors are only caught at runtime and not during compilation. For example, imagine a DataFrame called df with the named columns firstname, lastname, and gender. Now imagine that we coded the following statement:

df.filter( df.age > 30 )

This statement attempts to filter the DataFrame based on a missing and unknown column called age. Using the DataFrame API, this error would not be caught at compile time but instead only at runtime, which could be costly and time-consuming if the Spark application in question involved multiple transformations and aggregations prior to this statement.

Datasets

Datasets extend DataFrames by providing type safety. This means that in the preceding example of the missing column, the Dataset API will throw a compile time error. In fact, DataFrames are actually an alias for Dataset[Row] where a Row is an untyped object that you may see in Spark applications written in Java. However, because R and Python have no compile-time type safety, this means that Datasets are not currently available to these languages.

There are numerous advantages to using the DataFrame and Dataset APIs over RDDs, including better performance and more efficient memory usage. The high-level APIs offered by DataFrame and Dataset also make it easier to perform standard operations such as filtering, grouping, and calculating statistical aggregations such as totals and averages. RDDs, however, are still useful because of the greater degree of control offered by its low-level API, including low-level transformations and actions. They also provide analysis errors at compile time and are well suited to unstructured data.

Jobs, stages, and tasks

Now that we know how Spark stores data during computational processing, let's return to its logical architecture to understand how Spark applications are logically broken down into smaller units for distributed processing.

Job

When an action is called in a Spark application, Spark will use a dependency graph to ascertain the datasets on which that action depends and thereafter formulate an execution plan. An execution plan is essentially a chain of datasets, beginning with the dataset furthest back all the way through to the final dataset, that are required to be computed in order to calculate the value to return to the driver program as a result of that action. This process is called a Spark job, with each job corresponding to one action.

Stage

If the Spark job and, hence, the action that resulted in the launching of that job, involves the shuffling of data (that is, the redistribution of data), then that job is broken down into stages. A new stage begins when network communication is required between the worker nodes. An individual stage is therefore defined as a collection of tasks processed by an individual executor with no dependency on other executors.

Tasks

Tasks are the smallest unit of execution in Spark, with a single task being executed on one executor; in other words, a single task cannot span multiple executors. All the tasks making up one stage share the same code to be executed, but act on different partitions of the data. The number of tasks that can be processed by an executor is bounded by the number of cores associated with that executor. Therefore, the total number of tasks that can be executed in parallel across an entire Spark cluster can be calculated by multiplying the number of cores per executor by the number of executors. This value then provides a quantifiable measure of the level of parallelism offered by your Spark cluster.

In Chapter 2, Setting Up a Local Development Environment, we will discuss how to install, configure, and administer a single-node standalone Spark cluster for development purposes, as well as discussing some of the basic configuration options exposed by Spark. Then, in Chapter 3, Artificial Intelligence and Machine Learning, and onward, we will take advantage of Spark's machine learning library, MLlib, so that we may employ Spark as a distributed advanced analytics engine. To learn more about Apache Spark, please visit http://spark.apache.org/.

Distributed messaging

Continuing our journey through distributed systems, the next category that we will discuss is distributed messaging systems. Typically, real-world IT systems are, in fact, a collection of distinct applications, potentially written in different languages using different underlying technologies and frameworks, that are integrated with one another. In order for messages to be sent between distinct applications, developers could potentially code the consumption logic into each individual application. This is a bad idea however—what happens if the type of message sent by an upstream application changes? In this case, the consumption logic would have to be rewritten, relevant applications updated, and the whole system retested.

Messaging systems, such as Apache ActiveMQ and RabbitMQ, overcome this problem by providing a middleman called a message broker. Figure 1.10 illustrates how message brokers work at a high level:

Figure 1.10: Message broker high-level overview

At a high level, Producers are applications that generate and send messages required for the functionality of the system. The Message Broker receives these messages and stores them inside queue data structures or buffers. Consumer applications, which are applications designed to process messages, subscribe to the message broker. The message broker then delivers these messages to the Consumer applications which consume and process them. Note that a single application can be a producer, consumer, or both.

Distributed messaging systems, which is one use case of Apache Kafka, extend traditional messaging systems by being able to partition and scale horizontally, while offering high throughput, high performance, fault tolerance, and replication, like many other distributed systems. This means that messages are never lost, while offering the ability to load balance requests and provide ordering guarantees. We will discuss Apache Kafka in more detail next, but in the context of a distributed streaming platform for real-time data.

Distributed streaming

Imagine processing the data stored in a traditional spreadsheet or text-based delimited files such as a CSV file. The type of processing that you will typically execute when using these types of data stores is referred to as batch processing. In batch processing, data is collated into some sort of group, in this case, the collection of lines in our spreadsheet or CSV file, and processed together as a group at some future time and date. Typically, these spreadsheets or CSV files will be refreshed with updated data at some juncture, at which point the same, or similar, processing will be undertaken, potentially all managed by some sort of schedule or timer. Traditionally, data processing systems would have been developed with batch processing in mind, including conventional data warehouses.

Today, however, batch processing alone is not enough. With the advent of the internet, social media, and more powerful technology, coupled with the demand for mass data consumption as soon as possible (ideally immediately), real-time data processing and analytics are no longer a luxury for many businesses but instead a necessity. Examples of use cases where real-time data processing is vital include processing financial transactions and real-time pricing, real-time fraud detection and combating serious organized crime, logistics, travel, robotics, and artificial intelligence.

Micro-batch processing extends standard batch processing by executing at smaller intervals (typically seconds or milliseconds) and/or on smaller batches of data. However, like batch processing, data is still processed a batch at a time.

Stream processing differs from micro-batch and batch processing in the fact that data processing is executed as and when individual data units arrive. Distributed streaming platforms, such as Apache Kafka, provide the ability to safely and securely move real-time data between systems and applications. Thereafter, distributed streaming engines, such as Apache Spark's Streaming library, Spark Streaming, and Apache Storm, allow us to process and analyze real-time data. In Chapter 8, Real-Time Machine Learning Using Apache Spark, we will discuss Spark Streaming in greater detail, where we will develop a real-time sentiment analysis model by combining Apache Kafka with Spark Streaming and Apache Spark's machine learning library, MLlib.

In the meantime, let's quickly take a look into how Apache Kafka works under the hood. Take a moment to think about what kind of things you would need to consider in order to engineer a real-time streaming platform:

Fault tolerance: The platform must not lose real-time streams of data and have some way to store them in the event of partial system failure.
Ordering: The platform must provide a means to guarantee that streams can be processed in the order that they are received, which is especially important to business applications where order is critical.

Reliability: The platform must provide a means to reliably and efficiently move streams between various distinct applications and systems.

Apache Kafka provides all of these guarantees through its distributed streaming logical architecture, as illustrated in Figure 1.11:

Figure 1.11: Apache Kafka logical architecture

In Apache Kafka, a topic refers to a stream of records belonging to a particular category. Kafka's Producer API allows producer applications to publish streams of records to one or more Kafka topics, and its Consumer API allows consumer applications to subscribe to one or more topics and thereafter receive and process the stream of records belonging to those topics. Topics in Kafka are said to be multi-subscriber, meaning that a single topic may have zero, one, or more consumers subscribed to it. Physically, a Kafka topic is stored as a sequence of ordered and immutable records that can only be appended to and are partitioned and replicated across the Kafka cluster, thereby providing scalability and fault tolerance for large systems. Kafka guarantees that producer messages are appended to a topic partition in the order that they are sent, with the producer application responsible for identifying which partition to assign records to, and that consumer applications can access records in the order in which they are persisted.

Kafka has become synonymous with real-time data because of its logical architecture and the guarantees that it provides when moving real-time streams of data between systems and applications. But Kafka can also be used as a stream processing engine as well in its own right via its Streams API, not just as a messaging system. By means of its Streams API, Kafka allows us to consume continual streams of data from input topics, process that data in some manner or other, and thereafter produce continual streams of data to output topics. In other words, Kafka allows us to transform input streams of data into output streams of data, thereby facilitating the engineering of real-time data processing pipelines in competition with other stream processing engines such as Apache Spark and Apache Storm.

In Chapter 8, Real-Time Machine Learning Using Apache Spark, we will use Apache Kafka to reliably move real-time streams of data from their source systems to Apache Spark. Apache Spark will then act as our stream processing engine of choice in conjunction with its machine learning library. In the meantime, however, to learn more about Apache Kafka, please visit https://kafka.apache.org/.

Distributed ledgers

To finish our journey into distributed systems, let's talk about a particular type of distributed system that could potentially form the basis of a large number of exciting cutting-edge technologies in the future. Distributed ledgers are a special class within distributed databases made famous recently by the prevalence of blockchain and subsequent cryptocurrencies such as Bitcoin.

Traditionally, when you make a purchase using your credit or debit card, the issuing bank acts as the centralized authority. As part of the transaction, a request is made to the bank to ascertain whether you have sufficient funds to complete the transaction. If you do, the bank keeps a record of the new transaction and deducts the amount from your balance, allowing you to complete the purchase. The bank keeps a record of this and all transactions on your account. If you ever wish to view your historic transactions and your current overall balance, you can access your account records online or via paper statements, all of which are managed by the trusted and central source—your bank.

Distributed ledgers, on the other hand, have no single trusted central authority. Instead, records are independently created and stored on separate nodes forming a distributed network, in other words, a distributed database, but data is never created nor passed through a central authority or master node. Every node in the distributed network processes every transaction. If you make a purchase using a cryptocurrency such as Bitcoin based on Blockchain technology, which is one form of distributed ledger, the nodes vote on the update. Once a majority consensus is reached, the distributed ledger is updated and the latest version of the ledger is saved on each node separately.

As described, Blockchain is one form of a distributed ledger. As well as sharing the fundamental features of distributed ledgers, in blockchain, data is grouped into blocks that are secured using cryptography. Records in blockchain cannot be altered or deleted once persisted, but can only be appended to, making blockchain particularly well suited to use cases where maintaining a secure historical view is important, such as financial transactions and cryptocurrencies including Bitcoin.

Artificial intelligence and machine learning

We have discussed how distributed systems can be employed to store, model, and process huge amounts of structured, semi-structured, and unstructured data, while providing horizontal scalability, fault tolerance, resilience, high availability, consistency, and high throughput. However, other fields of study have become prevalent today, seemingly in conjunction with the rise of big data—artificial intelligence and machine learning.

But why have these fields of study, the underlying mathematical theories of which have been around for decades, and even centuries in some cases, risen to prominence at the same time as big data? The answer to this question lies in understanding the benefits offered by this new breed of technology.

Distributed systems allow us to consolidate, aggregate, transform, process, and analyze vast volumes of previously disparate data. The process of consolidating these disparate datasets allows us to infer insights and uncover hidden relationships that would have been impossible previously. Furthermore, cluster computing, such as that offered by distributed systems, exposes more powerful and numerous hardware and software working together as a single logical unit that can be assigned to solve complex computational tasks such as those inherent to artificial intelligence and machine learning. Today, by combining these features, we can efficiently run advanced analytical algorithms to ultimately provide actionable insights, the level and breadth of which have never been seen before in many mainstream industries.

Apache Spark's machine learning library, MLlib, and TensorFlow are examples of libraries that have been developed to allow us to quickly and efficiently engineer and execute machine learning algorithms as part of analytical processing pipelines.

In Chapter 3, Artificial Intelligence and Machine Learning, we will discuss some of the high-level concepts behind common artificial intelligence and machine learning algorithms, as well as the logical architecture behind Apache Spark's machine learning library MLlib. Thereafter, in Chapter 4, Supervised Learning Using Apache Spark, through to Chapter 8, Real-Time Machine Learning Using Apache Spark, we will develop advanced analytical models with MLlib using real-world use cases, while exploring their underlying mathematical theory.

To learn more about MLlib and TensorFlow, please visit https://spark.apache.org/mllib/ and https://www.tensorflow.org/ respectively.

Cloud computing platforms

Traditionally, many large organizations have invested in expensive data centers to house their business-critical computing systems. These data centers are integrated with their corporate networks, allowing users to access both the data stored in these centers and increased processing capacity. One of the main advantages of large organizations maintaining their own data centers is that of security—both data and processing capacity is kept on-premise under their control and administration within largely closed networks.

However, with the advent of big data and more accessible artificial intelligence and machine learning-driven analytics, the need to store, model, process, and analyze huge volumes of data requiring scalable hardware, software, and potentially distributed clusters containing hundreds or thousands of physical or virtual nodes quickly makes maintaining your own 24/7 data centers less and less cost-effective.

Cloud computing platforms, such as Amazon Web Services (AWS), Microsoft Azure, and the Google Cloud Platform (GCP), provide a means to offload some or all of an organization's data storage, management, and processing requirements to remote platforms managed by technology companies and that are accessible over the internet. Today, these cloud computing platforms offer much more than just a place to remotely store an organization's ever-increasing data estate. They offer scalable storage, computational processing, administration, data governance, and security software services, as well as access to artificial intelligence and machine learning libraries and frameworks. These cloud computing platforms also tend to offer a Pay-As-You-Go (PAYG) pricing model, meaning that you only pay for the storage and processing capacity that you actually use, which can also be easily scaled up or down depending on requirements.

Organizations that are weary of storing their sensitive data on remote platforms accessible over the internet tend to instead architect and engineer hybrid systems, whereby sensitive data remains on-premise, but computational processing on anonymized or unattributable data is offloaded to the cloud, for example.

A well-architected and well-engineered system should provide a layer of abstraction between its infrastructure and its end users, including data analysts and data scientists—that is, the fact as to whether the data storage and processing infrastructure is on-premise or cloud-based should be invisible to these types of user. Furthermore, many of the distributed technologies and processing engines that we have discussed so far tend to be written in Java or C++, but expose their API or programming model to other language variants, such as Python, Scala, and R. This therefore makes them accessible to a wide range of end users, deployable on any machines that can run a JVM or compile C++ code. A significant number of cloud services offered by cloud computing platforms are, in fact, commercial service wrappers built around open source technologies guaranteeing availability and support. Therefore, once system administrators and end users become familiar with a particular class of technology, migrating to cloud computing platforms actually becomes a matter of configuration to optimize performance rather than learning an entirely new way to store, model, and process data. This is important, since a significant cost for many organizations is the training of its staff—if underlying technologies and frameworks can be reused as much as possible, then this is preferable to migrating to entirely new storage and processing paradigms.

Data insights platform

We have discussed the various systems, technologies, and frameworks available today to allow us to store, aggregate, manage, transform, process, and analyze vast volumes of structured, semi-structured, and unstructured data in both batch and real time in order to provide actionable insights and deliver real business value. We will conclude this chapter by discussing how all of these systems and technologies can fully integrate with one another to deliver a consolidated, high-performance, secure, and reliable data insights platform accessible to all parts of your organization.

Reference logical architecture

We can represent a data insights platform as a series of logical layers, where each layer provides a distinct functional capability. When we combine these layers, we form a reference logical architecture for a data insights platform, as illustrated in Figure 1.12:

Figure 1.12: Data insights platform reference logical architecture

The logical layers in this data insights platform reference architecture are described in further detail in the following sub-sections.

Data sources layer

The data sources layer represents the various disparate data stores, datasets, and other source data systems that will provide the input data into the data insights platform. These disparate source data providers may contain either structured (for example, delimited text files, CSVs, and relational databases), semi-structured (for example, JSON and XML files) or unstructured (for example, images, videos, and audio) data.

Ingestion layer

The ingestion layer is responsible for consuming and thereafter moving the source data, no matter what their format and frequency, to either a persistent data store, or directly to a downstream data processing engine. The ingestion layer should be capable of supporting both batch data and stream-based event data. Examples of open source technologies used to implement the ingestion layer include the following:

Apache Sqoop
Apache Kafka
Apache Flume

Persistent data storage layer

The persistent data storage layer is responsible for consuming and persisting the raw source data provided by the ingestion layer. Little or no transformation of the raw source data takes place before it is persisted to ensure that the raw data remains in its original format. A data lake is a class of persistent data store that is often implemented to store raw data in this manner. Example technologies used to implement the persistent data storage layer include the following:

Traditional network-based stores, such as Storage Area Networks (SAN) and Network Attached Storage (NAS)
Open source technologies, such as the HDFS
Cloud-based technologies, such as AWS S3 and Azure BLOBs

Data processing layer

The data processing layer is responsible for the transformation, enrichment, and validation of the raw data gathered from either the persistent data store or directly from the ingestion layer. The data processing layer models the data according to downstream business and analytical requirements and prepares it for either persistence in the serving data storage layer, or for processing by data intelligence applications. Again, the data processing layer must be capable of processing both batch data and stream-based event data. Examples of open source technologies used to implement the data processing layer include the following:

Apache Hive
Apache Spark, including Spark Streaming (DStreams) and Structured Streaming
Apache Kafka
Apache Storm

Serving data storage layer

The serving analytical data storage layer is responsible for persisting the transformed, enriched, and validated data produced by the data processing layer in data stores that maintain the data model structures so that they are ready to serve downstream data intelligence and data insight applications. This minimizes or removes the need for further data transformations since the data is already persisted in highly-optimized data structures relevant to the type of processing required. The types of data stores provisioned in the serving analytical data storage layer are dependent on business and analytical requirements, and may include any, or a combination of, the following (stated along with examples of open source implementations):

Relational databases, such as PostgreSQL and MySQL
Document databases, such as Apache CouchDB and MongoDB
Columnar databases, such as Apache Cassandra and Apache HBase
Key-value databases, such as Redis and Voldemort
Graph databases and frameworks, such as Apache TinkerPop
Search engines, such as Apache Lucene and Elasticsearch

Data intelligence layer

The data intelligence layer is responsible for executing advanced analytical pipelines, both predictive and prescriptive, on both the transformed batch data and stream-based event data. Advanced analytical pipelines may include artificial intelligence services such as image and speech analysis, cognitive computing, and complex reasoning, as well as machine learning models and natural language processing. Examples of open source, advanced analytical libraries and frameworks used to implement the data intelligence layer include the following:

Apache Spark MLlib (API accessible via Python, R, Java, and Scala)
TensorFlow (API accessible via Python, Java, C++, Go, and JavaScript, with third-party bindings available in C#, R, Haskell, Ruby, and Scala)

Unified access layer

The unified access layer is responsible for providing access to both the serving analytical data storage layer and third-party APIs exposed by the data intelligence layer. The unified access layer should provide universal, scalable, and secure access to any downstream applications and systems that require it, and typically involves the architecting and engineering of APIs and/or implementations of data federation and virtualization patterns. Examples of open source technologies used to implement the unified access layer include the following:

Spring framework
Apache Drill

Data insights and reporting layer

The data insights and reporting layer is responsible for exposing the data insights platform to end users, including data analysts and data scientists. Data discovery, self-service business intelligence, search, visualization, data insights, and interactive analytical applications can all be provisioned in this layer and can access both the transformed data in the serving analytical data storage layer and third-party APIs in the data intelligence layer, all via the unified access layer. The fundamental purpose of the data insights and reporting layer is to deliver actionable insights and business value derived from all the structured, semi-structured, and unstructured data available to the data insights platform in both batch and real time. Examples of open source technologies used to implement the data insights and reporting layer that are also accessible to end users include the following:

Apache Zeppelin
Jupyter Notebook

Elastic Kibana
BIRT Reporting
Custom JavaScript-based web applications for search and visualization

Examples of commercial technologies include the following business intelligence applications for creating dashboards and reporting:

Tableau
Microsoft Power BI
SAP Lumira
QlikView

Platform governance, management, and administration

Finally, since the data insights platform is designed to be accessible by all areas of an organization, and will store sensitive data leading to the generation of actionable insights, the systems it contains must be properly governed, managed, and administered. Therefore, the following additional logical layers are required in order to provision a secure enterprise and production-grade platform:

Governance and Security: This layer includes identity and access management (IDAM) and data governance tooling. Open source technologies used to implement this layer include the following:
- Apache Knox (Hadoop Authentication Gateway)
- Apache Metron (Security Analytics Framework)
- Apache Ranger (Monitor Data Security)
- OpenLDAP (Lightweight Directory Access Protocol Implementation)
Management, Administration, and Orchestration: This layer includes DevOps processes (such as version control, automated builds, and deployment), cluster monitoring and administration software, and scheduling and workflow management tooling. Open source technologies used to implement this layer include the following:
- Jenkins (Automation Server)
- Git (Version Control)
- Apache Ambari (Administration, Monitoring, and Management)
- Apache Oozie (Workflow Scheduling)

Network and Access Middleware: This layer handles network connectivity and communication, and includes network security, monitoring, and intrusion detection software.
Hardware and Software: This layer contains the physical storage, compute, and network infrastructure on which the data insights platform is deployed. The physical components may be on-premise, cloud-based, or a hybrid combination of the two.

Open source implementation

Figure 1.13 illustrates an example implementation of the reference data insights platform using only open source technologies and frameworks:

Figure 1.13: Example of open source implementation of the data insights platform

You're reading from Machine Learning with Apache Spark Quick Start Guide Uncover patterns, derive actionable insights, and learn from big data using MLlib

Table of Contents (10) Chapters

Authors (1)

Other recommended products

Personalised recommendations for you