Overall architecture
Let's start with a high-level introduction to data architectures: what they do, why they're useful, when they should be used, and how Apache Spark fits in.
At their most general, modern data architectures have four basic characteristics:
- Data Ingestion
- Data Lake
- Data Science
- Data Access
Let's introduce each of these now, so that we can go into more detail in the later chapters.
Data Ingestion
Traditionally, data is ingested under strict rules and formatted according to a predetermined schema. This process is known as Extract, Transform, Load (ETL), and is still a very common practice supported by a large array of commercial tools as well as some open source products.
The ETL approach favors performing up-front checks, which ensure data quality and schema conformance, in order to simplify follow-on online analytical processing. It is particularly suited to handling data with a specific set of characteristics, namely, those that relate to a classical entity-relationship model. However, it is not suitable for all scenarios.
During the big data revolution, there was a metaphorical explosion of demand for structured, semi-structured, and unstructured data, leading to the creation of systems that were required to handle data with a different set of characteristics. These came to be defined by the phrase, 4 Vs: Volume, Variety, Velocity, and Veracity http://www.ibmbigdatahub.com/infographic/four-vs-big-data. While traditional ETL methods floundered under this new burden-because they simply required too much time to process the vast quantities of data, or were too rigid in the face of change, a different approach emerged. Enter the schema-on-read paradigm. Here, data is ingested in its original form (or at least very close to) and the details of normalization, validation, and so on are done at the time of analytical processing.
This is typically referred to as Extract Load Transform (ELT), a reference to the traditional approach:
This approach values the delivery of data in a timely fashion, delaying the detailed processing until it is absolutely required. In this way, a data scientist can gain access to the data immediately, searching for insight using a range of techniques not available with a traditional approach.
Although we only provide a high-level overview here, this approach is so important that throughout the book we will explore further by implementing various schema-on-read algorithms. We will assume the ELT method for data ingestion, that is to say we encourage the loading of data at the user's convenience. This may be every n minute, overnight or during times of low usage. The data can then be checked for integrity, quality, and so forth by running batch processing jobs offline, again at the user's discretion.
Data Lake
A data lake is a convenient, ubiquitous store of data. It is useful because it provides a number of key benefits, primarily:
- Reliable storage
- Scalable data processing capability
Let's take a brief look at each of these.
Reliable storage
There is a good choice of underlying storage implementations for a data lake, these include Hadoop Distributed File System (HDFS), MapR-FS, and Amazon AWS S3.
Throughout the book, HDFS will be the assumed storage implementation. Also, in this book the authors use a distributed Spark setup, deployed on Yet Another Resource Negotiator (YARN) running inside a Hortonworks HDP environment. Therefore, HDFS is the technology used, unless otherwise stated. If you are not familiar with any of these technologies, they are discussed further on in this chapter.
In any case, it's worth knowing that Spark references HDFS locations natively, accesses local file locations via the prefix file://
and references S3 locations via the prefix s3a://
.
Scalable data processing capability
Clearly, Apache Spark will be our data processing platform of choice. In addition, as you may recall, Spark allows the user to execute code in their preferred environment, be that local, standalone, YARN or Mesos, by configuring the appropriate cluster manager; in masterURL
. Incidentally, this can be done in any one of the three locations:
- Using the
--master
option when issuing thespark-submit
command - Adding the
spark.master
property in theconf/spark-defaults.conf
file - Invoking the
setMaster
method on theSparkConf
object
If you're not familiar with HDFS, or if you do not have access to a cluster, then you can run a local Spark instance using the local filesystem, which is useful for testing. However, beware that there are often bad behaviors that only appear when executing on a cluster. So, if you're serious about Spark, it's worth investing in a distributed cluster manager why not try Spark standalone cluster mode, or Amazon AWS EMR? For example, Amazon offers a number of affordable paths to cloud computing, you can explore the idea of spot instances at https://aws.amazon.com/ec2/spot/.
Data science platform
A data science platform provides services and APIs that enable effective data science to take place, including explorative data analysis, machine learning model creation and refinement, image and audio processing, natural language processing, and text sentiment analysis.
This is the area where Spark really excels and forms the primary focus of the remainder of this book, exploiting a robust set of native machine learning libraries, unsurpassed parallel graph processing capabilities and a strong community. Spark provides truly scalable opportunities for data science.
The remaining chapters will provide insight into each of these areas, including Chapter 6, Scraping Link-Based External Data, Chapter 7, Building Communities, and Chapter 8, Building a Recommendation System.
Data Access
Data in a data lake is most frequently accessed by data engineers and scientists using the Hadoop ecosystem tools, such as Apache Spark, Pig, Hive, Impala, or Drill. However, there are times when other users, or even other systems, need access to the data and the normal tools are either too technical or do not meet the demanding expectations of the user in terms of real-world latency.
In these circumstances, the data often needs to be copied into data marts or index stores so that it may be exposed to more traditional methods, such as a report or dashboard. This process, which typically involves creating indexes and restructuring data for low-latency access, is known as data egress.
Fortunately, Apache Spark has a wide variety of adapters and connectors into traditional databases, BI tools, and visualization and reporting software. Many of these will be introduced throughout the book.