Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Mastering Spark for Data Science
Mastering Spark for Data Science

Mastering Spark for Data Science: Lightning fast and scalable data science solutions

Arrow left icon
Profile Icon Morgan Profile Icon Bifet Profile Icon Hallett Profile Icon Amend Profile Icon George +1 more Show less
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Mar 2017 560 pages 1st Edition
eBook
€8.99 €36.99
Paperback
€45.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Morgan Profile Icon Bifet Profile Icon Hallett Profile Icon Amend Profile Icon George +1 more Show less
Arrow right icon
€18.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Mar 2017 560 pages 1st Edition
eBook
€8.99 €36.99
Paperback
€45.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €36.99
Paperback
€45.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering Spark for Data Science

Chapter 1.  The Big Data Science Ecosystem

As a data scientist, you'll no doubt be very familiar with handling files and processing perhaps even large amounts of data. However, as I'm sure you will agree, doing anything more than a simple analysis over a single type of data requires a method of organizing and cataloguing data so that it can be managed effectively. Indeed, this is the cornerstone of a great data scientist. As the data volume and complexity increases, a consistent and robust approach can be the difference between generalized success and over-fitted failure!

This chapter is an introduction to an approach and ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies. It introduces the environment, and how to configure it appropriately, but also explains some of the nonfunctional considerations relevant to the overall data architecture. While there is little actual data science at this stage, it provides the essential platform to pave the way for success in the rest of the book.

In this chapter, we will cover the following topics:

  • Data management responsibilities
  • Data architecture
  • Companion tools

Introducing the Big Data ecosystem

Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.

Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.

Data management

Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!

The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.

Data management responsibilities

When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:

  1. Obtain data
  2. Place the data somewhere (anywhere)
  3. Use the data
  4. Throw the data away

In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:

  • File integrity
    • Is the data file complete?
    • How do you know?
    • Was it part of a set?
    • Is the data file correct?
    • Was it tampered with in transit?

  • Data integrity
    • Is the data as expected?
    • Are all of the fields present?
    • Is there sufficient metadata?
    • Is the data quality sufficient?
    • Has there been any data drift?

  • Scheduling
    • Is the data routinely transmitted?
    • How often does the data arrive?
    • Was the data received on time?
    • Can you prove when the data was received?
    • Does it require acknowledgement?

  • Schema management
    • Is the data structured or unstructured?
    • How should the data be interpreted?
    • Can the schema be inferred?
    • Has the data changed over time?
    • Can the schema be evolved from the previous version?

  • Version Management
    • What is the version of the data?
    • Is the version correct?
    • How do you handle different versions of the data?
    • How do you know which version you're using?

  • Security
    • Is the data sensitive?
    • Does it contain personally identifiable information (PII)?
    • Does it contain personal health information (PHI)?
    • Does it contain payment card information (PCI)?
    • How should I protect the data?
    • Who is entitled to read/write the data?
    • Does it require anonymization/sanitization/obfuscation/encryption?

  • Disposal
    • How do we dispose of the data?
    • When do we dispose of the data?

If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk and crontab commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!

The right tool for the job

Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!

Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.

Overall architecture

Let's start with a high-level introduction to data architectures: what they do, why they're useful, when they should be used, and how Apache Spark fits in.

Overall architecture

At their most general, modern data architectures have four basic characteristics:

  • Data Ingestion
  • Data Lake
  • Data Science
  • Data Access

Let's introduce each of these now, so that we can go into more detail in the later chapters.

Data Ingestion

Traditionally, data is ingested under strict rules and formatted according to a predetermined schema. This process is known as Extract, Transform, Load (ETL), and is still a very common practice supported by a large array of commercial tools as well as some open source products.

Data Ingestion

The ETL approach favors performing up-front checks, which ensure data quality and schema conformance, in order to simplify follow-on online analytical processing. It is particularly suited to handling data with a specific set of characteristics, namely, those that relate to a classical entity-relationship model. However, it is not suitable for all scenarios.

During the big data revolution, there was a metaphorical explosion of demand for structured, semi-structured, and unstructured data, leading to the creation of systems that were required to handle data with a different set of characteristics. These came to be defined by the phrase, 4 Vs: Volume, Variety, Velocity, and Veracity http://www.ibmbigdatahub.com/infographic/four-vs-big-data. While traditional ETL methods floundered under this new burden-because they simply required too much time to process the vast quantities of data, or were too rigid in the face of change, a different approach emerged. Enter the schema-on-read paradigm. Here, data is ingested in its original form (or at least very close to) and the details of normalization, validation, and so on are done at the time of analytical processing.

This is typically referred to as Extract Load Transform (ELT), a reference to the traditional approach:

Data Ingestion

This approach values the delivery of data in a timely fashion, delaying the detailed processing until it is absolutely required. In this way, a data scientist can gain access to the data immediately, searching for insight using a range of techniques not available with a traditional approach.

Although we only provide a high-level overview here, this approach is so important that throughout the book we will explore further by implementing various schema-on-read algorithms. We will assume the ELT method for data ingestion, that is to say we encourage the loading of data at the user's convenience. This may be every n minute, overnight or during times of low usage. The data can then be checked for integrity, quality, and so forth by running batch processing jobs offline, again at the user's discretion.

Data Lake

A data lake is a convenient, ubiquitous store of data. It is useful because it provides a number of key benefits, primarily:

  • Reliable storage
  • Scalable data processing capability

Let's take a brief look at each of these.

Reliable storage

There is a good choice of underlying storage implementations for a data lake, these include Hadoop Distributed File System (HDFS), MapR-FS, and Amazon AWS S3.

Throughout the book, HDFS will be the assumed storage implementation. Also, in this book the authors use a distributed Spark setup, deployed on Yet Another Resource Negotiator (YARN) running inside a Hortonworks HDP environment. Therefore, HDFS is the technology used, unless otherwise stated. If you are not familiar with any of these technologies, they are discussed further on in this chapter.

In any case, it's worth knowing that Spark references HDFS locations natively, accesses local file locations via the prefix file:// and references S3 locations via the prefix s3a://.

Scalable data processing capability

Clearly, Apache Spark will be our data processing platform of choice. In addition, as you may recall, Spark allows the user to execute code in their preferred environment, be that local, standalone, YARN or Mesos, by configuring the appropriate cluster manager; in masterURL. Incidentally, this can be done in any one of the three locations:

  • Using the --master option when issuing the spark-submit command
  • Adding the spark.master property in the conf/spark-defaults.conf file
  • Invoking the setMaster method on the SparkConf object

If you're not familiar with HDFS, or if you do not have access to a cluster, then you can run a local Spark instance using the local filesystem, which is useful for testing. However, beware that there are often bad behaviors that only appear when executing on a cluster. So, if you're serious about Spark, it's worth investing in a distributed cluster manager why not try Spark standalone cluster mode, or Amazon AWS EMR? For example, Amazon offers a number of affordable paths to cloud computing, you can explore the idea of spot instances at https://aws.amazon.com/ec2/spot/.

Data science platform

A data science platform provides services and APIs that enable effective data science to take place, including explorative data analysis, machine learning model creation and refinement, image and audio processing, natural language processing, and text sentiment analysis.

This is the area where Spark really excels and forms the primary focus of the remainder of this book, exploiting a robust set of native machine learning libraries, unsurpassed parallel graph processing capabilities and a strong community. Spark provides truly scalable opportunities for data science.

The remaining chapters will provide insight into each of these areas, including Chapter 6, Scraping Link-Based External Data, Chapter 7, Building Communities, and Chapter 8, Building a Recommendation System.

Data Access

Data in a data lake is most frequently accessed by data engineers and scientists using the Hadoop ecosystem tools, such as Apache Spark, Pig, Hive, Impala, or Drill. However, there are times when other users, or even other systems, need access to the data and the normal tools are either too technical or do not meet the demanding expectations of the user in terms of real-world latency.

In these circumstances, the data often needs to be copied into data marts or index stores so that it may be exposed to more traditional methods, such as a report or dashboard. This process, which typically involves creating indexes and restructuring data for low-latency access, is known as data egress.

Fortunately, Apache Spark has a wide variety of adapters and connectors into traditional databases, BI tools, and visualization and reporting software. Many of these will be introduced throughout the book.

Data technologies

When Hadoop first started, the word Hadoop referred to the combination of HDFS and the MapReduce processing paradigm, as that was the outline of the original paper http://research.google.com/archive/mapreduce.html. Since that time, a plethora of technologies have emerged to complement Hadoop, and with the development of Apache YARN we now see other processing paradigms emerge such as Spark.

Hadoop is now often used as a colloquialism for the entire big data software stack and so it would be prudent at this point to define the scope of that stack for this book. The typical data architecture with a selection of technologies we will visit throughout the book is detailed as follows:

Data technologies

The relationship between these technologies is a dense topic as there are complex interdependencies, for example, Spark depends on GeoMesa, which depends on Accumulo, which depends on Zookeeper and HDFS! Therefore, in order to manage these relationships, there are platforms available, such as Cloudera or Hortonworks HDP http://hortonworks.com/products/sandbox/. These provide consolidated user interfaces and centralized configuration. The choice of platform is that of the reader, however, it is not recommended to install a few of the technologies initially and then move to a managed platform as the version problems encountered will be very complex. Therefore, it is usually easier to start with a clean machine and make a decision upfront as to which direction to take.

All of the software we use in this book is platform-agnostic and therefore fits into the general architecture described earlier. It can be installed independently and it is relatively straightforward to use with single or multiple server environment without the use of a managed product.

The role of Apache Spark

In many ways, Apache Spark is the glue that holds these components together. It increasingly represents the hub of the software stack. It integrates with a wide variety of components but none of them are hard-wired. Indeed, even the underlying storage mechanism can be swapped out. Combining this feature with the ability to leverage different processing frameworks means the original Hadoop technologies effectively become components, rather than an imposing framework. The logical diagram of our architecture appears as follows:

The role of Apache Spark

As Spark has gained momentum and wide-scale industry acceptance, many of the original Hadoop implementations for various components have been refactored for Spark. Thus, to add further complexity to the picture, there are often several possible ways to programmatically leverage any particular component; not least the imperative and declarative versions depending upon whether an API has been ported from the original Hadoop Java implementation. We have attempted to remain as true as possible to the Spark ethos throughout the remaining chapters.

Companion tools

Now that we have established a technology stack to use, let's describe each of the components and explain why they are useful in a Spark environment. This part of the book is designed as a reference rather than a straight read. If you're familiar with most of the technologies, then you can refresh your knowledge and continue to the next section, Chapter 2, Data Acquisition.

Apache HDFS

The Hadoop Distributed File System (HDFS) is a distributed filesystem with built-in redundancy. It is optimized to work on three or more nodes by default (although one will work fine and the limit can be increased), which provides the ability to store data in replicated blocks. So not only is a file split into a number of blocks but three copies of those blocks exist at any one time. This cleverly provides data redundancy (if one is lost two others still exist) but also provides data locality. When a distributed job is run against HDFS, not only will the system attempt to gather all of the blocks required for the data input to that job, it will also attempt to only use the blocks which are physically close to the server running that job; so it has the ability to reduce network bandwidth using only the blocks on its local storage, or those on nodes close to itself. This is achieved in practice by allocating HDFS physical disks to nodes, and nodes to racks; blocks are written in a node-local, rack-local, and cluster-local method. All instructions to HDFS are passed through a central server called NameNode, so this provides a possible central point of failure; there are various methods for providing NameNode redundancy.

Furthermore, in a multi-tenanted HDFS scenario, where many processes are accessing the same file at the same time, load balancing can also be achieved through the use of multiple blocks; for example, if a file takes up one block, this block is replicated three times and, therefore, potentially can be read from three different physical locations concurrently. Although this may not seem like a big win, on clusters of hundreds or thousands of nodes the network IO is often the single most limiting factor to a running job–the authors have certainly experienced times on multi-thousand node clusters where jobs have had to wait hours to complete purely because the network bandwidth has been maxed out due to the large number of other threads calling for data.

If you are running a laptop, require data to be stored locally, or wish to use the hardware you already have, then HDFS is a good option.

Advantages

The following are the advantages of using HDFS:

  • Redundancy: Configurable replication of blocks provides tolerance for node and disk failure
  • Load balancing: Block replication means the same data can be accessed from different physical locations
  • Data locality: Analytics try to access the closest relevant physical block, reducing network IO.
  • Data balance: An algorithm is available to re-balance the data blocks as they become too clustered or fragmented.
  • Flexible storage: If more space is needed, further disks and nodes can be added; although this is not a hot process, the cluster will require outage to add these resources
  • Additional costs: No third-party costs are involved
  • Data encryption: Implicit encryption (when turned on)

Disadvantages

The following are the disadvantages:

  • The NameNode provides for a central point of failure; to mitigate this, there are secondary and high availability options available
  • A cluster requires basic administration and potentially some hardware effort

Installation

To use HDFS, we should decide whether to run Hadoop in a local, pseudo-distributed or fully-distributed manner; for a single server, pseudo-distributed is useful as analytics should translate directly from this machine to any Hadoop cluster. In any case, we should install Hadoop with at least the following components:

  • NameNode
  • Secondary NameNode (or High Availability NameNode)
  • DataNode

Hadoop can be installed via http://hadoop.apache.org/releases.html.

Spark needs to know the location of the Hadoop configuration, specifically the following files: hdfs-site.xml, core-site.xml. This is then set in the configuration parameter HADOOP_CONF_DIR in your Spark configuration.

HDFS will then be available natively, so the file hdfs://user/local/dir/text.txt can be addressed in Spark simply using /user/local/dir/text.txt.

Amazon S3

S3 abstracts away all of the issues related to parallelism, storage restrictions, and security allowing very large parallel read/write operations along with a great Service Level Agreement (SLA) for a very small cost. This is perfect if you need to get up and running quickly, can't store data locally, or don't know what your future storage requirements might be. It should be recognized that s3n and S3a utilize an object storage model, not file storage, and therefore there are some compromises:

  • Eventual consistency is where changes made by one application (creation, updates, and deletions) will not be visible until some undefined time, although most AWS regions now support read-after-write consistency.
  • s3n and s3a utilize nonatomic rename and delete operations; therefore, renaming or deleting large directories takes time proportional to the number of entries. However, target files can remain visible to other processes during this time, and indeed, until the eventual consistency has been resolved.

S3 can be accessed through command-line tools (s3cmd) via a webpage and via APIs for most popular languages; it has native integration with Hadoop and Spark through a basic configuration.

Advantages

The following are the advantages:

  • Infinite storage capacity
  • No hardware considerations
  • Encryption available (user stored keys)
  • 99.9% availability
  • Redundancy

Disadvantages

The following are the disadvantages:

  • Cost to store and transfer data
  • No data locality
  • Eventual consistency
  • Relatively high latency

Installation

You can create an AWS account: https://aws.amazon.com/free/. Through this account, you will have access to S3 and will simply need to create some credentials.

The current S3 standard is s3a; to use it through Spark requires some changes to the Spark configuration:

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
spark.hadoop.fs.s3a.access.key=MyAccessKeyID 
spark.hadoop.fs.s3a.secret.key=MySecretKey

If using HDP, you may also need:

spark.driver.extraClassPath=${HADOOP_HOME}/extlib/hadoop-aws-currentversion.jar:${HADOOP_HOME}/ext/aws-java-sdk-1.7.4.jar

All S3 files will then be accessible within Spark using the prefix s3a:// to the S3 object reference:

val rdd = spark.sparkContext.textFile("s3a://user/dir/text.txt") 

We can also use the AWS credentials inline assuming that we have set spark.hadoop.fs.s3a.impl:

spark.sparkContext.textFile("s3a://AccessID:SecretKey@user/dir/file") 

However, this method will not accept the forward-slash character / in either of the keys. This is usually solved by obtaining another key from AWS (keep generating a new one until there are no forward-slashes present).

We can also browse the objects through the web interface located under the S3 tab in your AWS account.

Apache Kafka

Apache Kafka is a distributed, message broker written in Scala and available under the Apache Software Foundation license. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The result is essentially a massively scalable publish-subscribe message queue, making it highly valuable for enterprise infrastructures to process streaming data.

Advantages

The following are the advantages:

  • Publish-subscribe messaging
  • Fault-tolerant
  • Guaranteed delivery
  • Replay messages on failure
  • Highly-scalable, shared-nothing architecture
  • Supports back pressure
  • Low latency
  • Good Spark-streaming integration
  • Simple for clients to implement

Disadvantages

The following are the disadvantages:

  • At least once semantics - cannot provide exactly-once messaging due to lack of a transaction manager (as yet)
  • Requires Zookeeper for operation

Installation

As Kafka is a pub-sub tool, its purpose is to manage messages (publishers) and direct them to the relevant endpoints (subscribers). This is done using a broker, which is installed when implementing Kafka. Kafka is available through the Hortonworks HDP platform, or can be installed independently from this link http://kafka.apache.org/downloads.html.

Kafka uses Zookeeper to manage leadership election (as Kafka can be distributed thus allowing for redundancy), the quick start guide found in the preceding link can be used to set up a single node Zookeeper instance, and also provide a client and consumer to publish and subscribe to topics, which provide the mechanism for message handling.

Apache Parquet

Since the inception of Hadoop, the idea of columnar-based formats (as opposed to row based) has been gaining increasing support. Parquet has been developed to take advantage of compressed, efficient columnar data representation and is designed with complex nested data structures in mind; taking the lead from algorithms discussed in the Apache Dremel paper http://research.google.com/pubs/pub36632.html. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed for adding more encodings as they are implemented. It has also been designed to provide compatibility throughout the Hadoop ecosystem and, like Avro, stores the data schema with the data itself.

Advantages

The following are the advantages:

  • Columnar storage
  • Highly storage efficient
  • Per column compression
  • Supports predicate pushdown
  • Supports column pruning
  • Compatible with other formats, for example, Avro
  • Read efficient, designed for partial data retrieval

Disadvantages

The following are the disadvantages:

  • Not good for random access
  • Potentially computationally intensive for writes

Installation

Parquet is natively available in Spark and can be accessed directly as follows:

val ds = Seq(1, 2, 3, 4, 5).toDS 
ds.write.parquet("/data/numbers.parquet") 
val fromParquet = spark.read.parquet("/data/numbers.parquet")

Apache Avro

Apache Avro is a data serialization framework originally developed for Hadoop. It uses JSON for defining data types and protocols (although there is an alternative IDL), and serializes data in a compact binary format. Avro provides both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Another useful feature is its ability to store the data schema along with the data itself, so any Avro file can always be read without the need for referencing external sources. Further, Avro supports schema evolution and therefore backwards compatibility between Avro files written with older schema versions being read with a newer schema version.

Advantages

The following are the advantages:

  • Schema evolution
  • Disk space savings
  • Supports schemas in JSON and IDL
  • Supports many languages
  • Supports compression

Disadvantages

The following are the disadvantages:

  • Requires schema to read and write data
  • Serialization computationally heavy

Installation

As we are using Scala, Spark, and Maven environments in this book, Avro can be imported as follows:

<dependency>   
   <groupId>org.apache.avro</groupId>   
   <artifactId>avro</artifactId>   
   <version>1.7.7</version> 
</dependency> 

It is then a matter of creating a schema and producing the Scala code to write data to Avro using the schema. This is explained in detail in Chapter 3, Input Formats and Schema.

Apache NiFi

Apache NiFi originated from the United States National Security Agency (NSA) where it was released to open source in 2014 as part of their Technology Transfer Program. NiFi enables the production of scalable directed graphs of data routing and transformation, within a simple user interface. It also supports data provenance, a wide range of prebuilt processors and the ability to build new processors quickly and efficiently. It has prioritization, tunable delivery tolerances, and back-pressure features included, which allow the user to tune processors and pipelines for specific requirements, even allowing flow modification at runtime. All of this adds up to an incredibly flexible tool for building everything from one-off file download data flows through to enterprise grade ETL pipelines. It is generally quicker to build a pipeline and download files with NiFi than even writing a quick bash script, adding in the feature-rich processors used for this and it makes for a compelling proposition.

Advantages

The following are the advantages:

  • Wide range of processors
  • Hub and spoke architecture
  • Graphical User Interface (GUI)
  • Scalable
  • Simplifies parallel processing
  • Simplifies thread handling
  • Allows runtime modifications
  • Redundancy through clusters

Disadvantages

The following are the disadvantages:

  • No cross-cutting error handler
  • Expression language is only partially implemented
  • Flowfile version management lacking

Installation

Apache NiFi can be installed with Hortonworks and is known as Hortonworks Dataflow. It is also available as a standalone install from Apache, https://nifi.apache.org/. There is an introduction to NiFi in Chapter 2, Data Acquisition.

Apache YARN

YARN is the principle component of Hadoop 2.0, which essentially allows Hadoop to plug in processing paradigms rather than being limited to just the original MapReduce. YARN consists of three main components: the resource manager, node manager, and application manager. It is out of the scope of this book to dive into YARN; the main thing to understand is that if we are running a Hadoop cluster, then our Spark jobs can be executed using YARN in client mode, as follows:

spark-submit --class package.Class /  
             --master yarn / 
             --deploy-mode client [options] <app jar> [app options] 

Advantages

The following are the advantages:

  • Supports Spark
  • Supports prioritized scheduling
  • Supports data locality
  • Job history archive
  • Works out of the box with HDP

Disadvantages

The following are the disadvantages:

  • No CPU resource control
  • No support for data lineage

Installation

YARN is installed as part of Hadoop; this could either be Hortonworks HDP, Apache Hadoop, or one of the other vendors. In any case, we should install Hadoop with at least the following components:

  • ResourceManager
  • NodeManager (1 or more)

To ensure that Spark can use YARN, it simply needs to know the location of yarn-site.xml, which is set using the YARN_CONF_DIR parameter in your Spark configuration.

Apache Lucene

Lucene is an indexing and search library tool originally built with Java, but now ported to several other languages, including Python. Lucene has spawned a number of subprojects in its time, including Mahout, Nutch, and Tika. These have now become top-level Apache projects in their own right while Solr has more recently joined as a subproject. Lucene has a comprehensive capability, but is particularly known for its use in Q&A search engines and information-retrieval systems.

Advantages

The following are the advantages:

  • Highly efficient full-text searches
  • Scalable
  • Multilanguage support
  • Excellent out-of-the-box functionality

Disadvantages

The disadvantage is databases are generally better for relational operations.

Installation

Lucene can be downloaded from https://lucene.apache.org/ if you wish to learn more and interact with the library directly.

When utilizing Lucene, we only really need to include lucene-core-<version>.jar in our project. For example, when using Maven:

<dependency> 
    <groupId>org.apache.lucene</groupId> 
    <artifactId>lucene-core</artifactId> 
    <version>6.1.0</version> 
</dependency> 

Kibana

Kibana is an analytics and visualization platform that also provides charting and streaming data summarization. It uses Elasticsearch for its data source (which in turn uses Lucene) and can therefore leverage very powerful search and indexing capabilities at scale. Kibana can be used to visualize data in many different ways, including bar charts, histograms, and maps. We have mentioned Kibana briefly towards the end of this chapter and it will be used extensively throughout this book.

Advantages

The following are the advantages:

  • Visualize data at scale
  • Intuitive interface to quickly develop dashboards

Disadvantages

The following are the disadvantages:

  • Only integrates with Elasticsearch
  • Kibana releases are tied to specific Elasticsearch versions

Installation

Kibana can easily be installed as a standalone piece since it has its own web server. It can be downloaded from https://www.elastic.co/downloads/kibana. As Kibana requires Elasticsearch, this will also need to be installed; see preceding link for more information. The Kibana configuration is handled in config/kibana.yml, if you have installed a standalone version of Elasticsearch, then no changes are required, it will work out of the box!

Elasticsearch

Elasticsearch is a web-based search engine based on Lucene (see previously). It provides a distributed, multitenant-capable full-text search engine with schema-free JSON documents. It is built in Java but can be utilized from any language due to its HTTP web interface. This makes it particularly useful for transactions and/or data-intensive instructions that are to be displayed via web pages.

Advantages

The advantages are as follows:

  • Distributed
  • Schema free
  • HTTP interface

Disadvantages

The disadvantages are as follows

  • Unable to perform distributed transactions
  • Lack of frontend tooling

Installation

Elasticsearch can be installed from https://www.elastic.co/downloads/elasticsearch. To provide access to the Rest API, we can import the Maven dependency:

<dependency> 
    <groupId>org.elasticsearch</groupId> 
    <artifactId>elasticsearch-spark_2.10</artifactId> 
    <version>2.2.0-m1</version> 
</dependency> 

There is also a great tool to help with administering Elasticsearch content. Search for the Chrome extension, Sense, at https://chrome.google.com/webstore/category/extensions. With a further explanation found at: https://www.elastic.co/blog/found-sense-a-cool-json-aware-interface-to-elasticsearch. Alternatively, it is available for Kibana at https://www.elastic.co/guide/en/sense/current/installing.html.

Accumulo

Accumulo is a no-sql database based on Google's Bigtable design and was originally developed by the American National Security Agency, subsequently being released to the Apache community in 2011. Accumulo offers us the usual big data advantages such as bulk loading and parallel reading but also has some additional capabilities; iterators, for efficient server and client side pre-computation, data aggregation and, most importantly, cell level security. The security aspect of Accumulo makes it very useful for Enterprise usage as it enables flexible security in a multitenant environment. Accumulo is powered by Apache Zookeeper, in the same way as Kafka, and also leverages Apache Thrift, https://thrift.apache.org/, which enables a cross language Remote Procedural Call (RPC) capability.

Advantages

The advantages are as follows:

  • Pure implementation of Google Bigtable
  • Cell level security
  • Scalable
  • Redundancy
  • Provides iterators for server-side computation

Disadvantages

The disadvantages are as follows:

  • Zookeeper not universally popular with DevOps
  • Not always the most efficient choice for bulk relational operations

Installation

Accumulo can be installed as part of the Hortonworks HDP release, or may be installed as a standalone instance from https://accumulo.apache.org/. The instance should then be configured using the installation documentation, at the time of writing https://accumulo.apache.org/1.7/accumulo_user_manual#_installation.

In Chapter 7, Building Communities, we demonstrate the use of Accumulo with Spark, along with some of the more advanced features such as Iterators and InputFormats. We also show how to work with data between Elasticsearch and Accumulo.

Summary

In this chapter, we introduced the idea of data architecture and explained how to group responsibilities into capabilities that help manage data throughout its lifecycle. We explained that all data handling requires a level of due diligence, whether this is enforced by corporate rules or otherwise, and without this, analytics and their results can quickly become invalid.

Having scoped our data architecture, we have walked through the individual components and their respective advantages/disadvantages, explaining that our choices are based upon collective experience. Indeed, there are always options when it comes to choosing components and their individual features should always be carefully considered before any commitment.

In the next chapter, we will dive deeper into how to source and capture data. We will advise on how to bring data onto the platform and discuss aspects related to processing and handling data through a pipeline.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Develop and apply advanced analytical techniques with Spark
  • Learn how to tell a compelling story with data science using Spark’s ecosystem
  • Explore data at scale and work with cutting edge data science methods

Description

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Who is this book for?

This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.

What you will learn

  • Learn the design patterns that integrate Spark into industrialized data science pipelines
  • See how commercial data scientists design scalable code and reusable code for data science services
  • Explore cutting edge data science methods so that you can study trends
  • and causality
  • Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs
  • Find out how Spark can be used as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Study advanced Spark concepts, solution design patterns, and integration architectures
  • Demonstrate powerful data science pipelines

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 29, 2017
Length: 560 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882142
Vendor :
Apache
Category :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Mar 29, 2017
Length: 560 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882142
Vendor :
Apache
Category :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 124.97
Learning Apache Spark 2
€36.99
Apache Spark 2.x Machine Learning Cookbook
€41.99
Mastering Spark for Data Science
€45.99
Total 124.97 Stars icon
Banner background image

Table of Contents

14 Chapters
1. The Big Data Science Ecosystem Chevron down icon Chevron up icon
2. Data Acquisition Chevron down icon Chevron up icon
3. Input Formats and Schema Chevron down icon Chevron up icon
4. Exploratory Data Analysis Chevron down icon Chevron up icon
5. Spark for Geographic Analysis Chevron down icon Chevron up icon
6. Scraping Link-Based External Data Chevron down icon Chevron up icon
7. Building Communities Chevron down icon Chevron up icon
8. Building a Recommendation System Chevron down icon Chevron up icon
9. News Dictionary and Real-Time Tagging System Chevron down icon Chevron up icon
10. Story De-duplication and Mutation Chevron down icon Chevron up icon
11. Anomaly Detection on Sentiment Analysis Chevron down icon Chevron up icon
12. TrendCalculus Chevron down icon Chevron up icon
13. Secure Data Chevron down icon Chevron up icon
14. Scalable Algorithms Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(2 Ratings)
5 star 50%
4 star 0%
3 star 50%
2 star 0%
1 star 0%
Sumit Pal May 25, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book if for an intermediate to an expert level knowledge on Spark, Algorithms and Data Science in general. Each of the authors of the book are experts and highly accomplished craftsmen in their respective fields.The indepth coverage in the book in terms of coverage, depth, variety of algorithms and the pure fun, elegance of working with Spark and Scala code - leaves nothing more to be desired from a book of this calibre. The code is well written, and tested and explanations of the reasoning behind the code - why it is used and appropriate usage as per the algorithm makes the book highly readable. I have read numerous books on Spark for Data Processing, Streaming and Machine Learning - and this one stands out in terms of its organization, approach to solving problems in the Data Science space.I highly recommend the book. I have read the book 2 times ( while doing Technical reviewing - I was the technical reviewer of the book ) and again after it was published. I am hooked to reading it again.This book will not teach you Spark in terms of its basics, deployments, performance tuning.
Amazon Verified review Amazon
Amanda Jan 12, 2018
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
There is a definitely a market for Data Science books that are aimed at intermediate/advanced users and there is certainly a wealth of information contained within these pages. The examples were interesting enough to keep me engaged. There is the usual poor Packt editing and there were a few spelling mistakes to annoy the pedants among us.A word of caution though - don't buy this book thinking it will teach you how to use Kafka, Avro, NiFi, Accumulo - you will need to be well versed in how to use these products and link them as well as the usual Hadoop, Spark and Scala if you want to code the examples.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.