Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning with Apache Spark Quick Start Guide

You're reading from   Machine Learning with Apache Spark Quick Start Guide Uncover patterns, derive actionable insights, and learn from big data using MLlib

Arrow left icon
Product type Paperback
Published in Dec 2018
Publisher Packt
ISBN-13 9781789346565
Length 240 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Author (1):
Arrow left icon
Jillur Quddus Jillur Quddus
Author Profile Icon Jillur Quddus
Jillur Quddus
Arrow right icon
View More author details
Toc

A brief history of data

If you worked in the mainstream IT industry between the 1970s and early 2000s, it is likely that your organization's data was held either in text-based delimited files, spreadsheets, or nicely structured relational databases. In the case of the latter, data is modeled and persisted in pre-defined, and possibly related, tables representing the various entities found within your organization's data model, for example, according to employee or department. These tables contain rows of data across multiple columns representing the various attributes making up that entity; for example, in the case of employee, typical attributes include first name, last name, and date of birth.

Vertical scaling

As both your organization's data estate and the number of users requiring access to that data grew, high-performance remote servers would have been utilized, with access provisioned over the corporate network. These remote servers would typically either act as remote filesystems for file sharing or host relational database management systems (RDBMSes) in order to store and manage relational databases. As data requirements grew, these remote servers would have needed to scale vertically, meaning that additional CPU, memory, and/or hard disk space would have been installed. Typically, these relational databases would have stored anything between hundreds and potentially tens of millions of records.

Master/slave architecture

As a means of providing resilience and load balancing read requests, potentially, a master/slave architecture would have been employed whereby data is automatically copied from the master database server to physically distinct slave database server(s) utilizing near real-time replication. This technique requires that the master server be responsible for all write requests, while read requests could be offloaded and load balanced across the slaves, where each slave would hold a full copy of the master data. That way, if the master server ever failed for some reason, business-critical read requests could still be processed by the slaves while the master was being brought back online. This technique does have a couple of major disadvantages, however:

  • Scalability: The master server, by being solely responsible for processing write requests, limits the ability for the system to be scalable as it could quickly become a bottleneck.
  • Consistency and data loss: Since replication is near real-time, it is not guaranteed that the slaves would have the latest data at the point in time that the master server goes offline and transactions may be lost. Depending on the business application, either not having the latest data or losing data may be unacceptable.

Sharding

To increase throughput and overall performance, and as single machines reached their capacity to scale vertically in a cost-effective manner, it is possible that sharding would have been employed. This is one method of horizontal scaling whereby additional servers are provisioned and data is physically split over separate database instances residing on each of the machines in the cluster, as illustrated in Figure 1.1.

This approach would have allowed organizations to scale linearly to cater for increased data sizes while reusing existing database technologies and commodity hardware, thereby optimizing costs and performance for small- to medium-sized databases.

Crucially, however, these separate databases are standalone instances and have no knowledge of one another. Therefore, some sort of broker would be required that, based on a partitioning strategy, would keep track of where data was being written to for each write request and, thereafter, retrieve data from that same location for read requests. Sharding subsequently introduced further challenges, such as processing data queries, transformations, and joins that spanned multiple standalone database instances across multiple servers (without denormalizing data), thereby maintaining referential integrity and the repartitioning of data:

Figure 1.1: A simple sharding partitioning strategy

Data processing and analysis

Finally, in order to transform, process, and analyze the data sitting in these delimited text-based files, spreadsheets or relational databases, typically an analyst, data engineer or software engineer would have written some code.

This code, for example, could take the form of formulas or Visual Basic for Applications (VBA) for spreadsheets, or Structured Query Language (SQL) for relational databases, and would be used for the following purposes:

  • Loading data, including batch loading and data migration
  • Transforming data, including data cleansing, joins, merges, enrichment, and validation
  • Standard statistical aggregations, including computing averages, counts, totals, and pivot tables
  • Reporting, including graphs, charts, tables, and dashboards

To perform more complex statistical calculations, such as generating predictive models, advanced analysts could utilize more advanced programming languages, including Python, R, SAS, or even Java.

Crucially, however, this data transformation, processing, and analysis would have either been executed directly on the server in which the data was persisted (for example, SQL statements executed directly on the relational database server in competition with other business-as-usual read and write requests), or data would be moved over the network via a programmatic query (for example, an ODBC or JDBC connection), or via flat files (for example, CSV or XML files) to another remote analytical processing server. The code could then be executed on that data, assuming, of course, that the remote processing server had sufficient CPUs, memory and/or disk space in its single machine to execute the job in question. In other words, the data would have been moved to the code in some way or another.

Data becomes big

Fast forward to today—spreadsheets are still commonplace, and relational databases containing nicely structured data, whether partitioned across shards or not, are still very much relevant and extremely useful. In fact, depending on the use case, the data volumes, structure, and the computational complexity of the required processing, it could still be faster and more efficient to store and manage data via an RDBMS and process that data directly on the remote database server using SQL. And, of course, spreadsheets are still great for very small datasets and for simple statistical aggregations. What has changed, however, since the 1970s is the availability of more powerful and more cost-effective technology coupled with the introduction of the internet!

The internet has transformed the very essence of what we mean by data. Whereas before, data was thought of as text and numbers confined to spreadsheets or relational databases, it is now an organic and evolving asset in its own right being created and consumed on a mass scale by anyone that owns a smartphone, TV, or bank account. Data is being created every second around the world in virtually any format you can think of, from social media posts, images, videos, audio, and music to blog posts, online forums, articles, computer log files, and financial transactions. All of this structured, semi-structured, and unstructured data being created in both batche and real time can no longer be stored and managed by nicely organized, text-based delimited files, spreadsheets, or relational databases, nor can it all be physically moved to a remote processing server every time some analytical code is to be executed—a new breed of technology is required.

You have been reading a chapter from
Machine Learning with Apache Spark Quick Start Guide
Published in: Dec 2018
Publisher: Packt
ISBN-13: 9781789346565
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image