Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Hadoop Beginner's Guide
Hadoop Beginner's Guide

Hadoop Beginner's Guide: Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial.

eBook
$9.99 $32.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Hadoop Beginner's Guide

Chapter 1. What It's All About

This book is about Hadoop, an open source framework for large-scale data processing. Before we get into the details of the technology and its use in later chapters, it is important to spend a little time exploring the trends that led to Hadoop's creation and its enormous success.

Hadoop was not created in a vacuum; instead, it exists due to the explosion in the amount of data being created and consumed and a shift that sees this data deluge arrive at small startups and not just huge multinationals. At the same time, other trends have changed how software and systems are deployed, using cloud resources alongside or even in preference to more traditional infrastructures.

This chapter will explore some of these trends and explain in detail the specific problems Hadoop seeks to solve and the drivers that shaped its design.

In the rest of this chapter we shall:

  • Learn about the big data revolution

  • Understand what Hadoop is and how it can extract value from data

  • Look into cloud computing and understand what Amazon Web Services provides

  • See how powerful the combination of big data processing and cloud computing can be

  • Get an overview of the topics covered in the rest of this book

So let's get on with it!

Big data processing


Look around at the technology we have today, and it's easy to come to the conclusion that it's all about data. As consumers, we have an increasing appetite for rich media, both in terms of the movies we watch and the pictures and videos we create and upload. We also, often without thinking, leave a trail of data across the Web as we perform the actions of our daily lives.

Not only is the amount of data being generated increasing, but the rate of increase is also accelerating. From emails to Facebook posts, from purchase histories to web links, there are large data sets growing everywhere. The challenge is in extracting from this data the most valuable aspects; sometimes this means particular data elements, and at other times, the focus is instead on identifying trends and relationships between pieces of data.

There's a subtle change occurring behind the scenes that is all about using data in more and more meaningful ways. Large companies have realized the value in data for some time and have been using it to improve the services they provide to their customers, that is, us. Consider how Google displays advertisements relevant to our web surfing, or how Amazon or Netflix recommend new products or titles that often match well to our tastes and interests.

The value of data

These corporations wouldn't invest in large-scale data processing if it didn't provide a meaningful return on the investment or a competitive advantage. There are several main aspects to big data that should be appreciated:

  • Some questions only give value when asked of sufficiently large data sets. Recommending a movie based on the preferences of another person is, in the absence of other factors, unlikely to be very accurate. Increase the number of people to a hundred and the chances increase slightly. Use the viewing history of ten million other people and the chances of detecting patterns that can be used to give relevant recommendations improve dramatically.

  • Big data tools often enable the processing of data on a larger scale and at a lower cost than previous solutions. As a consequence, it is often possible to perform data processing tasks that were previously prohibitively expensive.

  • The cost of large-scale data processing isn't just about financial expense; latency is also a critical factor. A system may be able to process as much data as is thrown at it, but if the average processing time is measured in weeks, it is likely not useful. Big data tools allow data volumes to be increased while keeping processing time under control, usually by matching the increased data volume with additional hardware.

  • Previous assumptions of what a database should look like or how its data should be structured may need to be revisited to meet the needs of the biggest data problems.

  • In combination with the preceding points, sufficiently large data sets and flexible tools allow previously unimagined questions to be answered.

Historically for the few and not the many

The examples discussed in the previous section have generally been seen in the form of innovations of large search engines and online companies. This is a continuation of a much older trend wherein processing large data sets was an expensive and complex undertaking, out of the reach of small- or medium-sized organizations.

Similarly, the broader approach of data mining has been around for a very long time but has never really been a practical tool outside the largest corporations and government agencies.

This situation may have been regrettable but most smaller organizations were not at a disadvantage as they rarely had access to the volume of data requiring such an investment.

The increase in data is not limited to the big players anymore, however; many small and medium companies—not to mention some individuals—find themselves gathering larger and larger amounts of data that they suspect may have some value they want to unlock.

Before understanding how this can be achieved, it is important to appreciate some of these broader historical trends that have laid the foundations for systems such as Hadoop today.

Classic data processing systems

The fundamental reason that big data mining systems were rare and expensive is that scaling a system to process large data sets is very difficult; as we will see, it has traditionally been limited to the processing power that can be built into a single computer.

There are however two broad approaches to scaling a system as the size of the data increases, generally referred to as scale-up and scale-out.

Scale-up

In most enterprises, data processing has typically been performed on impressively large computers with impressively larger price tags. As the size of the data grows, the approach is to move to a bigger server or storage array. Through an effective architecture—even today, as we'll describe later in this chapter—the cost of such hardware could easily be measured in hundreds of thousands or in millions of dollars.

The advantage of simple scale-up is that the architecture does not significantly change through the growth. Though larger components are used, the basic relationship (for example, database server and storage array) stays the same. For applications such as commercial database engines, the software handles the complexities of utilizing the available hardware, but in theory, increased scale is achieved by migrating the same software onto larger and larger servers. Note though that the difficulty of moving software onto more and more processors is never trivial; in addition, there are practical limits on just how big a single host can be, so at some point, scale-up cannot be extended any further.

The promise of a single architecture at any scale is also unrealistic. Designing a scale-up system to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually apply larger versions of the same components, but the complexity of their connectivity may vary from cheap commodity through custom hardware as the scale increases.

Early approaches to scale-out

Instead of growing a system onto larger and larger hardware, the scale-out approach spreads the processing onto more and more machines. If the data set doubles, simply use two servers instead of a single double-sized one. If it doubles again, move to four hosts.

The obvious benefit of this approach is that purchase costs remain much lower than for scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger machines, and though a single host may cost $5,000, one with ten times the processing power may cost a hundred times as much. The downside is that we need to develop strategies for splitting our data processing across a fleet of servers and the tools historically used for this purpose have proven to be complex.

As a consequence, deploying a scale-out solution has required significant engineering effort; the system developer often needs to handcraft the mechanisms for data partitioning and reassembly, not to mention the logic to schedule the work across the cluster and handle individual machine failures.

Limiting factors

These traditional approaches to scale-up and scale-out have not been widely adopted outside large enterprises, government, and academia. The purchase costs are often high, as is the effort to develop and manage the systems. These factors alone put them out of the reach of many smaller businesses. In addition, the approaches themselves have had several weaknesses that have become apparent over time:

  • As scale-out systems get large, or as scale-up systems deal with multiple CPUs, the difficulties caused by the complexity of the concurrency in the systems have become significant. Effectively utilizing multiple hosts or CPUs is a very difficult task, and implementing the necessary strategy to maintain efficiency throughout execution of the desired workloads can entail enormous effort.

  • Hardware advances—often couched in terms of Moore's law—have begun to highlight discrepancies in system capability. CPU power has grown much faster than network or disk speeds have; once CPU cycles were the most valuable resource in the system, but today, that no longer holds. Whereas a modern CPU may be able to execute millions of times as many operations as a CPU 20 years ago would, memory and hard disk speeds have only increased by factors of thousands or even hundreds. It is quite easy to build a modern system with so much CPU power that the storage system simply cannot feed it data fast enough to keep the CPUs busy.

A different approach

From the preceding scenarios there are a number of techniques that have been used successfully to ease the pain in scaling data processing systems to the large scales required by big data.

All roads lead to scale-out

As just hinted, taking a scale-up approach to scaling is not an open-ended tactic. There is a limit to the size of individual servers that can be purchased from mainstream hardware suppliers, and even more niche players can't offer an arbitrarily large server. At some point, the workload will increase beyond the capacity of the single, monolithic scale-up server, so then what? The unfortunate answer is that the best approach is to have two large servers instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix. Though this gives some of the benefits of both approaches, it also compounds the costs and weaknesses; instead of very expensive hardware or the need to manually develop the cross-cluster logic, this hybrid architecture requires both.

As a consequence of this end-game tendency and the general cost profile of scale-up architectures, they are rarely used in the big data processing field and scale-out architectures are the de facto standard.

Tip

If your problem space involves data workloads with strong internal cross-references and a need for transactional integrity, big iron scale-up relational databases are still likely to be a great option.

Share nothing

Anyone with children will have spent considerable time teaching the little ones that it's good to share. This principle does not extend into data processing systems, and this idea applies to both data and hardware.

The conceptual view of a scale-out architecture in particular shows individual hosts, each processing a subset of the overall data set to produce its portion of the final result. Reality is rarely so straightforward. Instead, hosts may need to communicate between each other, or some pieces of data may be required by multiple hosts. These additional dependencies create opportunities for the system to be negatively affected in two ways: bottlenecks and increased risk of failure.

If a piece of data or individual server is required by every calculation in the system, there is a likelihood of contention and delays as the competing clients access the common data or host. If, for example, in a system with 25 hosts there is a single host that must be accessed by all the rest, the overall system performance will be bounded by the capabilities of this key host.

Worse still, if this "hot" server or storage system holding the key data fails, the entire workload will collapse in a heap. Earlier cluster solutions often demonstrated this risk; even though the workload was processed across a farm of servers, they often used a shared storage system to hold all the data.

Instead of sharing resources, the individual components of a system should be as independent as possible, allowing each to proceed regardless of whether others are tied up in complex work or are experiencing failures.

Expect failure

Implicit in the preceding tenets is that more hardware will be thrown at the problem with as much independence as possible. This is only achievable if the system is built with an expectation that individual components will fail, often regularly and with inconvenient timing.

Note

You'll often hear terms such as "five nines" (referring to 99.999 percent uptime or availability). Though this is absolute best-in-class availability, it is important to realize that the overall reliability of a system comprised of many such devices can vary greatly depending on whether the system can tolerate individual component failures.

Assume a server with 99 percent reliability and a system that requires five such hosts to function. The system availability is 0.99*0.99*0.99*0.99*0.99 which equates to 95 percent availability. But if the individual servers are only rated at 95 percent, the system reliability drops to a mere 76 percent.

Instead, if you build a system that only needs one of the five hosts to be functional at any given time, the system availability is well into five nines territory. Thinking about system uptime in relation to the criticality of each component can help focus on just what the system availability is likely to be.

Tip

If figures such as 99 percent availability seem a little abstract to you, consider it in terms of how much downtime that would mean in a given time period. For example, 99 percent availability equates to a downtime of just over 3.5 days a year or 7 hours a month. Still sound as good as 99 percent?

This approach of embracing failure is often one of the most difficult aspects of big data systems for newcomers to fully appreciate. This is also where the approach diverges most strongly from scale-up architectures. One of the main reasons for the high cost of large scale-up servers is the amount of effort that goes into mitigating the impact of component failures. Even low-end servers may have redundant power supplies, but in a big iron box, you will see CPUs mounted on cards that connect across multiple backplanes to banks of memory and storage systems. Big iron vendors have often gone to extremes to show how resilient their systems are by doing everything from pulling out parts of the server while it's running to actually shooting a gun at it. But if the system is built in such a way that instead of treating every failure as a crisis to be mitigated it is reduced to irrelevance, a very different architecture emerges.

Smart software, dumb hardware

If we wish to see a cluster of hardware used in as flexible a way as possible, providing hosting to multiple parallel workflows, the answer is to push the smarts into the software and away from the hardware.

In this model, the hardware is treated as a set of resources, and the responsibility for allocating hardware to a particular workload is given to the software layer. This allows hardware to be generic and hence both easier and less expensive to acquire, and the functionality to efficiently use the hardware moves to the software, where the knowledge about effectively performing this task resides.

Move processing, not data

Imagine you have a very large data set, say, 1000 terabytes (that is, 1 petabyte), and you need to perform a set of four operations on every piece of data in the data set. Let's look at different ways of implementing a system to solve this problem.

A traditional big iron scale-up solution would see a massive server attached to an equally impressive storage system, almost certainly using technologies such as fibre channel to maximize storage bandwidth. The system will perform the task but will become I/O-bound; even high-end storage switches have a limit on how fast data can be delivered to the host.

Alternatively, the processing approach of previous cluster technologies would perhaps see a cluster of 1,000 machines, each with 1 terabyte of data divided into four quadrants, with each responsible for performing one of the operations. The cluster management software would then coordinate the movement of the data around the cluster to ensure each piece receives all four processing steps. As each piece of data can have one step performed on the host on which it resides, it will need to stream the data to the other three quadrants, so we are in effect consuming 3 petabytes of network bandwidth to perform the processing.

Remembering that processing power has increased faster than networking or disk technologies, so are these really the best ways to address the problem? Recent experience suggests the answer is no and that an alternative approach is to avoid moving the data and instead move the processing. Use a cluster as just mentioned, but don't segment it into quadrants; instead, have each of the thousand nodes perform all four processing stages on the locally held data. If you're lucky, you'll only have to stream the data from the disk once and the only things travelling across the network will be program binaries and status reports, both of which are dwarfed by the actual data set in question.

If a 1,000-node cluster sounds ridiculously large, think of some modern server form factors being utilized for big data solutions. These see single hosts with as many as twelve 1- or 2-terabyte disks in each. Because modern processors have multiple cores it is possible to build a 50-node cluster with a petabyte of storage and still have a CPU core dedicated to process the data stream coming off each individual disk.

Build applications, not infrastructure

When thinking of the scenario in the previous section, many people will focus on the questions of data movement and processing. But, anyone who has ever built such a system will know that less obvious elements such as job scheduling, error handling, and coordination are where much of the magic truly lies.

If we had to implement the mechanisms for determining where to execute processing, performing the processing, and combining all the subresults into the overall result, we wouldn't have gained much from the older model. There, we needed to explicitly manage data partitioning; we'd just be exchanging one difficult problem with another.

This touches on the most recent trend, which we'll highlight here: a system that handles most of the cluster mechanics transparently and allows the developer to think in terms of the business problem. Frameworks that provide well-defined interfaces that abstract all this complexity—smart software—upon which business domain-specific applications can be built give the best combination of developer and system efficiency.

Hadoop

The thoughtful (or perhaps suspicious) reader will not be surprised to learn that the preceding approaches are all key aspects of Hadoop. But we still haven't actually answered the question about exactly what Hadoop is.

Thanks, Google

It all started with Google, which in 2003 and 2004 released two academic papers describing Google technology: the Google File System (GFS) (http://research.google.com/archive/gfs.html) and MapReduce (http://research.google.com/archive/mapreduce.html). The two together provided a platform for processing data on a very large scale in a highly efficient manner.

Thanks, Doug

At the same time, Doug Cutting was working on the Nutch open source web search engine. He had been working on elements within the system that resonated strongly once the Google GFS and MapReduce papers were published. Doug started work on the implementations of these Google systems, and Hadoop was soon born, firstly as a subproject of Lucene and soon was its own top-level project within the Apache open source foundation. At its core, therefore, Hadoop is an open source platform that provides implementations of both the MapReduce and GFS technologies and allows the processing of very large data sets across clusters of low-cost commodity hardware.

Thanks, Yahoo

Yahoo hired Doug Cutting in 2006 and quickly became one of the most prominent supporters of the Hadoop project. In addition to often publicizing some of the largest Hadoop deployments in the world, Yahoo has allowed Doug and other engineers to contribute to Hadoop while still under its employ; it has contributed some of its own internally developed Hadoop improvements and extensions. Though Doug has now moved on to Cloudera (another prominent startup supporting the Hadoop community) and much of the Yahoo's Hadoop team has been spun off into a startup called Hortonworks, Yahoo remains a major Hadoop contributor.

Parts of Hadoop

The top-level Hadoop project has many component subprojects, several of which we'll discuss in this book, but the two main ones are Hadoop Distributed File System (HDFS) and MapReduce. These are direct implementations of Google's own GFS and MapReduce. We'll discuss both in much greater detail, but for now, it's best to think of HDFS and MapReduce as a pair of complementary yet distinct technologies.

HDFS is a filesystem that can store very large data sets by scaling out across a cluster of hosts. It has specific design and performance characteristics; in particular, it is optimized for throughput instead of latency, and it achieves high availability through replication instead of redundancy.

MapReduce is a data processing paradigm that takes a specification of how the data will be input and output from its two stages (called map and reduce) and then applies this across arbitrarily large data sets. MapReduce integrates tightly with HDFS, ensuring that wherever possible, MapReduce tasks run directly on the HDFS nodes that hold the required data.

Common building blocks

Both HDFS and MapReduce exhibit several of the architectural principles described in the previous section. In particular:

  • Both are designed to run on clusters of commodity (that is, low-to-medium specification) servers

  • Both scale their capacity by adding more servers (scale-out)

  • Both have mechanisms for identifying and working around failures

  • Both provide many of their services transparently, allowing the user to concentrate on the problem at hand

  • Both have an architecture where a software cluster sits on the physical servers and controls all aspects of system execution

HDFS

HDFS is a filesystem unlike most you may have encountered before. It is not a POSIX-compliant filesystem, which basically means it does not provide the same guarantees as a regular filesystem. It is also a distributed filesystem, meaning that it spreads storage across multiple nodes; lack of such an efficient distributed filesystem was a limiting factor in some historical technologies. The key features are:

  • HDFS stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems.

  • HDFS is optimized for throughput over latency; it is very efficient at streaming read requests for large files but poor at seek requests for many small ones.

  • HDFS is optimized for workloads that are generally of the write-once and read-many type.

  • Each storage node runs a process called a DataNode that manages the blocks on that host, and these are coordinated by a master NameNode process running on a separate host.

  • Instead of handling disk failures by having physical redundancies in disk arrays or similar strategies, HDFS uses replication. Each of the blocks comprising a file is stored on multiple nodes within the cluster, and the HDFS NameNode constantly monitors reports sent by each DataNode to ensure that failures have not dropped any block below the desired replication factor. If this does happen, it schedules the addition of another copy within the cluster.

MapReduce

Though MapReduce as a technology is relatively new, it builds upon much of the fundamental work from both mathematics and computer science, particularly approaches that look to express operations that would then be applied to each element in a set of data. Indeed the individual concepts of functions called map and reduce come straight from functional programming languages where they were applied to lists of input data.

Another key underlying concept is that of "divide and conquer", where a single problem is broken into multiple individual subtasks. This approach becomes even more powerful when the subtasks are executed in parallel; in a perfect case, a task that takes 1000 minutes could be processed in 1 minute by 1,000 parallel subtasks.

MapReduce is a processing paradigm that builds upon these principles; it provides a series of transformations from a source to a result data set. In the simplest case, the input data is fed to the map function and the resultant temporary data to a reduce function. The developer only defines the data transformations; Hadoop's MapReduce job manages the process of how to apply these transformations to the data across the cluster in parallel. Though the underlying ideas may not be novel, a major strength of Hadoop is in how it has brought these principles together into an accessible and well-engineered platform.

Unlike traditional relational databases that require structured data with well-defined schemas, MapReduce and Hadoop work best on semi-structured or unstructured data. Instead of data conforming to rigid schemas, the requirement is instead that the data be provided to the map function as a series of key value pairs. The output of the map function is a set of other key value pairs, and the reduce function performs aggregation to collect the final set of results.

Hadoop provides a standard specification (that is, interface) for the map and reduce functions, and implementations of these are often referred to as mappers and reducers . A typical MapReduce job will comprise of a number of mappers and reducers, and it is not unusual for several of these to be extremely simple. The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination.

This last point is possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system. Critically, from the perspective of the size of data, the same MapReduce job can be applied to data sets of any size hosted on clusters of any size. If the data is 1 gigabyte in size and on a single host, Hadoop will schedule the processing accordingly. Even if the data is 1 petabyte in size and hosted across one thousand machines, it still does likewise, determining how best to utilize all the hosts to perform the work most efficiently. From the user's perspective, the actual size of the data and cluster are transparent, and apart from affecting the time taken to process the job, they do not change how the user interacts with Hadoop.

Better together

It is possible to appreciate the individual merits of HDFS and MapReduce, but they are even more powerful when combined. HDFS can be used without MapReduce, as it is intrinsically a large-scale data storage platform. Though MapReduce can read data from non-HDFS sources, the nature of its processing aligns so well with HDFS that using the two together is by far the most common use case.

When a MapReduce job is executed, Hadoop needs to decide where to execute the code most efficiently to process the data set. If the MapReduce-cluster hosts all pull their data from a single storage host or an array, it largely doesn't matter as the storage system is a shared resource that will cause contention. But if the storage system is HDFS, it allows MapReduce to execute data processing on the node holding the data of interest, building on the principle of it being less expensive to move data processing than the data itself.

The most common deployment model for Hadoop sees the HDFS and MapReduce clusters deployed on the same set of servers. Each host that contains data and the HDFS component to manage it also hosts a MapReduce component that can schedule and execute data processing. When a job is submitted to Hadoop, it can use an optimization process as much as possible to schedule data on the hosts where the data resides, minimizing network traffic and maximizing performance.

Think back to our earlier example of how to process a four-step task on 1 petabyte of data spread across one thousand servers. The MapReduce model would (in a somewhat simplified and idealized way) perform the processing in a map function on each piece of data on a host where the data resides in HDFS and then reuse the cluster in the reduce function to collect the individual results into the final result set.

A part of the challenge with Hadoop is in breaking down the overall problem into the best combination of map and reduce functions. The preceding approach would only work if the four-stage processing chain could be applied independently to each data element in turn. As we'll see in later chapters, the answer is sometimes to use multiple MapReduce jobs where the output of one is the input to the next.

Common architecture

Both HDFS and MapReduce are, as mentioned, software clusters that display common characteristics:

  • Each follows an architecture where a cluster of worker nodes is managed by a special master/coordinator node

  • The master in each case (NameNode for HDFS and JobTracker for MapReduce) monitors the health of the cluster and handle failures, either by moving data blocks around or by rescheduling failed work

  • Processes on each server (DataNode for HDFS and TaskTracker for MapReduce) are responsible for performing work on the physical host, receiving instructions from the NameNode or JobTracker, and reporting health/progress status back to it

As a minor terminology point, we will generally use the terms host or server to refer to the physical hardware hosting Hadoop's various components. The term node will refer to the software component comprising a part of the cluster.

What it is and isn't good for

As with any tool, it's important to understand when Hadoop is a good fit for the problem in question. Much of this book will highlight its strengths, based on the previous broad overview on processing large data volumes, but it's important to also start appreciating at an early stage where it isn't the best choice.

The architecture choices made within Hadoop enable it to be the flexible and scalable data processing platform it is today. But, as with most architecture or design choices, there are consequences that must be understood. Primary amongst these is the fact that Hadoop is a batch processing system. When you execute a job across a large data set, the framework will churn away until the final results are ready. With a large cluster, answers across even huge data sets can be generated relatively quickly, but the fact remains that the answers are not generated fast enough to service impatient users. Consequently, Hadoop alone is not well suited to low-latency queries such as those received on a website, a real-time system, or a similar problem domain.

When Hadoop is running jobs on large data sets, the overhead of setting up the job, determining which tasks are run on each node, and all the other housekeeping activities that are required is a trivial part of the overall execution time. But, for jobs on small data sets, there is an execution overhead that means even simple MapReduce jobs may take a minimum of 10 seconds.

Note

Another member of the broader Hadoop family is HBase , an open-source implementation of another Google technology. This provides a (non-relational) database atop Hadoop that uses various means to allow it to serve low-latency queries.

But haven't Google and Yahoo both been among the strongest proponents of this method of computation, and aren't they all about such websites where response time is critical? The answer is yes, and it highlights an important aspect of how to incorporate Hadoop into any organization or activity or use it in conjunction with other technologies in a way that exploits the strengths of each. In a paper (http://research.google.com/archive/googlecluster.html), Google sketches how they utilized MapReduce at the time; after a web crawler retrieved updated webpage data, MapReduce processed the huge data set, and from this, produced the web index that a fleet of MySQL servers used to service end-user search requests.

Cloud computing with Amazon Web Services


The other technology area we'll explore in this book is cloud computing, in the form of several offerings from Amazon Web Services. But first, we need to cut through some hype and buzzwords that surround this thing called cloud computing.

Too many clouds

Cloud computing has become an overused term, arguably to the point that its overuse risks it being rendered meaningless. In this book, therefore, let's be clear what we mean—and care about—when using the term. There are two main aspects to this: a new architecture option and a different approach to cost.

A third way

We've talked about scale-up and scale-out as the options for scaling data processing systems. But our discussion thus far has taken for granted that the physical hardware that makes either option a reality will be purchased, owned, hosted, and managed by the organization doing the system development. The cloud computing we care about adds a third approach; put your application into the cloud and let the provider deal with the scaling problem.

It's not always that simple, of course. But for many cloud services, the model truly is this revolutionary. You develop the software according to some published guidelines or interface and then deploy it onto the cloud platform and allow it to scale the service based on the demand, for a cost of course. But given the costs usually involved in making scaling systems, this is often a compelling proposition.

Different types of costs

This approach to cloud computing also changes how system hardware is paid for. By offloading infrastructure costs, all users benefit from the economies of scale achieved by the cloud provider by building their platforms up to a size capable of hosting thousands or millions of clients. As a user, not only do you get someone else to worry about difficult engineering problems, such as scaling, but you pay for capacity as it's needed and you don't have to size the system based on the largest possible workloads. Instead, you gain the benefit of elasticity and use more or fewer resources as your workload demands.

An example helps illustrate this. Many companies' financial groups run end-of-month workloads to generate tax and payroll data, and often, much larger data crunching occurs at year end. If you were tasked with designing such a system, how much hardware would you buy? If you only buy enough to handle the day-to-day workload, the system may struggle at month end and may likely be in real trouble when the end-of-year processing rolls around. If you scale for the end-of-month workloads, the system will have idle capacity for most of the year and possibly still be in trouble performing the end-of-year processing. If you size for the end-of-year workload, the system will have significant capacity sitting idle for the rest of the year. And considering the purchase cost of hardware in addition to the hosting and running costs—a server's electricity usage may account for a large majority of its lifetime costs—you are basically wasting huge amounts of money.

The service-on-demand aspects of cloud computing allow you to start your application on a small hardware footprint and then scale it up and down as the year progresses. With a pay-for-use model, your costs follow your utilization and you have the capacity to process your workloads without having to buy enough hardware to handle the peaks.

A more subtle aspect of this model is that this greatly reduces the costs of entry for an organization to launch an online service. We all know that a new hot service that fails to meet demand and suffers performance problems will find it hard to recover momentum and user interest. For example, say in the year 2000, an organization wanting to have a successful launch needed to put in place, on launch day, enough capacity to meet the massive surge of user traffic they hoped for but didn't know for sure to expect. When taking costs of physical location into consideration, it would have been easy to spend millions on a product launch.

Today, with cloud computing, the initial infrastructure cost could literally be as low as a few tens or hundreds of dollars a month and that would only increase when—and if—the traffic demanded.

AWS – infrastructure on demand from Amazon

Amazon Web Services (AWS) is a set of such cloud computing services offered by Amazon. We will be using several of these services in this book.

Elastic Compute Cloud (EC2)

Amazon's Elastic Compute Cloud (EC2), found at http://aws.amazon.com/ec2/, is basically a server on demand. After registering with AWS and EC2, credit card details are all that's required to gain access to a dedicated virtual machine, it's easy to run a variety of operating systems including Windows and many variants of Linux on our server.

Need more servers? Start more. Need more powerful servers? Change to one of the higher specification (and cost) types offered. Along with this, EC2 offers a suite of complimentary services, including load balancers, static IP addresses, high-performance additional virtual disk drives, and many more.

Simple Storage Service (S3)

Amazon's Simple Storage Service (S3), found at http://aws.amazon.com/s3/, is a storage service that provides a simple key/value storage model. Using web, command-line, or programmatic interfaces to create objects, which can be everything from text files to images to MP3s, you can store and retrieve your data based on a hierarchical model. You create buckets in this model that contain objects. Each bucket has a unique identifier, and within each bucket, every object is uniquely named. This simple strategy enables an extremely powerful service for which Amazon takes complete responsibility (for service scaling, in addition to reliability and availability of data).

Elastic MapReduce (EMR)

Amazon's Elastic MapReduce (EMR), found at http://aws.amazon.com/elasticmapreduce/, is basically Hadoop in the cloud and builds atop both EC2 and S3. Once again, using any of the multiple interfaces (web console, CLI, or API), a Hadoop workflow is defined with attributes such as the number of Hadoop hosts required and the location of the source data. The Hadoop code implementing the MapReduce jobs is provided and the virtual go button is pressed.

In its most impressive mode, EMR can pull source data from S3, process it on a Hadoop cluster it creates on EC2, push the results back into S3, and terminate the Hadoop cluster and the EC2 virtual machines hosting it. Naturally, each of these services has a cost (usually on per GB stored and server time usage basis), but the ability to access such powerful data processing capabilities with no need for dedicated hardware is a powerful one.

What this book covers

In this book we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters.

Not only will we be looking at Hadoop as an engine for performing MapReduce processing, but we'll also explore how a Hadoop capability can fit into the rest of an organization's infrastructure and systems. We'll look at some of the common points of integration, such as getting data between Hadoop and a relational database and also how to make Hadoop look more like such a relational database.

A dual approach

In this book we will not be limiting our discussion to EMR or Hadoop hosted on Amazon EC2; we will be discussing both the building and the management of local Hadoop clusters (on Ubuntu Linux) in addition to showing how to push the processing into the cloud via EMR.

The reason for this is twofold: firstly, though EMR makes Hadoop much more accessible, there are aspects of the technology that only become apparent when manually administering the cluster. Though it is also possible to use EMR in a more manual mode, we'll generally use a local cluster for such explorations. Secondly, though it isn't necessarily an either/or decision, many organizations use a mixture of in-house and cloud-hosted capacities, sometimes due to a concern of over reliance on a single external provider, but practically speaking, it's often convenient to do development and small-scale tests on local capacity then deploy at production scale into the cloud.

In some of the latter chapters, where we discuss additional products that integrate with Hadoop, we'll only give examples of local clusters as there is no difference in how the products work regardless of where they are deployed.

Summary


We learned a lot in this chapter about big data, Hadoop, and cloud computing.

Specifically, we covered the emergence of big data and how changes in the approach to data processing and system architecture bring within the reach of almost any organization techniques that were previously prohibitively expensive.

We also looked at the history of Hadoop and how it builds upon many of these trends to provide a flexible and powerful data processing platform that can scale to massive volumes. We also looked at how cloud computing provides another system architecture approach, one which exchanges large up-front costs and direct physical responsibility for a pay-as-you-go model and a reliance on the cloud provider for hardware provision, management and scaling. We also saw what Amazon Web Services is and how its Elastic MapReduce service utilizes other AWS services to provide Hadoop in the cloud.

We also discussed the aim of this book and its approach to exploration on both locally-managed and AWS-hosted Hadoop clusters.

Now that we've covered the basics and know where this technology is coming from and what its benefits are, we need to get our hands dirty and get things running, which is what we'll do in Chapter 2, Getting Hadoop Up and Running.

Left arrow icon Right arrow icon

Key benefits

  • Learn tools and techniques that let you approach big data with relish and not fear
  • Shows how to build a complete infrastructure to handle your needs as your data grows
  • Hands-on examples in each chapter give the big picture while also giving direct experience

Description

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills."Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Who is this book for?

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

What you will learn

  • The trends that led to Hadoop and cloud services, giving the background to know when to use the technology
  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Developing applications to run on Hadoop with examples in Java and Ruby
  • How Amazon Web Services can be used to deliver a hosted Hadoop solution and how this differs from directly-managed environments
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • How Flume can collect data from multiple sources and deliver it to Hadoop for processing
  • What other projects and tools make up the broader Hadoop ecosystem and where to go next
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 22, 2013
Length: 398 pages
Edition : 1st
Language : English
ISBN-13 : 9781849517300
Vendor :
Apache
Category :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Turkey

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Feb 22, 2013
Length: 398 pages
Edition : 1st
Language : English
ISBN-13 : 9781849517300
Vendor :
Apache
Category :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 109.98
Practical Data Analysis
$54.99
Hadoop Beginner's Guide
$54.99
Total $ 109.98 Stars icon
Banner background image

Table of Contents

11 Chapters
What It's All About Chevron down icon Chevron up icon
Getting Hadoop Up and Running Chevron down icon Chevron up icon
Understanding MapReduce Chevron down icon Chevron up icon
Developing MapReduce Programs Chevron down icon Chevron up icon
Advanced MapReduce Techniques Chevron down icon Chevron up icon
When Things Break Chevron down icon Chevron up icon
Keeping Things Running Chevron down icon Chevron up icon
A Relational View on Data with Hive Chevron down icon Chevron up icon
Working with Relational Databases Chevron down icon Chevron up icon
Data Collection with Flume Chevron down icon Chevron up icon
Where to Go Next Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(13 Ratings)
5 star 15.4%
4 star 46.2%
3 star 30.8%
2 star 7.7%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




MarcTC Feb 05, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I am very interested on Big Data and have read a lot of books about the subject. It was time for me to get hands. So I was looking for a beginner book for Hadoop. I already had the knowledge on Big Data as a subject and I also knew the type of problems we are solving. So I looked for a beginner book for Hadoop as I didn't have any prior experience with the platform. I am using Mac OSX machine. I know my way around linux/Unix shell.I started working through the book. The first chapter is a great short and sweet introduction to Hadoop, MapReduce and HDFS. Its not too lengthy. Just the right length for a person like me who gets bored pretty quickly with a lot of information. From Chapter 2, the real hands on work started. Although the instructions are mainly for Ubuntu linux, it worked like a charm in Mac. I didnt have any issues that other reviewers mentioned. when you run the first example, i couldn't use the exact command given in the book to run the first example of Pi calculation simply because i was using the latest version of Hadoop so the Jar name was different. Thats not something that the author can fix. once you give the correct jar name, it worked like a charm. Im now moving alone to the other chapters very quickly. Its a really good book if you just wanna familiarize yourself with the product, which is exactly the book is about. its well written. easy to follow. Author doesnt try to bore you to death.I recommend reading the below book if you first wanna understand what the Big Data is all about about and see what type of issues its trying solvehttp://www.amazon.com/gp/product/B009N08NKW/ref=cm_cr_ryp_prd_ttl_sol_39
Amazon Verified review Amazon
Cheng Sep 24, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
A very good book for the beginner
Amazon Verified review Amazon
Amazon Customer Nov 08, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I like this book, it is very clear and has good examples to try out. I don't give it 5 stars because many of the examples have mistakes, many of them very obvious.
Amazon Verified review Amazon
Varun Muriyanat Sep 29, 2013
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Very good for beginners. Gives both the big picture, as well as the start up implementation for newbies. Execellent choice.
Amazon Verified review Amazon
Alexander Tarnowski Jun 03, 2013
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I read this book "out of context", meaning that I didn't have an interesting problem solvable by MapReduce at hand and a dire need to learn Hadoop at the time of reading. Instead, I took time to read this book with the purpose of determining whether it's a good beginner book or not. All in all, I'd say that it is. The author really succeeds in creating a context for Hadoop and its ecosystem.From the second chapter and onwards, Hadoop is gradually introduced using very detailed instructions. The general format for doing this is by listing every single command the user needs to type and its output, so the book is full of terminal session listings. All such listings are followed by sections called "What just happened?" that explain in detail the purpose of the commands and their output. This is actually quite helpful for readers who understand what's happening from just looking at the session listing; such readers can safely skip these sections.The above approach should enable any reader, regardless of level of experience, to follow along and do the exercises or labs, which is a good thing for a beginner book. I have a remark about this though: the session dumps could have been proofread better! I can't say that I read them through a magnifying glass, but still I found quite a few errors.As for the contents, the book never shows the monster! In my opinion, the introductory chapter fails to actually establish a case for Hadoop and MapReduce. Yes, it's about big data, scaling and problems and so on, but I couldn't find a logical transition to Hadoop as a solution to these problems. Instead, chapter two illustrates the framework with a distributed calculation of pi and the word counting program (Hadoop's version of the "Hello world" program).In a later chapter, Hadoop is used to process a dataset with UFO sightings, and then, in a chapter on advanced techniques a graph problem is solved. Not until that chapter did I start getting a feeling for what kind of problems Hadoop and MapReduce should be used for. This is what I mean by "never showing the monster". Being an introductory text, I'd prefer the first or second chapter to describe some problems that are good candidates for the MapReduce paradigm, illustrate one of them, and then show how a distributed computation would help.That said, I may be off track here. This is a book on Hadoop, and not MapReduce in general, and I did say that I read it without having intricate MapReduce problems at hand. This is pretty much my only criticism. If a reader doesn't perceive this as a problem, then there's nothing to complain about. After reading the book, I feel that I have a very good feeling for what Hadoop does and what building blocks in its ecosystem to use.One more thing... Here and there the book contains examples of how to use Amazon's EMR. This didn't feel awfully important to me, but it provides an even more solid explanation of how to apply the framework to bigger problems and how it can be used in a cloud environment.To sum up: a good and comprehensive book on Hadoop that covers the framework and its ecosystem, verbose and easy to follow examples, and a structure that leaves the reader with a sense of getting the big picture.Minuses: could devote some more pages to the MapReduce paradigm and have its sample listings proofread better.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela