Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
zł39.99 | ALL EBOOKS & VIDEOS
Save more on purchases! Buy 2 and save 10%, Buy 3 and save 15%, Buy 5 and save 20%
Real-time Analytics with Storm and Cassandra
Real-time Analytics with Storm and Cassandra

Real-time Analytics with Storm and Cassandra: Solve real-time analytics problems effectively using Storm and Cassandra

By Shilpi Saxena
zł197.99 zł137.99
Book Mar 2015 220 pages 1st Edition
eBook
zł158.99 zł39.99
Print
zł197.99 zł137.99
Subscription
Free Trial
eBook
zł158.99 zł39.99
Print
zł197.99 zł137.99
Subscription
Free Trial

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Real-time Analytics with Storm and Cassandra

Chapter 1. Let's Understand Storm

In this chapter, you will be acquainted with the problems requiring distributed computed solutions and get to know how complex it could get to create and manage such solutions. We will look at the options available to solve distributed computation.

The topics that will be covered in the chapter are as follows:

  • Getting acquainted with a few problems that require distributed computing solutions

  • The complexity of existing solutions

  • Technologies offering real-time distributed computing

  • A high-level view of the various components of Storm

  • A quick peek into the internals of Storm

By the end of the chapter, you will be able to understand the real-time scenarios and applications of Apache Storm. You should be aware of solutions available in the market and reasons as to why Storm is still the best open source choice.

Distributed computing problems


Let's dive deep and identify some of the problems that require distributed solutions. In the world we live in today, we are so attuned to the power of now and that's the paradigm that generated the need for distributed real-time computing. Sectors such as banking, healthcare, automotive manufacturing, and so on are hubs where real-time computing can either optimize or enhance the solutions.

Real-time business solution for credit or debit card fraud detection

Let's get acquainted with the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than five seconds. In less than five seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to the issuing bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to be computed, and the result has to travel back over the secure network:

Real-time credit card fraud detection

The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results within 1 to 2 seconds.

Aircraft Communications Addressing and Reporting system

The Aircraft Communications Addressing and Reporting system (ACAR) demonstrates another typical use case that cannot be implemented without having a reliable real-time processing system in place. These Aircraft communication systems use satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real time and are able to generate analytics and alerts on the data in real time.

Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real time detects and raises the alerts to deviate routes for all other flights passing through that area.

Healthcare

Here, let's introduce you to another problem on healthcare.

This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real time to take informed life-saving actions.

The preceding figure depicts the use case where doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient databases, drug databases, and patient records. Once the data is collected, it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals.

Other applications

There are varieties of other applications where the power of real-time computing can either optimize or help people make informed decisions. It has become a great utility and aid in the following industries:

  • Manufacturing: A real-time defect detection mechanism can help optimize production costs. Generally, in the manufacturing segment QC is performed postproduction and there, due to one similar defect in goods, entire lot is rejected.

  • Transportation industry: Based on real-time traffic and weather data, transport companies can optimize their trade routes and save time and money.

  • Network optimization: Based on real-time network usage alerts, companies can design auto scale up and auto scale down systems for peak and off-peak hours.

Solutions for complex distributed use cases


Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options we have to process vast amount of data being generated at a very fast pace.

The Hadoop solution

The Hadoop solution is one of the solutions to solve the problems that require dealing with humongous volumes of data. It works by executing jobs in a clustered setup.

MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generates intermediate output again in the form of a key-value pair. Then a reduce function operates on the mapper output and merges the values associated with the same intermediate key and generates a result.

In the preceding figure, we demonstrate the simple word count MapReduce job where the simple word count job is being demonstrated using the MapReduce where:

  • There is a huge Big Data store, which can go up to zettabytes or petabytes.

  • Input datasets or files are split into blocks of configured size and each block is replicated onto multiple nodes in the Hadoop cluster depending upon the replication factor.

  • Each mapper job counts the number of words on the data blocks allocated to it.

  • Once the mapper is done, the words (which are actually the keys) and their counts are stored in a local file on the mapper node. The reducer then starts the reduce function and thus generates the result.

  • Reducers combine the mapper output and the final results are generated.

Big data, as we know, did provide a solution to processing and generating results out of humongous volumes of data, but that's predominantly a batch processing system and has almost no utility on a real-time use case.

A custom solution

Here we talk about a solution that was used in the social media world before we had a scalable framework such as Storm. A simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following the mechanism shown in the figure:

Here is the detailed information of how the preceding mechanism works:

  • A custom solution created a fire hose or queue onto which all the tweets are pushed.

  • A set of workers' nodes read data from the queue, parse the messages, and maintain counts of tweets by each user. The solution is scalable, as we can increase the number of workers to handle more load in the system. But the sharding algorithm for random distribution of the data among these workers nodes' should ensure equal distribution of data to all workers.

  • These workers assimilate this first level count into the next set of queues.

  • From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here, the data distribution among these workers is neither equal, nor random. The load balancing or the sharding logic has to ensure that tweets from the same user should always go to the same worker, to get the correct counts. For example, lets assume we have different users— "A, K, M, P, R, and L" and we have two workers "worker A" and "worker B". The tweets from user "A, K, and M" always goes to "worker A", and those of "P, R, and L users" goes to "worker B"; so the tweet counts for "A, K, and M" are always maintained by "worker A". Finally, these counts are dumped into the data store.

The queue-worker solution described in the preceding points works fine for our specific use case, but it has the following serious limitations:

  • It's very complex and specific to the use case

  • Redeployment and reconfiguration is a huge task

  • Scaling is very tedious

  • The system is not fault tolerant

Licensed proprietary solutions

After an open source Hadoop and custom Queue-worker solution, let's discuss the licensed options' proprietary solutions in the market to cater to the distributed real-time processing needs.

The Alabama Occupational Therapy Association (ALOTA) of big companies has invested in such products, because they clearly see where the future of computing is moving to. They can foresee demands of such solutions and support them in almost every vertical and domain. They have developed such solutions and products that let us do complex batch and real-time computing but that comes at a heavy license cost. A few solutions to name are from companies such as:

  • IBM: IBM has developed InfoSphere Streams for real-time ingestion, analysis, and correlation of data.

  • Oracle: Oracle has a product called Real Time Decisions (RTD) that provides analysis, machine learning, and predictions in real-time context

  • GigaSpaces: GigaSpaces has come up with a product called XAP that provides in-memory computation to deliver real-time results

Other real-time processing tools

There are few other technologies that have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is essentially a batch processing system with some features on micro-batching, which could be utilized as real time.

A high-level view of various components of Storm


In this section, we will get you acquainted with various components of Storm, their role, and their distribution in a Storm cluster.

A Storm cluster has three sets of nodes (which could be co-located, but are generally distributed in clusters), which are as follows:

  • Nimbus

  • Zookeeper

  • Supervisor

The following figure shows the integration hierarchy of these nodes:

The detailed explanation of the integration hierarchy is as follows:

  • Nimbus node (master node, similar to Hadoop-JobTracker): This is the heart of the Storm cluster. You can say that this is the master daemon process that is responsible for the following:

    • Uploading and distributing various tasks across the cluster

    • Uploading and distributing the topology jars jobs across various supervisors

    • Launching workers as per ports allocated on the supervisor nodes

    • Monitoring the topology execution and reallocating workers whenever necessary

    • Storm UI is also executed on the same node

  • Zookeeper nodes: Zookeepers can be designated as the bookkeepers in the Storm cluster. Once the topology job is submitted and distributed from the Nimbus nodes, then even if Nimbus dies the topology would continue to execute because as long as Zookeepers are alive, the workable state is maintained and logged by them. The main responsibility of this component is to maintain the operational state of the cluster and restore the operational state if recovery is required from some failure. It's the coordinator for the Storm cluster.

  • Supervisor nodes: These are the main processing chambers in the Storm topology; all the action happens in here. These are daemon processes that listen and manage the work assigned. These communicates with Nimbus through Zookeeper and starts and stops workers according to signals from Nimbus.

Delving into the internals of Storm


Now that we know which physical components are present in a Storm cluster, let's understand what happens inside various Storm components when a topology is submitted. When we say topology submission, it means that we have submitted a distributed job to Storm Nimbus for execution over the cluster of supervisors. In this section, we will explain the various steps that are executed in various Storm components when a Storm topology is executed:

  • Topology is submitted on the Nimbus node.

  • Nimbus uploads the code jars on all the supervisors and instructs the supervisors to launch workers as per the NumWorker configuration or the TOPOLOGY_WORKERS configuration defined in Storm.

  • During the same duration all the Storm nodes (Nimbus and Supervisors) constantly co-ordinate with the Zookeeper clusters to maintain a log of workers and their activities.

As per the following figure, we have depicted the topology and distribution of the topology components, which are the same across clusters:

In our case, let's assume that our cluster constitutes of one Nimbus node, three Zookeepers in a Zookeeper cluster, and one supervisor node.

By default, we have four slots allocated to each supervisor, so four workers would be launched per Storm supervisor node unless the configuration is tweaked.

Let's assume that the depicted topology is allocated four workers, and it has two bolts each with a parallelism of two and one spout with a parallelism of four. So in total, we have eight tasks to be distributed across four workers.

So this is how the topology would be executed: two workers on each supervisor and two executors within each worker, as shown in the following figure:

Quiz time


Q.1. Try to phrase a problem statement around real-time analytics in the following domains:

  • Network optimization

  • Traffic management

  • Remote sensing

Summary


In this chapter, you have understood the need for distributed computing by exploring various use cases in different verticals and domains. We have also walked you through various solutions to handle these problems and why Storm is the best choice in the open source world. You have also been introduced to Storm components and the action behind the scenes when these components are at work.

In the next chapter, we will walk through the setup aspects and you will get familiarized with programming structures in Storm by simple topologies.

Left arrow icon Right arrow icon

Key benefits

What you will learn

Integrate Storm applications with RabbitMQ for realtime analysis and processing of messages Monitor highly distributed applications using Nagios Integrate the Cassandra data store with Storm Develop and maintain distributed Storm applications in conjunction with Cassandra and In Memory Database (memcache) Build a Trident topology that enables realtime computing with Storm Tune performance for Storm topologies based on the SLA and requirements of the application Use Esper with the Storm framework for rapid development of applications
Estimated delivery fee Deliver to Poland

Premium delivery 7 - 10 business days

zł115.95
(Includes tracking information)

Product Details

Country selected

Publication date : Mar 27, 2015
Length 220 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781784395490
Vendor :
Apache
Category :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Estimated delivery fee Deliver to Poland

Premium delivery 7 - 10 business days

zł115.95
(Includes tracking information)

Product Details


Publication date : Mar 27, 2015
Length 220 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781784395490
Vendor :
Apache
Category :

Table of Contents

19 Chapters
Real-time Analytics with Storm and Cassandra Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. Let's Understand Storm Chevron down icon Chevron up icon
2. Getting Started with Your First Topology Chevron down icon Chevron up icon
3. Understanding Storm Internals by Examples Chevron down icon Chevron up icon
4. Storm in a Clustered Mode Chevron down icon Chevron up icon
5. Storm High Availability and Failover Chevron down icon Chevron up icon
6. Adding NoSQL Persistence to Storm Chevron down icon Chevron up icon
7. Cassandra Partitioning, High Availability, and Consistency Chevron down icon Chevron up icon
8. Cassandra Management and Maintenance Chevron down icon Chevron up icon
9. Storm Management and Maintenance Chevron down icon Chevron up icon
10. Advance Concepts in Storm Chevron down icon Chevron up icon
11. Distributed Cache and CEP with Storm Chevron down icon Chevron up icon
Quiz Answers Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela