Real-time processing
Now that we have talked so extensively about Big Data processing and Big Data persistence in the context of distributed, batch-oriented systems, the next obvious thing to talk about is real-time or near real-time processing. Big data processing processes huge datasets in offline batch mode. When real-time stream processing is executed on the most current set of data, we operate in the dimension of now or the immediate past; examples are credit card fraud detection, security, and so on. Latency is a key aspect in these analytics.
The two operatives here are velocity and latency, and that's where Hadoop and related distributed batch processing systems fall short. They are designed to deliver in batch mode and can't operate at a latency of nanoseconds/milliseconds. In use cases where we need accurate results in fractions of seconds, for example, credit card fraud, monitoring business activity, and so on, we need a Complex Event Processing (CEP) engine to process and derive results at lightning fast speed.
Storm, initially a project from the house of Twitter, has graduated to the league of Apache and was rechristened from Twitter Storm. It was a brainchild of Nathan Marz that's now been adopted by CDH, HDP, and so on.
Apache Storm is a highly scalable, distributed, fast, reliable real-time computing system designed to process high-velocity data. Cassandra complements the compute capability by providing lightning fast reads and writes, and this is the best combination available as of now for a data store with Storm. It helps the developer to create a data flow model in which tuples flow continuously through a topology (a collection of processing components). Data can be ingested to Storm using distributed messaging queues such as Kafka, RabbitMQ, and so on. Trident is another layer of abstraction API over Storm that brings microbatching capabilities into it.
Let's take a closer look at a couple of real-time, real-world use cases in various industrial segments.
The telecoms or cellular arena
We are living in an era where cell phones are no longer merely calling devices. In fact, they have evolved from being phones to smartphones, providing access to not just calling but also facilities such as data, photographs, tracking, GPS, and so on into the hands of the consumers. Now, the data generated by cell phones or telephones is not just call data; the typical CDR (short for Call Data Record) captures voice, data, and SMS transactions. Voice and SMS transactions have existed for more than a decade and are predominantly structured as they are because of telecoms protocols worldwide; for example, CIBER, SMPP, SMSC, and so on. However, the data or IP traffic flowing in/out of these smart devices is pretty unstructured and high volume. It could be a music track, a picture, a tweet, or just about anything in the data dimension. CDR processing and billing is generally a batch job, but a lot of other things are real-time:
- Geo-tracking of the device: Have you noticed how quickly we get an SMS whenever we cross a state border?
- Usage and alerts: Have you noticed how accurate and efficient the alert that informs you about the broadband consumption limit is and suggests that you top up the same?
- Prepaid mobile cards: If you have ever used a prepaid system, you must have been awed at the super-efficient charge-tracking system they have in place.
Transportation and logistics
Transportation and logistics is another useful segment that's using real-time analytics from vehicular data for transportation, logistics, and intelligent traffic management. Here's an example from McKinney's report that details how Big Data and real-time analytics are helping to handle traffic congestion on a major highway in Tel Aviv, the capital of Israel. Here's what they actually do: they monitor the receipts from the toll constantly and during the peak hours, to avert congestion, they hike the toll prices. This is a deterrent factor for the users. Once the congestion eases out during non-peak hours, the toll rates are reduced.
There may be many more use cases that can be built around the data from check-posts/tolls to develop intelligent management of traffic, thus preventing congestion, and make better utilization of public infrastructure.
The connected vehicle
An idea that was still in the realms of fiction until the last decade is now a reality that's being actively used by the consumer segment today. GPS and Google Maps are no news today, they are being imbibed and heavily used features.
My car's control unit has telemetry devices that capture various KPIs, such as engine temperature, fuel consumption pattern, RPM, and so on, and all this information is used by the manufacturers for analysis. In some of the cases, the user is also allowed to set and receive alerts on these KPI thresholds.
The financial sector
This is the sector that's emerging as the biggest consumer of real-time analytics for very obvious reasons. The volume of data is huge and quickly changing; the impact of analytics and its results boils down to the money aspect. This sector needs real-time instruments for rapid and precise data analysis for data from stock exchanges, various financial institutions, market prices and fluctuations, and so on.