The world is shifting from private, dedicated data centers to on-demand computing in the cloud. This shift moves the onus of cost from the hands of IT companies to the hands of developers. As your data sizes start to rise, the computing cost grows linearly with it. We have found that using statistical algorithms gives us a 95 percent accuracy rate, is faster, and is a lot more beneficial than waiting for the exact results. The following are some common analytical queries that we have often come across in applications:
Frequently, statistical algorithms avoid storing the original data, replacing it with hashes that eliminate a lot of network. Let’s get into the details of some of these algorithms that can help answer queries similar to those mentioned previously.
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is suitable in cases when we need to quickly filter items that are present in a set.
HyperLogLog is an approximate technique for computing the number of distinct entries in a set (cardinality). It does this while using only a small amount of memory. For instance, to achieve 99 percent accuracy, it needs only 16 KB. In cases when we need to count distinct elements in a dataset spread across a Hadoop cluster, we could compute the hashes on different machines, build the bit index, and combine the bit index to compute the overall distinct elements. This eliminates the need of moving the data across the network and thus saves us a lot of time.
The Count–min sketch is a probabilistic sub-linear space streaming algorithm that can be used to summarize a data stream to obtain the frequency of elements. It allocates a fixed amount of space to store count information, which does not vary over time even as more and more counts are updated. Nevertheless, it is able to provide useful estimated counts, because the accuracy scales with the total sum of all the counts stored.
Spark is a faster execution engine that provides 10 times the performance over MapReduce when combined with these statistical algorithms. Using Spark with statistical algorithms gives us a huge benefit both in terms of cost and time savings. Spark gets most of its speed by constructing Directed Acyclic Graphs (DAGs) out of the job operations and uses memory to save intermediate data, thus making the reads faster. When using statistical algorithms, saving the hashes in memory makes the algorithms work much faster.
Let’s say we have a continuous stream of user log data coming every hour at a rate of 4.4 GB per hour, and we need to analyze the distinct IPs in the logs on a daily basis. At my old company, when MapReduce was used to process the data, it was taking about 6 hours to process one day’s worth of data at a size of 106 GB. We had an AWS cluster consisting of 50 spot instances and 4 on-demand instances running to perform the analysis at a cost of $150 per day.
Our system was then shifted to use Spark and HyperLogLog. This shift brought down the cost to $16.50 per day.
To summarize, we had a 3.1 TB stream of data processed every month at a cost of $495, which was costing about $4,500 on the original system using MapReduce without the statistical algorithm in place.
In the second part of this two-part blog series, we will discuss two tools in depth: Apache Spark and Apache Pig. We will take a look at how Pig combined with Spark makes existing ETL pipelines 100 times faster, and we will further our understanding of how statistical perspectives positively effect data analytics.
Praveen Rachabattuni is a tech lead at Sigmoid Analytics, a company that provides a real-time streaming and ETL framework on Apache Spark. Praveen is also a committer to Apache Pig.