HyperLogLog
HyperLogLog is a probabilistic algorithm that’s used for estimating the cardinality (or the number of distinct elements) of a set with very low memory usage. It was introduced by Philippe Flajolet and is particularly useful when dealing with large datasets or when memory efficiency is a concern. The HyperLogLog algorithm approximates the cardinality of a set by using a fixed amount of memory, regardless of the size of the set. It achieves this by exploiting the properties of hash functions and probabilistic counting.
The basic idea behind HyperLogLog is to hash each element of the set and determine the longest run of zeros in the binary representation of the hash values. The length of the longest run of zeros is used as an estimation of the cardinality. By averaging these estimations over multiple hash functions, a more accurate cardinality estimate can be obtained.
Let’s understand this by considering an example. The problem statement is, “We need to find the count of unique visitors for a website.”
What are some naive solutions here? We can maintain a hashmap with key
as the unique user ID and value
to count how many times a particular user visited the website. This works fine for a low-scale website, but as the website scales up in terms of the number of visitors, the memory footprint grows linearly. So, for one billion visitors, we need 1 GB if we represent each visitor just by 1 byte. Let’s explore how HyperLogLog comes to the rescue here.
We need to use randomness here. Let’s use a hash function to convert a username into a binary number and assume it’s perfectly random. Also, the same username will yield the same hash value. We’ll assume we have a perfect hash function that provides a complete random hash and converts the value into a binary representation.
Let’s say we have 1 billion users. We need at least 30 bits to represent them:
u1(1,000,000,000) ->
111011100110101100101000000000
Now, let’s say we have the following:
hash("John_1275") =
111011100110101100101000001100
hash("David.raymond23") =
111011100110101100001000000010
hash("Sarah1978") =
100011100110101100101000000001
hash("John") ->
111011100110101100101000001100
Let’s reframe the problem: “We need to count the unique number of random binary numbers.”
Let’s understand how we can do this by using an analogy of flipping a coin – we need to flip the coin until we get a T
. Think about getting a sequence – H, H, H, T
. The probability of getting this is very hard, right?
That’s a ½ * ½ * ½ * ½ = 1/16
probability on average. So, that also means that to get H H H T
, we must flip the coin 16
times.
To extend this, if someone shows that the largest streak of leading heads (H
) they got was L, this means that, approximately, they flipped the coin 2^(L+1)
times. In the preceding example, L=3
, so 2^(3+1) =
16
.
Let’s get back to our binary numbers (hash of usernames):
hash("John_1275") =
111011100110101100101000001100
hash("David.raymond23") =
111011100110101100001000000010
hash("Sarah1978") =
100011100110101100101000000001
hash("John") =
111011100110101100101000001100
Instead of leading Hs, we will use ending 0
s. In the preceding sample of four usernames, the longest subsequence of 0
s at the end is 2
, so we likely have 2^(2+1) =
8 visitors.
In a small sample set, this isn’t correct, but at a large scale, does it become accurate? Even at a high scale, we can see the accuracy problem because there could be outliers. If there is one bad hash, it will screw up the estimate.
To increase the accuracy, we can follow these steps:
- Split the incoming hash(usernames) randomly into k buckets.
- Calculate the number of max ending 0s in each of these buckets.
- Store
"ending 0 counters"
in these k buckets. - Calculate
L=
average of these k counters. Instead of taking the arithmetic mean, we can take the harmonic mean (this is why it’s called HyperLogLog). The harmonic mean is better at ignoring outliers than the arithmetic mean. - The arithmetic mean of
N
numbers is(n1, n2, …) = (
n1+n2+n3+....)/N
. - The harmonic mean of
N
numbers is(n1, n2, …) =
N/(1/n1+1/n2+1/n3…).
- Estimate the final number of unique visitors via
=
2^(L+1)
.
Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large datasets. HyperLogLog uses significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.
HyperLogLog provides a relatively small memory footprint compared to traditional methods for exact counting, such as storing each element in a set.
We can use the following code to estimate the space of the HyperLogLog counter:
2^(L+1) = 1
billion visitors
L = log2(1000000000) = at max 30
ending 0's
log2(30) =
5 bits
5 bits can represent the number of ending 0s, so let’s say a byte. So, even if we use k =10
counters, it will be 10 bytes.
HyperLogLog has found applications in various domains where cardinality estimation is important, such as database systems, network traffic analysis, web analytics, and big data processing. It is widely used in distributed systems and data streaming scenarios where memory is limited, and fast approximate cardinality estimation is required.