Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
System Design Guide for Software Professionals

You're reading from   System Design Guide for Software Professionals Build scalable solutions – from fundamental concepts to cracking top tech company interviews

Arrow left icon
Product type Paperback
Published in Aug 2024
Publisher Packt
ISBN-13 9781805124993
Length 384 pages
Edition 1st Edition
Arrow right icon
Authors (2):
Arrow left icon
Dhirendra Sinha Dhirendra Sinha
Author Profile Icon Dhirendra Sinha
Dhirendra Sinha
Tejas Chopra Tejas Chopra
Author Profile Icon Tejas Chopra
Tejas Chopra
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Part 1: Foundations of System Design FREE CHAPTER
2. Chapter 1: Basics of System Design 3. Chapter 2: Distributed System Attributes 4. Chapter 3: Distributed Systems Theorems and Data Structures 5. Part 2: Core Components of Distributed Systems
6. Chapter 4: Distributed Systems Building Blocks: DNS, Load Balancers, and Application Gateways 7. Chapter 5: Design and Implementation of System Components –Databases and Storage 8. Chapter 6: Distributed Cache 9. Chapter 7: Pub/Sub and Distributed Queues 10. Part 3: System Design in Practice
11. Chapter 8: Design and Implementation of System Components: API, Security, and Metrics 12. Chapter 9: System Design – URL Shortener 13. Chapter 10: System Design – Proximity Service 14. Chapter 11: Designing a Service Like Twitter 15. Chapter 12: Designing a Service Like Instagram 16. Chapter 13: Designing a Service Like Google Docs 17. Chapter 14: Designing a Service Like Netflix 18. Chapter 15: Tips for Interviewees 19. Chapter 16: System Design Cheat Sheet 20. Index

HyperLogLog

HyperLogLog is a probabilistic algorithm that’s used for estimating the cardinality (or the number of distinct elements) of a set with very low memory usage. It was introduced by Philippe Flajolet and is particularly useful when dealing with large datasets or when memory efficiency is a concern. The HyperLogLog algorithm approximates the cardinality of a set by using a fixed amount of memory, regardless of the size of the set. It achieves this by exploiting the properties of hash functions and probabilistic counting.

The basic idea behind HyperLogLog is to hash each element of the set and determine the longest run of zeros in the binary representation of the hash values. The length of the longest run of zeros is used as an estimation of the cardinality. By averaging these estimations over multiple hash functions, a more accurate cardinality estimate can be obtained.

Let’s understand this by considering an example. The problem statement is, “We need to find the count of unique visitors for a website.”

What are some naive solutions here? We can maintain a hashmap with key as the unique user ID and value to count how many times a particular user visited the website. This works fine for a low-scale website, but as the website scales up in terms of the number of visitors, the memory footprint grows linearly. So, for one billion visitors, we need 1 GB if we represent each visitor just by 1 byte. Let’s explore how HyperLogLog comes to the rescue here.

We need to use randomness here. Let’s use a hash function to convert a username into a binary number and assume it’s perfectly random. Also, the same username will yield the same hash value. We’ll assume we have a perfect hash function that provides a complete random hash and converts the value into a binary representation.

Let’s say we have 1 billion users. We need at least 30 bits to represent them:

u1(1,000,000,000) -> 111011100110101100101000000000

Now, let’s say we have the following:

hash("John_1275") = 111011100110101100101000001100

hash("David.raymond23") = 111011100110101100001000000010

hash("Sarah1978") = 100011100110101100101000000001

hash("John") -> 111011100110101100101000001100

Let’s reframe the problem: “We need to count the unique number of random binary numbers.”

Let’s understand how we can do this by using an analogy of flipping a coin – we need to flip the coin until we get a T. Think about getting a sequence – H, H, H, T. The probability of getting this is very hard, right?

That’s a ½ * ½ * ½ * ½ = 1/16 probability on average. So, that also means that to get H H H T, we must flip the coin 16 times.

To extend this, if someone shows that the largest streak of leading heads (H) they got was L, this means that, approximately, they flipped the coin 2^(L+1) times. In the preceding example, L=3, so 2^(3+1) = 16.

Let’s get back to our binary numbers (hash of usernames):

hash("John_1275") = 111011100110101100101000001100

hash("David.raymond23") = 111011100110101100001000000010

hash("Sarah1978") = 100011100110101100101000000001

hash("John") = 111011100110101100101000001100

Instead of leading Hs, we will use ending 0s. In the preceding sample of four usernames, the longest subsequence of 0s at the end is 2, so we likely have 2^(2+1) = 8 visitors.

In a small sample set, this isn’t correct, but at a large scale, does it become accurate? Even at a high scale, we can see the accuracy problem because there could be outliers. If there is one bad hash, it will screw up the estimate.

To increase the accuracy, we can follow these steps:

  1. Split the incoming hash(usernames) randomly into k buckets.
  2. Calculate the number of max ending 0s in each of these buckets.
  3. Store "ending 0 counters" in these k buckets.
  4. Calculate L= average of these k counters. Instead of taking the arithmetic mean, we can take the harmonic mean (this is why it’s called HyperLogLog). The harmonic mean is better at ignoring outliers than the arithmetic mean.
  5. The arithmetic mean of N numbers is (n1, n2, …) = (n1+n2+n3+....)/N.
  6. The harmonic mean of N numbers is (n1, n2, …) = N/(1/n1+1/n2+1/n3…).
  7. Estimate the final number of unique visitors via = 2^(L+1).

Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large datasets. HyperLogLog uses significantly less memory than this, at the cost of obtaining only an approximation of the cardinality.

HyperLogLog provides a relatively small memory footprint compared to traditional methods for exact counting, such as storing each element in a set.

We can use the following code to estimate the space of the HyperLogLog counter:

2^(L+1) = 1 billion visitors

L = log2(1000000000) = at max 30 ending 0's

log2(30) = 5 bits

5 bits can represent the number of ending 0s, so let’s say a byte. So, even if we use k =10 counters, it will be 10 bytes.

HyperLogLog has found applications in various domains where cardinality estimation is important, such as database systems, network traffic analysis, web analytics, and big data processing. It is widely used in distributed systems and data streaming scenarios where memory is limited, and fast approximate cardinality estimation is required.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image