Bloom Filters and Beyond: Probabilistic Membership in Practice

Optimizing large systems is incredibly satisfying. Very few things are more exciting than seeing a large baseline shift in your metrics after a clever new algorithm is deployed. Optimization can save money, make systems easier to operate, improve developer productivity, or sometimes unlock scale that wouldn’t be possible otherwise. I have written some medium posts in the past about load-balancing optimizations. Today, I’d like to talk about Bloom filters and some of their close cousins with real-world examples of applying them.

Many developers know about Bloom filters but feel intimidated to use them in production. As programmers, we are trained to write correct code and build things that work reliably. Using a data structure that’s only right most of the time can feel odd. In the world of distributed systems, however, as we will see in this post, a cheap way of being almost right can be much more valuable than an expensive way of being accurate.

Motivating Example

Consider a load-balancing scenario where a few servers allocate a continuous influx of containers to a vast pool of worker nodes, creating an oversubscription of resources. Every container request is tied to a control-plane entity, like a function in FaaS, where clustering too many similar function containers in one node may result in sub-optimal workloads due to their correlated resource usage.

It’s crucial to avoid duplicating the same function on a single worker. Having a real-time list of functions across ~10k rapidly changing worker nodes, each running about ~1k functions, could resolve this, but fetching this information proves challenging due to the sheer volume of data that would need to be constantly transferred.

Now, let’s consider a different solution. Every few seconds, each worker node creates a ~1kb bloom filter incorporating all its active function names and transmits it to placement servers. This filter aids in determining where to assign new functions, minimizing correlated workloads. This approach needs only a fraction of the data transfer compared to the previous approach while negating the complexities of delta/anti-entropy protocols or database upkeep. Counting Bloom filters can also track the number of running copies per function if you can afford more memory.

Probabilistic Membership

Probabilistic membership is a class of algorithms where we can perform the .contains() operation by retaining only a small fraction of the original data. On the downside, depending on how you design your data structure, you will see some false positives (.contains says yes, but the item isn’t in the set) but never any false negatives. While this basic model is already fairly useful, over the years, many variants have been proposed that add a lot more interesting features.

At a high level, the ADT — let’s call it ProbableSet — looks like this:

class ProbableSet:
  def insert(key) -> None:
    # Must be implemented in all impl
    pass

  def maybe_contains(key) -> bool:
    # Must be implemented in all impl
    pass

  def delete(key) -> None:
    # The deleter must know the key was inserted
    # Only select implementations like Counting BF and Cuckoo filter
    pass

  def get_count_if_contains(key) -> int:
    # Will return wrong value if it's a false positive
    # Only select implementations like Counting BF
    pass

  def count() -> int:
    # Returns the count of keys
    # Most implementations can approximate this but it's easy enough to 
    # track the actual number.
    pass

There are two classes of algorithms that are used to implement this ADT, namely 1) Bloom filter and its derivatives and 2) perfect hash function-based algorithms.

I won’t go into implementation details for all these as that can be easily Googled. I am rather going to talk about the key differences and when to use what.

Bloom filter derivatives

This class of algorithms uses a bit (or bits) array of size m and hash the incoming data using k different hash functions mod the hashes by m, and set the bit (or increase counts in some cases). The quality depends on how large m is compared to n (your key space cardinality) and the chosen value of k. A good standard formula for choosing k is:

k = ln 2 * (m / n)

if you choose k like this, the relationship between m, n, and false positive probability p is approximately:

m/n = - 1.44 log2(p)

So if you have ~1000 keys and need a 0.01 probability of false positive, you need about 9.5k bits (about a kb).

Classic Bloom filters

This data structure was invented in 1970 and still finds a lot of utility in practice. This is the most space-efficient data structure here but only supports membership queries. You cannot delete, count or retire things.

Counting Bloom filters

This is a very versatile and perhaps my favorite data structure. It uses more bits per position in the array and stores the count of hash functions hitting that position instead of just 1 and 0. When checking just for existence, you look for non-zero instead of 1. When deleting an element, you decrement the number by 1. When checking the count, you look at all k positions and return the smallest of those numbers.

The counting aspect of this data structure makes it ideal for keeping frequencies of keys over a large stream of data when the cardinality of keys is relatively small such as in the metrics services. The same exact data structure in the metrics world is called a Count-min Sketch. Letting two copies in instead of one is a hack that’s used to improve false positive rates of a classic Bloom filter and only needs 2*m memory instead of m.

Sliding Bloom filters

This is where you store two arrays instead of one — old and new. This can be either classic or counting Bloom filter. Insertions and deletions always happen in the new array while lookups use both (by adding or ORing them). When a certain condition is met, like X minutes passed or Y items inserted, we throw away the old array and mature the other one. This is useful when you use these over a stream of requests for a long duration.

You can also do LRU-style evictions in these by promoting items from old to new when they are “found.” There are many fun variants of this data structure.

Other variants

Given how simple and flexible Bloom filters are, it’s no surprise that there’s a whole cottage industry for finding ways to improve them. Per this survey, there are 60+ known variants. If you think about this stuff for some time and devise a clever idea, it already has a name.

Cuckoo filters

Cuckoo filters are based on Cuckoo hash which is a perfect hash function algorithm. When you know all your hash table keys in advance (e.g., keywords in a language when building an AST parser), you can run them through a cuckoo hash algorithm to find a collision-free placement. This is done by creating two identically sized hash tables and using separate hash functions. Instead of just setting bits, you store a hash fingerprint in either table if you find space.

If you can’t find space, you displace the current item and recursively rehash till you satisfy all keys. The Cuckoo filter works on the same principle, but there are no values — which means if two keys have the same fingerprint, they will produce a false positive. Therefore, unlike Bloom filters, you can control false positive probability by increasing the fingerprint length.

One downside of the Cuckoo filters is that they have an actual capacity hard limit, so using them in streaming, long-lived use cases is tricky. They can be resized or supplemented, but implementations for those are non-trivial.

Note on Hash Functions

Computing so many hashes may also feel intimidating. However, it’s important to realize that not all hashes are computationally expensive. For data structure hashes, stay away from cryptographic hashes like SHA256 unless you have a very good reason. My go-to hash function for these use cases is murmur-hash v3 which is cheap to compute and very good with hash properties that we care about — uniform distribution and randomness.

Moreover, we don’t even have to compute all k hashes. There’s some great research that shows we can, in fact, just use a single 64-bit hash function, break down the output into two 32-bit values, v1 and v2, and then “fake” as many hash functions as we want using v1 + k*v2 form. The code looks like the following snippet:

def k_hashes(word, k):
    h = mmh3.hash64(word, signed=False)[0]
    h1 = h & (2**32 - 1)
    h2 = h>>32
    
    hashes = []
    for i in range(k):
        h = h1 + i * h2
        if h < 0:
            h = ~h
        hashes.append(h)
    return hashes

If using Java, Guava has a pretty good implementation of classic Bloom filters that uses Murmur hash with the trick above.

If using murmur3, beware that the optimized versions for 32bit and 64bit platforms yield different results for the same keys, so if sharing filters in a network of devices that run both types of machines, choose one implementation and run on all — even though that’s not optimal.

More Examples

In addition to the example provided earlier, here are some more ideas to motivate you to try these wonderful data structures in your own work:

Bloom filters act as concise stores for what records are accessed frequently. This is great to warm caches when deploying a new cache node. You may wrongly cache a small number of records but never under-cache. This is a great paper that describes this system at Akamai.
When maintaining a pool of independent cache nodes, one strategy for getting a good hit rate is regularly sync a Bloom filter of cache node content with all clients. False positives can be dealt with using read-through or client-side back-off.
Firewalls or blocklist checkers is a great use case for Bloom filters. Since packets or requests aren’t generally bad, the fact that Bloom filters cannot have false negatives is really handy. If you hit a positive, you can check against an authoritative source and cache the result locally.
In distributed databases, a SQL join operation can benefit from Bloom filters. Instead of sending all your join keys to the other side, you only send a filter, and the other side will respond with everything that matches the filter. As long as the false positives are negligible, this saves time and network bandwidth. This technique is called a Bloom join.
In web crawlers, Bloom filters can keep track of already visited URLs. They can be maintained by many servers separately and can be combined with an OR operation. False positives can hurt in this case, but since web crawlers can estimate the number of URLs to crawl fairly well, designing a low FP filter is possible.
Bitcoin apparently uses a Bloom filter to load unrelated transaction data on lightweight clients. This is fitting as blockchain transaction history is a set of hashes, and so is a Bloom filter!

Bloom Filters and Beyond: Probabilistic Membership in Practice was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

31 Jul 2023

Mihir's tech blog