Bloom filters
Last updated
Last updated
When you have a large set of structured data (identified by record IDs) stored in a set of data files, what is the most efficient way to know which file might contain our required data?
Reading each file would be slow as we have to read a lot of data from the disk.
We could build an index on each data file and store it in a separate index file. Each index file will be sorted on the record ID. If we want to search an ID in this index, the best we can do is a Binary Search.
A Bloom filter is a probabilistic data structure based on hashing designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. Elements are not added to the set but their hash is.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set, thus false positives are possible.
How does it work?
An empty Bloom filter is a bit-array of m
bits, all set to 0. We also have k
different hash functions, each of which maps a set element to one of the m
bit positions.
To add an element, feed it to the hash functions to get k
bit positions, and set the bits at these positions to 1.
To test if an element is in the set, feed it to the hash functions to get k
bit positions.
If any of the bits at these positions is 0, the element is definitely not in the set.
If all are 1, then the element may be in the set.
Choosing Hash Functions
The hash functions used in a Bloom filter should be independent and uniformly distributed. Some examples of independent enough include murmur, xxHash, the fnv series of hashes, and HashMix. The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.
Choosing the length of a Bloom filter
The false positive rate of you filter can vary on the length of the filter. A larger filter will have less false positives, and a smaller one more.
To choose the size of a bloom filter, we:
Choose a ballpark value for n
Choose a value for m
Calculate the optimal value of k
Calculate the error rate for our chosen values of n, m, and k. If it's unacceptable, return to step 2 and change m; otherwise we're done.
Time
insertion
O(k)
search
O(k)
Space
space
O(m)
In BigTable (and Cassandra), any read operation has to read from all SSTables that make up a Tablet. If these SSTables are not in memory, the read operation may end up doing many disk accesses. To reduce the number of disk accesses, BigTable uses Bloom filters.
Bloom filters are created for SSTables (particularly for the locality groups). They help reduce the number of disk accesses by predicting if an SSTable may contain data corresponding to a particular row or column pair. For certain applications, a small amount of Tablet server memory used for storing Bloom filters drastically reduces the number of disk-seeks, thereby improving read performance.
Resources:
The rate of the false positives will be approximately where = number of elements you expect to insert, = number of bits in the bloom filter and = number of hash functions.
If we are using a bloom filter with bits and hash function, insertion and search will both take time. In both cases, we just need to run the input through all of the hash functions. Then we just check the output bits.
To truly understand the space complexity of the bloom filter, you have to first choose your parameters. You could make a bloom filter with and it would just be a hash table that ignores collisions. However, you would have a very large if you wanted to keep your false positive rate low. The space of the actual data structure (what holds the data) is simply .