probabilistic retrieval from a fuzzy set #
Bloom filters are motivated by the following question: how do we use a fixed amount of memory to check set-inclusion?
As always there are tradeoffs. The tradeoff in this case is that we lose certainty — but what in life is certain, anyway?
More formally, a bloom filter is a probabilistic data structure — a data structure that — which implements the following operations:
insert(x)
— inidicate that an item x has been added into the set.query(x)
— check to see if an item x has been inserted into the bloom filter previously.
Notably, the standard implementation of a Bloom filter does not implement an operation to remove an element from the structure, since erasing an item’s inclusion from the set would also corrupt the inclusion queries for other items.
implementing a bloom filter with hash functions #
How do we implement Bloom filters? The short answer is with hash functions that map elements to addresses of a memory bank of bits.
Imagine you have m bits labeled 1 to m, and k hash functions $$ h_k: X \to \{1,\ldots,m\} $$ so that there is little-to-no correlation between each hash function (this is important).
Then the way we insert
an element x is to compute all the hash values
$$
H(x) = \{ h_1(x), \ldots, h_k(x) \} \subset [m]
$$
and flip all of their bits from 0 to 1.
When we query for the existence of x in the set represented by the Bloom filter, we compute H(x) again, and check that all the bits are equal to 1; if there is even a single zero bit in the hashed bits H(x), we return false
for the query.
false positive rates #
It’s possible to get a false positive with Bloom filters (the filter says x is in the set, but it was never actually inserted).
It’s not possible to get false negatives (the filter says x is not in the set, but it’s actually been inserted).
What’s the chance of getting one of these false positives? Well, if we’re using k hash functions per element across m bits, we can do some basic math to find that the probability of a collision after inserting n elements is: $$ \biggl( 1 - \exp\frac{-kn}{m} \biggr)^k $$
optimal settings for bloom filters #
Say we have a fixed budget of $m$ elements in our memorybank. How many hash functions should we be using to minimize the false positive rate?
(Watch this space; this is a topic for a future update.)
implementing it in python #
In the below, we assume we can construct a set of hash functions based on big prime numbers and modular arithmetic; see this link for the complicated details of how that’s possible. But in general, designing hash functions is hard, and it’s a vibrant field of algorithmics research to design excellent hash functions for various purposes — for instance, cryptographic security, low collision rates, et cetera.
class BloomFilter(object):
"""
Bloom filter implementation.
"""
def __init__(self, numhashes, numbits):
self.numhashes = numhashes
self.numbits = numbits
self.hash_functions = [ ]
for k in range(numhashes):
hk = (lambda x: COEFF[k]*x + OFFSET[k] % BIGPRIME[k])
self.hash_functions.append(hk)
self.memory = [0 for _ in range(numbits)]
def insert(self, x):
bits = [ hf(x) for hf in self.hash_functions ]
for bit in bits:
self.memory[bit] = 1
def query(self, x):
bits = [ hf(x) for hf in self.hash_functions ]
return all([self.memory[b] == 1 for b in bits])