## probabilistic retrieval from a fuzzy set #

Bloom filters are motivated by the following question: *how do we use a fixed amount of memory to check set-inclusion?*

As always there are tradeoffs. The tradeoff in this case is that we lose certainty — but what in life is certain, anyway?

More formally, a bloom filter is a *probabilistic data structure* — a data structure that — which implements the following operations:

`insert(x)`

— inidicate that an item*x*has been added into the set.`query(x)`

— check to see if an item*x*has been inserted into the bloom filter previously.

Notably, the standard implementation of a Bloom filter does *not* implement an operation to remove an element from the structure, since erasing an item’s inclusion from the set would also corrupt the inclusion queries for other items.

## implementing a bloom filter with hash functions #

How do we implement Bloom filters? The short answer is with *hash functions* that map elements to addresses of a *memory bank* of bits.

Imagine you have *m* bits labeled 1 to *m*, and *k* hash functions
$$
h_k: X \to \{1,\ldots,m\}
$$
so that there is little-to-no correlation between each hash function (this is important).

Then the way we `insert`

an element *x* is to compute all the hash values
$$
H(x) = \{ h_1(x), \ldots, h_k(x) \} \subset [m]
$$
and flip all of their bits from 0 to 1.

When we query for the existence of *x* in the set represented by the Bloom filter, we compute *H(x)* again, and check that all the bits are equal to 1; if there is even a single zero bit in the hashed bits *H(x)*, we return `false`

for the query.

## false positive rates #

It’s possible to get a false positive with Bloom filters (the filter says *x* is in the set, but it was never actually inserted).

It’s not possible to get false negatives (the filter says *x* is not in the set, but it’s actually been inserted).

What’s the chance of getting one of these false positives? Well, if we’re using *k* hash functions per element across *m* bits, we can do some basic math to find that the probability of a collision after inserting *n* elements is:
$$
\biggl( 1 - \exp\frac{-kn}{m} \biggr)^k
$$

## optimal settings for bloom filters #

Say we have a fixed budget of $m$ elements in our memorybank. How many hash functions should we be using to minimize the false positive rate?

(Watch this space; this is a topic for a future update.)

## implementing it in python #

In the below, we assume we can construct a set of hash functions based on big prime numbers and modular arithmetic; see this link for the complicated details of how that’s possible. But in general, designing hash functions is *hard*, and it’s a vibrant field of algorithmics research to design excellent hash functions for various purposes — for instance, cryptographic security, low collision rates, et cetera.

```
class BloomFilter(object):
"""
Bloom filter implementation.
"""
def __init__(self, numhashes, numbits):
self.numhashes = numhashes
self.numbits = numbits
self.hash_functions = [ ]
for k in range(numhashes):
hk = (lambda x: COEFF[k]*x + OFFSET[k] % BIGPRIME[k])
self.hash_functions.append(hk)
self.memory = [0 for _ in range(numbits)]
def insert(self, x):
bits = [ hf(x) for hf in self.hash_functions ]
for bit in bits:
self.memory[bit] = 1
def query(self, x):
bits = [ hf(x) for hf in self.hash_functions ]
return all([self.memory[b] == 1 for b in bits])
```