September 3, 2011
Posted by on
To implement a bloom filter, you need a bunch of hash functions. In naive implementations (and I’ve seen plenty), programmers pick out, say, five cryptographic hash functions. One problem with this is that the hash functions for bloom filters have different requirements than hash functions for cryptography–the latter tend to be more than is required for the former. What you want for bloom filters is something that’s very, very fast, while maintaining that basic desiderata for bloom filter hashing, uniform spread.
There’s a good paper that reminds us that you can easily simulate n hash functions by having just two hash functions around. This can be as simple as this function to create the ith hash of a key, given the results a and b of hashing a key with these two functions:
hash(i) = (a + b * i ) % m
where m is the maximum value of the hash (for example, the number of buckets in a bloom filter).
But here’s a good trick not really worth a paper–but it’s still a good trick. Typically, it’s totally reasonable to limit the size to under the maximum size of an unsigned 32-bit number. These days, at least, it’s probably cheaper to calculate a base hash function on unsigned 64-bit numbers. So, you can take the upper half and the lower half of the 64-bit hashed value and return them as two 32 bit numbers.
Voila! Double hashing with one hash function. And using FNV means you have a very cheap hash function to start with, so really this can be very, very fast.
I implemented this in my bloom filter code for go.