September 3, 2011
Posted by on
To implement a bloom filter, you need a bunch of hash functions. In naive implementations (and I’ve seen plenty), programmers pick out, say, five cryptographic hash functions. One problem with this is that the hash functions for bloom filters have different requirements than hash functions for cryptography–the latter tend to be more than is required for the former. What you want for bloom filters is something that’s very, very fast, while maintaining that basic desiderata for bloom filter hashing, uniform spread.
There’s a good paper that reminds us that you can easily simulate n hash functions by having just two hash functions around. This can be as simple as this function to create the ith hash of a key, given the results a and b of hashing a key with these two functions:
hash(i) = (a + b * i ) % m
where m is the maximum value of the hash (for example, the number of buckets in a bloom filter).
But here’s a good trick not really worth a paper–but it’s still a good trick. Typically, it’s totally reasonable to limit the size to under the maximum size of an unsigned 32-bit number. These days, at least, it’s probably cheaper to calculate a base hash function on unsigned 64-bit numbers. So, you can take the upper half and the lower half of the 64-bit hashed value and return them as two 32 bit numbers.
Voila! Double hashing with one hash function. And using FNV means you have a very cheap hash function to start with, so really this can be very, very fast.
I implemented this in my bloom filter code for go.
May 21, 2011
Posted by on
I’ve written my second Go package to implement Bloom filters. It can be found at:
It requires my bitset implementation:
May 11, 2011
Posted by on
I’ve started teaching myself the Go programming language. I’m enjoying most aspects of it. It feels like a really modern C–that is, low level enough to be fast, but with lots of the bells and whistles one comes to expect in a modern programming language. It has lots of data types, an interesting alternative to object-oriented programming called interfaces, a reasonably-sized library, support for concurrency baked into the language, etc. The Go home page says it well:
The Go programming language is an open source project to make programmers more productive. Go is expressive, concise, clean, and efficient. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel type system enables flexible and modular program construction. Go compiles quickly to machine code yet has the convenience of garbage collection and the power of run-time reflection. It’s a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language.
It has a kind of persnickety feel to it that I kind of like. It practically insists on formatting code in certain style–a style that I used to use, but from which I have weaned myself away because I became convinced that a different way was better. In doing so, you can dispense with some programming punctuation, and it ends up having a bit of a flavor of Python or Ruby.
I’ve been working a bit on a library for bit sets–there isn’t one in the core library–and have found the community quite helpful. This, of course, is a good sign for a language. The code I’ve been working on, with lots of kibitzing from the veteran Go language people, can be found at https://github.com/willf/bitset. One thing I like is that, in addition to the core language information, the Go website includes an essay on writing Go effectively, and how to write code that might be considered for inclusion in the standard libraries. It even has a standard unit testing library.