A weblog by Will Fitzgerald

O(n) beats O(lg n)

Here’s a problem I faced at work: determine the effective top level domain of a URI, using the data from the Public Suffix List, but not using any data structures read into memory (ie, a dynamically created hash table or even a static array). For example, the effective top level domain of “http://blog.entish.org” is “entish.org”, because it is a name one could register.

So, there are about 4000 “positive” rules and about 100 “negative” rules. One way to approach this is to convert a host name into a series of keys (blog.entish.org, entish.org, org) and find the longest positive and negative matching rules. It’s simple enough to write a fast hash function (e.g., FNVa), and write code to write a function that does one of the following two things:

(1) switch (fnva(key)) {case h1: case h2: …. case hn: return true; default: return false}

(2) var h = fnva(key); if (n < hmid) if …}

where (1) is a giant switch statement, and (2) is a function which does a binary search of comparisons of values. (1) is O(n), of course (because you might have to look through every value. (2) is O(lg n) as any good binary search is.

I was surprised to see that, on my test set of about 9000 hosts, the switch version was 4 times faster than the binary search version.

It’s fun to write code that writes code. It’s good to benchmark.



6 responses to “O(n) beats O(lg n)

  1. Michael H. July 7, 2011 at 7:41 pm

    Nice! I’ve always(*) heard that there were cases like this, but it’s neat that you’ve encountered one in real life. Good to remember.

    (*) For some values of always which include “after having read the C/L/R/S algorithms book, and it might’ve been noted by Jon Bentley or Kernighan & Pike too”.

  2. Will Fitzgerald July 7, 2011 at 7:51 pm

    Actually, we’re seeing more and more things like this. As Dave Fayram wrote, “@willf Who was it who said something like, “Never underestimate the power of L1 cache?” https://twitter.com/#!/KirinDave/status/89046920391180288

  3. tierliebierlieb July 8, 2011 at 8:45 am

    Hmm. Isn’t this just an example of why complexity classes / O-notations are too abstract in some cases? I think every CS student brings up that issue in the first class that discusses them:

    lg n + G, where G is Graham’s number, is still O(lg), and epsilon * n is still (O(n)), even for a freakishly small epsilon (which admittedly isn’t how my math teacher would have phrased it). In practical problem spaces, a algorithm with a runtime like the latter will outperform the former.

    This additional cost becomes harder to see the more we move away from theoretical constructs and into practical implementations of these, which usually are hidden behind layers of libraries, frameworks, operating systems and specialized hardware.

    • Will Fitzgerald July 8, 2011 at 9:57 am

      I don’t think so, although your point is well taken in general. In this case, n is the same, but the switch statement, with n+1 branches, is being compiled into something the hardware is much faster at executing than the the lg n comparisons.

  4. Kevin Clark July 9, 2011 at 8:53 pm

    Neat! What language? What’s the cost of < vs ==? Is the latter referential equality?

  5. Will Fitzgerald July 9, 2011 at 9:22 pm

    The language was C#. Data was all int (signed 32 bit integers), so the cost of < and == were pretty darn cheap.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: