Will.Whim

A weblog by Will Fitzgerald

Monthly Archives: July 2011

On the meta-media

In the wake of the horrible killings in Norway, I saw several tweets and Facebook posts that were something like “Brown: terrorist. White: extremist. Got it.” or “It speaks volumes that the immediate assumption was that this was the work of Muslim terrorists.” In other words, people began quite quickly to comment on the media reaction to the story, criticizing (implicitly or explicitly) for being anti-Muslim. This seemed odd to me, because by the time I started to see stories about the killings, the stories themselves were tentative about what the motivations of the killer or killers were.

After some discussion with one of the posters of one of these comments, what became clear is that some tabloids and pundits were quick to label the motivations. The Sun (on July 22) called it an “Al-Queda Massacre” and Thomas Jocelyn of The Standard wrote, somewhat more circumspectly, “We don’t know if al Qaeda was directly responsible for today’s events, but in all likelihood the attack was launched by part of the jihadist hydra.”

The legitimate news organizations were bit cautious about assigning motivations; at least, the ones I was able to check. The New York Times, after Anders Behring Breivik was arrested, quoted police officials as saying he was a “right-wing extremist,” although later they pointed out the similarities of the bombing and killings to prior al-Qaeda attacks, including declared threats against Norway. The Associated Press, prior to the arrest, quoted acting national Police Chief Sveinung Sponheim as refusing to speculate as to motivation, but the article also noted the al-Qaeda threats.

Once more was known about Breivik, discussion has focused on what to call him; a “terrorist,” an “extremist,” a “Christian fundamentalist,” a “right-wing terrorist,” etc. And, to some extent, this is a useful discussion to have. But any such discussion should shrink, I think, in importance to the facts on the ground: the grieving families, justice, and prevention of such attacks in the future. Part of our media-saturated culture is to think that stories on the media reaction to the story are as or more important than the story itself. No one should be surprised at The Sun being a tabloid, or even a right-leaning pundit making a too quick (but, seriously, in the current case, not uneducated) guess that Islamist extremists were behind the attack. Why is this worth comment? We should, I think, be slower to speak; we need fewer pundits, and we need even fewer meta-pundits. Ironically, of course, this blog post is meta-meta punditry, but such is the rabbit hole we enter.

O(n) beats O(lg n)

Here’s a problem I faced at work: determine the effective top level domain of a URI, using the data from the Public Suffix List, but not using any data structures read into memory (ie, a dynamically created hash table or even a static array). For example, the effective top level domain of “http://blog.entish.org” is “entish.org”, because it is a name one could register.

So, there are about 4000 “positive” rules and about 100 “negative” rules. One way to approach this is to convert a host name into a series of keys (blog.entish.org, entish.org, org) and find the longest positive and negative matching rules. It’s simple enough to write a fast hash function (e.g., FNVa), and write code to write a function that does one of the following two things:

(1) switch (fnva(key)) {case h1: case h2: …. case hn: return true; default: return false}

(2) var h = fnva(key); if (n < hmid) if …}

where (1) is a giant switch statement, and (2) is a function which does a binary search of comparisons of values. (1) is O(n), of course (because you might have to look through every value. (2) is O(lg n) as any good binary search is.

I was surprised to see that, on my test set of about 9000 hosts, the switch version was 4 times faster than the binary search version.

It’s fun to write code that writes code. It’s good to benchmark.

 

Dynamic typing and testing

As kind of a 10-finger exercise, I decided to write a small inverted index library in Ruby today, since I had the day off. This is just a few thoughts about what it was like to write tests for this code, which can be found at https://github.com/willf/inverted_index_simple–not that I think you’ll write the next major search engine with this bit of code…

What struck me was that as I was writing unit tests, how many of the tests failed at first either because of undefined variables or dynamic typing problems. For example, in the method query_and (which returns the set intersection of all documents containing query terms), I originally wrote:

def query_and tokens
return Set.new if tokens.size == 0
sets = tokens.map{|token| query_token token}
sets[1..-1].inject(sets[0]){|reduction, x| redution & x}
end

The third line is only tested if there are at least three query terms, so it was important to write such a test. But what surprised me is that the error was not in the logic, but the literal typing–I misspelled “reduction” the second time.

Similarly, I think I wrote a version of the function that had

sets = tokens.map{|token| query_token tokens}

where I typed ‘tokens’ instead of ‘token’ by mistake.

With a statically typed language, neither of these errors are likely to make it past the static compiler. The Go language, for example, won’t even let you declare variables unless you use them–something which drives me mad, but is probably the right thing to do. (It would have caught the second error, even if ‘token’ and ‘tokens’ happened to be of the same type.)

The great loss of strict static typing comes when it is difficult to write ‘generic’ code (stacks, and queues, for example, of anything); most modern languages provide some sort of facility for this (and we’re still waiting for Go to provide this). The other great loss is when it really makes sense to have union types: functions that work on lists of ints, in addition to ints.