Will.Whim

A weblog by Will Fitzgerald

Category Archives: Science and Tech

On “Retiring a Great Interview Problem”

Daniel Tunkaleng wrote an interesting blog post, “Retiring a Great Interview Problem” on an interview problem that he has, in the past, posed to interviewees, but which he has now retired, because someone posted the problem, and a solution, to an interview problem website. The problem is the segmentation problem (which he calls the word-break problem): given a string, segment it into reasonable words, if possible (for example, breaking “applepie” into “apple pie”). He describes the various gradations of good answers to this problem, and I guess I would say that his reasoning is sound: candidates who give a better answer than a worse answer would probably make better programmers.

However, if I were asked this question in an interview, I am fairly confident that I would freeze up and not give a good answer. Partially, this is because of the stupidly artificial conversational model that occurs when one is participating in a white-board exercise–what, program without the resources of the internet, while someone watches me, and makes me elucidate my mental states? But the other reason I’d have a problem is that I would recognize this problem as one I have already coded and place on Github (https://github.com/willf/microsoft_ngram) which is in turn based on Peter Norvig’s section on segmentation in the book Beautiful Data (for which he posted Python code). This code does segmentation even better than the Tunkaleng’s formulation of the problem, in that it provides the most probable segmentation (Tunkaleng’s formulation asked for any reasonable segmentation). I also know (or at least believe) that Norvig’s code is based on similar code he wrote for his Artificial Intelligence textbook (here’s the code) and that using memoization as a lazy person’s method of doing dynamic programming is an old and efficient trick of Norvig’s. This knowledge would basically drain me of any interest in reinventing this particular wheel.

In other words, I would fail this interview, although I have publicly demonstrated my ability to implement a more generally useful version. Despite Tunkaleng being a vastly better programmer, he got it wrong on the first try (he had an “off-by-one” error, which he later corrected). Granted, interviewers (or interviewers in the Google style) are generally more concerned about precision (avoiding hiring bad programmers) than recall (avoiding missing out on good programmers). But I continue to think that the whiteboard method systematically discriminates against certain classes of programmers. I suspect, in particular, it discriminates against older candidates, which, prima facie would be illegal under US law.

[Update, added later the same day]. Take the “I suspect” literally in the last paragraph. I’m still trying to articulate what I think about interviewing techniques of this type and discrimination, particularly age discrimination (the only kind I’m potentially subject to). And take prima facie literally, too–I doubt that, in a court of law, age discrimination could be proved based solely on these interviews.

O(n) beats O(lg n)

Here’s a problem I faced at work: determine the effective top level domain of a URI, using the data from the Public Suffix List, but not using any data structures read into memory (ie, a dynamically created hash table or even a static array). For example, the effective top level domain of “http://blog.entish.org” is “entish.org”, because it is a name one could register.

So, there are about 4000 “positive” rules and about 100 “negative” rules. One way to approach this is to convert a host name into a series of keys (blog.entish.org, entish.org, org) and find the longest positive and negative matching rules. It’s simple enough to write a fast hash function (e.g., FNVa), and write code to write a function that does one of the following two things:

(1) switch (fnva(key)) {case h1: case h2: …. case hn: return true; default: return false}

(2) var h = fnva(key); if (n < hmid) if …}

where (1) is a giant switch statement, and (2) is a function which does a binary search of comparisons of values. (1) is O(n), of course (because you might have to look through every value. (2) is O(lg n) as any good binary search is.

I was surprised to see that, on my test set of about 9000 hosts, the switch version was 4 times faster than the binary search version.

It’s fun to write code that writes code. It’s good to benchmark.

 

Dynamic typing and testing

As kind of a 10-finger exercise, I decided to write a small inverted index library in Ruby today, since I had the day off. This is just a few thoughts about what it was like to write tests for this code, which can be found at https://github.com/willf/inverted_index_simple–not that I think you’ll write the next major search engine with this bit of code…

What struck me was that as I was writing unit tests, how many of the tests failed at first either because of undefined variables or dynamic typing problems. For example, in the method query_and (which returns the set intersection of all documents containing query terms), I originally wrote:

def query_and tokens
return Set.new if tokens.size == 0
sets = tokens.map{|token| query_token token}
sets[1..-1].inject(sets[0]){|reduction, x| redution & x}
end

The third line is only tested if there are at least three query terms, so it was important to write such a test. But what surprised me is that the error was not in the logic, but the literal typing–I misspelled “reduction” the second time.

Similarly, I think I wrote a version of the function that had

sets = tokens.map{|token| query_token tokens}

where I typed ‘tokens’ instead of ‘token’ by mistake.

With a statically typed language, neither of these errors are likely to make it past the static compiler. The Go language, for example, won’t even let you declare variables unless you use them–something which drives me mad, but is probably the right thing to do. (It would have caught the second error, even if ‘token’ and ‘tokens’ happened to be of the same type.)

The great loss of strict static typing comes when it is difficult to write ‘generic’ code (stacks, and queues, for example, of anything); most modern languages provide some sort of facility for this (and we’re still waiting for Go to provide this). The other great loss is when it really makes sense to have union types: functions that work on lists of ints, in addition to ints.

Bloom filters in Go

I’ve written my second Go package to implement Bloom filters. It can be found at:

https://github.com/willf/bloom

It requires my bitset implementation:

https://github.com/willf/bitset

Enjoy!

Baby steps with Go

I’ve started teaching myself the Go programming language. I’m enjoying most aspects of it. It feels like a really modern C–that is, low level enough to be fast, but with lots of the bells and whistles one comes to expect in a modern programming language. It has lots of data types, an interesting alternative to object-oriented programming called interfaces, a reasonably-sized library, support for concurrency baked into the language, etc. The Go home page says it well:

The Go programming language is an open source project to make programmers more productive. Go is expressive, concise, clean, and efficient. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel type system enables flexible and modular program construction. Go compiles quickly to machine code yet has the convenience of garbage collection and the power of run-time reflection. It’s a fast, statically typed, compiled language that feels like a dynamically typed, interpreted language.

It has a kind of persnickety feel to it that I kind of like. It practically insists on formatting code in certain style–a style that I used to use,  but from which I have weaned myself away because I became convinced that a different way was better. In doing so, you can dispense with some programming punctuation, and it ends up having a bit of a flavor of Python or Ruby.

I’ve been working a bit on a library for bit sets–there isn’t one in the core library–and have found the community quite helpful. This, of course, is a good sign for a language. The code I’ve been working on, with lots of kibitzing from the veteran Go language people, can be found at https://github.com/willf/bitset. One thing I like is that, in addition to the core language information, the Go website includes an essay on writing Go effectively, and how to write code that might be considered for inclusion in the standard libraries. It even has a standard unit testing library.

On “culturomics”

[Note: I got bogged down on reading and reporting on the culturomics paper a little too closely, and this is a reboot.]

Peter Norvig, one of the co-authors of the culturomics paper and the director of research at Google, was also the co-author on another significant article with the suggestive title, “The Unreasonable Effectiveness of Data”. The invention of the term “culturomics” suggests a scientific programme for attacking questions of culture that stresses statistical models based on large amounts of data, a programme that has been very successful both academically and commercially for linguistics and artificial intelligence, to say nothing of the informatics approaches of genomics and proteomics upon which the term culturomics is based. The slogan, “every time I fire a linguist, the performance of my system goes up,” (a slight misattribution of something Frederick Jelinek, a pioneering computational linguist, said), is another restatement of this. Among the bets and assumptions made by this approach are:

(1) The goal of science is better-engineered systems, which have practical, commercializable outcomes.
(2) Models must be empirically testable, with precise, independent, repeatable evaluation metrics and procedures.
(3) Simple quantitative models based on large amounts of data will perform better, in the senses of (1) and (2), than complex qualitative models based on small amounts of data.

Among the successes attributable to big data programmes include effective search engines, speech interfaces, and automated translation. Google’s rigorous approach to big data affects nearly every aspect of their business; core search for starters, but even more important are the big data approaches to Google’s ability to make money on its search, as well as decrease its operating costs.

Studies of culture are currently, for the most part, either done using complex, qualitative models, or based on relatively small amounts of data. The Google N-gram data is, perhaps, an opening salvo in an attack on qualitative/small data approaches to studies of culture, to be replaced with quantitative/big data approaches. The quantitative/big data programme has been “unreasonably effective” in overturning how linguists, artificial intelligence, and cognitive science researchers approach their field and get their projects funded. The bet, here, is that the same will occur in other culture studies.

There are many problems with the example experiments described in the culturomics papers. The experiments are often not described in enough detail to be replicable. Proof is often by example rather than by large-scale evaluation metrics. The proofs often resemble just-so stories (explanations without adequate controls) or unsurprising results (for example, that Nazis suppressed and censored writers with whom they disagreed is borne out by the data). The scope of the experiments is often very limited (for example, the section on “evolution of grammar” laughably only describes the changes occurring in a small subset of strong/weak verbs).

Because this is an overview paper, it may be that some of the important details are missing for reasons of space. Some of these things are addressed in the supporting materials, but by no means all. For example, something as basic as the methods used for tokenization—how the successive strings of characters in the digital copies of the books of the corpora—is not really defined well enough to be repeatable. How, for example, does the system tokenize “T.S. Eliot”? Is this tokenized the same way as “T. S. Eliot” or “TS Eliot”? Based on the sample N-gram viewer, it appears that, to find mentions of T.S. Eliot, the search string, “TS Eliot,” (similarly WH Auden, CS Lewis) must be used. The supplemental and related supplemental material give many details, but in the end refer to proprietary tokenization routines used at Google.

And yet, there are some useful ideas here. Because the N-gram data is time-stamped, looking at some kinds of time-varying changes is possible. The idea of measuring half-life of changes is a powerful one, and the varying amounts of time it takes to fall to half-life is interesting in their analysis of “fame” (in reality, their analysis of name mentions). Seeing how some verbs are becoming strong in the face of a general tendency towards regularization is interesting. And the lexicographic estimates seem very valuable (if not very “culturomic” to me).

A danger in the approach of culturomics is that, by focusing on what can be processed on a large scale, and measured with precision, interesting scientific questions will be left unexplored, perhaps especially when those questions are not of obvious economic benefit. Engineers build better engines, not necessarily better theories.

Having said all of this, I remain optimistic about the release of the Google N-gram data, even as I resist the term and approach suggested by culturomics. Yes, Google needs to provide better descriptions of the data, and continue to clean up the data and metadata (as well as describe and report on what “cleaned up” means; some of this is, in fact, described in the supplementary data [4]), and to be much more transparent about access to the underlying documents, when permissible by law. It would be very useful for Google to provide algorithmic access to the data rather than just make the (very large) data sets available. But these data can be mined for interesting patterns and trends, and it will be interesting to see what researchers do with them. Let’s just call it time-stamped N-gram data, though, and eschew the term culturomics.

More thoughts on “culturomics” (part one)

I’ve now had a chance to read the Science Express article describing “culturomics,” that is, the “Quantitative Analysis of Culture Using Millions of Digitized Books,” recently published by researchers at Google and several high-profile academic and commercial institutions. The authors claim to have created a corpus of approximately four percent of all books ever published (over five billion books) and supplying time-stamped ngram data based on 500 billion words in seven languages, primarily English; from the 1500s until roughly the present, primarily more recently. The “-omics” of culturomics is by analogy to genomics and proteomics: that is, high-speed analysis of large amounts of data. As someone who has done some work with ngrams on the web (with data provided both by Bing, my employer, and earlier data provided by Google), this work is of great interest to me. Digitized books are, of course, a different animal from the web (and queries made to search engines on the web), and so it is of interest for this reason. The addition of time-stamps makes some kinds of time-based analyses possible, too; this is the kind of data we have not had before (Bing’s ngram data, assuming they continue to provide older versions, might eventually do so).

There have been a number of criticisms of the culturomics programme, many of them well-founded. It is worth-while to describe a few of these. First, there are problems with the meta-data associated with the books that Google has digitized. As a result, it is unclear how accurate the time-stamps are. This is not addressed in the article, although this has been a well-known problem. Certainly, they could have sampled the corpus and given estimates on the accuracy of the time-stamp data. Related to this is the lack of a careful description of the genres and dialects represented (partly a failure in the meta-data, again). Second, there are systematic errors in the text scans; this is especially true for older books, which often used typographic conventions and fonts not common today (and, one assumes, errors made due to language models based on modern texts rather than pre-modern ones). Consider, for example, the “long s” previously used in many contexts in English; this is often read as an “f,” instead of a long s. Incidentally, according to the tokenization goals of the project, the right thing to do would be to record the long s, not regularize it to modern, standard ‘s.’ Otherwise, it becomes more difficult to track the decline of the long s, except via guessing OCR errors. The whole notion of what it means to constitute a countable thing in these corpora–that is, what are the tokenization rules for generating the 1-grams, is given short shrift in this article, although it is a fairly important issue.

Third, the presentation of the Google Labs ngram viewer has made it overly easy to tell “just so” stories. For example, consider the contrast between “Jesus” and “Christ” from 1700 to 2008. It’s easy to tell this just-so story: People talked about Jesus or Christ before the revolution, but then not so much in the run-up to the War and its aftermath. But, with the Great Awakening, a large number of books were published, with “Christ” much more common that Jesus. Over time, due to increasing secularization of the United States, people wrote about Jesus or Christ less and less. The individualistic Evangelical explosion of the last thirty years has started to reverse the trend, with “Jesus” (a more personal name) becoming a more popular name than “Christ”  (a less personal name). Natalia Cecire describes this, better and more succinctly, as “Words for Snowism“. Cicere also views the Ngram Viewer as a guilty pleasure; as epistemic candy.

The Science Express article describes several experiments made with the Google books data, and it is worth spending time examining these, because this gives good hints as to what the data are likely to be good for; where the culturomics programme is likely to head.

The first set of experiments describe attempts at estimating the size of the English lexicon, and its growth over time. The authors describe (qualitatively, for the most part) their sampling technique for determining whether an 1-gram token was an English word form: it had to appear more than once per billion; a sample of these common potential word forms was manually annotated with respect to whether they were truly English word forms, or something else (like a number, a misspelling, or a foreign word). Sample sizes, procedure, inter-rater reliability, etc., were not reported; an important flaw, in my opinion. They show, for example, that the English vocabulary has been increasing by over 70% in the past 50 years, and contrast this to the size of printed dictionaries. This first set of experiments will be of great interest to lexicographers; indeed, it is just this gap that commercial enterprises like Wordnik are trying to fill). It is hard to see how this says much about “culture,” except as fodder for lexicographical historiography or lexicographical evangelism: there are many words that are not in ‘the dictionary:’ get used to it.

The second set of experiments purports to describe “the evolution of grammar,” but, not surprisingly, only attacks a very small subset of lexical grammar: the change in use of strong and weak verbs in English. Given the time-stamped, word-form based data, it is relatively simple to check the movement from “burnt” to “burned,” and compare and contrast this to other strong and weak verbs. One wishes for more description how they reach their conclusions, for example, “high-frequency irregulars, which are more readily remembered, hold their ground better.” This reasonable statement is “proved” by a single contrast between finded/found and dwelled/dwelt. Some of the conclusions are based on dialect: there are differences between American English and British English. Unfortunately, the actual scope of the differences is “proved” by a single contrast between the use of burned/burnt in American and British English. Again, knowing the accuracy of dialect assignment, list of verbs used, etc, would be very useful.

{to be continued}

Google’s Superbowl ad

I really enjoyed Google’s Superbowl ad, in which a love story is told as a series of search queries:

  • study abroad paris france
  • cafes near the louve (sic)
  • translate tu es très mignon
  • impress a french girl
  • chocolate shops paris france
  • what are truffles
  • who is truffaut
  • long distance relationship advice
  • jobs in paris
  • AA120
  • churches in paris
  • how to assemble a crib

I was curious how Google would do “in real life” on these queries, as well as how Bing (my employer) does. Not surprisingly, Google does well on all these queries. I am pleased to state that Bing does well, too, although I have to admit that the specific results from the translate and “AA120” (a flight search) are not quite as succinctly done (yet!) as Google’s are. But all of the general and “local” queries (even the one with “Louvre” misspelled) are every bit as good as Google’s, and sometimes better presented. “Churches in paris”, I think, is nicer–showing images of churches first before the “local” search.

At this point, I’d claim that Bing really is as good or better than Google for general search–not based on this ad, of course, but from my daily use of both.

I also notice that the link clicked in the video for “how to impress a french girl” is now welcoming people who saw the Google ad.

Follow

Get every new post delivered to your Inbox.