A weblog by Will Fitzgerald

Category Archives: Search technology

Dynamic typing and testing

As kind of a 10-finger exercise, I decided to write a small inverted index library in Ruby today, since I had the day off. This is just a few thoughts about what it was like to write tests for this code, which can be found at https://github.com/willf/inverted_index_simple–not that I think you’ll write the next major search engine with this bit of code…

What struck me was that as I was writing unit tests, how many of the tests failed at first either because of undefined variables or dynamic typing problems. For example, in the method query_and (which returns the set intersection of all documents containing query terms), I originally wrote:

def query_and tokens
return Set.new if tokens.size == 0
sets = tokens.map{|token| query_token token}
sets[1..-1].inject(sets[0]){|reduction, x| redution & x}

The third line is only tested if there are at least three query terms, so it was important to write such a test. But what surprised me is that the error was not in the logic, but the literal typing–I misspelled “reduction” the second time.

Similarly, I think I wrote a version of the function that had

sets = tokens.map{|token| query_token tokens}

where I typed ‘tokens’ instead of ‘token’ by mistake.

With a statically typed language, neither of these errors are likely to make it past the static compiler. The Go language, for example, won’t even let you declare variables unless you use them–something which drives me mad, but is probably the right thing to do. (It would have caught the second error, even if ‘token’ and ‘tokens’ happened to be of the same type.)

The great loss of strict static typing comes when it is difficult to write ‘generic’ code (stacks, and queues, for example, of anything); most modern languages provide some sort of facility for this (and we’re still waiting for Go to provide this). The other great loss is when it really makes sense to have union types: functions that work on lists of ints, in addition to ints.

What is missing? Watson, AI and search technology

All geekdom, and much that is outside our realm, is abuzz with news and discussion of the contest by IBM’s Jeopardy-playing computer, Watson, vs Ken Jennings and Brad Rutter. In last night’s game (the second half of a game started on Monday, Watson won decisively, answering many, many more questions than Jennings and Rutter combined. At the end of the first half, Rutter and Watson were tied. Lots of famous (and formerly famous) artificial intelligence researchers have weighed in on what this means for AI. I want to discuss one small point which highlights the difference between human intelligence in question answering and computerized question answering.

In Monday’s show, one of the Jeopardy “answers” was, “It was the anatomical oddity of U.S. gymnast George Eyser, who won a gold medal on the parallel bars in 1904.” The correct answer was something like, “What is a missing leg?” Apparently, the play sequence went like this: Ken Jennings buzzed in with “What is a missing hand?” Watson then buzzed in with “What is a leg?” This was first deemed correct by Alex Trebek, the host, but this judgment was reversed by the judges on review. I didn’t see the show (a Valentine’s Day dinner took precedent), but apparently the TV show edited out Trebek’s original decision. Because of the reversal, Watson was unable to give a “more specific answer,” and Rutter was unable to buzz in on Watson’s error.

It seems that Trebek was treating Watson’s answer as if a human had given it: If a human had said “What is a leg?” as a follow-up to the wrong question, “What is a missing hand?” it would make sense to treat this as having the same intention as “What is a missing leg?” But Watson doesn’t know what the other contestants are saying, and so it actually had no such context in which to give its answer. I think it is plausible that Trebek would have awarded Watson a correct answer if Watson had given its answer without the context Jennings’s question (or, perhaps, would have asked Watson to give a more specific answer), given the “anatomical oddity” context of the question.

People laughed when Watson gave a similar wrong answer to another of Jennings’s errors. Jennings answered “What are the ’20s?,” and then Watson said, “What is 1920s?” Interestingly, the press reports have pretty much all said that Watson gave the same answer as Jennings. But Watson’s answer, though it would have the same intent if given by a person, is different, both in its raw surface form and in its grammatical incorrectness. I don’t think the Jeopardy rules require grammatically correctness in the questions, for human players don’t have this kind of problem. They didn’t know, or remember in the game play, that Watson didn’t receive the other contestants’ answers.

Watson was penalized for getting the 1920s question wrong, and penalized for getting the leg question right, but in the wrong way. I find it fascinating that people–sometimes in real time, and sometimes only on deliberation–can navigate what it means to have correct and incorrect intentions with respect to human-made artifacts such as Watson, especially within the strong expectations set up by IBM and the Jeopardy game to treat the Watson system as an intentional agent. Most of Watson’s human-seeming qualities come from the expectations that get set up. For example, Watson uses templates such as “Let’s finish up <CATEGORY>, Alex” when there is only one answer left in the category). But the stronger expectations are set by giving it a human name, a synthesized voice, putting it between two humans, using the pronoun “he” when referring to the system, etc. But even given these expectations, people can notice, and react to, the breakdowns (and, in the case of the missing “missing,” a seeming non-breakdown).

Search engines, such as Google and Bing, have a feature often internally called “Answers,” that give an answer to an implied question right on the search results page. For example, enter “molecular weight of hydrogen” or “capital of Michigan” in Bing or Google, and you’ll get the answer without having to go to another page. No one confuses this with human intelligence, yet, at some level of analysis, this is what Watson is doing. Granted, search engines are not optimized for searches phrased as questions (if you ask Google today, “What was George Eyser’s problem?,” the caption will say “he ran out of the paper dishes on which to serve the ice…”; in this case, Bing does a better job, but it could have gone either way). But the extraction of facts and references from a very large number of documents and data sources, based on vague and ambiguous search queries is the essential job of search engines. For the most part, Watson is a search engine, specialized for Jeopardy, that has been given a human moniker and a few human mannerisms.

Is Bing cheating at search? Redux: It’s all in the clicks

[Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing]

Matt Cutts has a thoughtful follow-up article on the controversy over whether Bing is copying Google’s search results. There has been a little too much heat and not enough light in the discussion, and I appreciate reflective thoughts, and his shout out to us Bing engineers. Someday, we’ll have forgotten all about this controversy. I predict about two and a half weeks.

I’ve been thinking some more about this, too, and I’d like to write a bit more about the clickstream data. I won’t describe specific ways that the data is collected, although Bing has made it clear that it comes though the use of opt-in data provided by users. And I won’t describe all the content or format of what the clickstream contains. But let’s abstract what is obviously there: information about the search engine used, the query, the presented URLs (and their order), and the URLs that are clicked by the user. So, you can imagine that there is this stream of click data coming from a browser (I suggest only search engines here, Google, Bing and SearchCo, which will be a strawman for any search engine not as good as Google or Bing):

<Bing, “fish sticks”, http://fishsticks.com @1, clicked>
<Bing “fish sticks”, http://fishlips.com @2, unclicked>
<Google, “fish sticks”, http://fishsticks.com @1, clicked>
<Google, “fish sticks”, http://fishplanks.com @2, clicked>
<SearchCo, “fish sticks”, http://fishlips,com @1, unclicked>
<SearchCo, “fish sticks”, http://firesticks.com @2, unclicked>

and thousands and millions more of these (I have no idea of the number, actually, but it’s “large”).

A first question is to ask how these events are captured; given the variety of ways search requests are encoded, something, some process will have to do the decoding. Matt Cutts seems to think this is a priori bad reverse engineering (I guess), but it is done billions of times a day by all the search engine companies; the term of art is scraping, but it’s just using a practice of converting data from one form (a form outside the original intent of the provider) into a usable clickstream form. We all do this, all the time.

A second thing to note, and I think this is very important, is that the “goodness” of Google with respect to the clickstream is not evidenced because there is necessarily a choice to favor Google a priori. Rather, it is the users’ clicking on the links which provides the real evidence. The fake clickstreams above indicate users preferred Google’s particular search results (as evidenced by both results being clicked) to Bing’s (only one URL clicked) or to SearchCo (no URL clicked). The point is this: it is not that Bing is targeting Google, but that the evidence given by the users signal the good results provided by Google. As a bare fact, it is not the triple <Search Engine, Query, URL> triple that matters, but the triple <Query, URL, User Evaluation> that matters.

In the natural course of things, Bing’s search engine results gets better because users say so; when people like what Google (or Bing, or even SearchCo) gives them, that eventually shows up as selection and ranking preferences at Bing. In the end, when Bing’s search results look similar to Google’s search results, it’s because Google does a good job, and Bing’s learning algorithms do a good job; the learning algorithm doesn’t even have to know where the result came from to begin with. Bing won’t look as much like SearchCo just because SearchCo (mythical company, of course) doesn’t do as well. Also, of course, Google has had a huge market share, and so the preponderance of data comes from Google (although I have no real idea of the market share for the opt-in data Bing receives).

In yet other words, it’s all in the clicks. Or, at least, mostly in the clicks.

Time for a reminder: I’m a Bing developer, but I don’t work on ranking or selection.

I personally agree with Matt that software companies in general should do a better job of being transparent about what opting in means. I really don’t know what this should look like, though. For example, I use Chrome for a lot for personal web use—it’s so fast—and I know this helps, at some level, competitors to the people who pay my salary. But I have very little knowledge of the specific ways this happens—and perhaps then it doesn’t really matter. Just as I am glad that there is now a choice in search engines, so also I am glad that there are three or more good choices for browsers—which incidentally improve the aligned search engines. It just doesn’t seem that important to say anything more than anonymous data will be used to improve products, with a way to find out more details of the matter.

Finally, I want to point out that, even though the pages that Matt presents do have the same unique query/URL pairs for Bing and Google (and how that could happen is described above, my previous post, and in others’ posts on the web), the content on the pages is not the same. The titles differ in at least one case (alas for Bing, I think Google’s is better), and the captions are different (in general, I think Bing’s are better—but then I’m biased, because that is a large part of the work that goes on around me). Bing suggests alternative segmentation/spelling for some of the synthetic queries, and “related searches.” So, whatever the merit of Matt’s case as to what counts as “copying,” it’s important to note that much of the content differs.


Is Bing cheating at search?

Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing; I am writing this on my own time and behalf.

Google has accused Bing of cheating, of copying Google’s results (Danny Sullivan’s article which broke the news; Google’s official blog post with the accusation). Basically, Google set up a honeypot of synthetic (that is, unique, non-word) search keys which their results led to specifically selected (i.e., non-algorithmically), but real, websites. Later, these search queries, when issued on Bing, led to these selected results. Bing’s Harry Shum (the Microsoft VP in charge of Bing) had this to say in an official response to Danny Sullivan’s article:

We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.

When a search engine, like Bing’s or Google’s says the results are algorithmically chosen, this primarily means that machine learning algorithms are used for selecting a collection of results to potentially display, and then other machine learning algorithms are used for ranking the results for display (which goes first, second, etc.) Of course, there are a lot of other things that go on: algorithms are used for spelling correction, query alterations (for example, noticing that “SF” is a common way of saying “San Francisco”), query classification and answer type selection (for example, should “365/24/7” return a direct answer of 2.17261905?), and so on. But what Google is accusing Bing of doing is “cheating” by copying their answers directly from Google, which is to say, that the usual selection and ranking steps are being bypassed in favor of direct copying.

Both Google and Bing use machine learning models that use many features for selection and ranking. This is what was meant when Shum said Bing uses “over 1,000 different signals and features.” A subset of those features are “clickstream data we get from some of our customers, who opt-in to sharing anonymous data.”

The clickstream (requisite Wikipedia article) is the record of searches, displayed results, and selected results collected by “our customers,” as Shum said. Clickstream analysis (pioneered by Google, I say, admiringly) is an extremely powerful source of data for search result selection and ranking. It tells you, implicitly, what people think of the results presented to them. Given a ranked selection of A,B,C, they click on C and then B, but leave A alone. Given enough of these data, the machine learning models can learn to present C and B (in that order) and downrank A (if A is presented at all). And there is a lot of clickstream data, both in the direct logs to the search providers’ services, as well as the “opt-in” data mentioned by Shum. Obviously, Bing can’t inspect Google’s clickstream logs, but when customers allow it, Bing can use their searches made to Google to approximate this. I don’t know the details of what is collected (nor could I tell you, I suppose, if I did), but these are the data Shum is referring to.

Ok, so imagine you have a complicated function that takes “over 1,000 signals and features” as its input, to use for ranking or selection. But in some very specific cases, you only have only a few signals coming in; the rest are unknown. Typically, and algorithmically, the function will do what it can with the information it has. In the present case, if the only thing it knows is that several clicks on a particular website occurred when a set of users entered in queries (and never any other websites), the function will likely return that website; a completely reasonable thing to do. It should be noted that Google’s honeypot trap resulted in only about 10% of the results, which is consistent with the paucity of the data, the strength of the signals, and the time lags involved.

To my mind, this is not “copying,” but the natural result of algorithmic search. Google’s results are a kind of gold standard; that the machine learned models learn to do what Google does is not unexpected. One way for Bing to “beat Google” at search is to be as good as Google in algorithmic search; which implies being as good as Google first. I don’t think you’d get an argument from anyone that Microsoft Live—Bing’s predecessor—was not as good as Google in core search. But I think a lot of people are beginning to realize that Bing’s core results really are about as good as Google’s core results now. And this is a result, not of copying, but of all the engineering and hard work of identifying the “over 1000 signals and features” that go into ranking and selection.

Interestingly, Google’s accusation came out the evening before Matt Cutts (who helps Google maintain search quality and lead anti-spam efforts) and Harry Shum were to appear jointly at “Future of Search” roundtable discussion (along with the CEO of Blekko, a new search engine). Cutts accused Bing directly here, and Shum said more or less what he said in the blog post. But he said something else interesting. He said that Google’s honeypot was a new kind of query spam (and that he was glad that Google had identified it). Maybe this was just trying to get a dig in at Google’s most public anti-spam advocate. But there really is some truth in this. Having identified this kind of query trap, I suspect engineers at Bing will look for ways of turning this into an anti-feature, and these sorts of results will be learned away.

You could tell that Cutts, as mild mannered a person as you’re likely to meet, was really upset at the thought that Bing is unfairly copying Google’s results—in fact, he said as much at the round table. I don’t quite get this, and I’d like to understand it better. He knows better than most how these things work. When I first started working for Microsoft (after the acquisition of Powerset, a startup company I was part of), I was suspicious. What I found is a large group of dedicated search engineers who want to build the best possible search engine, measured and driven by data—a method modeled for us by Google. I guess we’d be upset if we thought a competitor were cheating by using our results. But, from my standpoint, Shum’s public statement about what Bing is doing fairly describes what is happening: Bing uses clickstream data in its models, which sometimes leads to similar selections and rankings as Google’s selections and rankings. Bing isn’t cheating or copying; it’s using data from customers (who have opted in) to improve its results.

Some Bing N-gram notes on tokenization

Brendan O’Connor asked me on Twitter what I knew about Bing N-gram tokenization conventions, and I said I would ask someone who knew. The N-gram team plans a future blog post on this, but here’s some things I was told that I could share (and I quote):

Here are a few things we can share right now:

  • Segment boundaries use the special symbols <s> and </s>, thus P(“<s> hello there </s>”) != P(“hello there”).
  • The GetConditionalProbability uses the last space character as the ‘word’ boundary, even if internally there are multiple tokens represented in that last ‘word’.  That is, GCP(“shown as-is”) is P(“as-is”|”shown”), not P(“is”|”shown as”).


New Bing NGram data

Bing (my employer, but here a different area) has announced new publicly available NGram data, current to April 2010. It includes 1 through 5-grams for title, anchor and body streams (that is, HTML page titles, text in anchor links, and overall HTML body text).

Computational lexicography on the cheap using Twitter

Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:

What iss the most popular question then? Tweet    the answer and hopefully u may only get asked 500 times?
what is there to                         tweet    about this morning?
What is your biggest food weakness?      Tweet    @Thintervention for motivation! #thinterventionG

This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).

To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:

> @client = TweetStream::Client.new('user','pass')
> File.open("tweet.tweets", "w+") do |f|
        n = 0
        @client.track('tweet','tweets','tweeting') do |s|
            n+= 1
           @client.stop if n >= 10000
           f.puts "#{s.text}"

When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:

> cat tweet.tweets | perl -pe 's/\b(tweet|tweets|tweeting)\b/\t$1\t/gi' |sort -f -k2,3 -t\t    > tweets.txt

You can then import this file into your speadsheet program and slice and dice to your heart’s content.

Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!

Nip it in the butt

I’ve seen two references to “nip it in the butt” for “nip it in the bud” recently, which reminded me of the “avoid like the plaque” post.

It’s a little harder to get statistics for this based on the public Bing data, which only goes to four tokens. But here are the most common following words for “nip in the ___”:

bud:46.79%, air:45.60%, h: 2.53%, butt: 1.05%, taste: 0.95%, form: 0.35%, bahamas: 0.25%, nip: 0.25%, </s>: 0.08%, us: 0.04%, post: 0.04%, office: 0.04%, images: 0.04%, counter: 0.04%, evening: 0.04%, twin: 0.04%, muck: 0.04%, buttoutkast: 0.04%

Nip Nip in the H Section” is referenced in the Urban dictionary. Nip in the taste bud seems to be a standard ‘clever’ headline. Nip in the bahamas is probably about a “nipple slip”.

Avoid like the plaque

I noticed that the expression “avoid like the plaque” showing up in some search results I was doing. This is “plaque” with-a-queue, not “plague”-with-a-gee. This seems like an odd misspelling to me, and I was curious how often this occurs. The Bing Ngram data gives us the tools: given “avoid like the”, the word “plague” occurs 98.68% of the time (in the web body data). The word “plaque” is the second most common word, occurring 0.13% of the time.

Here are the top  words and percentages:

plague:98.68%, </s>: 0.48%, plaque: 0.13%, pl: 0.12%, great: 0.07%, swine: 0.06%, velvet: 0.04%, black: 0.02%, proverbial: 0.02%, bubonic: 0.02%, bubolic: 0.02%, ten: 0.01%, plagued: 0.01%, plauge: 0.01%, bird: 0.01%, nuclear: 0.01%, blight: 0.01%, clap: 0.01%, lague: 0.01%, plagu: 0.01%, pleague: 0.01%

</s> means “end of sentence.”

Using Ngram data for segmentation

I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.

Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:

#  > segment("bpcares")
#  => ["bp", "cares"]
#  > segment("Twitter")
#  => ["Twitter"]
#  > segment("writers")
#  => ["writers"]
#  > segment("iamwriting")
#  => ["i", "am", "writing"]
#  > segment("backchannel")
#  => ["back", "channel"]
#  > segment("tcot")
#  => ["tcot"]
#  > segment("vacationfallout")
#  => ["vacation", "fall", "out"]

The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).

The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.

Anyway, enjoy the code: it can be found in my GitHub account:


It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.