A weblog by Will Fitzgerald

Monthly Archives: February 2011

What is missing? Watson, AI and search technology

All geekdom, and much that is outside our realm, is abuzz with news and discussion of the contest by IBM’s Jeopardy-playing computer, Watson, vs Ken Jennings and Brad Rutter. In last night’s game (the second half of a game started on Monday, Watson won decisively, answering many, many more questions than Jennings and Rutter combined. At the end of the first half, Rutter and Watson were tied. Lots of famous (and formerly famous) artificial intelligence researchers have weighed in on what this means for AI. I want to discuss one small point which highlights the difference between human intelligence in question answering and computerized question answering.

In Monday’s show, one of the Jeopardy “answers” was, “It was the anatomical oddity of U.S. gymnast George Eyser, who won a gold medal on the parallel bars in 1904.” The correct answer was something like, “What is a missing leg?” Apparently, the play sequence went like this: Ken Jennings buzzed in with “What is a missing hand?” Watson then buzzed in with “What is a leg?” This was first deemed correct by Alex Trebek, the host, but this judgment was reversed by the judges on review. I didn’t see the show (a Valentine’s Day dinner took precedent), but apparently the TV show edited out Trebek’s original decision. Because of the reversal, Watson was unable to give a “more specific answer,” and Rutter was unable to buzz in on Watson’s error.

It seems that Trebek was treating Watson’s answer as if a human had given it: If a human had said “What is a leg?” as a follow-up to the wrong question, “What is a missing hand?” it would make sense to treat this as having the same intention as “What is a missing leg?” But Watson doesn’t know what the other contestants are saying, and so it actually had no such context in which to give its answer. I think it is plausible that Trebek would have awarded Watson a correct answer if Watson had given its answer without the context Jennings’s question (or, perhaps, would have asked Watson to give a more specific answer), given the “anatomical oddity” context of the question.

People laughed when Watson gave a similar wrong answer to another of Jennings’s errors. Jennings answered “What are the ’20s?,” and then Watson said, “What is 1920s?” Interestingly, the press reports have pretty much all said that Watson gave the same answer as Jennings. But Watson’s answer, though it would have the same intent if given by a person, is different, both in its raw surface form and in its grammatical incorrectness. I don’t think the Jeopardy rules require grammatically correctness in the questions, for human players don’t have this kind of problem. They didn’t know, or remember in the game play, that Watson didn’t receive the other contestants’ answers.

Watson was penalized for getting the 1920s question wrong, and penalized for getting the leg question right, but in the wrong way. I find it fascinating that people–sometimes in real time, and sometimes only on deliberation–can navigate what it means to have correct and incorrect intentions with respect to human-made artifacts such as Watson, especially within the strong expectations set up by IBM and the Jeopardy game to treat the Watson system as an intentional agent. Most of Watson’s human-seeming qualities come from the expectations that get set up. For example, Watson uses templates such as “Let’s finish up <CATEGORY>, Alex” when there is only one answer left in the category). But the stronger expectations are set by giving it a human name, a synthesized voice, putting it between two humans, using the pronoun “he” when referring to the system, etc. But even given these expectations, people can notice, and react to, the breakdowns (and, in the case of the missing “missing,” a seeming non-breakdown).

Search engines, such as Google and Bing, have a feature often internally called “Answers,” that give an answer to an implied question right on the search results page. For example, enter “molecular weight of hydrogen” or “capital of Michigan” in Bing or Google, and you’ll get the answer without having to go to another page. No one confuses this with human intelligence, yet, at some level of analysis, this is what Watson is doing. Granted, search engines are not optimized for searches phrased as questions (if you ask Google today, “What was George Eyser’s problem?,” the caption will say “he ran out of the paper dishes on which to serve the ice…”; in this case, Bing does a better job, but it could have gone either way). But the extraction of facts and references from a very large number of documents and data sources, based on vague and ambiguous search queries is the essential job of search engines. For the most part, Watson is a search engine, specialized for Jeopardy, that has been given a human moniker and a few human mannerisms.

Is Bing cheating at search? Redux: It’s all in the clicks

[Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing]

Matt Cutts has a thoughtful follow-up article on the controversy over whether Bing is copying Google’s search results. There has been a little too much heat and not enough light in the discussion, and I appreciate reflective thoughts, and his shout out to us Bing engineers. Someday, we’ll have forgotten all about this controversy. I predict about two and a half weeks.

I’ve been thinking some more about this, too, and I’d like to write a bit more about the clickstream data. I won’t describe specific ways that the data is collected, although Bing has made it clear that it comes though the use of opt-in data provided by users. And I won’t describe all the content or format of what the clickstream contains. But let’s abstract what is obviously there: information about the search engine used, the query, the presented URLs (and their order), and the URLs that are clicked by the user. So, you can imagine that there is this stream of click data coming from a browser (I suggest only search engines here, Google, Bing and SearchCo, which will be a strawman for any search engine not as good as Google or Bing):

<Bing, “fish sticks”, http://fishsticks.com @1, clicked>
<Bing “fish sticks”, http://fishlips.com @2, unclicked>
<Google, “fish sticks”, http://fishsticks.com @1, clicked>
<Google, “fish sticks”, http://fishplanks.com @2, clicked>
<SearchCo, “fish sticks”, http://fishlips,com @1, unclicked>
<SearchCo, “fish sticks”, http://firesticks.com @2, unclicked>

and thousands and millions more of these (I have no idea of the number, actually, but it’s “large”).

A first question is to ask how these events are captured; given the variety of ways search requests are encoded, something, some process will have to do the decoding. Matt Cutts seems to think this is a priori bad reverse engineering (I guess), but it is done billions of times a day by all the search engine companies; the term of art is scraping, but it’s just using a practice of converting data from one form (a form outside the original intent of the provider) into a usable clickstream form. We all do this, all the time.

A second thing to note, and I think this is very important, is that the “goodness” of Google with respect to the clickstream is not evidenced because there is necessarily a choice to favor Google a priori. Rather, it is the users’ clicking on the links which provides the real evidence. The fake clickstreams above indicate users preferred Google’s particular search results (as evidenced by both results being clicked) to Bing’s (only one URL clicked) or to SearchCo (no URL clicked). The point is this: it is not that Bing is targeting Google, but that the evidence given by the users signal the good results provided by Google. As a bare fact, it is not the triple <Search Engine, Query, URL> triple that matters, but the triple <Query, URL, User Evaluation> that matters.

In the natural course of things, Bing’s search engine results gets better because users say so; when people like what Google (or Bing, or even SearchCo) gives them, that eventually shows up as selection and ranking preferences at Bing. In the end, when Bing’s search results look similar to Google’s search results, it’s because Google does a good job, and Bing’s learning algorithms do a good job; the learning algorithm doesn’t even have to know where the result came from to begin with. Bing won’t look as much like SearchCo just because SearchCo (mythical company, of course) doesn’t do as well. Also, of course, Google has had a huge market share, and so the preponderance of data comes from Google (although I have no real idea of the market share for the opt-in data Bing receives).

In yet other words, it’s all in the clicks. Or, at least, mostly in the clicks.

Time for a reminder: I’m a Bing developer, but I don’t work on ranking or selection.

I personally agree with Matt that software companies in general should do a better job of being transparent about what opting in means. I really don’t know what this should look like, though. For example, I use Chrome for a lot for personal web use—it’s so fast—and I know this helps, at some level, competitors to the people who pay my salary. But I have very little knowledge of the specific ways this happens—and perhaps then it doesn’t really matter. Just as I am glad that there is now a choice in search engines, so also I am glad that there are three or more good choices for browsers—which incidentally improve the aligned search engines. It just doesn’t seem that important to say anything more than anonymous data will be used to improve products, with a way to find out more details of the matter.

Finally, I want to point out that, even though the pages that Matt presents do have the same unique query/URL pairs for Bing and Google (and how that could happen is described above, my previous post, and in others’ posts on the web), the content on the pages is not the same. The titles differ in at least one case (alas for Bing, I think Google’s is better), and the captions are different (in general, I think Bing’s are better—but then I’m biased, because that is a large part of the work that goes on around me). Bing suggests alternative segmentation/spelling for some of the synthetic queries, and “related searches.” So, whatever the merit of Matt’s case as to what counts as “copying,” it’s important to note that much of the content differs.


Is Bing cheating at search?

Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing; I am writing this on my own time and behalf.

Google has accused Bing of cheating, of copying Google’s results (Danny Sullivan’s article which broke the news; Google’s official blog post with the accusation). Basically, Google set up a honeypot of synthetic (that is, unique, non-word) search keys which their results led to specifically selected (i.e., non-algorithmically), but real, websites. Later, these search queries, when issued on Bing, led to these selected results. Bing’s Harry Shum (the Microsoft VP in charge of Bing) had this to say in an official response to Danny Sullivan’s article:

We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.

When a search engine, like Bing’s or Google’s says the results are algorithmically chosen, this primarily means that machine learning algorithms are used for selecting a collection of results to potentially display, and then other machine learning algorithms are used for ranking the results for display (which goes first, second, etc.) Of course, there are a lot of other things that go on: algorithms are used for spelling correction, query alterations (for example, noticing that “SF” is a common way of saying “San Francisco”), query classification and answer type selection (for example, should “365/24/7” return a direct answer of 2.17261905?), and so on. But what Google is accusing Bing of doing is “cheating” by copying their answers directly from Google, which is to say, that the usual selection and ranking steps are being bypassed in favor of direct copying.

Both Google and Bing use machine learning models that use many features for selection and ranking. This is what was meant when Shum said Bing uses “over 1,000 different signals and features.” A subset of those features are “clickstream data we get from some of our customers, who opt-in to sharing anonymous data.”

The clickstream (requisite Wikipedia article) is the record of searches, displayed results, and selected results collected by “our customers,” as Shum said. Clickstream analysis (pioneered by Google, I say, admiringly) is an extremely powerful source of data for search result selection and ranking. It tells you, implicitly, what people think of the results presented to them. Given a ranked selection of A,B,C, they click on C and then B, but leave A alone. Given enough of these data, the machine learning models can learn to present C and B (in that order) and downrank A (if A is presented at all). And there is a lot of clickstream data, both in the direct logs to the search providers’ services, as well as the “opt-in” data mentioned by Shum. Obviously, Bing can’t inspect Google’s clickstream logs, but when customers allow it, Bing can use their searches made to Google to approximate this. I don’t know the details of what is collected (nor could I tell you, I suppose, if I did), but these are the data Shum is referring to.

Ok, so imagine you have a complicated function that takes “over 1,000 signals and features” as its input, to use for ranking or selection. But in some very specific cases, you only have only a few signals coming in; the rest are unknown. Typically, and algorithmically, the function will do what it can with the information it has. In the present case, if the only thing it knows is that several clicks on a particular website occurred when a set of users entered in queries (and never any other websites), the function will likely return that website; a completely reasonable thing to do. It should be noted that Google’s honeypot trap resulted in only about 10% of the results, which is consistent with the paucity of the data, the strength of the signals, and the time lags involved.

To my mind, this is not “copying,” but the natural result of algorithmic search. Google’s results are a kind of gold standard; that the machine learned models learn to do what Google does is not unexpected. One way for Bing to “beat Google” at search is to be as good as Google in algorithmic search; which implies being as good as Google first. I don’t think you’d get an argument from anyone that Microsoft Live—Bing’s predecessor—was not as good as Google in core search. But I think a lot of people are beginning to realize that Bing’s core results really are about as good as Google’s core results now. And this is a result, not of copying, but of all the engineering and hard work of identifying the “over 1000 signals and features” that go into ranking and selection.

Interestingly, Google’s accusation came out the evening before Matt Cutts (who helps Google maintain search quality and lead anti-spam efforts) and Harry Shum were to appear jointly at “Future of Search” roundtable discussion (along with the CEO of Blekko, a new search engine). Cutts accused Bing directly here, and Shum said more or less what he said in the blog post. But he said something else interesting. He said that Google’s honeypot was a new kind of query spam (and that he was glad that Google had identified it). Maybe this was just trying to get a dig in at Google’s most public anti-spam advocate. But there really is some truth in this. Having identified this kind of query trap, I suspect engineers at Bing will look for ways of turning this into an anti-feature, and these sorts of results will be learned away.

You could tell that Cutts, as mild mannered a person as you’re likely to meet, was really upset at the thought that Bing is unfairly copying Google’s results—in fact, he said as much at the round table. I don’t quite get this, and I’d like to understand it better. He knows better than most how these things work. When I first started working for Microsoft (after the acquisition of Powerset, a startup company I was part of), I was suspicious. What I found is a large group of dedicated search engineers who want to build the best possible search engine, measured and driven by data—a method modeled for us by Google. I guess we’d be upset if we thought a competitor were cheating by using our results. But, from my standpoint, Shum’s public statement about what Bing is doing fairly describes what is happening: Bing uses clickstream data in its models, which sometimes leads to similar selections and rankings as Google’s selections and rankings. Bing isn’t cheating or copying; it’s using data from customers (who have opted in) to improve its results.