Disclaimer: I work for Bing as a senior research software developer, but I do not speak for Bing; I am writing this on my own time and behalf.
Google has accused Bing of cheating, of copying Google’s results (Danny Sullivan’s article which broke the news; Google’s official blog post with the accusation). Basically, Google set up a honeypot of synthetic (that is, unique, non-word) search keys which their results led to specifically selected (i.e., non-algorithmically), but real, websites. Later, these search queries, when issued on Bing, led to these selected results. Bing’s Harry Shum (the Microsoft VP in charge of Bing) had this to say in an official response to Danny Sullivan’s article:
We use over 1,000 different signals and features in our ranking algorithm. A small piece of that is clickstream data we get from some of our customers, who opt-in to sharing anonymous data as they navigate the web in order to help us improve the experience for all users.
When a search engine, like Bing’s or Google’s says the results are algorithmically chosen, this primarily means that machine learning algorithms are used for selecting a collection of results to potentially display, and then other machine learning algorithms are used for ranking the results for display (which goes first, second, etc.) Of course, there are a lot of other things that go on: algorithms are used for spelling correction, query alterations (for example, noticing that “SF” is a common way of saying “San Francisco”), query classification and answer type selection (for example, should “365/24/7” return a direct answer of 2.17261905?), and so on. But what Google is accusing Bing of doing is “cheating” by copying their answers directly from Google, which is to say, that the usual selection and ranking steps are being bypassed in favor of direct copying.
Both Google and Bing use machine learning models that use many features for selection and ranking. This is what was meant when Shum said Bing uses “over 1,000 different signals and features.” A subset of those features are “clickstream data we get from some of our customers, who opt-in to sharing anonymous data.”
The clickstream (requisite Wikipedia article) is the record of searches, displayed results, and selected results collected by “our customers,” as Shum said. Clickstream analysis (pioneered by Google, I say, admiringly) is an extremely powerful source of data for search result selection and ranking. It tells you, implicitly, what people think of the results presented to them. Given a ranked selection of A,B,C, they click on C and then B, but leave A alone. Given enough of these data, the machine learning models can learn to present C and B (in that order) and downrank A (if A is presented at all). And there is a lot of clickstream data, both in the direct logs to the search providers’ services, as well as the “opt-in” data mentioned by Shum. Obviously, Bing can’t inspect Google’s clickstream logs, but when customers allow it, Bing can use their searches made to Google to approximate this. I don’t know the details of what is collected (nor could I tell you, I suppose, if I did), but these are the data Shum is referring to.
Ok, so imagine you have a complicated function that takes “over 1,000 signals and features” as its input, to use for ranking or selection. But in some very specific cases, you only have only a few signals coming in; the rest are unknown. Typically, and algorithmically, the function will do what it can with the information it has. In the present case, if the only thing it knows is that several clicks on a particular website occurred when a set of users entered in queries (and never any other websites), the function will likely return that website; a completely reasonable thing to do. It should be noted that Google’s honeypot trap resulted in only about 10% of the results, which is consistent with the paucity of the data, the strength of the signals, and the time lags involved.
To my mind, this is not “copying,” but the natural result of algorithmic search. Google’s results are a kind of gold standard; that the machine learned models learn to do what Google does is not unexpected. One way for Bing to “beat Google” at search is to be as good as Google in algorithmic search; which implies being as good as Google first. I don’t think you’d get an argument from anyone that Microsoft Live—Bing’s predecessor—was not as good as Google in core search. But I think a lot of people are beginning to realize that Bing’s core results really are about as good as Google’s core results now. And this is a result, not of copying, but of all the engineering and hard work of identifying the “over 1000 signals and features” that go into ranking and selection.
Interestingly, Google’s accusation came out the evening before Matt Cutts (who helps Google maintain search quality and lead anti-spam efforts) and Harry Shum were to appear jointly at “Future of Search” roundtable discussion (along with the CEO of Blekko, a new search engine). Cutts accused Bing directly here, and Shum said more or less what he said in the blog post. But he said something else interesting. He said that Google’s honeypot was a new kind of query spam (and that he was glad that Google had identified it). Maybe this was just trying to get a dig in at Google’s most public anti-spam advocate. But there really is some truth in this. Having identified this kind of query trap, I suspect engineers at Bing will look for ways of turning this into an anti-feature, and these sorts of results will be learned away.
You could tell that Cutts, as mild mannered a person as you’re likely to meet, was really upset at the thought that Bing is unfairly copying Google’s results—in fact, he said as much at the round table. I don’t quite get this, and I’d like to understand it better. He knows better than most how these things work. When I first started working for Microsoft (after the acquisition of Powerset, a startup company I was part of), I was suspicious. What I found is a large group of dedicated search engineers who want to build the best possible search engine, measured and driven by data—a method modeled for us by Google. I guess we’d be upset if we thought a competitor were cheating by using our results. But, from my standpoint, Shum’s public statement about what Bing is doing fairly describes what is happening: Bing uses clickstream data in its models, which sometimes leads to similar selections and rankings as Google’s selections and rankings. Bing isn’t cheating or copying; it’s using data from customers (who have opted in) to improve its results.