Search engine index wars
August 16, 2005
Posted by on
Slashdot reported on a NCSA study, A Comparison of the Size of the Yahoo! and Google Indices. On the one hand, they examined actual results returned from the search engines; on the other, they investigated only English searches. The later makes it very unlikely to report a comparison fairly.
Jean Véronis has written a series of posts examining reported results from Google, Yahoo and MSN. One of the latest looks at the claim that Yahoo is indexing 19 billion pages. (There is also a version française). His research indicates that the results returned by Yahoo are at least (internally) consistent with a large increase in indexing size by Yahoo. It’s worth following links in the post above to previous studies Véronis has done.
Also, he has some evidence that Yahoo doesn’t index as deeply as does Google: a search for azoïque returns at least one result that Yahoo does not, although Yahoo did index the document in question; the word “azoïque” appears further into the PDF document than Yahoo apparently goes. (The article is in French, which I read in a automatically translated version. I was amused that “azoïque” was translated as “azo,” which is the airport code for Kalamazoo, so perhaps “azoïque” means Kalamazooan).
Update. Jean Véronis damningly writes about the NCSA study: original in French, auto-translated into English. Apparently the NCSA researchers didn’t notice that many of the results returned by Google were simply copies of the original English dictionary, or other dictionaries, from which they drew their words.
Update 2 Jean Véronis writes (in the comments) that he’s done an English version of the second Missing Pages in Yahoo post, and a third posting on the issue in French with an English version promised soon. He was also quoted in the New York Times (although the writer of the article didn’t cite the problems with the NCSA study).