Page duplicates on the web

The always interesting Jean Veronis posts about page duplicates in English and French.

He investigates how many pages are returned for a search for the new word “ségolisme” (for which he can control at bit for precision numbers), and how many of these are duplicates due to RSS feeds, etc. A money quote:

So, what are all these pages that Google considers to be similar? On closer inspection we can see that they are mainly RSS versions of the same page, archived pages (a blog often presents the same post in an individual version and also in a weekly or monthly archive), versions of posts both with and without comments, and links such as “trackback” or “new comments”, which are also commonplace on blogs. In fact, the word ségolisme was used in a post on Agoravox, and for a time, the thousands of posts on this platform all automatically included the word ségolisme, not to mention the 459 comments, which are rendered on a separate dynamic page (“report an abuse”). On 1 July, the Agoravox website alone was responsible for 15,200 results of the 52,200 returned by Google. Yet a search on the site itself reveals that only 392 documents contain this word, most of them comments; one single post contains the word – the original post!

(Note, for example, how much of this post is a duplicate–and then look at the RSS,and Atom feeds for the main page and the category pages…)

