A weblog by Will Fitzgerald

More thoughts on “culturomics” (part one)

I’ve now had a chance to read the Science Express article describing “culturomics,” that is, the “Quantitative Analysis of Culture Using Millions of Digitized Books,” recently published by researchers at Google and several high-profile academic and commercial institutions. The authors claim to have created a corpus of approximately four percent of all books ever published (over five billion books) and supplying time-stamped ngram data based on 500 billion words in seven languages, primarily English; from the 1500s until roughly the present, primarily more recently. The “-omics” of culturomics is by analogy to genomics and proteomics: that is, high-speed analysis of large amounts of data. As someone who has done some work with ngrams on the web (with data provided both by Bing, my employer, and earlier data provided by Google), this work is of great interest to me. Digitized books are, of course, a different animal from the web (and queries made to search engines on the web), and so it is of interest for this reason. The addition of time-stamps makes some kinds of time-based analyses possible, too; this is the kind of data we have not had before (Bing’s ngram data, assuming they continue to provide older versions, might eventually do so).

There have been a number of criticisms of the culturomics programme, many of them well-founded. It is worth-while to describe a few of these. First, there are problems with the meta-data associated with the books that Google has digitized. As a result, it is unclear how accurate the time-stamps are. This is not addressed in the article, although this has been a well-known problem. Certainly, they could have sampled the corpus and given estimates on the accuracy of the time-stamp data. Related to this is the lack of a careful description of the genres and dialects represented (partly a failure in the meta-data, again). Second, there are systematic errors in the text scans; this is especially true for older books, which often used typographic conventions and fonts not common today (and, one assumes, errors made due to language models based on modern texts rather than pre-modern ones). Consider, for example, the “long s” previously used in many contexts in English; this is often read as an “f,” instead of a long s. Incidentally, according to the tokenization goals of the project, the right thing to do would be to record the long s, not regularize it to modern, standard ‘s.’ Otherwise, it becomes more difficult to track the decline of the long s, except via guessing OCR errors. The whole notion of what it means to constitute a countable thing in these corpora–that is, what are the tokenization rules for generating the 1-grams, is given short shrift in this article, although it is a fairly important issue.

Third, the presentation of the Google Labs ngram viewer has made it overly easy to tell “just so” stories. For example, consider the contrast between “Jesus” and “Christ” from 1700 to 2008. It’s easy to tell this just-so story: People talked about Jesus or Christ before the revolution, but then not so much in the run-up to the War and its aftermath. But, with the Great Awakening, a large number of books were published, with “Christ” much more common that Jesus. Over time, due to increasing secularization of the United States, people wrote about Jesus or Christ less and less. The individualistic Evangelical explosion of the last thirty years has started to reverse the trend, with “Jesus” (a more personal name) becoming a more popular name than “Christ”  (a less personal name). Natalia Cecire describes this, better and more succinctly, as “Words for Snowism“. Cicere also views the Ngram Viewer as a guilty pleasure; as epistemic candy.

The Science Express article describes several experiments made with the Google books data, and it is worth spending time examining these, because this gives good hints as to what the data are likely to be good for; where the culturomics programme is likely to head.

The first set of experiments describe attempts at estimating the size of the English lexicon, and its growth over time. The authors describe (qualitatively, for the most part) their sampling technique for determining whether an 1-gram token was an English word form: it had to appear more than once per billion; a sample of these common potential word forms was manually annotated with respect to whether they were truly English word forms, or something else (like a number, a misspelling, or a foreign word). Sample sizes, procedure, inter-rater reliability, etc., were not reported; an important flaw, in my opinion. They show, for example, that the English vocabulary has been increasing by over 70% in the past 50 years, and contrast this to the size of printed dictionaries. This first set of experiments will be of great interest to lexicographers; indeed, it is just this gap that commercial enterprises like Wordnik are trying to fill). It is hard to see how this says much about “culture,” except as fodder for lexicographical historiography or lexicographical evangelism: there are many words that are not in ‘the dictionary:’ get used to it.

The second set of experiments purports to describe “the evolution of grammar,” but, not surprisingly, only attacks a very small subset of lexical grammar: the change in use of strong and weak verbs in English. Given the time-stamped, word-form based data, it is relatively simple to check the movement from “burnt” to “burned,” and compare and contrast this to other strong and weak verbs. One wishes for more description how they reach their conclusions, for example, “high-frequency irregulars, which are more readily remembered, hold their ground better.” This reasonable statement is “proved” by a single contrast between finded/found and dwelled/dwelt. Some of the conclusions are based on dialect: there are differences between American English and British English. Unfortunately, the actual scope of the differences is “proved” by a single contrast between the use of burned/burnt in American and British English. Again, knowing the accuracy of dialect assignment, list of verbs used, etc, would be very useful.

{to be continued}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: