[Note: I got bogged down on reading and reporting on the culturomics paper a little too closely, and this is a reboot.]
Peter Norvig, one of the co-authors of the culturomics paper and the director of research at Google, was also the co-author on another significant article with the suggestive title, “The Unreasonable Effectiveness of Data”. The invention of the term “culturomics” suggests a scientific programme for attacking questions of culture that stresses statistical models based on large amounts of data, a programme that has been very successful both academically and commercially for linguistics and artificial intelligence, to say nothing of the informatics approaches of genomics and proteomics upon which the term culturomics is based. The slogan, “every time I fire a linguist, the performance of my system goes up,” (a slight misattribution of something Frederick Jelinek, a pioneering computational linguist, said), is another restatement of this. Among the bets and assumptions made by this approach are:
(1) The goal of science is better-engineered systems, which have practical, commercializable outcomes.
(2) Models must be empirically testable, with precise, independent, repeatable evaluation metrics and procedures.
(3) Simple quantitative models based on large amounts of data will perform better, in the senses of (1) and (2), than complex qualitative models based on small amounts of data.
Among the successes attributable to big data programmes include effective search engines, speech interfaces, and automated translation. Google’s rigorous approach to big data affects nearly every aspect of their business; core search for starters, but even more important are the big data approaches to Google’s ability to make money on its search, as well as decrease its operating costs.
Studies of culture are currently, for the most part, either done using complex, qualitative models, or based on relatively small amounts of data. The Google N-gram data is, perhaps, an opening salvo in an attack on qualitative/small data approaches to studies of culture, to be replaced with quantitative/big data approaches. The quantitative/big data programme has been “unreasonably effective” in overturning how linguists, artificial intelligence, and cognitive science researchers approach their field and get their projects funded. The bet, here, is that the same will occur in other culture studies.
There are many problems with the example experiments described in the culturomics papers. The experiments are often not described in enough detail to be replicable. Proof is often by example rather than by large-scale evaluation metrics. The proofs often resemble just-so stories (explanations without adequate controls) or unsurprising results (for example, that Nazis suppressed and censored writers with whom they disagreed is borne out by the data). The scope of the experiments is often very limited (for example, the section on “evolution of grammar” laughably only describes the changes occurring in a small subset of strong/weak verbs).
Because this is an overview paper, it may be that some of the important details are missing for reasons of space. Some of these things are addressed in the supporting materials, but by no means all. For example, something as basic as the methods used for tokenization—how the successive strings of characters in the digital copies of the books of the corpora—is not really defined well enough to be repeatable. How, for example, does the system tokenize “T.S. Eliot”? Is this tokenized the same way as “T. S. Eliot” or “TS Eliot”? Based on the sample N-gram viewer, it appears that, to find mentions of T.S. Eliot, the search string, “TS Eliot,” (similarly WH Auden, CS Lewis) must be used. The supplemental and related supplemental material give many details, but in the end refer to proprietary tokenization routines used at Google.
And yet, there are some useful ideas here. Because the N-gram data is time-stamped, looking at some kinds of time-varying changes is possible. The idea of measuring half-life of changes is a powerful one, and the varying amounts of time it takes to fall to half-life is interesting in their analysis of “fame” (in reality, their analysis of name mentions). Seeing how some verbs are becoming strong in the face of a general tendency towards regularization is interesting. And the lexicographic estimates seem very valuable (if not very “culturomic” to me).
A danger in the approach of culturomics is that, by focusing on what can be processed on a large scale, and measured with precision, interesting scientific questions will be left unexplored, perhaps especially when those questions are not of obvious economic benefit. Engineers build better engines, not necessarily better theories.
Having said all of this, I remain optimistic about the release of the Google N-gram data, even as I resist the term and approach suggested by culturomics. Yes, Google needs to provide better descriptions of the data, and continue to clean up the data and metadata (as well as describe and report on what “cleaned up” means; some of this is, in fact, described in the supplementary data ), and to be much more transparent about access to the underlying documents, when permissible by law. It would be very useful for Google to provide algorithmic access to the data rather than just make the (very large) data sets available. But these data can be mined for interesting patterns and trends, and it will be interesting to see what researchers do with them. Let’s just call it time-stamped N-gram data, though, and eschew the term culturomics.