I remembered a post by Mark Liberman about disambiguation of complex nominals when I visited a doughnut shop today. The post points out that, when trying to determine automatically whether a phrase like “sickle cell anemia” is (say) sickle-type cell anemia or an anemia of sickle cells — i.e., whether it’s [sickle] [cell anemia] or [sickle cell] [anemia]–you can do something very simple, which works very well in practice: count the number of times “sickle cell” and “cell anemia” occur in a large body of texts, and the more frequent is likely to be the right grouping. “Sickle cell” is much more common than “cell anemia” in Medline, so, based on counting alone (with no semantic understanding of sickle cells or anemia), the best grouping is [sickle cell] [anemia].
OK, so in the doughnut shop I saw signs for the various kinds of doughnuts, and one said:
Hmm. Are these Boston-style cream doughnuts, or doughnuts made of Boston cream? [Boston] [cream doughnuts], or [Boston cream] [doughnuts]?
And what about Boston cream pie? Boston-style cream pie [Boston] [cream pie]? Or pie made of Boston cream [Boston cream] [pie]?
So, firing up the answer box (Google):
Boston cream: about 31,600
Cream pie: about 1,230,000
Cream doughnut(s)/cream donut(s): about 10,300
Apparently, it’s [Boston] [cream pie] and [Boston cream] [doughnuts].
At first, this seemed a counter-example to the counting trick: Shouldn’t they both parse the same way? On second thought, though, I realized that the sign said “Boston cream” not “Boston cream doughnuts,” and, really, a “Boston cream doughnut” is a doughnut made with “Boston cream pie cream,” which one might just call “Boston cream.”
I suspect this trick indicates (again, without doing anything but counting) that Boston cream pie predates Boston cream doughnuts.
I also saw a recipe for [sour cream] [doughnuts] along the way–that’s just wicked.