Ruby project to access Microsoft Ngram data
September 3, 2010
Posted by on
I am pleased to announce the availability of a Ruby library to access the Microsoft Ngram data. This data currently includes 1,2,3,4 gram data for anchor text, 1,2,3 gram data for body text, 1,2,3,4 gram data for page titles, and 1,2,3 gram data for queries, collected in June 2009. See the Bing/MSR Ngram data page for general information. Although I am a Microsoft employee, this software is provided by me, not Microsoft.
Microsoft provides a SOAP API, and a Python REST-based library, but this is (I think) the first Ruby library. You can get a copy at Github.com: http://github.com/willf/microsoft_ngram.
I hope, in the days to come, to write some example programs that show the power of this data resource for research. But, to whet the appetite, should we parse “Boston cream pie” as [Boston [cream pie]] or [[Boston cream] pie]? That is, a cream pie made in the Boston style, or a pie made with Boston cream? If the former, “cream pie” should be more frequent than “Boston cream”; if the later, the opposite.
> MicrosoftNgram.new(:model => ‘bing-body/jun09/2’).jps([‘boston cream’,’cream pie’]) => [[“boston cream”, -7.231685], [“cream pie”, -6.027882]]
These are log probabilities. This model suggests [Boston [cream pie]] is the correct bracketing.