A weblog by Will Fitzgerald

Ruby project to access Microsoft Ngram data

I am pleased to announce the availability of a Ruby library to access the Microsoft Ngram data. This data currently includes 1,2,3,4 gram data for anchor text, 1,2,3 gram data for body text, 1,2,3,4 gram data for page titles, and 1,2,3 gram data for queries, collected in June 2009. See the Bing/MSR Ngram data page for general information. Although I am a Microsoft employee, this software is provided by me, not Microsoft.

Microsoft provides a SOAP API, and a Python REST-based library, but this is (I think) the first Ruby library. You can get a copy at Github.com: http://github.com/willf/microsoft_ngram.

I hope, in the days to come, to write some example programs that show the power of this data resource for research. But, to whet the appetite, should we parse “Boston cream pie” as [Boston [cream pie]] or [[Boston cream] pie]? That is, a cream pie made in the Boston style, or a pie made with Boston cream? If the former, “cream pie” should be more frequent than “Boston cream”; if the later, the opposite.

> MicrosoftNgram.new(:model => ‘bing-body/jun09/2’).jps([‘boston cream’,’cream pie’]) => [[“boston cream”, -7.231685], [“cream pie”, -6.027882]]
These are log probabilities. This model suggests [Boston [cream pie]] is the correct bracketing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: