Using Ngram data for segmentation
September 8, 2010
Posted by on
I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.
Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:
# > segment("bpcares")
# => ["bp", "cares"]
# > segment("Twitter")
# => ["Twitter"]
# > segment("writers")
# => ["writers"]
# > segment("iamwriting")
# => ["i", "am", "writing"]
# > segment("backchannel")
# => ["back", "channel"]
# > segment("tcot")
# => ["tcot"]
# > segment("vacationfallout")
# => ["vacation", "fall", "out"]
The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).
The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.
Anyway, enjoy the code: it can be found in my GitHub account:
It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.