Will.Whim

A weblog by Will Fitzgerald

Using Ngram data for segmentation

I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.

Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:

#  > segment("bpcares")
#  => ["bp", "cares"]
#  > segment("Twitter")
#  => ["Twitter"]
#  > segment("writers")
#  => ["writers"]
#  > segment("iamwriting")
#  => ["i", "am", "writing"]
#  > segment("backchannel")
#  => ["back", "channel"]
#  > segment("tcot")
#  => ["tcot"]
#  > segment("vacationfallout")
#  => ["vacation", "fall", "out"]

The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).

The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.

Anyway, enjoy the code: it can be found in my GitHub account:

http://github.com/willf/microsoft_ngram/blob/master/examples/segment.rb

It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.

Advertisements

3 responses to “Using Ngram data for segmentation

  1. Daniel Lemire September 8, 2010 at 1:03 pm

    Come on! Give us Norvig’s algorithm!

    ;-)

  2. Will Fitzgerald September 8, 2010 at 1:18 pm

    Daniel,

    The algorithm is described in his ngram chapter, but essentially it is a Viterbi search through the possible segementations. It is O(nL^2), where n is the number of tokens in the string, and L is the length of the string.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: