Will.Whim

A weblog by Will Fitzgerald

Monthly Archives: July 2008

parallel line-oriented file processing

At work, I’ve been doing a lot of line-oriented file processing, for example, of the tabbed-separated value files produced by the Freebase project (downloads). This is similar in spirit to Tim Bray’s ‘wide finder’ project, and I’ve leveraged his popularity to find a useful utility created by Preston l. Bannister called “feed-workers” that implements the ‘map’ part of map-reduce (but over a large file, rather than a large set of files).

Initial tests look good; for example, a nearly 3 times speedup on a processing loop over the 81 million lines in the Freebase tsv file.

$ time ./feed-workers -n 8 -r /usr/bin/ruby -s ~/just_names.rb /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv > /tmp/n1

real 4m20.682s
user 13m52.671s
sys 0m42.477s

$ time cat /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv | ruby ~/just_names.rb > /tmp/n2

real 11m58.470s
user 11m32.207s
sys 0m27.628s

Advertisements

O(log(N)) array insertion in Ruby

>> require 'bdb'
>> x = BDB::Btree.open('/tmp/foo.db',nil, 'w+', {'set_bt_compare' => lambda {|a,b| (a.to_i) (b.to_i)}})
=> #
>> (0..9).to_a.sort_by{rand}.each{|i| x[i] = i};true
=> true
>> x.keys.map{|i| i.to_i}
=> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

A few personal notes …

It’s been too long since my last update, but life has been busy, especially with the Microsoft purchase. But a few notes:

  1. I enjoyed singing Sacred Harp at the Michiana and Kalamazoo annual singings this weekend, and James Nelson-Gingerich gave me the *first* copy of the print version of 26th edition of the Harmonia Sacra for my work on the Harmonia Sacra website.
  2. I’m looking forward to a family reunion of all my brothers (five of us!) this coming weekend.
  3. I’ve been off to California a couple of times to meet about the Microsoft purchase, and I got my first “Microsoft Live!” tee-shirt
  4. Summer in Michigan is a wonderful thing this year.
  5. We just had our one-year anniversary of living in ‘the new house.’ So I guess it isn’t the new house, especially since we finally sold the old one.

Thoughts on the Microsoft acquisition

(The usual disclaimers: my opinion only, not my current or future employers)

When Powerset began a couple of years ago, a lot of commentators called us — and still do call us — a would be Google killer. This, despite repeated comments by senior staff that this wasn’t what we were about. As a company, Google is hard to beat. Our goal was audacious, but not that audacious. Our goal was to build a better search experience: to use natural language technology to provide better search results, both by having a better understanding of web documents as well as user queries.

But natural language technology has always only been part of the mix. We have, from the beginning, seen ourselves as doing “keywords plus”; that is, we have always planned to do what the other search engines do (keyword search, link analysis, blah, blah, blah), but add on top of this signals coming from parsing and semantic understanding. For example, we’d like to do as good a job as Google (say) on queries like ‘powerset microsoft’, but do even better on queries such as ‘Who acquired Powerset?’ and ‘Which company did Microsoft just buy?’ and everything in between.

What I didn’t realize when I joined the company is how some of the same technology would create innovations in the user interface, too. Powerset’s ‘Factz’ are a nice addition to the standard search page, and our ‘snippets’ are the best in the business. When I first typed in ‘stars of BSG‘ in the Powerset search box, I was floored by the beauty of the results.

So I think we met our audacious goal: a better search experience. Microsoft seems to think so; after all, they bought the company.

And here’s the thing: we were bought by Microsoft. Microsoft’s market cap is still 90 billion dollars greater than Google. If anyone is able to capitalize a little ol’ startup like Powerset to make us a big player in search, it’s Microsoft. In fact, it’s clear (to me at least) we have a new mission, which is just the old mission the pundits wrongly labeled us with at the start: As a search company, our mission is now to beat Google.

Interesting times ahead.

Who acquired Powerset?

microsoft acquires powerset