A weblog by Will Fitzgerald

parallel line-oriented file processing

At work, I’ve been doing a lot of line-oriented file processing, for example, of the tabbed-separated value files produced by the Freebase project (downloads). This is similar in spirit to Tim Bray’s ‘wide finder’ project, and I’ve leveraged his popularity to find a useful utility created by Preston l. Bannister called “feed-workers” that implements the ‘map’ part of map-reduce (but over a large file, rather than a large set of files).

Initial tests look good; for example, a nearly 3 times speedup on a processing loop over the 81 million lines in the Freebase tsv file.

$ time ./feed-workers -n 8 -r /usr/bin/ruby -s ~/just_names.rb /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv > /tmp/n1

real 4m20.682s
user 13m52.671s
sys 0m42.477s

$ time cat /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv | ruby ~/just_names.rb > /tmp/n2

real 11m58.470s
user 11m32.207s
sys 0m27.628s

2 responses to “parallel line-oriented file processing

  1. Daniel Lemire July 31, 2008 at 2:33 pm

    Is this with one hard drive or several?

    Because reading lines is pretty IO bound. It seems to me that it is very open to parallelization since parallelization implies random seeks which are going to take away IO bandwidth very fast.

  2. Preston L. Bannister December 6, 2008 at 10:53 pm

    Glad you found the tool useful (which after all was the intent of publishing in a public space).

    To Daniel – this is reading sequentially through a file (at the maximum possible rate allowed by the hardware and OS using C++ code) and throwing “chunks” (groups of lines) in turn down more than one pipe. This exactly suits the present, where we have tasks (especially in higher level interpreted languages) that cannot process data at raw-disk-read rates, and more than one physical CPU.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: