parallel line-oriented file processing
July 31, 2008
Posted by on
At work, I’ve been doing a lot of line-oriented file processing, for example, of the tabbed-separated value files produced by the Freebase project (downloads). This is similar in spirit to Tim Bray’s ‘wide finder’ project, and I’ve leveraged his popularity to find a useful utility created by Preston l. Bannister called “feed-workers” that implements the ‘map’ part of map-reduce (but over a large file, rather than a large set of files).
Initial tests look good; for example, a nearly 3 times speedup on a processing loop over the 81 million lines in the Freebase tsv file.
$ time ./feed-workers -n 8 -r /usr/bin/ruby -s ~/just_names.rb /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv > /tmp/n1
$ time cat /bfd/dv/freebase_download/current/freebase-datadump-quadruples.tsv | ruby ~/just_names.rb > /tmp/n2