Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:
What iss the most popular question then? Tweet the answer and hopefully u may only get asked 500 times?
what is there to tweet about this morning?
What is your biggest food weakness? Tweet @Thintervention for motivation! #thinterventionG
This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).
To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:
> @client = TweetStream::Client.new('user','pass')
> File.open("tweet.tweets", "w+") do |f|
n = 0
@client.track('tweet','tweets','tweeting') do |s|
@client.stop if n >= 10000
When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:
> cat tweet.tweets | perl -pe 's/\b(tweet|tweets|tweeting)\b/\t$1\t/gi' |sort -f -k2,3 -t\t > tweets.txt
You can then import this file into your speadsheet program and slice and dice to your heart’s content.
Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!