A weblog by Will Fitzgerald

Monthly Archives: November 2011

Computational Social Science on the cheap using Twitter

This is a followup to my post Computational lexicography on the cheap using Twitter, but more especially in response to Using off-the-shelf software for basic Twitter analysis.

The later article shows how to use database software (MySQL and its implementation of the SQL language) to do basic Twitter analysis. The ‘basic analysis’ includes counts by hashtag, timelines, and word clouds. They analyse about 475k tweets.

But here’s the thing: all their analyses can be done more simply with simple text files and pipes of Unix commands (as most eloquently demonstarted in Unix for Poets, by Ken Church). In fact, several simple   commands—commands I use everyday—are powerful enough to do the kind of analyses they discuss.

Getting the data.

(You can skip over this if you have data already!)

Interestingly, they do not show how to get the tweets to begin with. My previous post discusses this, but it might be useful to show a simple Ruby program that collects Tweet data, especially since the method has changed slightly since my post. The biggest hurdle is setting up authentication to access Twitter’s data—discussed in full, here, but the crucial thing is that you have to register as a Twitter developer, register a Twitter application, and get special tokens. You create an application at the Twitter apps page; from that same location you generate the special tokens.

Here’s the Ruby script (also listed here).

require 'rubygems'
require 'tweetstream'
require 'date'

TweetStream.configure do |config|
  config.consumer_key = ''
  config.consumer_secret = ''
  config.oauth_token = ''
  config.oauth_token_secret = ''
  config.auth_method = :oauth
  config.parser   = :json_gem

# Change the words you want to track
TweetStream::Client.new.track('football', 'baseball', 'soccer', 'cricket') do |status|
    # The Tweet id
    id = status.id
    # The text of the tweet, with new lines (returns) replaced by spaces
    txt = status.text.gsub(/\n/," ")
    # The date of the tweet, printed out in a slightly more useful form 
    # for our purposes
    d = DateTime.parse(status.created_at).strftime("%Y-%m-%d\t%H:%M:%S")
    puts [id,txt,d].join("\t")
  rescue Exception => e
    puts "!!! Error: #{e.to_s}"

With the proper keys and secrets, this gist wlll allow you to track keywords over time, and print out, in a tab-separated format, the tweet id, the text of the tweet, the date, andthe time it was published (in UTC, or Greenwich, time). You could add additional columns, as described (by example) in the Twitter API.

The example here tracks mentions of football, baseball, soccer, and cricket, but obviously, these could be other keywords. Running this using this command:

ruby track_tweets.rb | tee nsports.tsv

will place tweets in the file ‘nsports.tsv’.

Basic statistics

Counting the number of football, baseball, etc. mentions is easy:

$ grep -i football nsports.tsv | wc -l
$ grep -i baseball nsports.tsv | wc -l
$ grep -i soccer nsports.tsv | wc -l
$ grep -i cricket nsports.tsv | wc -l

As well as getting the number of lines in the file:

$ cat nsports.tsv | wc -l

The second analysis was to count who is retweeted the most, done by counting the username after the  standard Twitter “RT ” (eg “rt @willf good stuff!”). The following pipeline of commands accomplishes this simply enough:

egrep -io "rt +@\w+" nsports.tsv | perl -pe "s/ +/ /g" | cut -f2 -d\  | sort | uniq -c | sort -rn | head

(This may be easier to copy from here). Each of these is a separate command, and the pipe symbol (|), indicates that the output from one command goes on to the next. Here’s what these commands do:

  1. egrep -io “rt +@\w+” nsports.tsv — searches through the tweets for the pattern RT space @ name, where there is one or more spaces, and one or more ‘word’ characters. It only prints the matching parts (-o), and ignores differences in case (-i).
  2. perl -pe “s/ +/ /g” — I noticed that from time to time, there is more than one space after the ‘RT’, so this substitutes one or more spaces with exactly one space.
  3. cut -f2 -d\  — Each line looks like “RT @name”, now, and this command ‘cuts’ the second field out of each line, with a delimiter of a space. This results in each line looking like ‘@name’.
  4. sort | uniq -c | sort -rn — this is three commands, but I type them so frequently, it seems like one to me. It sorts the text, so they can be counted with the uniq command, which produces two columns : the count and the name; we reverse sort (-r) on the first numeric field (-n)
  5. head — this shows the top ten lines from a file.

This command pipeline should have no problem handling 475k lines.

The third analysis was to put the data in a format that can be used by Excel to create a graph, with counts by day. Because we have printed the date and time in separate columns, with the date in column 3. So, we can simply do the cut, sort, uniq series:

cat nsports.tsv | cut -f3 | sort | uniq -c > for_excel.tsv

This will put the data into a format that Excel can read.

Finally, the authors show how to create Wordle word graphs overall, and for the categories. I’m not a big fan of these as a data exploration tool, but notice you can use cut -f2 to get the text to paste into Wordle.

So, this is computational social science on the cheap using Twitter, using some basic Unix commands (cat, cut, sort, uniq, grep), with one tiny, tiny call to Perl. You can do this too–and it’s easier to learn than MySQL and SQL! Plus, you can easily read the text files that are created. All of this was done on a standard Mac, but any Unix machine, or Windows machine with the Cygwin tools installed, can do this as well.

Leroy Herron

When I was in junior high school at Burton Junior High School — that is, grades seven and eight — Mr Leroy Herron was a very important man in my life. He was a school counselor, and a coach for the basketball team. He was also the sponsor of the Human Relations Club, a club created to get black kids and white kids like me to learn more about what I now might call anti-racism, but then we mostly called non-discrimination. If I remember correctly, there were two white boys — Alan Kulevicz and me, and about a half dozen black girls. The school itself had a strong majority of white kids. I remember Mr Herron talking about how his son self-identified as “black,” while Mr Herron felt more comfortable, at that time, calling himself a Negro. If I recall correctly, African American, or Afro-American were also coming into vogue.

We once did a field trip to a school in Detroit where the students were all (or almost all) African American. I remember asking the principal how many of his staff were black, and how many were white. He had to stop and think, and he said that he didn’t primarily think of the teachers in racial terms. Since knowing whether someone was black or white was very important in my family, this came as a shock, and a new way of thinking.

Mr Herron loved sports, and he loved coaching. I wish I had been a decent ball player, but instead I just acted as the team’s manager. I don’t remember much about this experience, except I was at one point asked to keep score for the number of times players in the game showed “hustle,” and I had no idea how to do this, so I got razzed about it. I really was not a good manager — not as bad as I was a baseball umpire, but that’s another story.

One time, I left school crying. I don’t know why now — I was probably being bullied for being smart and weak and unpopular in some way. We lived about a mile away from the school, and I usually walked. And Mr Herron left the school looking for me, and drove until he found me. I think that I refused his help then, but his act of looking out for me is something I remember forty years later.

The Macomb Daily (the local county paper) reported back in February of 2009 that Mr Herron died in a house fire at the age of 75. My youngest brother Steve mentioned this to me over the phone. Mr Herron eventually became an assistant superintendent of the Roseville schools. I assume that he brought his love for students, for sports, and for racial equality to that job as well.

I never caught his love for sports, but he began to open my eyes to the experiences of African Americans, and he began to turn me into a man, for which I will always be grateful.

I am a Wordnik

This week, I started as the lead engineer for Wordnik‘s analytics platform. Except I get a little antsy about the term “engineer,” so I asked them to make my title “Lead, Analytics Platform.” It’s a real pleasure to work with the Wordnik team so far–super excited to be working with Tony Tam and Erin McKean, and also former Powersetters Colin Pollack and Robert Voyer. When Robert joined Wordnik over a year ago, I badgered him into getting me an interview–it’s only now that it’s come to fruition.

There were many good things about working at Bing and Microsoft, especially the large amounts of friendship I found there, and the large amounts of data I got to explore and understand. Still, it was a real joy to fire up a terminal session and start exercising my atrophied Unix muscles.

I’ll be spending most of my time in Silicon Valley/San Francisco with visits back to Michigan from time to time.

Let me end by pointing to Erin’s inspiring TED talk, which was the starting point of my path to Wordnik.