A weblog by Will Fitzgerald

Useful Scala introduction

I found the following Scala introduction, by Jason Baldridge, pretty useful, especially as it has a mild computational linguistics focus. It assumes no programming background, but that doesn’t get it the way too much.

Computational Social Science on the cheap using Twitter

This is a followup to my post Computational lexicography on the cheap using Twitter, but more especially in response to Using off-the-shelf software for basic Twitter analysis.

The later article shows how to use database software (MySQL and its implementation of the SQL language) to do basic Twitter analysis. The ‘basic analysis’ includes counts by hashtag, timelines, and word clouds. They analyse about 475k tweets.

But here’s the thing: all their analyses can be done more simply with simple text files and pipes of Unix commands (as most eloquently demonstarted in Unix for Poets, by Ken Church). In fact, several simple   commands—commands I use everyday—are powerful enough to do the kind of analyses they discuss.

Getting the data.

(You can skip over this if you have data already!)

Interestingly, they do not show how to get the tweets to begin with. My previous post discusses this, but it might be useful to show a simple Ruby program that collects Tweet data, especially since the method has changed slightly since my post. The biggest hurdle is setting up authentication to access Twitter’s data—discussed in full, here, but the crucial thing is that you have to register as a Twitter developer, register a Twitter application, and get special tokens. You create an application at the Twitter apps page; from that same location you generate the special tokens.

Here’s the Ruby script (also listed here).

require 'rubygems'
require 'tweetstream'
require 'date'

TweetStream.configure do |config|
  config.consumer_key = ''
  config.consumer_secret = ''
  config.oauth_token = ''
  config.oauth_token_secret = ''
  config.auth_method = :oauth
  config.parser   = :json_gem

# Change the words you want to track
TweetStream::Client.new.track('football', 'baseball', 'soccer', 'cricket') do |status|
    # The Tweet id
    id = status.id
    # The text of the tweet, with new lines (returns) replaced by spaces
    txt = status.text.gsub(/\n/," ")
    # The date of the tweet, printed out in a slightly more useful form 
    # for our purposes
    d = DateTime.parse(status.created_at).strftime("%Y-%m-%d\t%H:%M:%S")
    puts [id,txt,d].join("\t")
  rescue Exception => e
    puts "!!! Error: #{e.to_s}"

With the proper keys and secrets, this gist wlll allow you to track keywords over time, and print out, in a tab-separated format, the tweet id, the text of the tweet, the date, andthe time it was published (in UTC, or Greenwich, time). You could add additional columns, as described (by example) in the Twitter API.

The example here tracks mentions of football, baseball, soccer, and cricket, but obviously, these could be other keywords. Running this using this command:

ruby track_tweets.rb | tee nsports.tsv

will place tweets in the file ‘nsports.tsv’.

Basic statistics

Counting the number of football, baseball, etc. mentions is easy:

$ grep -i football nsports.tsv | wc -l
$ grep -i baseball nsports.tsv | wc -l
$ grep -i soccer nsports.tsv | wc -l
$ grep -i cricket nsports.tsv | wc -l

As well as getting the number of lines in the file:

$ cat nsports.tsv | wc -l

The second analysis was to count who is retweeted the most, done by counting the username after the  standard Twitter “RT ” (eg “rt @willf good stuff!”). The following pipeline of commands accomplishes this simply enough:

egrep -io "rt +@\w+" nsports.tsv | perl -pe "s/ +/ /g" | cut -f2 -d\  | sort | uniq -c | sort -rn | head

(This may be easier to copy from here). Each of these is a separate command, and the pipe symbol (|), indicates that the output from one command goes on to the next. Here’s what these commands do:

  1. egrep -io “rt +@\w+” nsports.tsv — searches through the tweets for the pattern RT space @ name, where there is one or more spaces, and one or more ‘word’ characters. It only prints the matching parts (-o), and ignores differences in case (-i).
  2. perl -pe “s/ +/ /g” — I noticed that from time to time, there is more than one space after the ‘RT’, so this substitutes one or more spaces with exactly one space.
  3. cut -f2 -d\  — Each line looks like “RT @name”, now, and this command ‘cuts’ the second field out of each line, with a delimiter of a space. This results in each line looking like ‘@name’.
  4. sort | uniq -c | sort -rn — this is three commands, but I type them so frequently, it seems like one to me. It sorts the text, so they can be counted with the uniq command, which produces two columns : the count and the name; we reverse sort (-r) on the first numeric field (-n)
  5. head — this shows the top ten lines from a file.

This command pipeline should have no problem handling 475k lines.

The third analysis was to put the data in a format that can be used by Excel to create a graph, with counts by day. Because we have printed the date and time in separate columns, with the date in column 3. So, we can simply do the cut, sort, uniq series:

cat nsports.tsv | cut -f3 | sort | uniq -c > for_excel.tsv

This will put the data into a format that Excel can read.

Finally, the authors show how to create Wordle word graphs overall, and for the categories. I’m not a big fan of these as a data exploration tool, but notice you can use cut -f2 to get the text to paste into Wordle.

So, this is computational social science on the cheap using Twitter, using some basic Unix commands (cat, cut, sort, uniq, grep), with one tiny, tiny call to Perl. You can do this too–and it’s easier to learn than MySQL and SQL! Plus, you can easily read the text files that are created. All of this was done on a standard Mac, but any Unix machine, or Windows machine with the Cygwin tools installed, can do this as well.

Leroy Herron

When I was in junior high school at Burton Junior High School — that is, grades seven and eight — Mr Leroy Herron was a very important man in my life. He was a school counselor, and a coach for the basketball team. He was also the sponsor of the Human Relations Club, a club created to get black kids and white kids like me to learn more about what I now might call anti-racism, but then we mostly called non-discrimination. If I remember correctly, there were two white boys — Alan Kulevicz and me, and about a half dozen black girls. The school itself had a strong majority of white kids. I remember Mr Herron talking about how his son self-identified as “black,” while Mr Herron felt more comfortable, at that time, calling himself a Negro. If I recall correctly, African American, or Afro-American were also coming into vogue.

We once did a field trip to a school in Detroit where the students were all (or almost all) African American. I remember asking the principal how many of his staff were black, and how many were white. He had to stop and think, and he said that he didn’t primarily think of the teachers in racial terms. Since knowing whether someone was black or white was very important in my family, this came as a shock, and a new way of thinking.

Mr Herron loved sports, and he loved coaching. I wish I had been a decent ball player, but instead I just acted as the team’s manager. I don’t remember much about this experience, except I was at one point asked to keep score for the number of times players in the game showed “hustle,” and I had no idea how to do this, so I got razzed about it. I really was not a good manager — not as bad as I was a baseball umpire, but that’s another story.

One time, I left school crying. I don’t know why now — I was probably being bullied for being smart and weak and unpopular in some way. We lived about a mile away from the school, and I usually walked. And Mr Herron left the school looking for me, and drove until he found me. I think that I refused his help then, but his act of looking out for me is something I remember forty years later.

The Macomb Daily (the local county paper) reported back in February of 2009 that Mr Herron died in a house fire at the age of 75. My youngest brother Steve mentioned this to me over the phone. Mr Herron eventually became an assistant superintendent of the Roseville schools. I assume that he brought his love for students, for sports, and for racial equality to that job as well.

I never caught his love for sports, but he began to open my eyes to the experiences of African Americans, and he began to turn me into a man, for which I will always be grateful.

I am a Wordnik

This week, I started as the lead engineer for Wordnik‘s analytics platform. Except I get a little antsy about the term “engineer,” so I asked them to make my title “Lead, Analytics Platform.” It’s a real pleasure to work with the Wordnik team so far–super excited to be working with Tony Tam and Erin McKean, and also former Powersetters Colin Pollack and Robert Voyer. When Robert joined Wordnik over a year ago, I badgered him into getting me an interview–it’s only now that it’s come to fruition.

There were many good things about working at Bing and Microsoft, especially the large amounts of friendship I found there, and the large amounts of data I got to explore and understand. Still, it was a real joy to fire up a terminal session and start exercising my atrophied Unix muscles.

I’ll be spending most of my time in Silicon Valley/San Francisco with visits back to Michigan from time to time.

Let me end by pointing to Erin’s inspiring TED talk, which was the starting point of my path to Wordnik.

Remembering Miss Mullens

It’s Ada Lovelace Day and we are encouraged to write about women who were significant in getting us involved in Science and Technology.

I remember my Junior High Geometry teacher, Miss Mullens. She was very, very short, kind of shy, but very funny — a classic geek, really, now that I think of it (geek, of course, being a term of praise here, not a negative thing).

Junior High school geometry, for me, was mostly about learning to do proofs — classic Elements of Euclid stuff. This was in the heyday of the “new math” movement, and I think — although this is a long time ago — that they emphasized thought processes over rote memorization. And I loved doing proofs, getting them right. I know I got a “A+” in the class. It was a great encouragement to me. I must have had this class in grade 9, because I didn’t do well in Algebra (8th grade) until the teacher — Mr Perkins — called me out on my laziness. Mr Perkins actually had a paddle and used it on students (this was in the late 60s). Miss Mullens was too small for that of course — but I would have done anything for her; her praise was enough.

Alas, I doubt if she’ll read these words — perhaps, even, her name was Mullins, or Mullen. It’s been a long time. But she was a great teacher, and I remember her fondly.

Producing n hash functions by hashing only once

To implement a bloom filter, you need a bunch of hash functions. In naive implementations (and I’ve seen plenty), programmers pick out, say, five cryptographic hash functions. One problem with this is that the hash functions for bloom filters have different requirements than hash functions for cryptography–the latter tend to be more than is required for the former. What you want for bloom filters is something that’s very, very fast, while maintaining that basic desiderata for bloom filter hashing, uniform spread.

There’s a good paper that reminds us that you can easily simulate n hash functions by having just two hash functions around. This can be as simple as this function to create the ith hash of a key, given the results a and b of hashing a key with these two functions:

hash(i) = (a + b * i ) % m

where m is the maximum value of the hash (for example, the number of buckets in a bloom filter).

But here’s a good trick not really worth a paper–but it’s still a good trick. Typically, it’s totally reasonable to limit the size to under the maximum size of an unsigned 32-bit number. These days, at least, it’s probably cheaper to calculate a base hash function on unsigned 64-bit numbers. So, you can take the upper half and the lower half of the 64-bit hashed value and return them as two 32 bit numbers.

Voila! Double hashing with one hash function. And using FNV means you have a very cheap hash function to start with, so really this can be very, very fast.

I implemented this in my bloom filter code for go.

For @mojombo.

On “Retiring a Great Interview Problem”

Daniel Tunkaleng wrote an interesting blog post, “Retiring a Great Interview Problem” on an interview problem that he has, in the past, posed to interviewees, but which he has now retired, because someone posted the problem, and a solution, to an interview problem website. The problem is the segmentation problem (which he calls the word-break problem): given a string, segment it into reasonable words, if possible (for example, breaking “applepie” into “apple pie”). He describes the various gradations of good answers to this problem, and I guess I would say that his reasoning is sound: candidates who give a better answer than a worse answer would probably make better programmers.

However, if I were asked this question in an interview, I am fairly confident that I would freeze up and not give a good answer. Partially, this is because of the stupidly artificial conversational model that occurs when one is participating in a white-board exercise–what, program without the resources of the internet, while someone watches me, and makes me elucidate my mental states? But the other reason I’d have a problem is that I would recognize this problem as one I have already coded and place on Github (https://github.com/willf/microsoft_ngram) which is in turn based on Peter Norvig’s section on segmentation in the book Beautiful Data (for which he posted Python code). This code does segmentation even better than the Tunkaleng’s formulation of the problem, in that it provides the most probable segmentation (Tunkaleng’s formulation asked for any reasonable segmentation). I also know (or at least believe) that Norvig’s code is based on similar code he wrote for his Artificial Intelligence textbook (here’s the code) and that using memoization as a lazy person’s method of doing dynamic programming is an old and efficient trick of Norvig’s. This knowledge would basically drain me of any interest in reinventing this particular wheel.

In other words, I would fail this interview, although I have publicly demonstrated my ability to implement a more generally useful version. Despite Tunkaleng being a vastly better programmer, he got it wrong on the first try (he had an “off-by-one” error, which he later corrected). Granted, interviewers (or interviewers in the Google style) are generally more concerned about precision (avoiding hiring bad programmers) than recall (avoiding missing out on good programmers). But I continue to think that the whiteboard method systematically discriminates against certain classes of programmers. I suspect, in particular, it discriminates against older candidates, which, prima facie would be illegal under US law.

[Update, added later the same day]. Take the “I suspect” literally in the last paragraph. I’m still trying to articulate what I think about interviewing techniques of this type and discrimination, particularly age discrimination (the only kind I’m potentially subject to). And take prima facie literally, too–I doubt that, in a court of law, age discrimination could be proved based solely on these interviews.

On the meta-media

In the wake of the horrible killings in Norway, I saw several tweets and Facebook posts that were something like “Brown: terrorist. White: extremist. Got it.” or “It speaks volumes that the immediate assumption was that this was the work of Muslim terrorists.” In other words, people began quite quickly to comment on the media reaction to the story, criticizing (implicitly or explicitly) for being anti-Muslim. This seemed odd to me, because by the time I started to see stories about the killings, the stories themselves were tentative about what the motivations of the killer or killers were.

After some discussion with one of the posters of one of these comments, what became clear is that some tabloids and pundits were quick to label the motivations. The Sun (on July 22) called it an “Al-Queda Massacre” and Thomas Jocelyn of The Standard wrote, somewhat more circumspectly, “We don’t know if al Qaeda was directly responsible for today’s events, but in all likelihood the attack was launched by part of the jihadist hydra.”

The legitimate news organizations were bit cautious about assigning motivations; at least, the ones I was able to check. The New York Times, after Anders Behring Breivik was arrested, quoted police officials as saying he was a “right-wing extremist,” although later they pointed out the similarities of the bombing and killings to prior al-Qaeda attacks, including declared threats against Norway. The Associated Press, prior to the arrest, quoted acting national Police Chief Sveinung Sponheim as refusing to speculate as to motivation, but the article also noted the al-Qaeda threats.

Once more was known about Breivik, discussion has focused on what to call him; a “terrorist,” an “extremist,” a “Christian fundamentalist,” a “right-wing terrorist,” etc. And, to some extent, this is a useful discussion to have. But any such discussion should shrink, I think, in importance to the facts on the ground: the grieving families, justice, and prevention of such attacks in the future. Part of our media-saturated culture is to think that stories on the media reaction to the story are as or more important than the story itself. No one should be surprised at The Sun being a tabloid, or even a right-leaning pundit making a too quick (but, seriously, in the current case, not uneducated) guess that Islamist extremists were behind the attack. Why is this worth comment? We should, I think, be slower to speak; we need fewer pundits, and we need even fewer meta-pundits. Ironically, of course, this blog post is meta-meta punditry, but such is the rabbit hole we enter.

O(n) beats O(lg n)

Here’s a problem I faced at work: determine the effective top level domain of a URI, using the data from the Public Suffix List, but not using any data structures read into memory (ie, a dynamically created hash table or even a static array). For example, the effective top level domain of “http://blog.entish.org” is “entish.org”, because it is a name one could register.

So, there are about 4000 “positive” rules and about 100 “negative” rules. One way to approach this is to convert a host name into a series of keys (blog.entish.org, entish.org, org) and find the longest positive and negative matching rules. It’s simple enough to write a fast hash function (e.g., FNVa), and write code to write a function that does one of the following two things:

(1) switch (fnva(key)) {case h1: case h2: …. case hn: return true; default: return false}

(2) var h = fnva(key); if (n < hmid) if …}

where (1) is a giant switch statement, and (2) is a function which does a binary search of comparisons of values. (1) is O(n), of course (because you might have to look through every value. (2) is O(lg n) as any good binary search is.

I was surprised to see that, on my test set of about 9000 hosts, the switch version was 4 times faster than the binary search version.

It’s fun to write code that writes code. It’s good to benchmark.


Dynamic typing and testing

As kind of a 10-finger exercise, I decided to write a small inverted index library in Ruby today, since I had the day off. This is just a few thoughts about what it was like to write tests for this code, which can be found at https://github.com/willf/inverted_index_simple–not that I think you’ll write the next major search engine with this bit of code…

What struck me was that as I was writing unit tests, how many of the tests failed at first either because of undefined variables or dynamic typing problems. For example, in the method query_and (which returns the set intersection of all documents containing query terms), I originally wrote:

def query_and tokens
return Set.new if tokens.size == 0
sets = tokens.map{|token| query_token token}
sets[1..-1].inject(sets[0]){|reduction, x| redution & x}

The third line is only tested if there are at least three query terms, so it was important to write such a test. But what surprised me is that the error was not in the logic, but the literal typing–I misspelled “reduction” the second time.

Similarly, I think I wrote a version of the function that had

sets = tokens.map{|token| query_token tokens}

where I typed ‘tokens’ instead of ‘token’ by mistake.

With a statically typed language, neither of these errors are likely to make it past the static compiler. The Go language, for example, won’t even let you declare variables unless you use them–something which drives me mad, but is probably the right thing to do. (It would have caught the second error, even if ‘token’ and ‘tokens’ happened to be of the same type.)

The great loss of strict static typing comes when it is difficult to write ‘generic’ code (stacks, and queues, for example, of anything); most modern languages provide some sort of facility for this (and we’re still waiting for Go to provide this). The other great loss is when it really makes sense to have union types: functions that work on lists of ints, in addition to ints.