A weblog by Will Fitzgerald

Category Archives: Science and Tech

A tiny rant about ‘tr’

The tr program converts a string of characters into another string of characters, using a very simple rule system. The tr [A-Z] [a-z].  But this works for simple “ASCII” strings only. tr, at least on many systems, understands Unicode, and so the standard example fails for converting, say, Russian or Czech. But tr also understands character classes, so the standard example should be written tr [:upper:] [:lower:].

File under “boring post.”

Scala filters

A random Scala note.

Today, I wanted to apply a list of filters to each item in a list, and return just those that pass each of the filters.

For example, given a range of integers, return just those that are divisible by 2 and by 3.

Let’s start by defining a boolean function divides:

def divides(d:Int,i:Int) = if (i%d==0) true else false

Note that divides(2,_:Int) defines the (partial) function for division by 2.

(divides(2,_:Int))(2) => true
(divides(2,_:Int))(3) => false

So we can create our filters so:

val filters = divides(2,_:Int) :: divides(3,_:Int) :: Nil


val filters = List(divides(2,_:Int),divides(3,_:Int))

Now, we can simply use Scala’s filter and forall functions to filter a range of integers:

scala> Range(1,50).filter(x => filters.forall(f => f(x)))
res45: scala.collection.immutable.IndexedSeq[Int] =
  Vector(6, 12, 18, 24, 30, 36, 42, 48)

The filters could also be defined as a Set, but by creating them as a List, one can put the less expensive filters first.

Useful Scala introduction

Computational Social Science on the cheap using Twitter

This is a followup to my post Computational lexicography on the cheap using Twitter, but more especially in response to Using off-the-shelf software for basic Twitter analysis.

The later article shows how to use database software (MySQL and its implementation of the SQL language) to do basic Twitter analysis. The ‘basic analysis’ includes counts by hashtag, timelines, and word clouds. They analyse about 475k tweets.

But here’s the thing: all their analyses can be done more simply with simple text files and pipes of Unix commands (as most eloquently demonstarted in Unix for Poets, by Ken Church). In fact, several simple   commands—commands I use everyday—are powerful enough to do the kind of analyses they discuss.

Getting the data.

(You can skip over this if you have data already!)

Interestingly, they do not show how to get the tweets to begin with. My previous post discusses this, but it might be useful to show a simple Ruby program that collects Tweet data, especially since the method has changed slightly since my post. The biggest hurdle is setting up authentication to access Twitter’s data—discussed in full, here, but the crucial thing is that you have to register as a Twitter developer, register a Twitter application, and get special tokens. You create an application at the Twitter apps page; from that same location you generate the special tokens.

Here’s the Ruby script (also listed here).

require 'rubygems'
require 'tweetstream'
require 'date'

TweetStream.configure do |config|
  config.consumer_key = ''
  config.consumer_secret = ''
  config.oauth_token = ''
  config.oauth_token_secret = ''
  config.auth_method = :oauth
  config.parser   = :json_gem

# Change the words you want to track
TweetStream::Client.new.track('football', 'baseball', 'soccer', 'cricket') do |status|
    # The Tweet id
    id = status.id
    # The text of the tweet, with new lines (returns) replaced by spaces
    txt = status.text.gsub(/\n/," ")
    # The date of the tweet, printed out in a slightly more useful form 
    # for our purposes
    d = DateTime.parse(status.created_at).strftime("%Y-%m-%d\t%H:%M:%S")
    puts [id,txt,d].join("\t")
  rescue Exception => e
    puts "!!! Error: #{e.to_s}"

With the proper keys and secrets, this gist wlll allow you to track keywords over time, and print out, in a tab-separated format, the tweet id, the text of the tweet, the date, andthe time it was published (in UTC, or Greenwich, time). You could add additional columns, as described (by example) in the Twitter API.

The example here tracks mentions of football, baseball, soccer, and cricket, but obviously, these could be other keywords. Running this using this command:

ruby track_tweets.rb | tee nsports.tsv

will place tweets in the file ‘nsports.tsv’.

Basic statistics

Counting the number of football, baseball, etc. mentions is easy:

$ grep -i football nsports.tsv | wc -l
$ grep -i baseball nsports.tsv | wc -l
$ grep -i soccer nsports.tsv | wc -l
$ grep -i cricket nsports.tsv | wc -l

As well as getting the number of lines in the file:

$ cat nsports.tsv | wc -l

The second analysis was to count who is retweeted the most, done by counting the username after the  standard Twitter “RT ” (eg “rt @willf good stuff!”). The following pipeline of commands accomplishes this simply enough:

egrep -io "rt +@\w+" nsports.tsv | perl -pe "s/ +/ /g" | cut -f2 -d\  | sort | uniq -c | sort -rn | head

(This may be easier to copy from here). Each of these is a separate command, and the pipe symbol (|), indicates that the output from one command goes on to the next. Here’s what these commands do:

  1. egrep -io “rt +@\w+” nsports.tsv — searches through the tweets for the pattern RT space @ name, where there is one or more spaces, and one or more ‘word’ characters. It only prints the matching parts (-o), and ignores differences in case (-i).
  2. perl -pe “s/ +/ /g” — I noticed that from time to time, there is more than one space after the ‘RT’, so this substitutes one or more spaces with exactly one space.
  3. cut -f2 -d\  — Each line looks like “RT @name”, now, and this command ‘cuts’ the second field out of each line, with a delimiter of a space. This results in each line looking like ‘@name’.
  4. sort | uniq -c | sort -rn — this is three commands, but I type them so frequently, it seems like one to me. It sorts the text, so they can be counted with the uniq command, which produces two columns : the count and the name; we reverse sort (-r) on the first numeric field (-n)
  5. head — this shows the top ten lines from a file.

This command pipeline should have no problem handling 475k lines.

The third analysis was to put the data in a format that can be used by Excel to create a graph, with counts by day. Because we have printed the date and time in separate columns, with the date in column 3. So, we can simply do the cut, sort, uniq series:

cat nsports.tsv | cut -f3 | sort | uniq -c > for_excel.tsv

This will put the data into a format that Excel can read.

Finally, the authors show how to create Wordle word graphs overall, and for the categories. I’m not a big fan of these as a data exploration tool, but notice you can use cut -f2 to get the text to paste into Wordle.

So, this is computational social science on the cheap using Twitter, using some basic Unix commands (cat, cut, sort, uniq, grep), with one tiny, tiny call to Perl. You can do this too–and it’s easier to learn than MySQL and SQL! Plus, you can easily read the text files that are created. All of this was done on a standard Mac, but any Unix machine, or Windows machine with the Cygwin tools installed, can do this as well.

Producing n hash functions by hashing only once

To implement a bloom filter, you need a bunch of hash functions. In naive implementations (and I’ve seen plenty), programmers pick out, say, five cryptographic hash functions. One problem with this is that the hash functions for bloom filters have different requirements than hash functions for cryptography–the latter tend to be more than is required for the former. What you want for bloom filters is something that’s very, very fast, while maintaining that basic desiderata for bloom filter hashing, uniform spread.

There’s a good paper that reminds us that you can easily simulate n hash functions by having just two hash functions around. This can be as simple as this function to create the ith hash of a key, given the results a and b of hashing a key with these two functions:

hash(i) = (a + b * i ) % m

where m is the maximum value of the hash (for example, the number of buckets in a bloom filter).

But here’s a good trick not really worth a paper–but it’s still a good trick. Typically, it’s totally reasonable to limit the size to under the maximum size of an unsigned 32-bit number. These days, at least, it’s probably cheaper to calculate a base hash function on unsigned 64-bit numbers. So, you can take the upper half and the lower half of the 64-bit hashed value and return them as two 32 bit numbers.

Voila! Double hashing with one hash function. And using FNV means you have a very cheap hash function to start with, so really this can be very, very fast.

I implemented this in my bloom filter code for go.

For @mojombo.

Ruby Google Chart API

A few years ago, I wrote some Ruby code to access the Google Chart API. I’ve (finally) put in up on GitHub.

This is part of an effort to get more of my open source projects up in a common space.


What is missing? Watson, AI and search technology

All geekdom, and much that is outside our realm, is abuzz with news and discussion of the contest by IBM’s Jeopardy-playing computer, Watson, vs Ken Jennings and Brad Rutter. In last night’s game (the second half of a game started on Monday, Watson won decisively, answering many, many more questions than Jennings and Rutter combined. At the end of the first half, Rutter and Watson were tied. Lots of famous (and formerly famous) artificial intelligence researchers have weighed in on what this means for AI. I want to discuss one small point which highlights the difference between human intelligence in question answering and computerized question answering.

In Monday’s show, one of the Jeopardy “answers” was, “It was the anatomical oddity of U.S. gymnast George Eyser, who won a gold medal on the parallel bars in 1904.” The correct answer was something like, “What is a missing leg?” Apparently, the play sequence went like this: Ken Jennings buzzed in with “What is a missing hand?” Watson then buzzed in with “What is a leg?” This was first deemed correct by Alex Trebek, the host, but this judgment was reversed by the judges on review. I didn’t see the show (a Valentine’s Day dinner took precedent), but apparently the TV show edited out Trebek’s original decision. Because of the reversal, Watson was unable to give a “more specific answer,” and Rutter was unable to buzz in on Watson’s error.

It seems that Trebek was treating Watson’s answer as if a human had given it: If a human had said “What is a leg?” as a follow-up to the wrong question, “What is a missing hand?” it would make sense to treat this as having the same intention as “What is a missing leg?” But Watson doesn’t know what the other contestants are saying, and so it actually had no such context in which to give its answer. I think it is plausible that Trebek would have awarded Watson a correct answer if Watson had given its answer without the context Jennings’s question (or, perhaps, would have asked Watson to give a more specific answer), given the “anatomical oddity” context of the question.

People laughed when Watson gave a similar wrong answer to another of Jennings’s errors. Jennings answered “What are the ’20s?,” and then Watson said, “What is 1920s?” Interestingly, the press reports have pretty much all said that Watson gave the same answer as Jennings. But Watson’s answer, though it would have the same intent if given by a person, is different, both in its raw surface form and in its grammatical incorrectness. I don’t think the Jeopardy rules require grammatically correctness in the questions, for human players don’t have this kind of problem. They didn’t know, or remember in the game play, that Watson didn’t receive the other contestants’ answers.

Watson was penalized for getting the 1920s question wrong, and penalized for getting the leg question right, but in the wrong way. I find it fascinating that people–sometimes in real time, and sometimes only on deliberation–can navigate what it means to have correct and incorrect intentions with respect to human-made artifacts such as Watson, especially within the strong expectations set up by IBM and the Jeopardy game to treat the Watson system as an intentional agent. Most of Watson’s human-seeming qualities come from the expectations that get set up. For example, Watson uses templates such as “Let’s finish up <CATEGORY>, Alex” when there is only one answer left in the category). But the stronger expectations are set by giving it a human name, a synthesized voice, putting it between two humans, using the pronoun “he” when referring to the system, etc. But even given these expectations, people can notice, and react to, the breakdowns (and, in the case of the missing “missing,” a seeming non-breakdown).

Search engines, such as Google and Bing, have a feature often internally called “Answers,” that give an answer to an implied question right on the search results page. For example, enter “molecular weight of hydrogen” or “capital of Michigan” in Bing or Google, and you’ll get the answer without having to go to another page. No one confuses this with human intelligence, yet, at some level of analysis, this is what Watson is doing. Granted, search engines are not optimized for searches phrased as questions (if you ask Google today, “What was George Eyser’s problem?,” the caption will say “he ran out of the paper dishes on which to serve the ice…”; in this case, Bing does a better job, but it could have gone either way). But the extraction of facts and references from a very large number of documents and data sources, based on vague and ambiguous search queries is the essential job of search engines. For the most part, Watson is a search engine, specialized for Jeopardy, that has been given a human moniker and a few human mannerisms.

Whitelisting in Google Mail

Google Mail (Gmail) has been putting mail from the LinkedIn contacts and groups into the spam folder. I had to research how to whitelist domains in Gmail, and this is what I learned:

  1. Press the ‘Create a Filter’
  2. Enter the domain in the ‘From’ field
  3. Select the ‘Never send it to Spam’ option

And, hey presto: it’s been whitelisted.

    Some Bing N-gram notes on tokenization

    Brendan O’Connor asked me on Twitter what I knew about Bing N-gram tokenization conventions, and I said I would ask someone who knew. The N-gram team plans a future blog post on this, but here’s some things I was told that I could share (and I quote):

    Here are a few things we can share right now:

    • Segment boundaries use the special symbols <s> and </s>, thus P(“<s> hello there </s>”) != P(“hello there”).
    • The GetConditionalProbability uses the last space character as the ‘word’ boundary, even if internally there are multiple tokens represented in that last ‘word’.  That is, GCP(“shown as-is”) is P(“as-is”|”shown”), not P(“is”|”shown as”).


    New Bing NGram data

    Bing (my employer, but here a different area) has announced new publicly available NGram data, current to April 2010. It includes 1 through 5-grams for title, anchor and body streams (that is, HTML page titles, text in anchor links, and overall HTML body text).