Will.Whim

A weblog by Will Fitzgerald

Category Archives: Language

The Non-Chaos, or English Spelling Defended in Rhyme

Dearest creature in creation,
Study English pronunciation.
It’s more regular in its core
Than pundits, who focus on its more
Erratic ways, would have you believe.
Perhaps they simply cannot conceive
Of any system not based in Latin—
They would choose, I suppose, to flatten
All writing to “one form, one sound”
But, really, regularities abound.
Consider, how we pronounce the plural
Form of words; Imagine the neural
Work of reading “dogs” and “cats.”
Would you prefer “dogz”? That’s
Not right—that single ess for each
Is easier to read, to sound out, and to teach.
Or consider “heir/inherit”
To write “air” would be a demerit,
A signature failure, and a sign
Of a spelling system’s worse design.
Seriously, it would simply astonish,
Anyone to think that “ghoti” sounds like “fish.”
Besides, English spans such colossal ages
And latitudes, I doubt such cages
Desired by fans of regularization
Could withstand the normal mutation
Of how language really adapts.
“Wind” and “hind” have rhymed or not, perhaps,
As, over time and place, each has adopted
A short I, sometimes a long I, co-opted
By real human beings. So “after tea and cakes and ices, “
Let us “force the moment to its crisis”—
Haters, they say, are going to hate; let them snivel
I have had enough of drivel,
Go ahead, enjoy your whine,
But English spelling is basically fine.

—Will Fitzgerald, January 2012

Distribution of tweet lengths

% of English tweets by size (sample 50k)

I get a very different distribution of tweets than Isaac Hepworth — no spikes at 28. My provisional guess is that his data is a bit wonky. My data here is (only) 50k English tweets from one day in 2007.

Isaac Hepworth's distribution

Computational Social Science on the cheap using Twitter

This is a followup to my post Computational lexicography on the cheap using Twitter, but more especially in response to Using off-the-shelf software for basic Twitter analysis.

The later article shows how to use database software (MySQL and its implementation of the SQL language) to do basic Twitter analysis. The ‘basic analysis’ includes counts by hashtag, timelines, and word clouds. They analyse about 475k tweets.

But here’s the thing: all their analyses can be done more simply with simple text files and pipes of Unix commands (as most eloquently demonstarted in Unix for Poets, by Ken Church). In fact, several simple   commands—commands I use everyday—are powerful enough to do the kind of analyses they discuss.

Getting the data.

(You can skip over this if you have data already!)

Interestingly, they do not show how to get the tweets to begin with. My previous post discusses this, but it might be useful to show a simple Ruby program that collects Tweet data, especially since the method has changed slightly since my post. The biggest hurdle is setting up authentication to access Twitter’s data—discussed in full, here, but the crucial thing is that you have to register as a Twitter developer, register a Twitter application, and get special tokens. You create an application at the Twitter apps page; from that same location you generate the special tokens.

Here’s the Ruby script (also listed here).

require 'rubygems'
require 'tweetstream'
require 'date'

TweetStream.configure do |config|
  config.consumer_key = ''
  config.consumer_secret = ''
  config.oauth_token = ''
  config.oauth_token_secret = ''
  config.auth_method = :oauth
  config.parser   = :json_gem
end

# Change the words you want to track
TweetStream::Client.new.track('football', 'baseball', 'soccer', 'cricket') do |status|
  begin
    # The Tweet id
    id = status.id
    # The text of the tweet, with new lines (returns) replaced by spaces
    txt = status.text.gsub(/\n/," ")
    # The date of the tweet, printed out in a slightly more useful form 
    # for our purposes
    d = DateTime.parse(status.created_at).strftime("%Y-%m-%d\t%H:%M:%S")
    puts [id,txt,d].join("\t")
  rescue Exception => e
    puts "!!! Error: #{e.to_s}"
  end
end

With the proper keys and secrets, this gist wlll allow you to track keywords over time, and print out, in a tab-separated format, the tweet id, the text of the tweet, the date, andthe time it was published (in UTC, or Greenwich, time). You could add additional columns, as described (by example) in the Twitter API.

The example here tracks mentions of football, baseball, soccer, and cricket, but obviously, these could be other keywords. Running this using this command:

ruby track_tweets.rb | tee nsports.tsv

will place tweets in the file ‘nsports.tsv’.

Basic statistics

Counting the number of football, baseball, etc. mentions is easy:

$ grep -i football nsports.tsv | wc -l
$ grep -i baseball nsports.tsv | wc -l
$ grep -i soccer nsports.tsv | wc -l
$ grep -i cricket nsports.tsv | wc -l

As well as getting the number of lines in the file:

$ cat nsports.tsv | wc -l

The second analysis was to count who is retweeted the most, done by counting the username after the  standard Twitter “RT ” (eg “rt @willf good stuff!”). The following pipeline of commands accomplishes this simply enough:

egrep -io "rt +@\w+" nsports.tsv | perl -pe "s/ +/ /g" | cut -f2 -d\  | sort | uniq -c | sort -rn | head

(This may be easier to copy from here). Each of these is a separate command, and the pipe symbol (|), indicates that the output from one command goes on to the next. Here’s what these commands do:

  1. egrep -io “rt +@\w+” nsports.tsv — searches through the tweets for the pattern RT space @ name, where there is one or more spaces, and one or more ‘word’ characters. It only prints the matching parts (-o), and ignores differences in case (-i).
  2. perl -pe “s/ +/ /g” — I noticed that from time to time, there is more than one space after the ‘RT’, so this substitutes one or more spaces with exactly one space.
  3. cut -f2 -d\  — Each line looks like “RT @name”, now, and this command ‘cuts’ the second field out of each line, with a delimiter of a space. This results in each line looking like ‘@name’.
  4. sort | uniq -c | sort -rn — this is three commands, but I type them so frequently, it seems like one to me. It sorts the text, so they can be counted with the uniq command, which produces two columns : the count and the name; we reverse sort (-r) on the first numeric field (-n)
  5. head — this shows the top ten lines from a file.

This command pipeline should have no problem handling 475k lines.

The third analysis was to put the data in a format that can be used by Excel to create a graph, with counts by day. Because we have printed the date and time in separate columns, with the date in column 3. So, we can simply do the cut, sort, uniq series:

cat nsports.tsv | cut -f3 | sort | uniq -c > for_excel.tsv

This will put the data into a format that Excel can read.

Finally, the authors show how to create Wordle word graphs overall, and for the categories. I’m not a big fan of these as a data exploration tool, but notice you can use cut -f2 to get the text to paste into Wordle.

So, this is computational social science on the cheap using Twitter, using some basic Unix commands (cat, cut, sort, uniq, grep), with one tiny, tiny call to Perl. You can do this too–and it’s easier to learn than MySQL and SQL! Plus, you can easily read the text files that are created. All of this was done on a standard Mac, but any Unix machine, or Windows machine with the Cygwin tools installed, can do this as well.

On “culturomics”

[Note: I got bogged down on reading and reporting on the culturomics paper a little too closely, and this is a reboot.]

Peter Norvig, one of the co-authors of the culturomics paper and the director of research at Google, was also the co-author on another significant article with the suggestive title, “The Unreasonable Effectiveness of Data”. The invention of the term “culturomics” suggests a scientific programme for attacking questions of culture that stresses statistical models based on large amounts of data, a programme that has been very successful both academically and commercially for linguistics and artificial intelligence, to say nothing of the informatics approaches of genomics and proteomics upon which the term culturomics is based. The slogan, “every time I fire a linguist, the performance of my system goes up,” (a slight misattribution of something Frederick Jelinek, a pioneering computational linguist, said), is another restatement of this. Among the bets and assumptions made by this approach are:

(1) The goal of science is better-engineered systems, which have practical, commercializable outcomes.
(2) Models must be empirically testable, with precise, independent, repeatable evaluation metrics and procedures.
(3) Simple quantitative models based on large amounts of data will perform better, in the senses of (1) and (2), than complex qualitative models based on small amounts of data.

Among the successes attributable to big data programmes include effective search engines, speech interfaces, and automated translation. Google’s rigorous approach to big data affects nearly every aspect of their business; core search for starters, but even more important are the big data approaches to Google’s ability to make money on its search, as well as decrease its operating costs.

Studies of culture are currently, for the most part, either done using complex, qualitative models, or based on relatively small amounts of data. The Google N-gram data is, perhaps, an opening salvo in an attack on qualitative/small data approaches to studies of culture, to be replaced with quantitative/big data approaches. The quantitative/big data programme has been “unreasonably effective” in overturning how linguists, artificial intelligence, and cognitive science researchers approach their field and get their projects funded. The bet, here, is that the same will occur in other culture studies.

There are many problems with the example experiments described in the culturomics papers. The experiments are often not described in enough detail to be replicable. Proof is often by example rather than by large-scale evaluation metrics. The proofs often resemble just-so stories (explanations without adequate controls) or unsurprising results (for example, that Nazis suppressed and censored writers with whom they disagreed is borne out by the data). The scope of the experiments is often very limited (for example, the section on “evolution of grammar” laughably only describes the changes occurring in a small subset of strong/weak verbs).

Because this is an overview paper, it may be that some of the important details are missing for reasons of space. Some of these things are addressed in the supporting materials, but by no means all. For example, something as basic as the methods used for tokenization—how the successive strings of characters in the digital copies of the books of the corpora—is not really defined well enough to be repeatable. How, for example, does the system tokenize “T.S. Eliot”? Is this tokenized the same way as “T. S. Eliot” or “TS Eliot”? Based on the sample N-gram viewer, it appears that, to find mentions of T.S. Eliot, the search string, “TS Eliot,” (similarly WH Auden, CS Lewis) must be used. The supplemental and related supplemental material give many details, but in the end refer to proprietary tokenization routines used at Google.

And yet, there are some useful ideas here. Because the N-gram data is time-stamped, looking at some kinds of time-varying changes is possible. The idea of measuring half-life of changes is a powerful one, and the varying amounts of time it takes to fall to half-life is interesting in their analysis of “fame” (in reality, their analysis of name mentions). Seeing how some verbs are becoming strong in the face of a general tendency towards regularization is interesting. And the lexicographic estimates seem very valuable (if not very “culturomic” to me).

A danger in the approach of culturomics is that, by focusing on what can be processed on a large scale, and measured with precision, interesting scientific questions will be left unexplored, perhaps especially when those questions are not of obvious economic benefit. Engineers build better engines, not necessarily better theories.

Having said all of this, I remain optimistic about the release of the Google N-gram data, even as I resist the term and approach suggested by culturomics. Yes, Google needs to provide better descriptions of the data, and continue to clean up the data and metadata (as well as describe and report on what “cleaned up” means; some of this is, in fact, described in the supplementary data [4]), and to be much more transparent about access to the underlying documents, when permissible by law. It would be very useful for Google to provide algorithmic access to the data rather than just make the (very large) data sets available. But these data can be mined for interesting patterns and trends, and it will be interesting to see what researchers do with them. Let’s just call it time-stamped N-gram data, though, and eschew the term culturomics.

More thoughts on “culturomics” (part one)

I’ve now had a chance to read the Science Express article describing “culturomics,” that is, the “Quantitative Analysis of Culture Using Millions of Digitized Books,” recently published by researchers at Google and several high-profile academic and commercial institutions. The authors claim to have created a corpus of approximately four percent of all books ever published (over five billion books) and supplying time-stamped ngram data based on 500 billion words in seven languages, primarily English; from the 1500s until roughly the present, primarily more recently. The “-omics” of culturomics is by analogy to genomics and proteomics: that is, high-speed analysis of large amounts of data. As someone who has done some work with ngrams on the web (with data provided both by Bing, my employer, and earlier data provided by Google), this work is of great interest to me. Digitized books are, of course, a different animal from the web (and queries made to search engines on the web), and so it is of interest for this reason. The addition of time-stamps makes some kinds of time-based analyses possible, too; this is the kind of data we have not had before (Bing’s ngram data, assuming they continue to provide older versions, might eventually do so).

There have been a number of criticisms of the culturomics programme, many of them well-founded. It is worth-while to describe a few of these. First, there are problems with the meta-data associated with the books that Google has digitized. As a result, it is unclear how accurate the time-stamps are. This is not addressed in the article, although this has been a well-known problem. Certainly, they could have sampled the corpus and given estimates on the accuracy of the time-stamp data. Related to this is the lack of a careful description of the genres and dialects represented (partly a failure in the meta-data, again). Second, there are systematic errors in the text scans; this is especially true for older books, which often used typographic conventions and fonts not common today (and, one assumes, errors made due to language models based on modern texts rather than pre-modern ones). Consider, for example, the “long s” previously used in many contexts in English; this is often read as an “f,” instead of a long s. Incidentally, according to the tokenization goals of the project, the right thing to do would be to record the long s, not regularize it to modern, standard ‘s.’ Otherwise, it becomes more difficult to track the decline of the long s, except via guessing OCR errors. The whole notion of what it means to constitute a countable thing in these corpora–that is, what are the tokenization rules for generating the 1-grams, is given short shrift in this article, although it is a fairly important issue.

Third, the presentation of the Google Labs ngram viewer has made it overly easy to tell “just so” stories. For example, consider the contrast between “Jesus” and “Christ” from 1700 to 2008. It’s easy to tell this just-so story: People talked about Jesus or Christ before the revolution, but then not so much in the run-up to the War and its aftermath. But, with the Great Awakening, a large number of books were published, with “Christ” much more common that Jesus. Over time, due to increasing secularization of the United States, people wrote about Jesus or Christ less and less. The individualistic Evangelical explosion of the last thirty years has started to reverse the trend, with “Jesus” (a more personal name) becoming a more popular name than “Christ”  (a less personal name). Natalia Cecire describes this, better and more succinctly, as “Words for Snowism“. Cicere also views the Ngram Viewer as a guilty pleasure; as epistemic candy.

The Science Express article describes several experiments made with the Google books data, and it is worth spending time examining these, because this gives good hints as to what the data are likely to be good for; where the culturomics programme is likely to head.

The first set of experiments describe attempts at estimating the size of the English lexicon, and its growth over time. The authors describe (qualitatively, for the most part) their sampling technique for determining whether an 1-gram token was an English word form: it had to appear more than once per billion; a sample of these common potential word forms was manually annotated with respect to whether they were truly English word forms, or something else (like a number, a misspelling, or a foreign word). Sample sizes, procedure, inter-rater reliability, etc., were not reported; an important flaw, in my opinion. They show, for example, that the English vocabulary has been increasing by over 70% in the past 50 years, and contrast this to the size of printed dictionaries. This first set of experiments will be of great interest to lexicographers; indeed, it is just this gap that commercial enterprises like Wordnik are trying to fill). It is hard to see how this says much about “culture,” except as fodder for lexicographical historiography or lexicographical evangelism: there are many words that are not in ‘the dictionary:’ get used to it.

The second set of experiments purports to describe “the evolution of grammar,” but, not surprisingly, only attacks a very small subset of lexical grammar: the change in use of strong and weak verbs in English. Given the time-stamped, word-form based data, it is relatively simple to check the movement from “burnt” to “burned,” and compare and contrast this to other strong and weak verbs. One wishes for more description how they reach their conclusions, for example, “high-frequency irregulars, which are more readily remembered, hold their ground better.” This reasonable statement is “proved” by a single contrast between finded/found and dwelled/dwelt. Some of the conclusions are based on dialect: there are differences between American English and British English. Unfortunately, the actual scope of the differences is “proved” by a single contrast between the use of burned/burnt in American and British English. Again, knowing the accuracy of dialect assignment, list of verbs used, etc, would be very useful.

{to be continued}

Things that are stentorian

The first twenty five things that are stentorian, according to the examples at Wordnik:

  1. tones [xxxxxxxxxxxxxxxxxxx]
  2. voices [xxxxxxxxxxxxxxx]
  3. commands [xxx]
  4. defenses [xx]
  5. styles [xx]
  6. rings [x]
  7. growls [x]
  8. moments [x]
  9. ways [x]
  10. jocks [x]
  11. phrases [x]
  12. resonances [x]
  13. screams [x]
  14. baritones [x]
  15. announcements [x]
  16. greetings [x]
  17. pronouncements [x]
  18. engines [x]
  19. insults [x]
  20. thickness [x]
  21. yelps [x]
  22. breathing [x]
  23. wheezing [x]
  24. barks [x]
  25. snores [x]

“Stentorian” derives from Stentor, a herald of the Greeks during the Trojan War. This post is a response to Robert L Vaughn’s post on stentorian. I thought stentorian meant “in a grand rhetorical style,” but I think it does mean just “powerfully loud,” so Sacred Harp or black gospel music could be said to be sung in a typically stentorian manner, I think. But I’m not quite sure: there are not many musical examples. But “The Stentorian Harp” would be a cool name for a shape note songbook.

World’s longest logogol

It was fun to review the world’s longest logogol (or palindrome) on Peter Norvig’s website today. A logogol is a word or phrase that has the same letters going forwards or going backwards. I think it’s cool that “logogol” is itself a logogol.

Text is a radio verb

Like tweet, text follows the pattern of verbs called “verbs of instruments of communication” in Beth Levin’s book, English Verb Classes and Alterations: A Preliminary Investigation. In Levin’s inventory, these verbs include

cable e-mail fax modem netmail phone radio relay satellite sign semaphore signal telecast telegraph telephone telex wire wireless

As I did for tweet, I downloaded tweets from Twitter; I just looked for occurrences of ‘texted. Not surprisingly, it was harder to find examples of texted than tweet[ed], but I did find a fair number of examples in a short time. In sum, like, tweet, text is a “radio” verb.

To show this, I will present each of Levin’s “properties” of radio verbs, along with her example verb cable side-by-side with text; and then several examples of tweet used in this way. There are two negative properties which requires further discussion.

  1. Heather cabled the news./Heather texted the news.
    • Phonetics has taken over my life. I almost texted the word “cute” spelled “kjut”…smh. #nerdtweet
    • @vfcimyourgirlHL I texted it
  2. Heather cabled Sara./Heather texted Sara.
    • Ohh Goddd, I just texted Jamal Robinson haha smhh he’s gonna be like who tha fuckk
    • Lmfao I had accidently texted my tattoo guy :)
  3. Dative Alteration
    1. Heather cabled the news to Sara/Heather texted the news to Sara.
      • @strawberikisz93 Haha. I just texted that to you. Lol. I seriously cant wait. Lol.
      • My moms annoying. She texted this to me: “wat time r u leaving”. Would it kill her to type the fucking word?
    2. Heather cabled Sara the news./Heather texted Sara the news.
      • @Kirsty_Jedward I just texted you the answer to this question LOL
      • RT @rainnwilson: Brett Favre texted me explicit pictures of his enormous ego. (This was heavily retweeted)
  4. *Heather cabled to Sara. (See below)
  5. *Heather cabled the news at Sara. (See below)
  6. Heather cabled Sara about the situation./ Heather texted Sara about the situation.
    • Ok, that strange phone number that texted me about the #OKC #earthquake was a friend of mine (new cell number). I assume he’s okay.
    • @Trap_Legend I texted you about a photographer hit me back she’s really good n located here
  7. Sentential Complement with Optional Goal Object
    1. Heather cabled (Sara) that the party would be tonight. / Heather texted (Sara) that the party would be tonight.
    2. Heather cabled (Sara) when to send the package. / Heather texted (Sara) when to send the package.
    3. Heather cabled (Sara) to come. / Heather texted (Sara) to come.
      • Landlord texted mom that she owes $2000 or we are evicted. Ignoring the obvious that its illegal
      • Lauren just texted me to say that she’s at the park watching a corgi go down the slide over & over and didn’t include a photo #notacceptable
  8. Sentential Complement with Optional Goal _To_ Phrase
    1. Heather cabled (to Sara) that the party would be tonight. / Heather texted (to Sara) that the party would be tonight.
    2. Heather cabled (to Sara) when to send the package / Heather texted (to Sara) when to send the package
    3. Heather cabled (to Sara) to come. / Heather texted (to Sara) to come.
      • Texted coach I’m gunna b late got stopped he replies ‘Did you eye batt?” Lol
  9. Heather cabled for Sara to come./ Heather texted for Sara to come.
    • you couldve texted me back to say ok…smh.
    • @immaELAYEpeace you never texted me shirley to come over!
  10. Direct Speech
    1. Heather cabled (Sarah), “Come immediately.” / Heather texted (Sarah), “Come immediately.”
      • My mom thinks “LOL” means “Lots Of Love”. She texted me, “Your grandma had just died. LOL”
    2. Heather cabled (to Sarah), “Come immediately.” / Heather texted (to Sarah), “Come immediately.”
      • My moms annoying. She texted this to me: “wat time r u leaving”. Would it kill her to type the fucking word?
  11. Parenthetical Use of the Verb
    Given the informal register of most Twitter messages, I did not find any examples of parenthetical uses.

    1. The winner, Heather cabled (Sarah) , would be announced tonight.
      / The winner, Heather texted (Sarah) , would be announced tonight.

    2. The winner, Heather cabled (to Sarah) , would be announced tonight. / The winner, Heather texted (to Sarah) , would be announced tonight.
  12. Zero-related Nominal: a cable / a text (from a direct search)
    • @DrewSmooth did u get my text from last weekend
    • In the US, the average 13- to 17-year-old sends and receives 3,339 texts a month—more than 100 per day http://j.mp/bG1eEp
    • Okay. I woke up with no texts….. What happened? Usually i have about 16 not 0.

Regarding the negative cases (574. Heather cabled/texted to Sarah, 575. Heather cabled/texted at Sarah), I found examples of the former. Here are some examples of “texted to”:

  • @RajaThalyn ugh i texted…couldn’t resist it
  • RT @teensinschool: I wish there was a class where you just sat and talked and texted for a period.

I found no examples of “texted at.” For these, see my discussion of tweet.

Additional similarities

There are other similarities between text, tweet and the other verbs of instrument of communication brought out by the Twitter data.

  1. Use with _back_: Heather texted/cabled Sarah back.
  2. Use with adverbial phrases of frequency. Heather texted/cabled Sarah too much.
  3. Use with adverbial phrases of duration. Heather hasn’t texted/cabled Sarah for two days.
  4. Use with adverbial phrases of enumeration: Heather texted/cabled Sarah fifty times
  5. use with points in time: Heather texted/cabled Sarah at midnight

Additional notes

The vast majority of tweets I looked at have a pronominal direct object as message recipient: [someone] texted you/him/her/me/u/yu; A rough estimate is 80% of the tweets in my sample are of this format, and the great majority (95% or so) of these are forms of “me” and “you.” For example,

  • @ohuaintknow i texted uu too hoe
  • WWHHOOOOAAA!!!! dis chick jus texted me nd said “Yu wanna take me to halloween horror nights?” uhhh no bitch lmao
  • @KolorfulKisses2 I had texted you to tell ya
  • @puckzilla19 Well Santana’s mom texted me she is a sleep. Can I help with Rachel?
  • @ILoveTashae i texted you faggot
  • damn , like 4 ppl just texted me & asked what am I doing . . O_o
  • I THINK Taj texted me because I recognized her area code.
  • My mom just texted me: Hiiiii LOLOLOLOLOLOL!!!

Summary

So, I would like to suggest that, like tweet, text is a radio verb; that is, one of Levin’s “Verbs of Instrument of Communication.” More analysis is required, here, as for tweet. But if you have any comments, please write me below.

Clichés used in an Oxford Dictionary anti-clichés article

We all hate clichés. As this article on avoiding clichés from the Oxford Dictionaries says, they’re not always possible to avoid.

Here are some clichés used in the article’s text. These are not examples they give, but clichés they use in the body text:

  • Once you’ve spotted a cliché…
  • they’ve lost their impact
  • [they’ve] become stale
  • Some people just tune out
  • make a point [they may miss the point that you’re trying to make]
  • use [something] as a starting point
  • indispensable advice

At least they didn’t say “avoid clichés like the plague,” the clichéd anti-cliché joke.

Not surprisingly, the past tense of “tweet” is “tweeted”

I have 100,000 tweets which were sent on 25 March, 2010. Of these, 2,522 had a token which matched the pattern ‘tw*t[*]’, which collected forms like ‘tweeted’ and ‘twittered’ (and forms like ‘twentysomething’ and ‘twilight’). I scanned these and found twenty examples of relatively clear uses of a past tense form of ‘tweet’ or ‘twitter’ (as a verb); there were only 21 of these. Of these 21, 20 where ‘tweeted,’ and the other was ‘twittered’ (It was actually ‘twittrd’, but I take that to be a mispelling). No ‘strong’ forms emerged (twote, twitted, twat) as some have suggested.

Although I didn’t count the number of uses of ‘twitter’ as a verb, I didn’t many instances in my quick scan. Based on this data–and more data really is needed–the past tense of ‘tweet’ is ‘tweeted.’