Will.Whim

A weblog by Will Fitzgerald

Monthly Archives: October 2010

Chris Biemann

Herr Doktor Biemann’s own graphical models
are ausgezeichnet,
If you are surmising that yours will do better
Likeyanknowjus  forgeddaboudit.

 

Some Bing N-gram notes on tokenization

Brendan O’Connor asked me on Twitter what I knew about Bing N-gram tokenization conventions, and I said I would ask someone who knew. The N-gram team plans a future blog post on this, but here’s some things I was told that I could share (and I quote):

Here are a few things we can share right now:

  • Segment boundaries use the special symbols <s> and </s>, thus P(“<s> hello there </s>”) != P(“hello there”).
  • The GetConditionalProbability uses the last space character as the ‘word’ boundary, even if internally there are multiple tokens represented in that last ‘word’.  That is, GCP(“shown as-is”) is P(“as-is”|”shown”), not P(“is”|”shown as”).

 

New Bing NGram data

Bing (my employer, but here a different area) has announced new publicly available NGram data, current to April 2010. It includes 1 through 5-grams for title, anchor and body streams (that is, HTML page titles, text in anchor links, and overall HTML body text).

Tech notes to self: unix sort on a specific field

given a tab separated file, want to sort on field n:

cat file.tsv | sort –key=n -t’ ‘

with ‘Ctrl-V Tab’ in the ‘t’ field (literal tabular character)

Hey, and use -g for ‘general numeric’ sorting! It understands scientific notation, etc.

Text is a radio verb

Like tweet, text follows the pattern of verbs called “verbs of instruments of communication” in Beth Levin’s book, English Verb Classes and Alterations: A Preliminary Investigation. In Levin’s inventory, these verbs include

cable e-mail fax modem netmail phone radio relay satellite sign semaphore signal telecast telegraph telephone telex wire wireless

As I did for tweet, I downloaded tweets from Twitter; I just looked for occurrences of ‘texted. Not surprisingly, it was harder to find examples of texted than tweet[ed], but I did find a fair number of examples in a short time. In sum, like, tweet, text is a “radio” verb.

To show this, I will present each of Levin’s “properties” of radio verbs, along with her example verb cable side-by-side with text; and then several examples of tweet used in this way. There are two negative properties which requires further discussion.

  1. Heather cabled the news./Heather texted the news.
    • Phonetics has taken over my life. I almost texted the word “cute” spelled “kjut”…smh. #nerdtweet
    • @vfcimyourgirlHL I texted it
  2. Heather cabled Sara./Heather texted Sara.
    • Ohh Goddd, I just texted Jamal Robinson haha smhh he’s gonna be like who tha fuckk
    • Lmfao I had accidently texted my tattoo guy :)
  3. Dative Alteration
    1. Heather cabled the news to Sara/Heather texted the news to Sara.
      • @strawberikisz93 Haha. I just texted that to you. Lol. I seriously cant wait. Lol.
      • My moms annoying. She texted this to me: “wat time r u leaving”. Would it kill her to type the fucking word?
    2. Heather cabled Sara the news./Heather texted Sara the news.
      • @Kirsty_Jedward I just texted you the answer to this question LOL
      • RT @rainnwilson: Brett Favre texted me explicit pictures of his enormous ego. (This was heavily retweeted)
  4. *Heather cabled to Sara. (See below)
  5. *Heather cabled the news at Sara. (See below)
  6. Heather cabled Sara about the situation./ Heather texted Sara about the situation.
    • Ok, that strange phone number that texted me about the #OKC #earthquake was a friend of mine (new cell number). I assume he’s okay.
    • @Trap_Legend I texted you about a photographer hit me back she’s really good n located here
  7. Sentential Complement with Optional Goal Object
    1. Heather cabled (Sara) that the party would be tonight. / Heather texted (Sara) that the party would be tonight.
    2. Heather cabled (Sara) when to send the package. / Heather texted (Sara) when to send the package.
    3. Heather cabled (Sara) to come. / Heather texted (Sara) to come.
      • Landlord texted mom that she owes $2000 or we are evicted. Ignoring the obvious that its illegal
      • Lauren just texted me to say that she’s at the park watching a corgi go down the slide over & over and didn’t include a photo #notacceptable
  8. Sentential Complement with Optional Goal _To_ Phrase
    1. Heather cabled (to Sara) that the party would be tonight. / Heather texted (to Sara) that the party would be tonight.
    2. Heather cabled (to Sara) when to send the package / Heather texted (to Sara) when to send the package
    3. Heather cabled (to Sara) to come. / Heather texted (to Sara) to come.
      • Texted coach I’m gunna b late got stopped he replies ‘Did you eye batt?” Lol
  9. Heather cabled for Sara to come./ Heather texted for Sara to come.
    • you couldve texted me back to say ok…smh.
    • @immaELAYEpeace you never texted me shirley to come over!
  10. Direct Speech
    1. Heather cabled (Sarah), “Come immediately.” / Heather texted (Sarah), “Come immediately.”
      • My mom thinks “LOL” means “Lots Of Love”. She texted me, “Your grandma had just died. LOL”
    2. Heather cabled (to Sarah), “Come immediately.” / Heather texted (to Sarah), “Come immediately.”
      • My moms annoying. She texted this to me: “wat time r u leaving”. Would it kill her to type the fucking word?
  11. Parenthetical Use of the Verb
    Given the informal register of most Twitter messages, I did not find any examples of parenthetical uses.

    1. The winner, Heather cabled (Sarah) , would be announced tonight.
      / The winner, Heather texted (Sarah) , would be announced tonight.

    2. The winner, Heather cabled (to Sarah) , would be announced tonight. / The winner, Heather texted (to Sarah) , would be announced tonight.
  12. Zero-related Nominal: a cable / a text (from a direct search)
    • @DrewSmooth did u get my text from last weekend
    • In the US, the average 13- to 17-year-old sends and receives 3,339 texts a month—more than 100 per day http://j.mp/bG1eEp
    • Okay. I woke up with no texts….. What happened? Usually i have about 16 not 0.

Regarding the negative cases (574. Heather cabled/texted to Sarah, 575. Heather cabled/texted at Sarah), I found examples of the former. Here are some examples of “texted to”:

  • @RajaThalyn ugh i texted…couldn’t resist it
  • RT @teensinschool: I wish there was a class where you just sat and talked and texted for a period.

I found no examples of “texted at.” For these, see my discussion of tweet.

Additional similarities

There are other similarities between text, tweet and the other verbs of instrument of communication brought out by the Twitter data.

  1. Use with _back_: Heather texted/cabled Sarah back.
  2. Use with adverbial phrases of frequency. Heather texted/cabled Sarah too much.
  3. Use with adverbial phrases of duration. Heather hasn’t texted/cabled Sarah for two days.
  4. Use with adverbial phrases of enumeration: Heather texted/cabled Sarah fifty times
  5. use with points in time: Heather texted/cabled Sarah at midnight

Additional notes

The vast majority of tweets I looked at have a pronominal direct object as message recipient: [someone] texted you/him/her/me/u/yu; A rough estimate is 80% of the tweets in my sample are of this format, and the great majority (95% or so) of these are forms of “me” and “you.” For example,

  • @ohuaintknow i texted uu too hoe
  • WWHHOOOOAAA!!!! dis chick jus texted me nd said “Yu wanna take me to halloween horror nights?” uhhh no bitch lmao
  • @KolorfulKisses2 I had texted you to tell ya
  • @puckzilla19 Well Santana’s mom texted me she is a sleep. Can I help with Rachel?
  • @ILoveTashae i texted you faggot
  • damn , like 4 ppl just texted me & asked what am I doing . . O_o
  • I THINK Taj texted me because I recognized her area code.
  • My mom just texted me: Hiiiii LOLOLOLOLOLOL!!!

Summary

So, I would like to suggest that, like tweet, text is a radio verb; that is, one of Levin’s “Verbs of Instrument of Communication.” More analysis is required, here, as for tweet. But if you have any comments, please write me below.

Clichés used in an Oxford Dictionary anti-clichés article

We all hate clichés. As this article on avoiding clichés from the Oxford Dictionaries says, they’re not always possible to avoid.

Here are some clichés used in the article’s text. These are not examples they give, but clichés they use in the body text:

  • Once you’ve spotted a cliché…
  • they’ve lost their impact
  • [they’ve] become stale
  • Some people just tune out
  • make a point [they may miss the point that you’re trying to make]
  • use [something] as a starting point
  • indispensable advice

At least they didn’t say “avoid clichés like the plague,” the clichéd anti-cliché joke.

Not surprisingly, the past tense of “tweet” is “tweeted”

I have 100,000 tweets which were sent on 25 March, 2010. Of these, 2,522 had a token which matched the pattern ‘tw*t[*]’, which collected forms like ‘tweeted’ and ‘twittered’ (and forms like ‘twentysomething’ and ‘twilight’). I scanned these and found twenty examples of relatively clear uses of a past tense form of ‘tweet’ or ‘twitter’ (as a verb); there were only 21 of these. Of these 21, 20 where ‘tweeted,’ and the other was ‘twittered’ (It was actually ‘twittrd’, but I take that to be a mispelling). No ‘strong’ forms emerged (twote, twitted, twat) as some have suggested.

Although I didn’t count the number of uses of ‘twitter’ as a verb, I didn’t many instances in my quick scan. Based on this data–and more data really is needed–the past tense of ‘tweet’ is ‘tweeted.’

Center embedding in the wild (sort of)

Around 9:10 minutes into Community, Season 2, Episode 3, “The Psychology of Letting Go“.

Annie: Bitter much?

Britta: Say “Bitter much” much?

Annie: Say “Say ‘Bitter much’ much?'” much?

Tweet is a Radio Verb

Geoffrey K Pullum, in Tweet this, a blog post at the inestimable Language Log, discusses the syntax of the neologism tweet and engagingly writes:

Twitter merely coined a verb meaning “send a message via Twitter”, but they didn’t specify what linguists call its subcategorization possibilities. They added the verb to the dictionary, but they didn’t specify its grammar. The verb tweet is gradually developing its own syntax according to what it means and what its users regard as its combinatory possibilities. That is a really interesting, though unintended, large-scale natural experiment in how syntactic change works. And it is running right now, every minute of every day.

The suggestion is that the syntactic characteristics of tweet are as yet unknown. This suggestion is taken up by the Economist’s language weblog, Johnson. Because tweet doesn’t pattern as say, write or tell, they suggest, we have a chance to watch linguistic evolution occur right before our eyes.

It would be foolish, of course, to suggest that tweet, in the sense suggested, is not new, and thus “evolutionary.” We didn’t tweet before Twitter, and now we do. But a comment on Pullum’s post by John Lawler suggested that tweet follows the pattern of verbs called “verbs of instruments of Communication” in Beth Levin’s inestimable 1 verbal bestiary, English Verb Classes and Alterations: A Preliminary Investigation. In Levin’s inventory, these verbs include

cable e-mail fax modem netmail phone radio relay satellite sign semaphore signal telecast telegraph telephone telex wire wireless2

One way to think of these as having the meaning “send a message via a x”, where x is noun form of the verb. So, cable means send a message via a cable, fax means send a message via a fax, etc. Hmm, this looks familiar. A message sent via Twitter is a tweet, of course; so to tweet means send a message via a tweet.

The question is: does linguistic evidence support this? To begin to answer this, I downloaded a lot of tweets from Twitter. Not surprisingly, this is a good source of the use of tweet in the sense required. In sum, there is a lot of evidence to support its has the syntactic properties of a “radio” verb, plus a few special features of its own.

To show this, I will present each of Levin’s “properties” of radio verbs, along with her example verb cable side-by-side with tweet; and then several examples of tweet used in this way. There is two negative properties which requires further discussion.

  1. Heather cabled the news./Heather tweeted the news.
    • @DeVonna13 That name reminds me of one of my favorite Fleetwood Mac songs called Dreams. The lyrics I tweeted the other night r from it.
    • I don’t recall asking for that information to be tweeted. Grr. Annoying.
    • RT @x2nickjonas2: So I tweeted a lot of things and now they disappeared. :l
  2. Heather cabled Sara./Heather tweeted Sara.
    • Sure ok RT @OGmerv: @MissTasty25 I forgot what I tweeted u last night I was drunk
    • @TVDFANSIRELAND u tweeted the wrong ian lol u tweeted the one with no R !!
    • I take it was because I tweeted you x
  3. Dative Alteration
    1. Heather cabled the news to Sara/Heather tweeted the news to Sara.
      • Wow @Ali_R19 Tweeted Over 350 To @tomthewanted Some Dedicated Person!
      • @geoaubsmom I’m listening now…I think you need to keep tweeting this..has it been tweeted to Dina?
      • @apezz babe, think you might have tweeted that to me by accident instead of writing a reminder to yourself.
    2. Heather cabled Sara the news./Heather tweeted Sara the news.
      • tweet me anything i RT all tweets (but not stupid one’s) ;)
      • Need Spa recommendations for Spa Week? @michellejoni is standing by -just tweet her your city and what treatment you want, darling
  4. *Heather cabled to Sara. (See below)
  5. *Heather cabled the news at Sara. (See below)
  6. Heather cabled Sara about the situation./ Heather tweeted Sara about the situation.
    • Somebody thinks Tanika makes fake IDs then she tweeted her about it lol FEDS gon get y’all
    • @jessica__lasaga I don’t know if you noticed, but LOTS of people tweeted him about his addiction. Her tweet probably wasn’t any different.
  7. Sentential Complement with Optional Goal Object
    1. Heather cabled (Sara) that the party would be tonight. / Heather tweeted (Sara) that the party would be tonight.
    2. Heather cabled (Sara) when to send the package. / Heather tweeted (Sara) when to send the package.
    3. Heather cabled (Sara) to come. / Heather tweeted (Sara) to come.
      • ok. i now hate Jasmine V for a reason. i tweeted her so many times to help us trend #stopchildabuse but she ignored them. heartless bitch
  8. Sentential Complement with Optional Goal _To_ Phrase
    1. Heather cabled (to Sara) that the party would be tonight. / Heather tweeted (to Sara) that the party would be tonight.
    2. Heather cabled (to Sara) when to send the package / Heather tweeted (to Sara) when to send the package
    3. Heather cabled (to Sara) to come. / Heather tweeted (to Sara) to come.
      • @JessKlaibs_1D belle amie girls just tweeted that there not allowed to tweet anymore! :(
      • @MishtotheD Weird cos elvis tweeted that he has gone home??? Im confused haha
  9. Heather cabled for Sara to come./ Heather tweeted for Sara to come.
    • @mspillowtalk60 >I tweeted you the other day to see if you could get discount tickets for Magic Mountain
  10. Direct Speech
    1. Heather cabled (Sarah), “Come immediately.” / Heather tweeted (Sarah), “Come immediately.”
      • I tweeted “Heeeeh .. Going back home.” … then I got a reply on my tweet from you says “Add what ??” that’s it.
      • What would you think if someone tweeted “Using the potty.”?? Lol.
      • RT @BiebsSexySupras: #BieberFact justin once tweeted ‘SHAKE THAT LAFFY TAFFY’ but deleted caus he recieved ALOT of tweets by us pervy be …
    2. Heather cabled (to Sarah), “Come immediately.” / Heather tweeted (to Sarah), “Come immediately.”
      • #bieberfact: 1014 He tweeted once to Usher: “I’m sure it’s illegal to make love in the club”! RT if you want to do something illegal with J.
  11. Parenthetical Use of the Verb
    Given the informal register of most Twitter messages, I did not find any examples of parenthetical uses. 

    1. The winner, Heather cabled (Sarah) , would be announced tonight.
      / The winner, Heather tweeted (Sarah) , would be announced tonight. 
    2. The winner, Heather cabled (to Sarah) , would be announced tonight. / The winner, Heather tweeted (to Sarah) , would be announced tonight.
  12. Zero-related Nominal: a cable / a tweet
    • These type of tweets makes my day.
    • If I am sharing my “first tweet with the world” shouldn’t it be much more profound that this???
    • @YasminTMB your last tweet was the funniest thing I have read all day haha

Regarding the negative cases (574. Heather cabled/tweeted to Sarah, 575. Heather cabled/tweeted at Sarah), I found examples of both. Here are some examples of “tweeted to”:

  • Me & My followers didn’t tweet a lot becuz of some of them haven’t tweeted to me just once so I’ve not even know they exist :P
  • @MidnaBella Amanda tweeted to the wrong person, fewl.
  • @LucyEdwards96 if you look at dans tweets he tweeted to me but i just want to make sure he’s okay and for him to know its from me :) xx

And some examples of “tweeted at”:

  • @silvsthesex no I was jus sayin I hadn’t seen u in my timeline ky33, not knowing u had tweeted at me
  • I tweeted at myself. I’m tired. Meant to tweet @Birdflaps
  • @Geektastic_Tim hey i just saw you tweeted at me last night… i am great. how are you!?

In the “tweet to” case, I’m not that sure that “cabled to/radioed to/faxed to” are that incorrect. Consider these variants (having formalized the register a bit):

  • My friends and I didn’t cable/radio/fax one another a lot—some of them haven’t cabled/radioed/faxed to me even once.
  • Amanda cabled/radioed/faxed to the wrong person.
  • If you look at Dan’s cables/radio messages/faxes, he cabled/radioed/faxed to me, but I just want to be sure he’s okay.

In the “tweet at” examples, I would suggest that when a broadcasting verb is used (that is, a medium of communication which is one-to-many), the “at” construction is more acceptable. “Heather cabled at Sarah” is odd, because a cable is physically delivered to a person. But imagine a large corporation sending scattershot cables or faxes to potential customers. Then, “Gizmatron cabled/faxed at their potential customers” seems more acceptable. The “tweeted at” examples, though, are odd because in each of the cases cited it is clear there is a one-to-one message implied. So, this is perhaps a special syntactic feature of tweet. Still, it is worth noting that Twitter is simultaneously a one-to-one medium and a one-to-many medium. The author may tweet to a specific person, but everyone (or all followers in the normal case) can see the tweet. So, this broadcasty nature of Twitter may allow more flexibility in the use of “tweet at.”

Additional similarities

There are other similarities between tweet and the other verbs of instrument of communication brought out by the Twitter data. Below, I have sanitized the data, but they are all backed by tweet examples. (Some of these have technical names, but I’ve spent too long on this post already.)

  1. Use with _back_: Heather tweeted/cabled Sarah back.
  2. Use with _about_: Heather tweeted/cabled Sarah about her condition
  3. Use with adverbial phrases of frequency. Heather tweeted/cabled Sarah too much.
  4. Use with adverbial phrases of duration. Heather hasn’t tweeted/cabled Sarah for two days.
  5. Use with adverbial phrases of enumeration: Heather tweeted/cabled Sarah fifty times
  6. use with points in time: Heather tweeted/cabled Sarah at midnight
  7. Usable as a filler in “The Revolution will not be [televised]”: The Revolution will not be tweeted/cabled
  8. Usable as an adjective: It was the most tweeted/cabled event.
  9. Usable with inter-group reflexives: Heather and Sarah tweeted/cabled each other.

Summary

So, I would like to suggest that tweet is a radio verb; that is, one of Levin’s “Verbs of Instrument of Communication.” More analysis is required, of course—there’s a dissertation lurking in here, I’m sure. But if you have any comments, please write me below. Or feel free to tweet/cable/fax/phone/signal/sms/message/text/wire/email me. Unfortunately, you won’t be able to netmail or satellite me your responses.

1 I know I’ve just used inestimable for a second time. It’s because, you see, both Language Log and English Verb Classes and Alterations are inestimable. Deal with it. If you care about language, you really need to read Language Log and Beth Levin’s book.

2 Some of these verbs are obsolete now, of course, and I don’t recall ever seeing satellite used as a verb, but this doesn’t affect this discussion much.

Computational lexicography on the cheap using Twitter

Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:

What iss the most popular question then? Tweet    the answer and hopefully u may only get asked 500 times?
what is there to                         tweet    about this morning?
What is your biggest food weakness?      Tweet    @Thintervention for motivation! #thinterventionG

This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).

To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:

> @client = TweetStream::Client.new('user','pass')
> File.open("tweet.tweets", "w+") do |f|
        n = 0
        @client.track('tweet','tweets','tweeting') do |s|
            n+= 1
           @client.stop if n >= 10000
           f.puts "#{s.text}"
        end
    end

When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:

> cat tweet.tweets | perl -pe 's/\b(tweet|tweets|tweeting)\b/\t$1\t/gi' |sort -f -k2,3 -t\t    > tweets.txt

You can then import this file into your speadsheet program and slice and dice to your heart’s content.

Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!