A weblog by Will Fitzgerald

Category Archives: Language

Center embedding in the wild (sort of)

Around 9:10 minutes into Community, Season 2, Episode 3, “The Psychology of Letting Go“.

Annie: Bitter much?

Britta: Say “Bitter much” much?

Annie: Say “Say ‘Bitter much’ much?'” much?


Tweet is a Radio Verb

Geoffrey K Pullum, in Tweet this, a blog post at the inestimable Language Log, discusses the syntax of the neologism tweet and engagingly writes:

Twitter merely coined a verb meaning “send a message via Twitter”, but they didn’t specify what linguists call its subcategorization possibilities. They added the verb to the dictionary, but they didn’t specify its grammar. The verb tweet is gradually developing its own syntax according to what it means and what its users regard as its combinatory possibilities. That is a really interesting, though unintended, large-scale natural experiment in how syntactic change works. And it is running right now, every minute of every day.

The suggestion is that the syntactic characteristics of tweet are as yet unknown. This suggestion is taken up by the Economist’s language weblog, Johnson. Because tweet doesn’t pattern as say, write or tell, they suggest, we have a chance to watch linguistic evolution occur right before our eyes.

It would be foolish, of course, to suggest that tweet, in the sense suggested, is not new, and thus “evolutionary.” We didn’t tweet before Twitter, and now we do. But a comment on Pullum’s post by John Lawler suggested that tweet follows the pattern of verbs called “verbs of instruments of Communication” in Beth Levin’s inestimable 1 verbal bestiary, English Verb Classes and Alterations: A Preliminary Investigation. In Levin’s inventory, these verbs include

cable e-mail fax modem netmail phone radio relay satellite sign semaphore signal telecast telegraph telephone telex wire wireless2

One way to think of these as having the meaning “send a message via a x”, where x is noun form of the verb. So, cable means send a message via a cable, fax means send a message via a fax, etc. Hmm, this looks familiar. A message sent via Twitter is a tweet, of course; so to tweet means send a message via a tweet.

The question is: does linguistic evidence support this? To begin to answer this, I downloaded a lot of tweets from Twitter. Not surprisingly, this is a good source of the use of tweet in the sense required. In sum, there is a lot of evidence to support its has the syntactic properties of a “radio” verb, plus a few special features of its own.

To show this, I will present each of Levin’s “properties” of radio verbs, along with her example verb cable side-by-side with tweet; and then several examples of tweet used in this way. There is two negative properties which requires further discussion.

  1. Heather cabled the news./Heather tweeted the news.
    • @DeVonna13 That name reminds me of one of my favorite Fleetwood Mac songs called Dreams. The lyrics I tweeted the other night r from it.
    • I don’t recall asking for that information to be tweeted. Grr. Annoying.
    • RT @x2nickjonas2: So I tweeted a lot of things and now they disappeared. :l
  2. Heather cabled Sara./Heather tweeted Sara.
    • Sure ok RT @OGmerv: @MissTasty25 I forgot what I tweeted u last night I was drunk
    • @TVDFANSIRELAND u tweeted the wrong ian lol u tweeted the one with no R !!
    • I take it was because I tweeted you x
  3. Dative Alteration
    1. Heather cabled the news to Sara/Heather tweeted the news to Sara.
      • Wow @Ali_R19 Tweeted Over 350 To @tomthewanted Some Dedicated Person!
      • @geoaubsmom I’m listening now…I think you need to keep tweeting this..has it been tweeted to Dina?
      • @apezz babe, think you might have tweeted that to me by accident instead of writing a reminder to yourself.
    2. Heather cabled Sara the news./Heather tweeted Sara the news.
      • tweet me anything i RT all tweets (but not stupid one’s) ;)
      • Need Spa recommendations for Spa Week? @michellejoni is standing by -just tweet her your city and what treatment you want, darling
  4. *Heather cabled to Sara. (See below)
  5. *Heather cabled the news at Sara. (See below)
  6. Heather cabled Sara about the situation./ Heather tweeted Sara about the situation.
    • Somebody thinks Tanika makes fake IDs then she tweeted her about it lol FEDS gon get y’all
    • @jessica__lasaga I don’t know if you noticed, but LOTS of people tweeted him about his addiction. Her tweet probably wasn’t any different.
  7. Sentential Complement with Optional Goal Object
    1. Heather cabled (Sara) that the party would be tonight. / Heather tweeted (Sara) that the party would be tonight.
    2. Heather cabled (Sara) when to send the package. / Heather tweeted (Sara) when to send the package.
    3. Heather cabled (Sara) to come. / Heather tweeted (Sara) to come.
      • ok. i now hate Jasmine V for a reason. i tweeted her so many times to help us trend #stopchildabuse but she ignored them. heartless bitch
  8. Sentential Complement with Optional Goal _To_ Phrase
    1. Heather cabled (to Sara) that the party would be tonight. / Heather tweeted (to Sara) that the party would be tonight.
    2. Heather cabled (to Sara) when to send the package / Heather tweeted (to Sara) when to send the package
    3. Heather cabled (to Sara) to come. / Heather tweeted (to Sara) to come.
      • @JessKlaibs_1D belle amie girls just tweeted that there not allowed to tweet anymore! :(
      • @MishtotheD Weird cos elvis tweeted that he has gone home??? Im confused haha
  9. Heather cabled for Sara to come./ Heather tweeted for Sara to come.
    • @mspillowtalk60 >I tweeted you the other day to see if you could get discount tickets for Magic Mountain
  10. Direct Speech
    1. Heather cabled (Sarah), “Come immediately.” / Heather tweeted (Sarah), “Come immediately.”
      • I tweeted “Heeeeh .. Going back home.” … then I got a reply on my tweet from you says “Add what ??” that’s it.
      • What would you think if someone tweeted “Using the potty.”?? Lol.
      • RT @BiebsSexySupras: #BieberFact justin once tweeted ‘SHAKE THAT LAFFY TAFFY’ but deleted caus he recieved ALOT of tweets by us pervy be …
    2. Heather cabled (to Sarah), “Come immediately.” / Heather tweeted (to Sarah), “Come immediately.”
      • #bieberfact: 1014 He tweeted once to Usher: “I’m sure it’s illegal to make love in the club”! RT if you want to do something illegal with J.
  11. Parenthetical Use of the Verb
    Given the informal register of most Twitter messages, I did not find any examples of parenthetical uses. 

    1. The winner, Heather cabled (Sarah) , would be announced tonight.
      / The winner, Heather tweeted (Sarah) , would be announced tonight. 
    2. The winner, Heather cabled (to Sarah) , would be announced tonight. / The winner, Heather tweeted (to Sarah) , would be announced tonight.
  12. Zero-related Nominal: a cable / a tweet
    • These type of tweets makes my day.
    • If I am sharing my “first tweet with the world” shouldn’t it be much more profound that this???
    • @YasminTMB your last tweet was the funniest thing I have read all day haha

Regarding the negative cases (574. Heather cabled/tweeted to Sarah, 575. Heather cabled/tweeted at Sarah), I found examples of both. Here are some examples of “tweeted to”:

  • Me & My followers didn’t tweet a lot becuz of some of them haven’t tweeted to me just once so I’ve not even know they exist :P
  • @MidnaBella Amanda tweeted to the wrong person, fewl.
  • @LucyEdwards96 if you look at dans tweets he tweeted to me but i just want to make sure he’s okay and for him to know its from me :) xx

And some examples of “tweeted at”:

  • @silvsthesex no I was jus sayin I hadn’t seen u in my timeline ky33, not knowing u had tweeted at me
  • I tweeted at myself. I’m tired. Meant to tweet @Birdflaps
  • @Geektastic_Tim hey i just saw you tweeted at me last night… i am great. how are you!?

In the “tweet to” case, I’m not that sure that “cabled to/radioed to/faxed to” are that incorrect. Consider these variants (having formalized the register a bit):

  • My friends and I didn’t cable/radio/fax one another a lot—some of them haven’t cabled/radioed/faxed to me even once.
  • Amanda cabled/radioed/faxed to the wrong person.
  • If you look at Dan’s cables/radio messages/faxes, he cabled/radioed/faxed to me, but I just want to be sure he’s okay.

In the “tweet at” examples, I would suggest that when a broadcasting verb is used (that is, a medium of communication which is one-to-many), the “at” construction is more acceptable. “Heather cabled at Sarah” is odd, because a cable is physically delivered to a person. But imagine a large corporation sending scattershot cables or faxes to potential customers. Then, “Gizmatron cabled/faxed at their potential customers” seems more acceptable. The “tweeted at” examples, though, are odd because in each of the cases cited it is clear there is a one-to-one message implied. So, this is perhaps a special syntactic feature of tweet. Still, it is worth noting that Twitter is simultaneously a one-to-one medium and a one-to-many medium. The author may tweet to a specific person, but everyone (or all followers in the normal case) can see the tweet. So, this broadcasty nature of Twitter may allow more flexibility in the use of “tweet at.”

Additional similarities

There are other similarities between tweet and the other verbs of instrument of communication brought out by the Twitter data. Below, I have sanitized the data, but they are all backed by tweet examples. (Some of these have technical names, but I’ve spent too long on this post already.)

  1. Use with _back_: Heather tweeted/cabled Sarah back.
  2. Use with _about_: Heather tweeted/cabled Sarah about her condition
  3. Use with adverbial phrases of frequency. Heather tweeted/cabled Sarah too much.
  4. Use with adverbial phrases of duration. Heather hasn’t tweeted/cabled Sarah for two days.
  5. Use with adverbial phrases of enumeration: Heather tweeted/cabled Sarah fifty times
  6. use with points in time: Heather tweeted/cabled Sarah at midnight
  7. Usable as a filler in “The Revolution will not be [televised]”: The Revolution will not be tweeted/cabled
  8. Usable as an adjective: It was the most tweeted/cabled event.
  9. Usable with inter-group reflexives: Heather and Sarah tweeted/cabled each other.


So, I would like to suggest that tweet is a radio verb; that is, one of Levin’s “Verbs of Instrument of Communication.” More analysis is required, of course—there’s a dissertation lurking in here, I’m sure. But if you have any comments, please write me below. Or feel free to tweet/cable/fax/phone/signal/sms/message/text/wire/email me. Unfortunately, you won’t be able to netmail or satellite me your responses.

1 I know I’ve just used inestimable for a second time. It’s because, you see, both Language Log and English Verb Classes and Alterations are inestimable. Deal with it. If you care about language, you really need to read Language Log and Beth Levin’s book.

2 Some of these verbs are obsolete now, of course, and I don’t recall ever seeing satellite used as a verb, but this doesn’t affect this discussion much.

Computational lexicography on the cheap using Twitter

Let’s say you want to investigate the use of “tweet” as a verb (see “Tweet this” at Language Log), and you want to collect, oh, 10,000 examples or so and do some concordance work, for example:

What iss the most popular question then? Tweet    the answer and hopefully u may only get asked 500 times?
what is there to                         tweet    about this morning?
What is your biggest food weakness?      Tweet    @Thintervention for motivation! #thinterventionG

This is simple to do with a bash command line, perl, Ruby, the Tweetstream gem, and a spreadsheet program (or just plain old grep).

To download 10,000 tweets containing “tweet,” “tweets”, or “tweeting” and save them in a file called “tweet.tweets”:

> @client = TweetStream::Client.new('user','pass')
> File.open("tweet.tweets", "w+") do |f|
        n = 0
        @client.track('tweet','tweets','tweeting') do |s|
            n+= 1
           @client.stop if n >= 10000
           f.puts "#{s.text}"

When these are finished downloading, you can tab separate the contexts using perl, and sort on the right context:

> cat tweet.tweets | perl -pe 's/\b(tweet|tweets|tweeting)\b/\t$1\t/gi' |sort -f -k2,3 -t\t    > tweets.txt

You can then import this file into your speadsheet program and slice and dice to your heart’s content.

Note: it took longer to write this blog post than it did to collect the data. Analysis to follow, though!

Obscene intensification of adjectives (a bit NSFW)

XKCD, a ‘webcomic of romance, sarcasm, math, and language’, presents a hand-drawn chart of the frequency with which ‘fucking adjective‘ or ‘adjective as shit’ can be found in web search results.

Well, with the Bing Ngram data, we can provide more exact figures which don’t depend on all of the choice and ranking decisions made by a search engine to include or not a page in a result, and (in particular) the estimate of the number of pages on which the term appears.

So, I can say that ‘fucking free’ is the most common ‘fucking adjective‘ pair (though one suspects we’re not talking about free-as-in-speech here); followed by hot, hard, young, hardcore, black, awesome, big, white, old, sexy, good, great, amazing, huge, blonde, naked, asian, nude, sweet, crazy, horny, and hotttttttttt. And the least likely are lap-straked, cortico-hypothalamic, cloven-hooved, cloven-footed, Malayo-Polynesian, most-favored-nation, and neo-Lamarkian.

Science marches on.

Republican Pledge To America (the stock phrase version)

This is the Republican Pledge to America, removing everything but the multiword stock phrases recognized by an (internal Microsoft) phrase detector. It makes a kind of poetry.

free people govern themselves unalienable rights woman can religious liberty man or woman consent of the governed endowed by their creator
first principles through hard declaration of independence destructive of these ends enshrined in the constitution
do not consent of the governed
their values striking down long standing
makes decisions out of touch
rising joblessness
like free our citizens speaking out founding principles common problems
urgent action people cannot
document we our founding keeping faith principles we
original intent tenth amendment united states powers not delegated
promote greater wider opportunity
traditional marriage our american faith based organizations
its actions
fellow citizens join us true faith and allegiance
has declined economic growth families and communities
town halls public squares spoken out phone calls
though these
have imposed backroom deals has supplanted behind closed doors
not enough behavior so have drafted their concerns
clearly different different approach
accountable government fiscal responsibility american values
puts forth powers that be the powers that be
wider opportunity economic recovery
economic uncertainty we offer taking steps
small businesses tax deduction 20 percent red tape tape factory washington dc small business health care
we cannot we offer stop out out of control
common sense our troops roll back government spending balance the budget
sustained effort has occurred occurred over hiring freeze federal employees programs we spending habits fulfilling our irresponsible behavior fannie mae troubled asset relief program
pushing off off our challenges we budget process entitlement programs
remember that president obama health care address our our growing thoroughly discredited we offer common sense lowering costs small businesses coverage they been thoroughly discredited we now know medical liability reform across state lines doctor patient relationship
proposing them ensure there not harder we offer missile defense guantanamo bay local jails act decisively working closely eliminate unnecessary spending above all else state and local
pride and dignity
heavy handed jobless claims our national we need heavy handed approach
economic uncertainty
decision making personal choices free people free market local governments bob mcdonnell one size fits all state and local governments
up against trillion dollar spending bill has made rallying cry has been nine percent far cry people were
dismal results president obama spending billions billions more raise taxes roughly half small business raising taxes economists agree obama is proposing
president john john f. f. kennedy balance the budget
policies such such as marriage penalty household income according to see its deloitte tax llp child tax credit cut in half alternative minimum tax
economic policies have pushed tax increases small businesses businesses must must have months so employers must under their every few months
tax increases spending sprees wake up abandon its turn things around
trillion dollar unemployment rate has climbed january 2009
we cannot working again
all tax tax increases currently scheduled their retirement small businesses middle class families
small businesses tax deduction small business 20 percent
red tape tape factory washington dc de facto the game small businesses cannot properly may harm
small business health care small businesses internal revenue any purchases 1099 reporting has determined the agency ill equipped handle all internal revenue service equipped to handle
spending spree out of control
raise taxes make it easy
fiscal discipline
up against spending habits they promised president obama
discretionary spending each year defense department homeland security has increased result we years our day just fight terrorism department of defense secure our border department of homeland security
summed up ronald reagan president ronald reagan
interest rates
walked away out of control
debt we urgent action spending habits bring down build long challenges we face pay down the debt
act immediately no reason us further wasteful and unnecessary wastes taxpayer money there is no reason
common sense our troops roll back spending spree balancing the budget paying down the debt
discretionary spending common sense last year even more were used growth we cutting discretionary spending
increased its its own small businesses significantly reducing
spending cuts house republicans runaway spending nine weeks save taxpayers
rightly outraged tens of billions once and for all troubled asset relief program
fannie mae freddie mac mortgage companies too many many high afford them
federal hiring hiring freeze small businesses federal employees public sector no longer
once created federal programs go away away even even if problem they never go away addressed this problem root out government waste
budget process focus on challenges we entitlement programs social security these programs reviewing them medicare and medicaid
every hour spent per per minute twice as much
budget projections man woman and child
does all assistance programs local governments non profit funding programs federal domestic assistance non profit organizations
crowding out one level most recent government spending ten years years than several percentage points
one thing health care president obama higher taxes small businesses doctor patient health care reform
up against health care through congress congress have
have announced laying off dropping their health care coast to coast laying off employees health care coverage
chief actuary medicaid services health care congressional budget office nonpartisan congressional budget office
social security law does his plan important thing social security and medicare
health care raise taxes middle class has conceded
chief actuary medicaid services services has has confirmed their current
million americans drop their their current forced to acknowledge
overwhelmingly opposed president obama health care taxpayer funds
health care
health care
health care care we take action will immediately take action
liability insurance rates have have distorted protect themselves often referred common sense lower costs medical liability reform medical liability insurance
health insurance into those those plans health care state in across state lines health insurance plans health care coverage
savings accounts health insurance health care these savings purchase over making it easier over the counter high deductible health plans
relationship we health care common sense doctor patient relationship
ensure access health care just because lower premiums number of uninsured
insurance coverage hyde amendment other instances health care health care providers
regulations have
employees may taxes levied health care billions of dollars
health care tax increases goods and services
health care committee have joint economic committee
often ignored backroom deals congress have than under nancy pelosi speaker nancy pelosi
all so once and for all
up against does whatever her tenure speaker pelosi the letter house rules while ignoring within her her own democratic leaders wrong direction using various despite having democratic majority since 1993 considered under house of representatives letter and spirit
health care speaker pelosi louise slaughter publicly discussed house democrats chairwoman louise slaughter
key provisions
americans believe released earlier this year
just plain house republicans real time highest priorities what goes on house of representatives first of its kind
before coming coming up legislation should interested parties no more hiding
adhere to too long too often has allowed massive deficit require each constitutional authority by what authority lack of respect
behavior so let any make it easier democrat or republican
legislative issues time we instead we one at a time
up significantly wasting time considered under so far
only two
relied heavily bring any
all that served us ensure our government has never apologize support our troops around the world
new york fort hood times square these attacks new york city ready and willing new york city subway
does not president dwight d. eisenhower
border security not just national security just war
provide our our troops more troop held up pork barrel
america we president obama his administration guantanamo bay fight against terrorist plots guantanamo bay detainees
foreign terrorists american citizens they have u.s. military such as military intelligence law enforcement
missile defense ballistic missiles intercontinental ballistic missiles
iranian regime harm our its own has declared nuclear capability sanctions against iran
take action enforcing our border patrol immigration laws illegal immigration drug cartels means we we need law enforcement all federal secure our borders mexican drug cartels enforce our immigration laws all hands on deck
christmas day homeland security visa applications after having department of homeland security
our founders the servant people not checks and balances
elected representatives bills passed dollars spent
quite different people went went there they sought their strong has over helped create margaret thatcher sense of purpose
our principles more accountable working again stop out health care we will stand out of control
time we against any
transparency and accountability
health care
union bosses
built through our constitution people we 111th congress bring these these reforms ask all good will our beliefs men and women
speak out

Avoid like the plaque

I noticed that the expression “avoid like the plaque” showing up in some search results I was doing. This is “plaque” with-a-queue, not “plague”-with-a-gee. This seems like an odd misspelling to me, and I was curious how often this occurs. The Bing Ngram data gives us the tools: given “avoid like the”, the word “plague” occurs 98.68% of the time (in the web body data). The word “plaque” is the second most common word, occurring 0.13% of the time.

Here are the top  words and percentages:

plague:98.68%, </s>: 0.48%, plaque: 0.13%, pl: 0.12%, great: 0.07%, swine: 0.06%, velvet: 0.04%, black: 0.02%, proverbial: 0.02%, bubonic: 0.02%, bubolic: 0.02%, ten: 0.01%, plagued: 0.01%, plauge: 0.01%, bird: 0.01%, nuclear: 0.01%, blight: 0.01%, clap: 0.01%, lague: 0.01%, plagu: 0.01%, pleague: 0.01%

</s> means “end of sentence.”

Using Ngram data for segmentation

I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.

Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:

#  > segment("bpcares")
#  => ["bp", "cares"]
#  > segment("Twitter")
#  => ["Twitter"]
#  > segment("writers")
#  => ["writers"]
#  > segment("iamwriting")
#  => ["i", "am", "writing"]
#  > segment("backchannel")
#  => ["back", "channel"]
#  > segment("tcot")
#  => ["tcot"]
#  > segment("vacationfallout")
#  => ["vacation", "fall", "out"]

The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).

The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.

Anyway, enjoy the code: it can be found in my GitHub account:


It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.

Ruby project to access Microsoft Ngram data

I am pleased to announce the availability of a Ruby library to access the Microsoft Ngram data. This data currently includes 1,2,3,4 gram data for anchor text, 1,2,3 gram data for body text, 1,2,3,4 gram data for page titles, and 1,2,3 gram data for queries, collected in June 2009. See the Bing/MSR Ngram data page for general information. Although I am a Microsoft employee, this software is provided by me, not Microsoft.

Microsoft provides a SOAP API, and a Python REST-based library, but this is (I think) the first Ruby library. You can get a copy at Github.com: http://github.com/willf/microsoft_ngram.

I hope, in the days to come, to write some example programs that show the power of this data resource for research. But, to whet the appetite, should we parse “Boston cream pie” as [Boston [cream pie]] or [[Boston cream] pie]? That is, a cream pie made in the Boston style, or a pie made with Boston cream? If the former, “cream pie” should be more frequent than “Boston cream”; if the later, the opposite.

> MicrosoftNgram.new(:model => ‘bing-body/jun09/2’).jps([‘boston cream’,’cream pie’]) => [[“boston cream”, -7.231685], [“cream pie”, -6.027882]]
These are log probabilities. This model suggests [Boston [cream pie]] is the correct bracketing.


I enjoyed reading the Wikipedia page about its “lamest edit wars.” One of these edit wars was whether the article on what Americans call “aluminum” and what Brits call “aluminium” should have, as its fundamental title, “Aluminum,” or “Aluminium.” And, one of the arguments presented in favo(u)r of “Aluminum” was that more Google hits are available for the US spelling than the UK spelling. “Ghits” is notoriously unreliable (as are Bing hits and the other search engines), since the number of search results reported are subject to lots of factors, not of which is tied directly to actual number of documents returned.

However, Bing (my employer) has recently provided programmatic access to its data on ngrams (frequency statistics based on the number of word tokens) found on web pages, query logs and anchor text (the data inside links). And I can safely express that the US spelling is much more frequently used. Here is the actual data, based on the June 2009 data release:

Source P(Aluminum) P(Aluminium) Ratio US:UK
Body text 0.00852 0.00487 1.76
Anchor text 0.00727 0.00426 1.70
Query text 0.00974 0.00483 2.01

So, as a data point: “aluminum” is around twice as frequent as “aluminium” on the Web.

“buy a house, sell a home?” redux

A while ago, I wrote about whether there was evidence that people tend to “buy a home and sell a house,” and used Google’s n-gram  data (based on web documents) to suggest this wasn’t the case. I happened to come across this post today, and wondered whether some of the query streams I have access to now might say something to this. I looked at our frequently “sell” or “buy” co-occur with “house” or “home” in a stream of about 36 million queries.

buy & home : 393
buy & house: 525

sell & home: 396
sell & house: 420

buy & home/buy & house: 0.74
sell & home/sell & house: 0.94
(buy & home + sell & home)/(buy & house + sell & house) : 0.83

Unlike the n-gram data, people are more likely to use “house” in queries that include “buy” or “sell,” (taken separately, or taken together). This may indicate that people searching for information on real estate tend to use “house,” while people advertising real estate tend to use “home” (“sell a home” was over 7.5x more likely than “sell a house” in the document-based ngram data). As far as people searching goes, they tend to “buy a house” and “sell a house.”