A weblog by Will Fitzgerald

Monthly Archives: September 2010

Obscene intensification of adjectives (a bit NSFW)

XKCD, a ‘webcomic of romance, sarcasm, math, and language’, presents a hand-drawn chart of the frequency with which ‘fucking adjective‘ or ‘adjective as shit’ can be found in web search results.

Well, with the Bing Ngram data, we can provide more exact figures which don’t depend on all of the choice and ranking decisions made by a search engine to include or not a page in a result, and (in particular) the estimate of the number of pages on which the term appears.

So, I can say that ‘fucking free’ is the most common ‘fucking adjective‘ pair (though one suspects we’re not talking about free-as-in-speech here); followed by hot, hard, young, hardcore, black, awesome, big, white, old, sexy, good, great, amazing, huge, blonde, naked, asian, nude, sweet, crazy, horny, and hotttttttttt. And the least likely are lap-straked, cortico-hypothalamic, cloven-hooved, cloven-footed, Malayo-Polynesian, most-favored-nation, and neo-Lamarkian.

Science marches on.

Atheists and agnostics know more about religion than the religious people. So what?

The Pew Forum on Religion and Public Life released a report on how much Americans know about some religious facts; they have a online quiz version of their survey that’s making the rounds.

NPR reported on this in their story “Survey: Atheists, Agnostics Know More About Religion Than Religious,” and the LA Times did as well in their story “Atheists, agnostics most knowledgeable about religion, survey says.” For some reason, the actual report from the Pew Forum is unavailable as I write, but much of the details can be found at the online quiz.

As their headlines indicate, both NPR and the LA Times find it remarkable that non-religious people know more answers to these questions about religious figures and demographics than religious people do. However, I find this completely unremarkable. Even a cursory reading of the details indicate that doing well on this test strongly correlates with educational level: the more education you have, the better you’ll do on this quiz. But this is also true of non-belief: the more education you have, the less likely you are to hold religious beliefs. “White Evangelical Protestants” actually do better than average (what most educated people think of when they think about religious Americans, I think). The religious groups with the least education–Black Protestant (read: Black) and Latino Catholic (read: Latino) do the worst. The religious group with the most education, Jews, do the best (I don’t know if Pew broke out non-theistic Jews from “Jews”).

One of the questions asked (in the online version) is who the religious figure is most associated with the Great Awakening. This was an incredibly important series of events and movements in US history and affected the shape of Protestant and other Christian practice and belief until the current day. But very few Protestants that I know consider this figure anything more than a historical figure; he’s a matter of history, not religious belief or practice. If you’ve gone to a lot of school, you’re more likely to have run across his name (of the 15 online questions, this is the least well known). Do you know the answer? It’s not unlikely you learned it in a (college level) history or American religion class.

Another one of the questions asks on what day the Jewish Sabbath begins.  Most people get this wrong (except Jews, who live–or have the memory of living–in the duality of the Jewish and Western calendars). But this feels like a bit of a trick question, requiring a secular or Western Christian understanding of the week. I think “Saturday” is a perfectly reasonable correct answer to this question. I got it “correct,” but this seems to indicate more about my test-taking skills (watch out for trick questions!) than my nuanced understanding of my co-religionists.

In the online quiz, the only especially interesting question is the appalling misunderstanding about whether a teacher can read from the Bible “as literature.” On average, only 25% know this is just fine (and, in general, who gets this right correlates positively with education, but no group gets this right more than 50% of the time).

Education correlates negatively with religious beliefs: this is not a new finding. In fact, it’s found right in the Bible (sort of), when Paul writes to an early church in Corinth:

Consider your own call … not many of you were wise by human standards, not many were powerful, not many were of noble birth (1 Corinthians 1:26).

It’s disappointing that NPR and the LA Times wrote such smug articles.

Republican Pledge To America (the stock phrase version)

This is the Republican Pledge to America, removing everything but the multiword stock phrases recognized by an (internal Microsoft) phrase detector. It makes a kind of poetry.

free people govern themselves unalienable rights woman can religious liberty man or woman consent of the governed endowed by their creator
first principles through hard declaration of independence destructive of these ends enshrined in the constitution
do not consent of the governed
their values striking down long standing
makes decisions out of touch
rising joblessness
like free our citizens speaking out founding principles common problems
urgent action people cannot
document we our founding keeping faith principles we
original intent tenth amendment united states powers not delegated
promote greater wider opportunity
traditional marriage our american faith based organizations
its actions
fellow citizens join us true faith and allegiance
has declined economic growth families and communities
town halls public squares spoken out phone calls
though these
have imposed backroom deals has supplanted behind closed doors
not enough behavior so have drafted their concerns
clearly different different approach
accountable government fiscal responsibility american values
puts forth powers that be the powers that be
wider opportunity economic recovery
economic uncertainty we offer taking steps
small businesses tax deduction 20 percent red tape tape factory washington dc small business health care
we cannot we offer stop out out of control
common sense our troops roll back government spending balance the budget
sustained effort has occurred occurred over hiring freeze federal employees programs we spending habits fulfilling our irresponsible behavior fannie mae troubled asset relief program
pushing off off our challenges we budget process entitlement programs
remember that president obama health care address our our growing thoroughly discredited we offer common sense lowering costs small businesses coverage they been thoroughly discredited we now know medical liability reform across state lines doctor patient relationship
proposing them ensure there not harder we offer missile defense guantanamo bay local jails act decisively working closely eliminate unnecessary spending above all else state and local
pride and dignity
heavy handed jobless claims our national we need heavy handed approach
economic uncertainty
decision making personal choices free people free market local governments bob mcdonnell one size fits all state and local governments
up against trillion dollar spending bill has made rallying cry has been nine percent far cry people were
dismal results president obama spending billions billions more raise taxes roughly half small business raising taxes economists agree obama is proposing
president john john f. f. kennedy balance the budget
policies such such as marriage penalty household income according to see its deloitte tax llp child tax credit cut in half alternative minimum tax
economic policies have pushed tax increases small businesses businesses must must have months so employers must under their every few months
tax increases spending sprees wake up abandon its turn things around
trillion dollar unemployment rate has climbed january 2009
we cannot working again
all tax tax increases currently scheduled their retirement small businesses middle class families
small businesses tax deduction small business 20 percent
red tape tape factory washington dc de facto the game small businesses cannot properly may harm
small business health care small businesses internal revenue any purchases 1099 reporting has determined the agency ill equipped handle all internal revenue service equipped to handle
spending spree out of control
raise taxes make it easy
fiscal discipline
up against spending habits they promised president obama
discretionary spending each year defense department homeland security has increased result we years our day just fight terrorism department of defense secure our border department of homeland security
summed up ronald reagan president ronald reagan
interest rates
walked away out of control
debt we urgent action spending habits bring down build long challenges we face pay down the debt
act immediately no reason us further wasteful and unnecessary wastes taxpayer money there is no reason
common sense our troops roll back spending spree balancing the budget paying down the debt
discretionary spending common sense last year even more were used growth we cutting discretionary spending
increased its its own small businesses significantly reducing
spending cuts house republicans runaway spending nine weeks save taxpayers
rightly outraged tens of billions once and for all troubled asset relief program
fannie mae freddie mac mortgage companies too many many high afford them
federal hiring hiring freeze small businesses federal employees public sector no longer
once created federal programs go away away even even if problem they never go away addressed this problem root out government waste
budget process focus on challenges we entitlement programs social security these programs reviewing them medicare and medicaid
every hour spent per per minute twice as much
budget projections man woman and child
does all assistance programs local governments non profit funding programs federal domestic assistance non profit organizations
crowding out one level most recent government spending ten years years than several percentage points
one thing health care president obama higher taxes small businesses doctor patient health care reform
up against health care through congress congress have
have announced laying off dropping their health care coast to coast laying off employees health care coverage
chief actuary medicaid services health care congressional budget office nonpartisan congressional budget office
social security law does his plan important thing social security and medicare
health care raise taxes middle class has conceded
chief actuary medicaid services services has has confirmed their current
million americans drop their their current forced to acknowledge
overwhelmingly opposed president obama health care taxpayer funds
health care
health care
health care care we take action will immediately take action
liability insurance rates have have distorted protect themselves often referred common sense lower costs medical liability reform medical liability insurance
health insurance into those those plans health care state in across state lines health insurance plans health care coverage
savings accounts health insurance health care these savings purchase over making it easier over the counter high deductible health plans
relationship we health care common sense doctor patient relationship
ensure access health care just because lower premiums number of uninsured
insurance coverage hyde amendment other instances health care health care providers
regulations have
employees may taxes levied health care billions of dollars
health care tax increases goods and services
health care committee have joint economic committee
often ignored backroom deals congress have than under nancy pelosi speaker nancy pelosi
all so once and for all
up against does whatever her tenure speaker pelosi the letter house rules while ignoring within her her own democratic leaders wrong direction using various despite having democratic majority since 1993 considered under house of representatives letter and spirit
health care speaker pelosi louise slaughter publicly discussed house democrats chairwoman louise slaughter
key provisions
americans believe released earlier this year
just plain house republicans real time highest priorities what goes on house of representatives first of its kind
before coming coming up legislation should interested parties no more hiding
adhere to too long too often has allowed massive deficit require each constitutional authority by what authority lack of respect
behavior so let any make it easier democrat or republican
legislative issues time we instead we one at a time
up significantly wasting time considered under so far
only two
relied heavily bring any
all that served us ensure our government has never apologize support our troops around the world
new york fort hood times square these attacks new york city ready and willing new york city subway
does not president dwight d. eisenhower
border security not just national security just war
provide our our troops more troop held up pork barrel
america we president obama his administration guantanamo bay fight against terrorist plots guantanamo bay detainees
foreign terrorists american citizens they have u.s. military such as military intelligence law enforcement
missile defense ballistic missiles intercontinental ballistic missiles
iranian regime harm our its own has declared nuclear capability sanctions against iran
take action enforcing our border patrol immigration laws illegal immigration drug cartels means we we need law enforcement all federal secure our borders mexican drug cartels enforce our immigration laws all hands on deck
christmas day homeland security visa applications after having department of homeland security
our founders the servant people not checks and balances
elected representatives bills passed dollars spent
quite different people went went there they sought their strong has over helped create margaret thatcher sense of purpose
our principles more accountable working again stop out health care we will stand out of control
time we against any
transparency and accountability
health care
union bosses
built through our constitution people we 111th congress bring these these reforms ask all good will our beliefs men and women
speak out

Nip it in the butt

I’ve seen two references to “nip it in the butt” for “nip it in the bud” recently, which reminded me of the “avoid like the plaque” post.

It’s a little harder to get statistics for this based on the public Bing data, which only goes to four tokens. But here are the most common following words for “nip in the ___”:

bud:46.79%, air:45.60%, h: 2.53%, butt: 1.05%, taste: 0.95%, form: 0.35%, bahamas: 0.25%, nip: 0.25%, </s>: 0.08%, us: 0.04%, post: 0.04%, office: 0.04%, images: 0.04%, counter: 0.04%, evening: 0.04%, twin: 0.04%, muck: 0.04%, buttoutkast: 0.04%

Nip Nip in the H Section” is referenced in the Urban dictionary. Nip in the taste bud seems to be a standard ‘clever’ headline. Nip in the bahamas is probably about a “nipple slip”.

Avoid like the plaque

I noticed that the expression “avoid like the plaque” showing up in some search results I was doing. This is “plaque” with-a-queue, not “plague”-with-a-gee. This seems like an odd misspelling to me, and I was curious how often this occurs. The Bing Ngram data gives us the tools: given “avoid like the”, the word “plague” occurs 98.68% of the time (in the web body data). The word “plaque” is the second most common word, occurring 0.13% of the time.

Here are the top  words and percentages:

plague:98.68%, </s>: 0.48%, plaque: 0.13%, pl: 0.12%, great: 0.07%, swine: 0.06%, velvet: 0.04%, black: 0.02%, proverbial: 0.02%, bubonic: 0.02%, bubolic: 0.02%, ten: 0.01%, plagued: 0.01%, plauge: 0.01%, bird: 0.01%, nuclear: 0.01%, blight: 0.01%, clap: 0.01%, lague: 0.01%, plagu: 0.01%, pleague: 0.01%

</s> means “end of sentence.”

C# Lambdas are almost like C# compile-time macros

I was writing some instrumentation code in C# today that looked like this:


If I were using a language with (compile-time) code macros (like Clojure, Scheme or Lisp), I’d prefer to write something like:


But C# doesn’t have such a basic and useful thing, by design (bad design, I think, but still by design).

But C# does have first-class anonymous functions, so it’s possible to write methods that take one of these as a parameter. This allowed me to write something like:

Measuring("SomeFeature", () => { ComputeSomeFeature(); });
Measuring("AnotherFeature", () => { ComputeAnotherFeature(); });

This is almost not horrible. It was a little tricky to find out how to declare anonymous functions in C# code. But it runs out that .NET has a special class for anonymous functions which take no values and return no values; .NET calls one of these an Action.

So, my Measuring method looks something like this:

private void Measuring(string feature, Action f)

I did some simple timing on the ‘macro’ vs. the ‘non-macro’ version of these; basically, you pay the cost of an additional function call. I don’t know enough about .NET to know how much continuation state would get passed; this was adequate for my purposes, anyway. Perhaps a more knowledgeable commentator will comment.

(This is a note to my future self; I’m sure to forget the details about getting around the limits of C# not having compile-time macros).

Using Ngram data for segmentation

I’ve updated the Microsoft NGram Ruby library to provide an example use: segmenting Twitter hashtags. Twitter hashtags have been used for some time to tag tweets according to users’ choice and whimsy. Coincidentally, my daytime boss has just conducted an interview with William Morgan–formerly of Powerset, but now at Twitter–about hash tags, in case you want to come up to speed on what they are. It’s a fun read in any case, including the origin story of hash tags.

Anyway, the Ruby library allows you to segment text; it uses the Bing unigram and bigram data to guess the mostly likely segmentation. Here are some hashtags in my timeline from today, and their segmentations:

#  > segment("bpcares")
#  => ["bp", "cares"]
#  > segment("Twitter")
#  => ["Twitter"]
#  > segment("writers")
#  => ["writers"]
#  > segment("iamwriting")
#  => ["i", "am", "writing"]
#  > segment("backchannel")
#  => ["back", "channel"]
#  > segment("tcot")
#  => ["tcot"]
#  > segment("vacationfallout")
#  => ["vacation", "fall", "out"]

The code closely follows Peter Norvig’s discussion of segmentation in his chapter “Natural Language Corpus Data” in the book Beautiful Data. The only differences are (1) using the Web-based data for unigram and bigram data, and (2) a small optimization (perhaps) of creating splits on text only when the first part of the split reaches a certain probability threshold. ([“vacationf” “allout”] is not a good split for “vacationfallout” because “vacationf” is very unlikely).

The code would be better if it batched the calls for probabilities rather than requesting them one-by-one. Norvig’s code also has the advantage of running off-line. The Bing data is more recent.

Anyway, enjoy the code: it can be found in my GitHub account:


It requires (in addition to the microsoft_ngram library and its dependencies) the memoize gem.

Ruby project to access Microsoft Ngram data

I am pleased to announce the availability of a Ruby library to access the Microsoft Ngram data. This data currently includes 1,2,3,4 gram data for anchor text, 1,2,3 gram data for body text, 1,2,3,4 gram data for page titles, and 1,2,3 gram data for queries, collected in June 2009. See the Bing/MSR Ngram data page for general information. Although I am a Microsoft employee, this software is provided by me, not Microsoft.

Microsoft provides a SOAP API, and a Python REST-based library, but this is (I think) the first Ruby library. You can get a copy at Github.com: http://github.com/willf/microsoft_ngram.

I hope, in the days to come, to write some example programs that show the power of this data resource for research. But, to whet the appetite, should we parse “Boston cream pie” as [Boston [cream pie]] or [[Boston cream] pie]? That is, a cream pie made in the Boston style, or a pie made with Boston cream? If the former, “cream pie” should be more frequent than “Boston cream”; if the later, the opposite.

> MicrosoftNgram.new(:model => ‘bing-body/jun09/2’).jps([‘boston cream’,’cream pie’]) => [[“boston cream”, -7.231685], [“cream pie”, -6.027882]]
These are log probabilities. This model suggests [Boston [cream pie]] is the correct bracketing.