[16] People Take Baths In Hotel Rooms

This post is the product of a heated debate.

At a recent conference, a colleague mentioned, much too matter-of-factly, that she took a bath in her hotel room. Not a shower. A bath. I had never heard of someone voluntarily bathing in a hotel room. I think bathing is preposterous. Bathing in a hotel room is lunacy.

I started asking people to estimate the percentage of people who had ever bathed in a hotel room. The few that admitted bathing in a hotel room guessed 15-20%. I guessed 4%, but then decided that number was way too high. A group of us asked the hotel concierge for his estimate. He said 60%, which I took as evidence that he lacked familiarity with numbers.

One of the participants in this conversation, Chicago professor Abigail Sussman, suggested testing this empirically. So we did that.

We asked 532 U.S.-resident MTurkers who had spent at least one night in a hotel within the last year to recall their most recent hotel stay (notes .pdf; materials .pdf; data .xls). We asked them a few questions about their stay, including whether they had showered, bathed, both, or, um, neither.

Here are the results, removing those who said their hotel room definitely did not have a bathtub (N = 442).

Ok, about 80% of people took a normal person’s approach to their last hotel stay. One in 20 didn’t bother cleaning themselves (respect), and 12.4% took a bath, including an incomprehensible subset who bathed but didn’t shower. Given that these data capture only their most recent hotel stay, the proportion of people bathing is at least an order of magnitude higher than I expected.


Gender Differences

If you had told me that 12.4% of people report having taken a bath during their last hotel stay, I’d have told you to include some men in your sample. Women have to be at least 5 times more likely to bathe. Right?


Women bathed more than men (15.6% vs. 10.8%), but only by a small, nonsignificant margin (p=.142). Also surprising is that women and men were equally likely to take a Pigpen approach to life.


What Predicts Hotel Bathing?

Hotel quality: People are more likely to bathe in higher quality hotels, and nobody bathes in a one-star hotel.


Perceptions of hotel cleanliness: People are more likely to bathe when they think hotels are cleaner, although almost 10% took a bath despite believing that hotel rooms are somewhat dirtier than the average home.


Others in the room: Sharing a room with more than 1 person really inhibits bathing, as it should, since it’s pretty inconsiderate to occupy the bathroom for the length of time that bathing requires. More than one in five people bathe when they are alone.


Bathing History & Intentions

We also asked people whether they had ever, as an adult, bathed in a hotel room. Making a mockery of my mockery, the concierge’s estimate was slightly closer to the truth than mine was, as fully one-third (33.3%) reported doing so. One in three.

I give up.

Finally, we asked those who had never bathed in a hotel room to report whether they’d ever consider doing so. Of the 66.7% who said they had never bathed in a hotel room, 64.1% said they’d never consider it.

So only 43% have both never bathed in a hotel room and say they would never consider it.

Those are my people.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[15] Citing Prospect Theory

Kahneman and Tversky’s (1979) Prospect Theory (.pdf), with its 9,206 citations, is the most cited article in Econometrica, the prestigious journal in which it appeared. In fact, it is more cited than any article published in any economics journal. [1]

Let’s break it down by year.prospect theory number

To be clear, this figure shows that just in 2013, Prospect Theory got about 700 citations.

Kahneman won the Nobel prize in Economics in 2002. This figure suggests a Nobel bump in citations. To examine whether the Nobel bump is real, I got citation data for other papers. I will get to that in about 20 seconds. Let’s not abandon Prospect Theory just yet.

Fan club.
Below we see which researchers and which journals have cited Prospect Theory the most. Leading the way is the late Duncan Luce with his 38 cites.


If you are wondering, Kahneman would be ranked 14th with 24 cites, and Tversky 15th with 23. Richard Thaler comes in 33rd place with 16, and DanAriely in 58th with 12.

How about journals?


I think the most surprising top-5 is Management Science, it only recently started its Behavioral Economics and Judgment & Decision Making departments

Not drinking the cool-aid
The first article to cite Prospect Theory came out the same year, 1979, in Economics Letters (.pdf). It provided a rational explanation for risk attitudes differing for gains and losses. The story is perfect if one is willing to make ad-hoc assumptions about the irreversibility of decisions and if one is also willing to ignore the fact that Prospect Theory involves small stakes decisions.  Old school.

Correction: an earlier version of this post indicated the Journal of Political Economy did not cite Prospect Theory until 2005, in fact it was cited already in 1987 (.html).  

About that Nobel bump
The first figure in this blog suggests the Nobel lead to more Prospect Theory cites. I thought I would look at other 1979 Econometrica papers as a “placebo” comparison. It turned out that they also showed a marked and sustained increase in the early 2000s. Hm?

I then realized that Heckman’s famous “Sample Selection as Specification Error” paper was also published in Econometrica in 1979 (good year!) and Heckman, it turns out, got the Nobel in 2000, my placebo was no good. Whether the bump was real or spurious it was expected to show the same pattern.

So I used Econometrica 1980. The figure below shows that deflating Prospect Theory cites by cites of all articles published in Econometrica in 1980, the same Nobel bump pattern emerges. Before the Nobel, Prospect Theory was getting about 40% as many cites per year as all 1980 Econometrica articles combined. Since then that has been rising, in 2013 they were nearly tied.

ratio econometrica 1980

Let’s take this out of sample, did other econ Nobel laureates get a bump in citations? I looked for laureates from different time periods and whose award I thought could be tied to a specific paper.

There seems to be something there, though the Coase cites started increasing a bit early and Akerlof’s a bit late. [2]

coase Lemos

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Only two psychology articles have more citations:Baron and Kenny paper introducing mediation (.pdf) with 21,746, and Bandura’s on Self-Efficacy (.html) with 9,879 []
  2. The papers: Akerlof, 1970 (..pdf) & Coase, 1960 (..pdf)   []

[14] How To Win A Football Prediction Contest: Ignore Your Gut

This is a boastful tale of how I used psychology to win dominate a football prediction contest.

Back in September, I was asked to represent my department – Operations and Information Management – in a Wharton School contest to predict NFL football game outcomes. Having always wanted a realistic chance to outperform Adam Grant at something, I agreed.

The contest involved making the same predictions that sports gamblers make. For each game, we predicted whether the superior team (the favorite) was going to beat the inferior team (the underdog) by more or less than the Las Vegas point spread. For example, when the very good New England Patriots played the less good Pittsburgh Steelers, we had to predict whether or not the Patriots would win by more than the 6.5-point point spread. We made 239 predictions across 16 weeks.

Contrary to popular belief, oddsmakers in Las Vegas don’t set point spreads in order to ensure that half of the money is wagered on the favorite and half the money is wagered on the underdog. Rather, their primary aim is to set accurate point spreads, one that gives the favorite (and underdog) a 50% chance to beat the spread. [1] Because Vegas is good at setting accurate spreads, it is very hard to perform better than chance when making these predictions. The only way to do it is to predict the NFL games better than Vegas does.

Enter Wharton professor Cade Massey and professional sports analyst Rufus Peabody. They’ve developed a statistical model that, for an identifiable subset of football games, outperforms Vegas. Their Massey-Peabody power rankings are featured in the Wall Street Journal, and from those rankings you can compute expected game outcomes. For example, their current rankings (shown below) say that the Broncos are 8.5 points better than the average team on a neutral field whereas the Seahawks are 8 points better. Thus, we can expect, on average, the Broncos to beat the Seahawks by 0.5 points if they were to play on a neutral field, as they will in Sunday’s Super Bowl. [2]


My approach to the contest was informed by two pieces of information.

First, my work with Leif (.pdf) has shown that naïve gamblers are biased when making these predictions – they predict favorites to beat the spread much more often than they predict underdogs to beat the spread. This is because people’s first impression about which team to bet on ignores the point spread and is thus based on a simpler prediction as to which team will win the game. Since the favorite is usually more likely to win, people’s first impressions tend to favor favorites. And because people rarely talk themselves out of these first impressions, they tend to predict favorites against the spread. This is true even though favorites don’t win against the spread more often than underdogs (paper 1, .pdf), and even when you manipulate the point spreads to make favorites more likely to lose (paper 2, .pdf). Intuitions for these predictions are just not useful.

Second, knowing that evidence-based algorithms are better forecasters than humans (.pdf), I used the Massey-Peabody algorithm for all my predictions.

So how did the results shake out? (Notes on Analyses; Data)

First, did my Wharton colleagues also show the bias toward favorites, a bias that would indicate that they are no more sophisticated than the typical gambler?

Yes. All of them predicted significantly more favorites than underdogs.


Second, how did I perform relative to the “competition?”

Since everyone loves a humble champion, let me just say that my victory is really a victory for Massey-Peabody. I don’t deserve all of the accolades. Really.

Yeah, for about the millionth time (see meta-analysis, .pdf), we see that statistical models outperform human forecasters. This is true even (especially?) when the humans are Wharton professors, students, and staff.

So, if you want to know who is going to win this Sunday’s Super Bowl, don’t ask me and don’t ask the bestselling author of Give and Take. Ask Massey-Peabody.

And they will tell you, unsatisfyingly, that the game is basically a coin flip.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Vegas still makes money in the long run because gamblers have to pay a fee in order to bet []
  2. For any matchup involving home field advantage, give an additional 2.4 points to the home team []

[13] Posterior-Hacking

Many believe that while p-hacking invalidates p-values, it does not invalidate Bayesian inference. Many are wrong.

This blog post presents two examples from my new “Posterior-Hacking” (SSRN) paper showing  selective reporting invalidates Bayesian inference as much as it invalidates p-values.

Example 1. Chronological Rejuvenation experiment
In  “False-Positive Psychology” (SSRN), Joe, Leif and I run experiments to demonstrate how easy p-hacking makes it to obtain statistically significant evidence for any effect, no matter how untrue. In Study 2 we “showed” that undergraduates randomly assigned to listen to the song “When I am 64” became 1.4 years younger (p<.05).

We obtained this absurd result by data-peeking, dropping a condition, and cherry-picking a covariate. p-hacking allowed us to fool Mr. p-value. Would it fool Mrs. Posterior also? If we take the selectively reported result and feed it to a Bayesian calculator. What happens?

The figure below shows traditional and Bayesian 95% confidence intervals for the above mentioned 1.4 years-younger chronological rejuvenation effect.  Both point just as strongly (or weakly) toward the absurd effect existing. [1]


When researchers p-hack they also posterior-hack

Example 2. Simulating p-hacks
Many Bayesian advocates propose concluding an experiment suggests an effect exists if the data are at least three times more likely under the alternative than under the null hypothesis. This “Bayes factor>3” approach is philosophically different, and mathematically more complex than computing p-values, but it is in practice extremely similar to simply requiring p< .01 for statistical significance. I hence run simulations assessing how p-hacking facilitates getting p<.01 vs getting Bayes factor>3. [2]

I simulated difference-of-means t-tests p-hacked via data-peeking (getting n=20 per-cell, going to n=30 if necessary), cherry-picking among three dependent variables, dropping a condition, and dropping outliers. See R-code.

Adding 10 observations to samples of size n=20 a researcher can increase her false-positive rate from the nominal 1% to 1.7%. The probability of getting a Bayes factor >3 is a comparable 1.8%. Combined with other forms of p-hacking, the ease with which a false finding is obtained increases multiplicatively. A researcher willing to engage in any of the four forms of p-hacking, has a 20.1% chance of obtaining p<.01, and a 20.8% chance of obtaining a Bayes factor >3.

When a researcher p-hacks, she also Bayes-factor-hacks.

Everyone needs disclosure
Andrew Gelman and colleagues, in their influential Bayesian textbook write:

A naïve student of Bayesian inference might claim that because all inference is conditional on the observed data, it makes no difference how those data were collected, […] the essential flaw in the argument is that a complete definition of ‘the observed data’ should include information on how the observed values arose […]”
(p.203, 2nd edition)

Whether doing traditional or Bayesian statistics, without disclosure, we cannot evaluate evidence.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The Bayesian confidence interval is the “highest density posterior interval”, computed using Kruschke’s BMLR (html). []
  2. This equivalence is for the default-alternative, see Table 1 in Rouder et al, 2009 (HTML).  []

[12] Preregistration: Not just for the Empiro-zealots

I recently joined a large group of academics in co-authoring a paper looking at how political science, economics, and psychology are working to increase transparency in scientific publications. Psychology is leading, by the way.

Working on that paper (and the figure below) actually changed my mind about something. A couple of years ago, when Joe, Uri, and I wrote False Positive Psychology, we were not really advocates of preregistration (a la clinicaltrials.gov). We saw it as an implausible superstructure of unspecified regulation. Now I am an advocate. What changed?

Transparency in Scientific Reporting Figure

First, let me relate an anecdote originally told by Don Green (and related with more subtlety here). He described watching a research presentation that at one point emphasized a subtle three-way interaction. Don asked, “did you preregister that hypothesis?” and the speaker said “yes.” Don, as he relates it, was amazed. Here was this super complicated pattern of results, but it had all been predicted ahead of time. That is convincing. Then the speaker said, “No. Just kidding.” Don was less amazed.

The gap between those two reactions is the reason I am trying to start preregistering my experiments. I want people to be amazed.

The single most important scientific practice that Uri, Joe, and I have emphasized is disclosure (i.e., the top panel in the figure). Transparently disclose all manipulations, measures, exclusions, and sample size specification. We have been at least mildly persuasive, as a number of journals (e.g., Psychological Science, Management Science) are requiring such reporting.

Meanwhile, as a researcher, transparency creates a rhetorical problem. When I conduct experiments, for example, I typically collect a single measure that I see as the central test of my hypothesis. But, like any curious scientist, I sometimes measure some other stuff in case I can learn a bit more about what is happening. If I report everything, then my confirmatory measure is hard to distinguish from my exploratory measures. As outlined in the figure above, a reader might reasonably think, “Leif is p-hacking.” My only defense is to say, “no, that first measure was the critical one. These other ones were bonus.” When I read things like that I am often imperfectly convinced.

How can Leif the researcher be more convincing to Leif the reader? By saying something like, “The reason you can tell that the first measure was the critical one is because I said that publicly before I ran the study. Here, go take a look. I preregistered it.” (i.e., the left panel of the figure).

Note that this line of thinking is not even vaguely self-righteous. It isn’t pushy. I am not saying, “you have to preregister or else!” Heck, I am not even saying that you should; I am saying that I should. In a world of transparent reporting, I choose preregistration as a way to selfishly show off that I predicted the outcome of my study. I choose to preregister in the hopes that one day someone like Don Green will ask me, and that he will be amazed.

I am new to preregistration, so I am going to be making lots of mistakes. I am not going to wait until I am perfect (it would be a long wait). If you want to join me in trying to add preregistration to your research process, it is easy to get started. Go here, and open an account, set up a page for your project, and when you’re ready, preregister your study. There is even a video to help you out.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[11] “Exactly”: The Most Famous Framing Effect Is Robust To Precise Wording

In an intriguing new paper, David Mandel suggests that the most famous demonstration of framing effects – Tversky & Kahneman’s (1981) “Asian Disease Problem” – is caused by a linguistic artifact. His paper suggests that eliminating this artifact eliminates, or at least strongly reduces, the framing effect. Does it?

This is the perfect sort of paper for a replication: The original finding is foundational and the criticism is both novel and fundamental. We read Mandel’s paper because we care about the topic, and we replicated it because we cared about the outcome.

The Asian Disease Problem

Imagine that an Asian disease is expected to kill 600 people and you have to decide between two policies designed to combat the disease. The policies can be framed in terms of gains (people being saved) or in terms of losses (people dying).

Tversky and Kahneman found that whereas 72% of people given the gain frame chose Program A’s 200 of 600 certain lives saved, only 22% of people given the loss frame chose Program C’s 400 of 600 certain deaths. This result supports prospect theory: people will take risks to avoid losses but will avoid risks to protect gains.

Just a Linguistic Artifact?

David Mandel argues that when people read “200 people will be saved” or “400 people will die” they interpret it as “at least 200 people will be saved” or “at least 400 people will die”. That small difference in interpretation would switch the finding from irrational to sensible, as the certain option would potentially save more people in the gain frame (>200 of 600 will be saved) than in the loss frame (>400 of 600 will die). Mandel resolves the ambiguity by adding the word “exactly” (“exactly 200 people will be saved”). Because the word “exactly” makes it clear that no more than 200 will be saved, he predicts that including that word will eliminate the framing effect.

In Study 2 (the study most faithful to the original), Mandel used Tversky and Kahneman’s wording and replicated their result (58% chose to save 200 people for certain; 26% chose to kill 400 people for certain). When he added “exactly,” that difference was reduced (59% vs. 43%). [1]

We replicated Mandel’s procedure. We showed mTurk workers the same scenario and asked the same questions. We collected ~2.5 times Mandel’s sample size; Mandel had ~38 per cell and we had ~98 per cell. (Following Mandel, we also included conditions with “at least” as a modifier; here are those results).

Unlike Mandel, we found a strong framing effect even with the use of the word “exactly” (p<.001) (materialsdata):

For completeness, we should report that Mandel emphasized a different dependent variable – a continuous measure of preference. We measured that too and it also failed to replicate his result.

In sum, our replication suggests that Tversky and Kahneman’s (1981) framing effect is not caused by this linguistic artifact.

We Could Have Just Asked Uri

When we told Uri about all this, he told us that he conducts this experiment in his class each year and that he uses the word “exactly” in his materials. The experiment has replicated every single year. For example, in the past two years combined (N=250), he observed that 63% chose to save 200 lives for sure whereas only 25% chose to let 400 die for sure.

David Mandel Responds

As is our policy, we sent a draft of this post to David Mandel to offer him the chance to respond. Please check out David’s response below. We are very thankful he took the time to do this:

I welcome this replication experiment by Joseph Simmons and Leif Nelson. I think we all agree proper replications are important regardless of how the results turn out. They have kindly offered me 150 words to reply, but that would hardly get me started. There are many points to cover, both about the replication and some of the broader issues it sparks. Joe contacted me on the Friday before the post went live with the “unwelcome news” and it’s been a weekend of changed plans, but I wanted to have a reply ready when the post goes live. Here it is on my website [and here it as a pdf]. I hope you read both. Feel free to email me if you have comments. Lastly, Joe invited me to comment on their post, but I haven’t since I cover what I might have otherwise recommended they alter in my reply. Readers can make up their own minds. 151, 152…

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. However, compared to the condition with the original wording, this reduction was nonsignificant (p=.293). []

[10] Reviewers are asking for it

Recent past and present
The leading empirical psychology journal, Psychological Science, will begin requiring authors to disclose flexibility in data collection and analysis starting on January of 2014 (see editorial). The leading business school journal, Management Science, implemented a similar policy a few months ago.

Both policies closely mirror the recommendations we made in our 21 Word Solution piece, where we contrasted the level of disclosure in science vs. food (see reprint of Figure 3).

Our proposed 21 word disclosure statement was:

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

Etienne Lebel tested an elegant and simple implementation in his PsychDisclosure project. Its success contributed to Psych Science‘s decision to implement disclosure requirements.

Starting Now
When reviewing for journals other than Psych Science and Management Science, what could reviewers do?

On the one hand, as reviewers we simply cannot do our jobs if we do not know fully what happened in the study we are tasked with evaluating.

On the other hand, requiring disclosure from an individual article one is reviewing risks authors taking such requests personally (reviewers are doubting them) and risks revealing our identity as reviewers.

A solution is a uniform disclosure request that large numbers of reviewers request for every paper they review.

Together with Etienne LebelDon Moore, and Brian Nosek we created a standardized request that we and many others have already begun using in all of our reviews. We hope you will start using it too. With many reviewers including it in their referee reports, the community norms will change:

I request that the authors add a statement to the paper confirming whether, for all experiments, they have reported all measures, conditions, data exclusions, and how they determined their sample sizes. The authors should, of course, add any additional text to ensure the statement is accurate. This is the standard reviewer disclosure request endorsed by the Center for Open Science [see http://osf.io/project/hadz3]. I include it in every review.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[9] Titleogy: Some facts about titles

Naming things is fun. Not sure why, but it is. I have collaborated in the naming of people, cats, papers, a blog, its posts, and in coining the term “p-hacking.” All were fun to do. So I thought I would write a Colada on titles.

To add color I collected some data. At the end what I wrote was quite boring, so I killed it, but the facts seemed worth sharing. Here they go, in mostly non-contextualized prose.

Cliché titles
I dislike titles with (unmodified) idioms. The figure below shows how frequent some of them are in the web-of-science archive.
Ironically, the most popular (I found), at 970 papers, is “What’s in a name?” …Lack of originality?

A colleague once shared his disapproval of the increase in the use of colons in titles. With this post as an excuse, I used Mozenda to scrape ~30,000 psychology paper titles published across 19 journals over 40 years, and computed the fraction including a colon. “Colleague was Wrong: Title Colonization Has Been Stable at about 63% Since the 1970s.” [1]

That factoid took a couple of hours to generate. Data in hand I figured I should answer more questions. Any sense of coherence in this piece disappears with the next pixel.

Have titles gotten longer over time? 
Yes. At about 1.5 characters per year (or a tweet a century).
note: controlling for journal fixed effects.

Three less obvious questions to ask
Question 1. What are the two highest scoring Scrabble words used in a Psychology title?
Hypnotizability (37 points), is used in several articles it turns out. [2]
Ventriloquized (36 points) only in this paper.

Question 2. What is the most frequent last-word in a Psychology paper title?
(try guessing before reading the next line)

This is probably the right place to let you know the Colada has a Facebook page now 

Winner: 137 titles end with: “Tasks”
Runner up: 70 titles end with “Effect”

Question 3. What’s more commonly used in a Psychology title, “thinking” or “sex”?
Not close.

Sex: 407.
Thinking: 172.

Alright, that’s not totally fair, in psychology sex often refers to gender rather than the activity. Moreover, thinking (172) is, as expected for academic papers, more common than doing (44).
But memory blows sex, thinking, and doing combined out of the water with 2008 instances; one in 15 psychology titles has the word memory in them.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. I treated the Journal of Consumer Research as a psychology journal, a decision involving two debatable assumptions. []
  2. Shane Frederick indicated via email that this is a vast underestimate that ignores tripling of points; Hypnotizability could get you 729 points . []

[8] Adventures in the Assessment of Animal Speed and Morality

Animal Virtue Figure 1
In surveys, most people answer most questions. That is true regardless of whether or not questions are coherently constructed and reasonably articulated. That means that absurd questions still receive answers, and in part because humans are similar to one another, those answers can even look peculiarly consistent. I asked an absurd question and was rewarded with an entertaining answer.

Some years ago, with Tom Meyvis, I tried to develop a manipulation to create an association between speed and virtue. Our spartan publication history on the topic testifies to our (lack of) success. That doesn’t mean that the pilot data weren’t interesting for a different reason.

Participants saw a sequence of 20 animal photographs and rated each on one of two bipolar dimensions: speed or goodness. The former is straightforward. The latter could be best construed as an evaluation of moral worth. That is an absurd question. What sorts of answers did we receive?
Animal Virtue Figure 2
My Top 5 observations:

1. The Tortoise is the most moral animal. I anticipated more class-profiling, and a resulting ingroup bias for mammalia. Nope. Perhaps researchers should try an implicit measure?*

2. Aquatic race featuring: Jellyfish vs. Starfish vs. Walrus. Who wins? People give the jellyfish the edge. The starfish has no chance.

3. Nature documentaries frequently bandy about facts like, “hippopotami kill more people every year than heart disease.” My respondents overlooked that; Hippos are more moral than sloths (which nature documentaries never mention for their killing ability).

4. The orangutan is not just a mammal or just a primate, it is a great ape. Huge opportunity for some ingroup favoritism. Instead people favor the cheetah, walrus, and hippo (amongst others). Explain that.

5. Most animals are good. Our scale had a meaningful midpoint, yet all but three animals are above it. Who is bad? Hyena, Barracuda, and Jellyfish. The Jellyfish is worst. And deceptively fast. Perhaps a researcher could prime people with jellyfish and see if they cheat more on that matrices task?**

Perhaps some absurd questions have correct answers? I asked a pair of experts. Pieter Thomas Jefferson Johnson is an ecologist possibly best known for solving a major scientific problem before he was old enough to drink. Michael Jennions is a world renowned evolutionary biologist, known for many things, including this video (the link alone makes this post worthwhile). I asked them to rank the 20 animals for speed and morality. Their speed ratings are similar to each other (r = .91) and the novices (r = .87). Morality was trickier. Both said that any response would be random, or as Piet said, “I would probably tie them all in ranking”. But responses aren’t quite random. Michael rated based on the complexity of the central nervous system (complex = evil), whereas Pieter used “trophic level, followed by an inverse body mass index”. Despite very different approaches, they are mildly correlated with each other (r = .29). Experts and novices all agree on the virtue of the Tortoise, but Michael and Piet are just as fond of the lowly snail.
Animal Virtue Figure 3
*No they shouldn’t.

**Don’t run that study. I mean it.

[7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate

In the paper “One Swallow Doesn’t Make A Summer: New Evidence on Anchoring Effects”, forthcoming in the AER, Maniadis, Tufano and List attempted to replicate a classic study in economics. The results were entirely consistent with the original and yet they interpreted them as a “failure to replicate.” What went wrong?

This post answers that question succinctly; our new paper has additional analyses.

Original results
In an article with >600 citations, Ariely, Loewenstein, and Prelec (2003) showed that people presented with high anchors (“Would you pay $70 for a box of chocolates?”) end up paying more than people presented with low anchors (“Would you pay $20 for a box of chocolates?”). They found this effect in five studies, but the AER replication reran only Study 2. In that study, participants gave their asking prices for aversive sounds that were 10, 30, or 60 seconds long, after a high (50¢), low (10¢), or no anchor.

Replication results

comparing only the 10-cent and 50-cent anchor conditions, we find an effect size equal to 28.57 percent [the percentage difference between valuations], about half of what ALP found. The p-value […] was equal to 0.253” (p. 8).

So their evidence is unable to rule out the possibility that anchoring is a zero effect. But that is only part of the story. Does their evidence also rule out a sizable anchoring effect? It does not. Their evidence is consistent with an effect much larger than the original.

Fig1 Anchoring post

Those calculations use Maniadis et al.’s definition of effect size: % difference in valuations (as quoted above). An alternative is to divide the differences of means by the standard deviation (Cohen’s d). Using this metric the Replication’s effect size is more markedly different from the Original’s, d=.94 vs. d=.26 . However, the 95% confidence interval for the Replication includes effects as big as d=.64, midway between medium and large effects. Whether we examine Maniadis et al.’s operationalization of effect size, then, or Cohen’s d, we arrive at the same conclusion: the Replication is too noisy to distinguish between a nonexistent and a sizable anchoring effect.

Why is the Replication so imprecise?
In addition to having 12% fewer participants, nearly half of all valuations are ≤10¢. Even if anchoring had a large percentage effect, one that doubles WTA from 3¢ to 6¢, the tendency of participants to round both to 5¢ makes it undetectable. And there is the floor effect: valuations so close to $0 cannot drop. One way around this problem is to do something economists do all the time: Express the effect size of one variable (How big is the impact of X on Z?) relative to the effect size of another (it is half the effect of Y on Z). Figure 2 shows that, in cents, both the effect of anchoring and duration is smaller in the replication, and that the relative effect of anchoring is comparable across studies. Fig2 Anchoring post

The original paper had five studies, four were p<.01, the fifth p<.02. When we submit these p-values to p-curve we can empirically examine the fear expressed by the replicators that the original finding is false-positive. The results strongly reject this possibility; selective reporting is an unlikely explanation for the original paper, p<.0001.

Some successful replications
Every year Uri runs a replication of Ariely et al.’s Study 1 in his class. In an online survey at the beginning of the semester, students write down the last two digits of their social-security-number, indicate if they would pay that amount for something (this semester it was for a ticket to watch Jerry Seinfeld live on campus), and then indicate the most they would pay. Figure 3 has this year’s data:

Fig3 Anchoring post

We recently learned that SangSuk Yoon, Nathan Fong and Angelika Dimoka successfully replicated Ariely et al.’s Study 1 with real decisions (in contrast to this paper).

Concluding remark
We are not vouching for the universal replicability of Ariely et al here. It is not difficult to imagine moderators (beyond floor effects) that attenuate anchoring. We are arguing that the forthcoming “failure-to-replicate” anchoring in the AER is no such thing.

note: When we discuss others’ work at DataColada we ask them for feedback and offer them space to comment within the original post. Maniadis, Tufano, and List provided feedback only for our paper and did not send us comments to post here.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.