[23] Ceiling Effects and Replications

A recent failure to replicate led to an attention-grabbing debate in psychology.

As you may expect from university professors, some of it involved data.  As you may not expect from university professors, much of it involved saying mean things that would get a child sent to the principal’s office (.pdf).

The hostility in the debate has obscured an interesting empirical question. This post aims to answer that interesting empirical question. [1]

Ceiling effect
The replication (.pdf) was pre-registered; it was evaluated and approved by peers, including the original authors, before being run. The predicted effect was not obtained, in two separate replication studies.

The sole issue of contention regarding the data (.xlsx), is that nearly twice as many respondents gave the highest possible answer in the replication as in the original study (about 41% vs about 23%).  In a forthcoming commentary (.pdf), the original author proposes a “ceiling effect” explanation: it is hard to increase something that is already very high.

I re-analyzed the original and replication data to assess this sensible concern.
My read is that the evidence is greatly inconsistent with the ceiling effect explanation.

The experiments
In the original paper (.pdf), participants rated six “dilemmas” involving moral judgments (e.g., How wrong  is it to keep money found in a lost wallet?). These judgments were predicted to become less harsh for people primed with cleanliness (Study 1) or who just washed their hands (Study 2).

The new analysis
In a paper with Joe and Leif (SSRN), we showed that a prominent failure to replicate in economics was invalidated by a ceiling effect. I use the same key analysis here. [2]

It consists of going beyond comparing means, examining instead all observations.The stylized figures below give the intuition. They plot the cumulative percentage of observations for each value of the dependent variable.

The first shows an effect across the board: there is a gap between the curves throughout.
The third shows the absence of an effect: the curves perfectly overlap.

Example FigureThe middle figure captures what a ceiling effect looks like. All values above 2 were brought down to 2 so the lines overlap there, but below the ceiling the gap is still easy to notice.

Let’s now look at real data. Study 1 first: [3]
Ori1  Rep1
It is easy to spot the effect in the original data.
It is just as easy to spot the absence of an effect in the replication.

Study 2 is more compelling,
Ori2 Rep2

In the Original the effect is largest in the 4-6 range. In the Replication about 60% of the data is in that range, far from the ceiling of 7. But still there is no gap between the lines.

Ceiling analysis by original author
In her forthcoming commentary (.pdf), effect size is computed as a percentage and shown to be smaller in scenarios with higher baseline levels (see her Figure 1). This is interpreted as evidence of a ceiling effect.
I don’t think that’s right.

Dividing something by increasingly larger numbers leads to increasingly smaller ratios, with or without a ceiling. Imagine the effect were constant, completely unaffected by ceiling effects. Say a 1 point increase in the morality scale in every scenario. This constant effect would be a smaller % in scenarios with a larger baseline; going from 2 to 3 is a 50% increase, whereas going from 9 to 10 only 11%. [4]

If a store-owner gives you $5 off any item, buying a $25 calculator gets you a 20% discount, buying a $100 jacket gets you only a 5% discount. But there is no ceiling, you are getting $5 in both cases.

To eliminate the arithmetic confound, I redid this analysis with effect size defined as the difference of means, rather than %, and there was no association between effect size and share of answers at boundary across scenarios (see calculations, .xlsx).

Ceiling analysis by replicators
In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.
I don’t think that’s right either.

Dropping observations at the boundary lowers power whether there is a ceiling effect or not, by a lot.  In simulations, I saw drops of 30% and more, say from 50% to 20% power (R Code). So not getting an effect this way does not support the absence of a ceiling effect problem.

To formally take ceiling effects into account one can use the Tobit model (common in economics for censored data, see Wikipedia). A feature of this approach is that it allows analyzing the data at the scenario level, where the ceiling effect would actually be happening. I run Tobits on all datasets. The replications still had tiny effect sizes (<1/20th size of original), with p-values>.8 (STATA code). [5]

Wide logo

Authors’ response
Our policy at DataColada is to give drafts of our post to authors whose work we cover before posting, asking for feedback and providing an opportunity to comment. This causes delays (see footnote 1) but avoids misunderstandings.

The replication authors, Brent Donnellan, Felix Cheung and David Johnson suggested minor modifications to analyses and writing. They are reflected in the version you just read.

The original author, Simone Schnall, suggested a few edits also, and asked me to include this comment from her:

Your analysis still does not acknowledge the key fact: There are significantly more extreme scores in the replication data (38.5% in Study 1, and 44.0% in Study 2) than in the original data. The Tobin analysis is a model-based calculation and makes certain assumptions; it is not based on the empirical data. In the presence of so many extreme scores a null result remains inconclusive.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. This blogpost was drafted on Thursday May 29th and was sent to original and replication authors for feedback, offering also an opportunity to comment. The dialogue with Simone Schnall lasted until June 3rd, which is why it appears only today. In the interim Tal Yarkoni and Yoel Inbar, among others, posted their own independent analyses. []
  2. Actually, in that paper it was a floor effect []
  3. The x-axis on these graphs had a typo that we were alerted to by Alex Perrone in August, 2014. The current version is correct []
  4. She actually divides by the share of observations at ceiling, but the same intuition and arithmetic apply. []
  5. I treat the experiment as nested, with 6 repeated-measures for each participant, one per scenario []

[22] You know what’s on our shopping list

As part of an ongoing project with Minah Jung, a nearly perfect doctoral student, we asked  people to estimate the percentage of people who bought some common items in their last trip to the supermarket. For each of 18 items, we simply asked people (N = 397) to report whether they had bought it on their last trip to the store and also to estimate the percentage of other people who bought it [1].

Take a sample item: Laundry Detergent. Did you buy laundry detergent the last time you went to the store? What percentage of other people [2] do you think purchased laundry detergent? The correct answer is that 42% of people bought laundry detergent. If you’re like me, you see that number and say, “that’s crazy, no one buys laundry detergent.” If you’re like Minah, you say, “that’s crazy, everyone buys laundry detergent.” Minah had just bought laundry detergent, whereas I had not. Our biases are shared by others. People who bought detergent thought that 69% of others bought detergent whereas non-buyers thought that number was only 29%. Those are really different. We heavily emphasize our own behavior when estimating the behavior of others [3].
Grocery Shopping Figure 1
That effect, generally referred to as the false consensus effect (see classic paper .pdf), extends beyond estimates of detergent purchase likelihoods. All of the items (e.g., milk, crackers, etc.) showed a similar effect. The scatterplot below shows estimates for each of the products. The x-axis is the actual percentage of purchasers and the y-axis reports estimated percentages (so the identity line would be a perfectly accurate estimate).
Grocery Shopping Figure 2
For every single product, buyers gave a higher estimate than non-buyers; the false consensus effect is quite robust. People are biased. But a second observation gets its own chart. What happens if you just average the estimates from everyone?
Grocery Shopping Figure 3
That is a correlation of r = .95.

As a judgment and decision making researcher, one of my tasks is to identify idiosyncratic shortcomings in human thinking (e.g., the false consensus effect). Nevertheless, under the right circumstances, I can be entranced by accuracy. In this case, I marvel at the wisdom of crowds. Every person has a ton of error (e.g., “I have no idea whether you bought detergent”) and a solid amount of bias (e.g., “but since I didn’t buy detergent, you probably didn’t either.”). When we put all of that together, the error and the bias cancel out. What’s left over is astonishing amounts of signal.

Minah and I could cheerfully use the same data to write one of two papers. The first could use a pervasive judgmental bias (18 out of 18 products show the effect!) to highlight the limitations of human thinking. A second paper could use the correlation (.95!) to highlight the efficiency of human thinking. Fortunately, this is a blog post, so I get to comfortably write about both.

Sometimes, even with judgmental shortcomings in the individual, there is still judgmental genius in the many.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Truth be told, it was ever so slightly more complicated. We asked half the people to talk about purchases from their next shopping trip. To first approximation there are no differences between these conditions, so for the simplicity of verb tense I refer to the past. []
  2. “Other people” was articulated as “other people who are also answering this question on mTurk.” []
  3. In fact, you might recall from Colada[16] that Joe is rather publicly prone to this error. []

[21] Fake-Data Colada

Recently, a psychology paper (.pdf) was flagged as possibly fraudulent based on statistical analyses (.pdf). The author defended his paper (.html), but the university committee investigating misconduct concluded it had occurred (.pdf).

In this post we present new and more intuitive versions of the analyses that flagged the paper as possibly fraudulent. We then rule out p-hacking among other benign explanations.

Excessive linearity
The whistleblowing report pointed out the suspicious paper had excessively linear results.
That sounds more technical than it is.

Imagine comparing the heights of kids in first, second, and third grade, with the hypothesis that higher grades have taller children. You get samples of n=20 kids in each grade, finding average heights of: 120 cms, 126 cms, and 130 cms. That’s almost a perfectly linear pattern,  2nd graders [126], are almost exactly between the other two groups [mean(120,130)=125].

The scrutinized paper has 12 studies with three conditions each. The Control was too close to the midpoint of the other two in all of them. It is not suspicious for the true effect to be linear. Nothing wrong with 2nd graders being 125 cm tall. But, real data are noisy, so even if the effect is truly and perfectly linear, small samples of 2nd graders won’t average 125 every time.

Our new analysis of excessive linearity
The original report estimated a less than 1 in 179 million chance that a single paper with 12 studies would lead to such perfectly linear results. Their approach was elegant (subjecting results from two F-tests to a third F-test) but a bit technical for the uninitiated.

We did two things differently:
(1)    Created a more intuitive measure of linearity, and
(2)    Ran simulations instead of relying on F-distributions.

Intuitive measure of linearity
For each study, we calculated how far the Control condition was from the midpoint of the other two. So if in one study the means were: Low=0, Control=61, High=100, our measure compares the midpoint, 50, to the 61 from the Control, and notes they differ by 11% of the High-Low distance. [1]

Across the 12 studies, the Control conditions were on average just 2.3% away from the midpoint. We ran simulations to see how extreme that 2.3% was.

We drew samples from populations with means and standard deviations equal to those reported in the suspicious paper. Our simulated variables were discrete and bounded, as in the paper, and we assumed that the true mean of the Control was exactly midway between the other two. [2] We gave the reported data every benefit of the doubt.
(see R Code)

Recall that in the suspicious paper the Control was off by just 2.3% from the midpoint of the other two conditions. How often did we observe such a perfectly linear result in our 100,000 simulations?



In real life, studies need to be p<.05 to be published. Could that explain it?

We redid the above chart including only the 45% of simulated papers in which all 12 studies were p<.05. The results changed so little that to save space we put the (almost identical) chart here

A second witness. Excessive similarity across studies
The original report also noted very similar effect sizes across studies.
The results reported in the suspicious paper convey this: Colada21_fig2

The F-values are not just surprisingly large, they are also surprisingly stable across studies.
Just how unlikely is that?

We computed the simplest measure of similarity we could think of: the standard deviation of F() across the 12 studies. In the suspicious paper, see figure above, SD(F)=SD(8.93, 9.15, 10.02…)=.866. We then computed SD(F) for each of the simulated papers.

How often did we observe such extreme similarity in our 100,000 simulations?



Two red flags
For each simulated paper we have two measures of excessive similarity “Control is too close to High-Low midpoint,” and “SD of F-values”. These proved uncorrelated in our simulations (r = .004), so they provide independent evidence of aberrant results, we have a conceptual replication of  “these data are not real.” [3]

Alternative explanations
1.  Repeat subjects?
Some have speculated that perhaps some participants took part in more than one of the  studies. Because of random assignment to condition that wouldn’t help explain consistency in differences across conditions in different studies. Possibly it would make things worse; repeat participants would increase variability, as studies would differ in the mixture of experienced and inexperienced participants.

2. Recycled controls?
Others have speculated that perhaps the same control condition was used in multiple studies. But controls were different across studies. e.g., Study 2 involved listening to poems, Study 1 seeing letters.

3. Innocent copy-paste error?
Recent scandals in economics (.html) and medicine (.html) have involved copy-pasting errors before running analyses. Here so many separate experiments are involved, with the same odd patterns, that unintentional error seems implausible.

4. P-hacking?
To p-hack you need to drop participants, measures, or conditions.  The studies have the same dependent variables, parallel manipulations, same sample sizes and analysis. There is no room for selective reporting.

In addition, p-hacking leads to p-values just south of .05 (see our p-curve paper, SSRN). All p-values in the paper are smaller than p=.0008.  P-hacked findings do not reliably get this pedigree of p-values.

Actually, with n=20, not even real effects do.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The measure=|((High+Low)/2  – Control)/(High-Low)| []
  2. Thus, we don’t use the reported Control mean; our analysis is much more conservative than that []
  3. Note that the SD(F) simulation is not under the null that the F-values are the same, but rather, under the null that the Control is the midpoint. We also carried out 100,000 simulations under this other null and also never got SD(F) that small []

[20] We cannot afford to study effect size in the lab

Methods people often say  – in textbooks, task forces, papers, editorials, over coffee, in their sleep – that we should focus more on estimating effect sizes rather than testing for significance.

I am kind of a methods person, and I am kind of going to say the opposite.

Only kind of the opposite because it is not that we shouldn’t try to estimate effect sizes; it is that, in the lab, we can’t afford to.

The sample sizes needed to estimate effect sizes are too big for most researchers most of the time.

With n=20, forget it
The median sample size in published studies in Psychology is about n=20 per cell. [1] There have been many calls over the last few decades to report and discuss effect size in experiments. Does it make sense to push for effect size reporting when we run small samples? I don’t see how.

Arguably the lowest bar for claiming to care about effect size is so to distinguish among Small, Medium, and Large effects. And with n=20 we can’t do even that.

Cheatsheet: I use Cohen’s d to index effect size. d is by how many standard deviations the means differ. Small is d=.2, Medium d=.5 and Large d=.8.

The figure below shows 95% confidence intervals surrounding Small, Medium and Large estimates when n=20 (see simple R Code).


Whatever effect we get, we will not be able to rule out effects of a different qualitative size.

Four-digit n’s
It is easy to bash n=20 (please do it often). But just how big an n do we need to study effect size?

I am about to show that the answer has four-digits.

It will be rhetorically useful to consider a specific effect size. Let’s go with d=.5. You need n=64 per cell to detect this effect 80% of the time.

If you run the study with n=64, then you will get a confidence interval that will not include zero 80% of the time, but if your estimate is right on the money at d=.5, that confidence interval still will include effects smaller than Small (d<.2) and larger than Large (d>.8). So n=64 is fine for testing whether the effect exists, but not for estimating its size.

Properly powered studies teach you almost nothing about effect size. [2]

What if we go the extra mile, or three, and power it to 99.9%, running n=205 per cell. This study will almost always produce a significant effect, yet the expected confidence interval is massive, spanning a basically small effect (d=.3) to a basically large effect (d=.7).

To get the kind of confidence interval that actually gives confidence regarding effect size, one that spans say ±0.1, we need n=3000 per cell. THREE-THOUSAND. (see simple R Code)  [3]

In the lab, four-digit per-cell sample sizes are not affordable.

Advocating a focus on effect size estimation, then, implies advocating for either:
1)       Leaving the lab (e.g., mTurk, archival data). [4]
2)       Running within-subject designs.

Some may argue effect size is so important we ought to do these things.
But that’s a case to be made, not an implication to be ignored.

UPDATE 2014 05 08: A commentary on this post is available here

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Based on the degrees of freedom reported in thousands of test statistics I scraped from Psych Science and JPSP []
  2. Unless you properly power for a trivially small effect by running a gigantic sample []
  3. If you run n=1000 the expected confidence interval spans d=.41 and d=.59 []
  4. One way to get big samples is to combine many small samples. Whether one should focus on effect size in meta-analysis is not something that seems controversial enough to be interesting to discuss []

[19] Fake Data: Mendel vs. Stapel

Diederik Stapel, Dirk Smeesters, and Lawrence Sanna published psychology papers with fake data. They each faked in their own idiosyncratic way, nevertheless, their data do share something in common. Real data are noisy. Theirs aren’t.

Gregor Mendel’s data also lack noise (yes, famous peas-experimenter Mendel). Moreover, in a mathematical sense, his data are just as lacking in noise. And yet, while there is no reasonable doubt that Stapel, Smeesters, and Sanna all faked, there is that Mendel did. Why? Why does the same statistical anomaly make a compelling case against the psychologists but not Mendel?

Because Mendel, unlike the psychologists, had a motive. Mendel’s motive is his alibi.

Excessive similarity
To get a sense for what we are talking about, let’s look at the study that first suggested Smeesters was a fabricateur. Twelve groups of participants answered multiple-choice questions. Six were predicted to do well, six poorly.  semesters3

(See retracted paper .pdf)

Results are as predicted.  The lows, however, are too similarly low. Same with highs. In “Just Post It” (SSRN), I computed this level of lack of noise had ~21/100000 chance if the data were real (additional analyses lower the odds to miniscule).

Stapel and Sanna had data with the same problem. Results too similar to be true. Even if population means were identical, samples have sampling error. Lack of sampling error suggests lack of sampling.

How Mendel is like Stapel, Smeesters & Sanna
Mendel famously crossed plants and observed the share of baby-plants with a given trait. Sometimes he predicted 1/3 would show it, sometimes 1/4. He was, of course, right. The problem, first noted by Ronald Fisher in 1911 (yes, p-value co-inventor Fisher) is that Mendel’s data also, unlike real data, lacked sampling error. His data were too-close to his predictions.


Recall how Smeesters’ data had 27/100000 chance if data were real?
Well, Gregor Mendels’ data had 7/100000 chance (See Fisher’s 1936 paper .pdf; especially Table V).

How Mendel is not like Stapel, Smeesters, Sanna
Mendel wanted his data to look like his theory. He had a motive for lacking noise. It made his theory look better.

Imagine Mendel runs an experiment and gets 27% instead of 33% of baby-plants with a trait. He may toss out a plant or two, he may re-run the experiment; he may p-hack. This introduces reasonable doubt. P-hacking is an alternative explanation for Mendel’s anomalous data. (See e.g., Pires & Branco’s 2010 paper suggesting Mendel p-hacked, .pdf).

Sanna, Smeesters and Stapel, in contrast lack a motive. The similarity is unrelated to their hypotheses. P-hacking to get their results does not explain the degenerate results they get.

One way to think of this is that Mendel’s theory was rich enough to make point predictions, and his data are too similar to these. Psychologists seldom make point predictions, the fabricaterus had their means too similar to each other, not to theory.

Smeesters, for instance, did not need the low conditions to be similar to each other, just to be different from the high so p-hacking wouldn’t get his data to lack noise. [1] Worse. p-hacking makes Smeesters’ data look more aberrant.

Before resigning his tenured position, Smeesters told the committee investigating him, that he merely dropped extreme observations aiming for better results. If this were true, if he had p-hacked that way, his low-means would look too different  from each other, not too similar.  [2]

If Mendel p-hacked, his data would look the way they look.
If Smeesters p-hacked, his data would look the opposite of the way they look.

This gives reasonable doubt to Mendel being a fabricateur, and eliminates reasonable doubt for Smeesters.

CODA: Stapel’s dissertation
Contradicting the committee that investigated him, Stapel has indicated that he did not fake his dissertation (e.g., in this NYTimes story .pdf).

Check out this table from a (retracted) paper based on that dissertation (.pdf).JPSP1996

These are means from a between subject design with n<20 per cell. Such small samples produce so much noise as to make it very unlikely to observe means this similar to each other.

Most if not all studies in his dissertation look this way.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Generic file-drawering of n.s. results will make means appear slightly surprisingly similar among the significant subset. But, not enough to raise red flags (see R code). Also, it will not explain the other anomalies in Smeesters data (e.g., lack of round numbers, negative correlations in valuations of similar items). []
  2. See details in the Nonexplanations section of  “Just Post It”, SSRN []

[18] MTurk vs. The Lab: Either Way We Need Big Samples

Back in May 2012, we were interested in the question of how many participants a typical between-subjects psychology study needs to have an 80% chance to detect a true effect. To answer this, you need to know the effect size for a typical study, which you can’t know from examining the published literature because it severely overestimates them (.pdf1; .pdf2; .pdf3).

To begin to answer this question, we set out to estimate some effects we expected to be very large, such as “people who like eggs report eating egg salad more often than people who don’t like eggs.” We did this assuming that the typical psychology study is probably investigating an effect no bigger than this. Thus, we reasoned that the sample size needed to detect this effect is probably smaller than the sample size psychologists typically need to detect the effects that they study.

We investigated a bunch of these “obvious” effects in a survey on amazon.com’s Mechanical Turk (N=697). The results are bad news for those who think 10-40 participants per cell is an adequate sample.

Turns out you need 47 participants per cell to detect that people who like eggs eat egg salad more often than those who dislike eggs. The finding that smokers think that smoking is less likely to kill someone requires 149 participants per cell. The irrefutable takeaway is that, to be appropriately powered, our samples must be a lot larger than they have been in the past, a point that we’ve made in a talk on “Life After P-Hacking” (slides).

Of course, “irrefutable” takeaways inevitably invite attempts at refutation. One thoughtful attempt is the suggestion that the effect sizes we observed were so small because we used MTurk participants, who are supposedly inattentive and whose responses are supposedly noisy. The claim is that these effect sizes would be much larger if we ran this survey in the Lab, and so samples in the Lab don’t need to be nearly as big as our MTurk investigation suggests.

MTurk vs. The Lab

Not having yet read some excellent papers investigating MTurk’s data quality (the quality is good; .pdf1; .pdf2; .pdf3), I ran nearly the exact same survey in Wharton’s Behavioral Lab (N=192), where mostly undergraduate participants are paid $10 to do an hour’s worth of experiments.

I then compared the effect sizes between MTurk and the Lab (materials .pdf; data .xls). [1] Turns out…

…MTurk and the Lab did not differ much. You need big samples in both.

Six of the 10 the effects we studied were directionally smaller in the Lab sample: [2]

No matter what, you need ~50 per cell to detect that egg-likers eat egg salad more often. The one effect resembling something psychologists might actually care about – smokers think that smoking is less likely to kill someone – was actually quite a bit smaller in the Lab sample than in the MTurk sample: to detect this effect in our Lab would actually require many more participants than on MTurk (974 vs. 149 per cell).

Four out of 10 effect sizes were directionally larger in the Lab, three of them involving gender differences:

So across the 10 items, some of the effects were bigger in the Lab sample and some were bigger in the MTurk sample.

Most of the effect sizes were very similar, and any differences that emerged almost certainly reflect differences in population rather than data quality. For example, gender differences in weight were bigger in the Lab because few overweight individuals visit our lab. The MTurk sample, by being more representative, had a larger variance and thus a smaller effect size than did the Lab sample. [3]


MTurk is not perfect. As with anything, there are limitations, especially the problem of nonnaïvete (.pdf), and since it is a tool that so many of us use, we should continue to monitor the quality of the data that it produces. With that said, the claim that MTurk studies require larger samples is based on intuitions unsupported by evidence.

So whether we are running our studies on MTurk or in the Lab, the irrefutable fact remains:

We need big samples. And 50 per cell is not big.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. To eliminate outliers, I trimmed open-ended responses below the 5th and above the 95th percentiles. This increases effect size estimates. If you don’t do this, you need even more participants for the open-ended items than the figures below suggest. []
  2. For space considerations, here I report only the 10 of 12 effects that were significant in at least one of the samples; the .xls file shows the full results. The error bars are 95% confidence intervals. []
  3. MTurk’s gender on weight effect size estimate more closely aligns with other nationally representative investigations (.pdf) []

[17] No-way Interactions

This post shares a shocking and counterintuitive fact about studies looking at interactions where effects are predicted to get smaller (attenuated interactions).

I needed a working example and went with Fritz Strack et al.’s  (1988, .pdf) famous paper [933 Google cites], in which participants rated cartoons as funnier if they saw them while holding a pen with their lips (inhibiting smiles) vs. their teeth (facilitating them).

holding pens
The paper relies on a sensible and common tactic: Show the effect in Study 1. Then in Study 2 show that a moderator makes it go away or get smaller. Their Study 2 tested if the pen effect got smaller when it was held only after seeing the cartoons (but before rating them).

In hypothesis-testing terms the tactic is:

Study Statistical Test Example
#1 Simple effect People rate cartoons as funnier with pen held in their teeth vs. lips.
#2 Two-way interaction But less so if they hold pen after seeing cartoons

This post’s punch line:
To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.

Power discussions get muddied by uncertainty about effect size. The blue fact is free of this problem: whatever power Study 1 had, at least twice as many subjects are needed in Study 2, per cell, to maintain it. We know this because we are testing the reduction of that same effect.

Study 1 with the cartoons had n=31 per-cell. [1] Study 2 hence needed to increase to at least n=62 per cell, but instead the authors decreased it to n=21.  We should not make much of the fact that the interaction was not significant in Study 2

(Strack et al. do, interpreting the n.s. result as accepting the null of no-effect and hence as evidence for their theory).

The math behind the blue fact is simple enough (see math derivations .pdf | R simulations| Excel Simulations).
Let’s focus on consequences.

A multiplicative bummer
Twice as many subjects per cell sounds bad. But it is worse than it sounds. If Study 1 is a simple two-cell design, Study 2 typically has at least four (2×2 design).
If Study 1 had 100 subjects total (n=50 per cell), Study 2 needs at least 50 x 2 x 4=400 subjects total.
If Study 2 instead tests a three-way interaction (attenuation of an attenuated effect), it needs N=50 x 2 x2 x 8=1600 subjects .

With between subject designs, two-way interactions are ambitious. Three-ways are more like no-way.

How bad is it to ignore this?
Running Study 2 with the same per-cell n as Study 1 lowers power by ~1/3.
If Study 1 had 80% power, Study 2 would have 51%.

Why do you keep saying at least?
Because I have assumed the moderator eliminates the effect. If it merely reduces it, things get worse. Fast. If the effect drops in 70%, instead of 100%, you need FOUR times as many subjects in Study 2, again, per cell. If two-cell Study 1 has 100 total subjects, 2×2 Study 2 needs 800.

How come so many interaction studies have worked?
In order of speculated likelihood:

1) p-hacking: many interactions are post-dicted “Bummer, p=.14. Do a median split on father’s age… p=.048, nailed it!” or if predicted, obtained by dropping subjects, measures, or conditions.

2) Bad inferences: Very often people conclude an interaction ‘worked’ if one effect is  p<.05  and the other isn’t. Bad reasoning allows underpowered studies to “work.”
(Gelman & Stern explain the fallacy .pdf, Nieuwenhuis et al document it’s common .pdf)

3) Cross-overs: Some studies examine if an effect reverses rather than merely goes away,those may need only 30%-50% more subjects per cell.

4) Stuff happens: even if power is just 20%, 1 in 5 studies will work

5) Bigger ns: Perhaps some interaction studies have run twice as many subjects per cell as Study 1s, or Study 1 was so high-powered that not doubling n still lead to decent power.


(you can cite this blogpost using DOI: 10.15200/winn.142559.90552)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Study 1 was a three-cell design, with a pen-in-hand control condition in the middle. Statistical power of a linear trend with three n=30 cells is virtually identical to a t-test on the high-vs-low cells with n=30. The blue fact applies to the cartoons paper all the same. []

[16] People Take Baths In Hotel Rooms

This post is the product of a heated debate.

At a recent conference, a colleague mentioned, much too matter-of-factly, that she took a bath in her hotel room. Not a shower. A bath. I had never heard of someone voluntarily bathing in a hotel room. I think bathing is preposterous. Bathing in a hotel room is lunacy.

I started asking people to estimate the percentage of people who had ever bathed in a hotel room. The few that admitted bathing in a hotel room guessed 15-20%. I guessed 4%, but then decided that number was way too high. A group of us asked the hotel concierge for his estimate. He said 60%, which I took as evidence that he lacked familiarity with numbers.

One of the participants in this conversation, Chicago professor Abigail Sussman, suggested testing this empirically. So we did that.

We asked 532 U.S.-resident MTurkers who had spent at least one night in a hotel within the last year to recall their most recent hotel stay (notes .pdf; materials .pdf; data .xls). We asked them a few questions about their stay, including whether they had showered, bathed, both, or, um, neither.

Here are the results, removing those who said their hotel room definitely did not have a bathtub (N = 442).

Ok, about 80% of people took a normal person’s approach to their last hotel stay. One in 20 didn’t bother cleaning themselves (respect), and 12.4% took a bath, including an incomprehensible subset who bathed but didn’t shower. Given that these data capture only their most recent hotel stay, the proportion of people bathing is at least an order of magnitude higher than I expected.


Gender Differences

If you had told me that 12.4% of people report having taken a bath during their last hotel stay, I’d have told you to include some men in your sample. Women have to be at least 5 times more likely to bathe. Right?


Women bathed more than men (15.6% vs. 10.8%), but only by a small, nonsignificant margin (p=.142). Also surprising is that women and men were equally likely to take a Pigpen approach to life.


What Predicts Hotel Bathing?

Hotel quality: People are more likely to bathe in higher quality hotels, and nobody bathes in a one-star hotel.


Perceptions of hotel cleanliness: People are more likely to bathe when they think hotels are cleaner, although almost 10% took a bath despite believing that hotel rooms are somewhat dirtier than the average home.


Others in the room: Sharing a room with more than 1 person really inhibits bathing, as it should, since it’s pretty inconsiderate to occupy the bathroom for the length of time that bathing requires. More than one in five people bathe when they are alone.


Bathing History & Intentions

We also asked people whether they had ever, as an adult, bathed in a hotel room. Making a mockery of my mockery, the concierge’s estimate was slightly closer to the truth than mine was, as fully one-third (33.3%) reported doing so. One in three.

I give up.

Finally, we asked those who had never bathed in a hotel room to report whether they’d ever consider doing so. Of the 66.7% who said they had never bathed in a hotel room, 64.1% said they’d never consider it.

So only 43% have both never bathed in a hotel room and say they would never consider it.

Those are my people.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[15] Citing Prospect Theory

Kahneman and Tversky’s (1979) Prospect Theory (.pdf), with its 9,206 citations, is the most cited article in Econometrica, the prestigious journal in which it appeared. In fact, it is more cited than any article published in any economics journal. [1]

Let’s break it down by year.prospect theory number

To be clear, this figure shows that just in 2013, Prospect Theory got about 700 citations.

Kahneman won the Nobel prize in Economics in 2002. This figure suggests a Nobel bump in citations. To examine whether the Nobel bump is real, I got citation data for other papers. I will get to that in about 20 seconds. Let’s not abandon Prospect Theory just yet.

Fan club.
Below we see which researchers and which journals have cited Prospect Theory the most. Leading the way is the late Duncan Luce with his 38 cites.


If you are wondering, Kahneman would be ranked 14th with 24 cites, and Tversky 15th with 23. Richard Thaler comes in 33rd place with 16, and DanAriely in 58th with 12.

How about journals?


I think the most surprising top-5 is Management Science, it only recently started its Behavioral Economics and Judgment & Decision Making departments

Not drinking the cool-aid
The first article to cite Prospect Theory came out the same year, 1979, in Economics Letters (.pdf). It provided a rational explanation for risk attitudes differing for gains and losses. The story is perfect if one is willing to make ad-hoc assumptions about the irreversibility of decisions and if one is also willing to ignore the fact that Prospect Theory involves small stakes decisions.  Old school.

Correction: an earlier version of this post indicated the Journal of Political Economy did not cite Prospect Theory until 2005, in fact it was cited already in 1987 (.html).  

About that Nobel bump
The first figure in this blog suggests the Nobel lead to more Prospect Theory cites. I thought I would look at other 1979 Econometrica papers as a “placebo” comparison. It turned out that they also showed a marked and sustained increase in the early 2000s. Hm?

I then realized that Heckman’s famous “Sample Selection as Specification Error” paper was also published in Econometrica in 1979 (good year!) and Heckman, it turns out, got the Nobel in 2000, my placebo was no good. Whether the bump was real or spurious it was expected to show the same pattern.

So I used Econometrica 1980. The figure below shows that deflating Prospect Theory cites by cites of all articles published in Econometrica in 1980, the same Nobel bump pattern emerges. Before the Nobel, Prospect Theory was getting about 40% as many cites per year as all 1980 Econometrica articles combined. Since then that has been rising, in 2013 they were nearly tied.

ratio econometrica 1980

Let’s take this out of sample, did other econ Nobel laureates get a bump in citations? I looked for laureates from different time periods and whose award I thought could be tied to a specific paper.

There seems to be something there, though the Coase cites started increasing a bit early and Akerlof’s a bit late. [2]

coase Lemos

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Only two psychology articles have more citations:Baron and Kenny paper introducing mediation (.pdf) with 21,746, and Bandura’s on Self-Efficacy (.html) with 9,879 []
  2. The papers: Akerlof, 1970 (..pdf) & Coase, 1960 (..pdf)   []

[14] How To Win A Football Prediction Contest: Ignore Your Gut

This is a boastful tale of how I used psychology to win dominate a football prediction contest.

Back in September, I was asked to represent my department – Operations and Information Management – in a Wharton School contest to predict NFL football game outcomes. Having always wanted a realistic chance to outperform Adam Grant at something, I agreed.

The contest involved making the same predictions that sports gamblers make. For each game, we predicted whether the superior team (the favorite) was going to beat the inferior team (the underdog) by more or less than the Las Vegas point spread. For example, when the very good New England Patriots played the less good Pittsburgh Steelers, we had to predict whether or not the Patriots would win by more than the 6.5-point point spread. We made 239 predictions across 16 weeks.

Contrary to popular belief, oddsmakers in Las Vegas don’t set point spreads in order to ensure that half of the money is wagered on the favorite and half the money is wagered on the underdog. Rather, their primary aim is to set accurate point spreads, one that gives the favorite (and underdog) a 50% chance to beat the spread. [1] Because Vegas is good at setting accurate spreads, it is very hard to perform better than chance when making these predictions. The only way to do it is to predict the NFL games better than Vegas does.

Enter Wharton professor Cade Massey and professional sports analyst Rufus Peabody. They’ve developed a statistical model that, for an identifiable subset of football games, outperforms Vegas. Their Massey-Peabody power rankings are featured in the Wall Street Journal, and from those rankings you can compute expected game outcomes. For example, their current rankings (shown below) say that the Broncos are 8.5 points better than the average team on a neutral field whereas the Seahawks are 8 points better. Thus, we can expect, on average, the Broncos to beat the Seahawks by 0.5 points if they were to play on a neutral field, as they will in Sunday’s Super Bowl. [2]


My approach to the contest was informed by two pieces of information.

First, my work with Leif (.pdf) has shown that naïve gamblers are biased when making these predictions – they predict favorites to beat the spread much more often than they predict underdogs to beat the spread. This is because people’s first impression about which team to bet on ignores the point spread and is thus based on a simpler prediction as to which team will win the game. Since the favorite is usually more likely to win, people’s first impressions tend to favor favorites. And because people rarely talk themselves out of these first impressions, they tend to predict favorites against the spread. This is true even though favorites don’t win against the spread more often than underdogs (paper 1, .pdf), and even when you manipulate the point spreads to make favorites more likely to lose (paper 2, .pdf). Intuitions for these predictions are just not useful.

Second, knowing that evidence-based algorithms are better forecasters than humans (.pdf), I used the Massey-Peabody algorithm for all my predictions.

So how did the results shake out? (Notes on Analyses; Data)

First, did my Wharton colleagues also show the bias toward favorites, a bias that would indicate that they are no more sophisticated than the typical gambler?

Yes. All of them predicted significantly more favorites than underdogs.


Second, how did I perform relative to the “competition?”

Since everyone loves a humble champion, let me just say that my victory is really a victory for Massey-Peabody. I don’t deserve all of the accolades. Really.

Yeah, for about the millionth time (see meta-analysis, .pdf), we see that statistical models outperform human forecasters. This is true even (especially?) when the humans are Wharton professors, students, and staff.

So, if you want to know who is going to win this Sunday’s Super Bowl, don’t ask me and don’t ask the bestselling author of Give and Take. Ask Massey-Peabody.

And they will tell you, unsatisfyingly, that the game is basically a coin flip.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Vegas still makes money in the long run because gamblers have to pay a fee in order to bet []
  2. For any matchup involving home field advantage, give an additional 2.4 points to the home team []