[47] Evaluating Replications: 40% Full ≠ 60% Empty

Last October, Science published the paper “Estimating the Reproducibility of Psychological Science” (.pdf), which reported the results of 100 replication attempts. Today it published a commentary by Gilbert et al. (.pdf) as well as a response by the replicators (.pdf).

The commentary makes two main points. First, because of sampling error, we should not expect all of the effects to replicate even if all of them were true. Second, differences in design between original studies and replication attempts may explain differences in results. Let’s start with the latter.[1]

Design differences
The commentators provide some striking examples of design differences. For example, they write, “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon” (p. 1037).

People can debate if such differences can explain the results (and in their reply, the replicators explain why they don’t think so). However, for readers to consider whether design differences matter, they first need to know those differences exist. I, for one, was unaware of them before reading Gilbert et al. (They are not mentioned in the 6 page Science article .pdf, nor 26 page supplement .pdf). [2]

This is not about pointing fingers, as I have also made this mistake: I did not sufficiently describe differences between original and replication studies  in my Small Telescopes paper (see Colada [43]).

This is also not about taking a position on whether any particular difference is responsible for any particular discrepancy in results. I have no idea. Nor am I arguing design differences are a problem per-se, in most cases they were even approved by the original authors.

This is entirely about improving the reporting of replications going forward. After reading the commentary I better appreciate the importance of prominently disclosing design differences. This better enables readers to consider the consequences of such differences, while encouraging replicators to anticipate and address, before publication, any concerns they may raise. [3]

Noisy results
I am also sympathetic to the commentators’ other concern, which is that sampling error may explain the low reproducibility rate. Their statistical analyses are not quite right, but neither are those by the replicators in the reproducibility project.

A study result can be imprecise enough to be consistent both with an effect existing and with it not existing. (See Colada[7] for a remarkable example from Economics). Clouds are consistent with rain, but also consistent with no rain. Clouds, like noisy results, are inconclusive.

The replicators interpreted inconclusive replications as failures, the commentators as successes. For instance, one of the analyses by the replicators considered replications as successful only if they obtained p<.05, effectively treating all inconclusive replications as failures. [4]

Both sets of authors examined whether the results from one study were within the confidence interval of the other, selectively ignoring sampling error of one or the other study.[5]

In particular, the replicators deemed a replication successful if the original finding was within the confidence interval of the replication. Among other problems this approach leads most true effects to fail to replicate with sufficiently big replication samples.[6]

The commentators, in contrast, deemed replications successful if their estimate was within the confidence interval of the original. Among other problems, this approach leads too many false-positive findings to survive most replication efforts.[7]

For more on these problems with effect size comparisons, see p. 561 in “Small Telescopes” (.pdf).

Accepting the null
Inconclusive replications are not failed replications.

For a replication to fail, the data must support the null. They must affirm the non-existence of a detectable effect. There are four main approaches to accepting the null (see Colada [42]). Two lend themselves particularly well to evaluating replications:

(i) Small Telescopes (.pdf): Test whether the replication rejects effects big enough to be detectable by the original study, and (ii) Bayesian evaluation of replications (.pdf).

These are philosophically and mathematically very different, but in practice they often agree. In Colada [42] I reported that for this very reproducibility project, the Small Telescopes and the Bayesian approach are correlated r = .91 overall, and r = .72 among replications with p>.05. Moreover, both find that about 30% of replications were inconclusive. (R Code).  [8],[9]

40% full is not 60% empty
The opening paragraph of the response by the replicators reads:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled”

They are saying the glass is 40% full.  They are not explicitly saying it is 60% empty. But readers may be forgiven for jumping to that conclusion, and they almost invariably have.  This opening paragraph would have been equally justified:
[…] the Open Science Collaboration observed that the original result failed to replicate in ~30 of 100 studies sampled”

It would be much better to fully report:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 replications were inconclusive.”

1. Replications must be analyzed in ways that allow for results to be inconclusive, not just success/fail
2. Design differences between original and replication should be prominently disclosed.

Wide logo

Author feedback.
I shared a draft of this post with Brian Nosek, Dan Gilbert and Tim Wilson, and invited them and their co-authors to provide feedback. I exchanged over 20 emails total with 7 of them. Their feedback greatly improved, and considerably lengthened, this post. Colada Co-host Joe Simmons provided lots of feedback as well.  I kept editing after getting feedback from all of them, so the version you just read is probably worse and surely different from the versions any of them commented on.

Concluding remarks
My views on the state of social science and what to do about it are almost surely much closer to those of the reproducibility team than to those of the authors of the commentary. But. A few months ago I came across a “Rationally Speaking” podcast (.htm) by Julia Galef (relevant part of transcript starts on page 7, .pdf) where she talks about debating with a “steel-man” version, as opposed to straw-man, of an argument. It changed how Iman of steel approach disagreements. For example, the Gilbert et al  commentary opens with what appears to be an incorrectly calculated probability. One could straw-man argue against the commentary by focusing on that calculation. But the argument such probability is meant to support does not hinge on precisely estimating it. There are other weak-links in the commentary, but its steel-man version, the one focusing on its strengths rather than weaknesses, did make me think better about the issues at hand and ended up with what I think is an improved perspective on replications.

We are greatly indebted to the collaborative work of 100s of colleagues behind the reproducibility project, and to Brian Nosek for leading that gargantuan effort (as well as many other important efforts to improve the transparency and replicability of social science). This does not mean we should not try to improve on it or to learn from its shortcomings.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. The commentators  actually focus on three issues: (1) (Sampling) error, (2) Statistical power, and (3) Design differences. I treat (1) and (2) as the same problem []
  2. However, the 100 detailed study protocols are available online (.htm), and so people can identify them by reading those protocols. For instance, here (.htm) is the (8 page) protocol for the military vs honeymoon study. []
  3. Brandt et al (JESP 2014) understood the importance of this long before I did, see their ‘Replication Recipe’ paper .pdf []
  4. Any true effect can fail to replicate with a small enough sample, a point made in most articles making suggestions for conducting and evaluating replications, including Small Telescopes (.pdf). []
  5. The original paper reported 5 tests of reproducibility: (i) Is the replication p<.05?, (ii) Is the original within the confidence interval of the replication?, (iii) Does the replication team subjectively rate it as successful vs failure? (iv) Is the replication directionally smaller than the original? and (v) Is the average of original and replication significantly different from zero? In the post I focus only on (i) and (ii) because: (iii)  is not a statistic with evaluative properties (but in any case, also does not include an ‘inconclusive bin’), and neither (iv) nor (v) measure reproducibility.  (iv) Measures publication bias (with lots of noise), and I couldn’t say what (v) measures. []
  6. Most true findings are inflated due to publication bias, so the unbiased estimate from the replication will eventually reject it []
  7. For example, the prototypically  p-hacked p=.049 finding, has a confidence interval that nearly touches zero. To obtain a replication outside that confidence interval, therefore, we need to observe a negative estimate. If the true effect is zero, that will happen only 50% of the time, so about half of false-positive p=.049 would survive replication attempts []
  8. Alex Etz in his blog post did the Bayesian analyses long before I did and I used his summary dataset, as is, to run my analyses. See his PLOS ONE paper, .htm. []
  9. The Small Telescope approach finds that only 25% of replications conclusively failed to replicate, whereas the Bayesian approach says this number is about 37%. However, several of the disagreements come from results that barely accept or don’t accept the null, so the two agree more than these two figures suggest. In the last section of Colada[42] I explain what causes disagreements between the two. []

[43] Rain & Happiness: Why Didn’t Schwarz & Clore (1983) ‘Replicate’ ?

In my “Small Telescopes” paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days.

Figure and text from Small Telescopes paper (SSRN)
Small Telescopes quotes
I recently had an email exchange with a senior researcher (not involved in the original paper) who persuaded me I should have been more explicit regarding the design differences between the original and replication studies.  If my paper weren’t published I would add a discussion of such differences and would explain why I don’t believe these can explain the failures to replicate.  

Because my paper is already published, I write this post instead.

The 1983 study
This study is so famous that a paper telling the story behind it (.pdf) has over 450 Google cites.  It is among the top-20 most cited articles published in JPSP and the most cited by either (superstar) author.

In the original study a research assistant called University of Illinois students either during the “first two sunny spring days after a long period of gray, overcast days”, or during two rainy days within a “period of low-hanging clouds and rain” (p. 298, .pdf).

She asked about life satisfaction and then current mood. At the beginning of the phone conversation, she either did not mention the weather, mentioned it in passing, or described it as being of interest to the study.

The reported finding is that “respondents were more satisfied with their lives on sunny than rainy days—but only when their attention was not drawn to the weather” (p.298, .pdf)
Feddersen et al. (.pdf) matched weather data to the Australian Household Income Survey, which includes a question about life satisfaction. With 90,000 observations, the effect was basically zero.

There are at least three notable design differences between the original and replication studies:[1]

1. Smaller causes have smaller effect. The 1983 study focused on days on which weather was expected to have large mood effects, the Australian sample used the whole year. The first sunny day in spring is not like the 53rd sunny day of summer.

2. Already attributed. Respondents answered many questions in Australia before reporting their life-satisfaction, possibly misattributing mood to something else.

3. Noise. The representative sample is more diverse than a sample of college undergrads is; thus the data are noisier, less likely to detectably exhibit any effect.

Often this is where discussions of failed replications end—with the enumeration of potential moderators, and the call for more and better data. I’ll try to use the data we already have to assess whether any of the differences are likely to matter.[2]

Design difference 1. Smaller causes.
If weather contrasts were critical for altering mood and hence possibly happiness, then the effect in the 1983 study should be driven by the first sunny day in spring, not the Nth rainy day.  But a look at the bar chart above shows the opposite: People were NOT happier the first sunny day of spring; they were unhappier on the rainy days. Their description of these days again: and the rainy days we used were several days into a new period of low-hanging clouds and rain.’ (p. 298, .pdf)

The days driving the effect, then, were similar to previous days. Because of how seasons work, most days in the replication studies presumably were also similar to the days that preceded them (sunny after sunny and rainy after rainy), and so on this point the replication does not seem different or problematic.

Second, Lucas and Lawless (JPSP 2014, .pdf) analyzed a large (N=1 million) US sample and also found no effect of weather on life satisfaction. Moreover, they explicitly assessed if unseasonably cloudy/sunny days, or days with sunshine that differed from recent days, were associated with bigger effects. They were not. (See their Table 3).

Third, the effect size Schwarz and Clore report is enormous: 1.7 points in a 1-10 scale. To put that in perspective, from other studies, we know that the life satisfaction gap between people who got married vs. people who became widows over the past year is about 1.5 on the same scale (see Figure 1, Lucas 2005 .pdf). Life vs. death are estimated as less impactful than precipitation. Even if the effect were smaller on days not as carefully selected as those by Schwarz and Clore, the ‘replications’ averaging across all days should still have detectable effects.

The large effect is particularly surprising considering it is the downstream effect of weather on mood, and that effect is really tiny (see Tal Yarkoni’s blog review of a few studies .htm)

Design difference  2. Already attributed.
This concern, recall, is that people answering many questions in a survey may misattribute their mood to earlier questions. This makes sense, but the concern applies to the original as well.

The phone-call from Schwarz & Clore’s RA does not come immediately after the “mood induction” either, rather, participants get the RA’s phone call hours into a rainy vs sunny day.  Before the call they presumably made evaluations too, answering questions like “How are you and Lisa doing?” “How did History 101 go?” “Man, don’t you hate Champaign’s weather?” etc. Mood could have been misattributed to any of these earlier judgments in the original as well. Our participants’ experiences do not begin when we start collecting their data. [3]

Design difference 3. Noise.
This concern is that the more diverse sample in the replication makes it harder to detect any effect. If the replication were noisier, we may expect the dependent variable to have a higher standard deviation (SD).  For life-satisfaction Schwarz and Clore got about SD=1.69, Feddersen et al, SD=1.52.  So less noise in the replication. [4] Moreover, the replication has panel data and controls for individual differences via fixed effects. These account for 50% of the variance, so they have spectacularly less noise. [5]

Concluding bullet points.
– The existing data are overwhelmingly inconsistent with current weather affecting reported life satisfaction.
– This does not imply the theory behind Schwarz and Clore (1983), mood-as-information, is wrong.

Wide logo

Author feedback
I sent a draft of this post to Richard Lucas (.htm) who provided valuable feedback and additional sources. I also sent a draft to Norbert Schwarz (.htm) and Gerald Clore (.htm). They provided feedback that led me to clarify when I first identified the design differences between the original and replication studies (back in 2013, see footnotes 1&2).  They turned down several invitations to comment within this post.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. The first two were mentioned in the first draft of my paper but I unfortunately cut them out during a major revision, around May 2013. The third was proposed in Feburary of 2013 in a small mailing list discussing the first talk I gave of my Small Telescopes paper []
  2. There is also the issue, as Norbert Schwarz pointed out to me in an email in May of 2013, that the 1983 study is not about weather nor life satisfaction, but about misattribution of mood. The ‘replications’ do not even measure mood. I believe we can meaningfully discuss whether the affects of rain on happiness replicates without measuring mood, in fact, the difficulty to manipulate mood via weather is one thing that make the original finding surprising. []
  3. What one needs to explain the differences via the presence of other questions is that mood effects from weather replenish through the day, but not immediately. So on sunny days at 7AM I think my cat makes me happier than usual, and then at 10AM that my calculus teacher jokes are funnier than usual, but if the joke had been told at 7.15AM I would not have found it funny because I had already attributed my mood to the cat. This is possible. []
  4. Schwarz and Clore did not report SDs, but one can compute them off the reported test statistics. See Supplement 2 for Small Telescopes .pdf. []
  5. See Rin Feddersen et al’s Table A1, column 4 vs 3, .pdf  []

[42] Accepting the Null: Where to Draw the Line?

We typically ask if an effect exists.  But sometimes we want to ask if it does not.

For example, how many of the “failed” replications in the recent reproducibility project published in Science (.pdf) suggest the absence of an effect?

Data have noise, so we can never say ‘the effect is exactly zero.’  We can only say ‘the effect is basically zero.’ What we do is draw a line close to zero and if we are confident the effect is below the line, we accept the null.
Drawing on whiteboard with confidence intervals that do and do not include the lineWe can draw the line via Bayes or via p-values, it does not matter very much. The line is what really matters. How far from zero is it? What moves it up and down?

In this post I describe 4 ways to draw the line, and then pit the top-2 against each other.

Way 1. Absolutely small
The oldest approach draws the line based on absolute size. Say, diets leading to losing less than 2 pounds have an effect of basically zero. Economists do this often. For instance, a recent World Bank paper (.html) reads

“The impact of financial literacy on the average remittance frequency has a 95 percent confidence interval [−4.3%, +2.5%] …. We consider this a relatively precise zero effect, ruling out large positive or negative effects of training” (emphasis added)
(Dictionary note. Remittance: immigrants sending money home).

In much of behavioral science effects of any size can be of theoretical interest, and sample sizes are too small to obtain tight confidence intervals, making this approach unviable in principle and in practice [1].

Way 2. Undetectably Small
In our first p-curve paper with Joe and Leif (SSRN), and in my “Small Telescopes” paper on evaluating replications (.pdf), we draw the line based on detectability.

We don’t draw the line where we stop caring about effects.
We draw the line where we stop being able to detect them.

Say an original study with n=50 finds people can feel the future. A replication with n=125 ‘fails,’ getting and effect estimate of d=0.01, p=.94. Data are noisy, so the confidence interval goes all the way up to d=.2. That’s a respectably big feeling-the-future effect we are not ruling out. So we cannot say the effect is absolutely small.
The original study, with just n=50, however, is unable to detect that small an effect (it would have <18% power). So we accept the null, the null that the effect is either zero, or undetectably small by existing studies.

Way 3. Smaller than expected in general
Bayesian hypothesis testing runs a horse race between two hypotheses:

Hypothesis 1 (null):              The effect is exactly zero.
Hypothesis 2 (alternative): The effect is one of those moderately sized ones [2].

When data clearly favor 1 more than 2, we accept the null. The bigger the effects Hypothesis 2 includes, the further from zero we draw the line, the more likely we accept the null [3].

The default Bayesian test, commonly used by Bayesian advocates in psychology, draws the line too far from zero (for my taste). Reasonably powered studies of moderately big effects wrongly accept the null of zero effect too often (see Colada[35]) [4].

Way 4. Smaller than expected this time
A new Bayesian approach to evaluate replications, by Verhagen and Wagenmakers (2014 .pdf), pits a different Hypothesis 2 against the null. Its Hypothesis 2 is what a Bayesian observer would predict for the replication after seeing the Original (with some assumed prior).

Similar to Way 3 the bigger the effect seen in the original is, the bigger the effect we expect in the replication, and hence the further from zero we draw the line. Importantly, here the line moves based on what we observed in the original, not (only) on what we arbitrarily choose to consider reasonable to expect. The approach is the handsome cousin of testing if effect size differs between original and replication.

Small Telescope vs Expected This Time (Way 2 vs Way 4)
I compared the conclusions both approaches arrive at when applied to the 100 replications from that Science paper. The results are similar but far from equal, r = .9 across all replications, and r = .72 among n.s. ones (R Code). Focusing on situations where the two lead to opposite conclusions is useful to understand each better [5],[6].

In Study 7 in the Science paper,
The Original estimated a monstrous d=2.14 with N=99 participants total.
The Replication estimated a small    d=0.26, with a miniscule N=14.

The Small Telescopes approach is irked by the small sample of the replication. Its wide confidence interval includes effects as big as d =1.14, giving the original >99% power. We cannot rule out detectable effects, the replication is inconclusive.

The Bayesian observer, in contrast, draw a line quite far from zero after seeing the massive Original effect size. The line, indeed is at a remarkable d=.8. Replications with smaller effect size estimates, anything smaller than large, ‘supports the null.’ Because the replication is d=.26, it strongly supports the null.

A hypothetical scenario where they disagree in the opposite direction (R Code),
Original.       N=40,       d=.7
Replication.  N=5000, d=.1

The Small Telescopes approach asks if the replication rejects an effect big enough to be detectable by the original. Yes. d=.1 cannot be studied with N=40. Null Accepted [7].

Interestingly, that small N=40 pushes the Bayesian in the opposite direction. An original with N=40 changes very little her beliefs about the effect, so d=.1 in the replication is not that surprising  vs. the Original, but it is incompatible with d=0 given the large sample size, null rejected.

I find myself agreeing with the Small Telescopes’ line more than any other. But that’s a matter of taste, not fact.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. e.g., we need n=1500 per cell to have a confidence interval entirely within d<.1 and d>-.1 []
  2. The tests don’t formally assume the effects are moderately large, rather they assume distributions of effect size, say N(0,1). These distributions include tiny effects, even zero, but they also include very large effects, e.g., d>1 as probable possibilities.  It is hard to have intuitions for what assuming a distribution entails. So for brevity and clarity I just say they assume the effect is moderately large. []
  3. Bayesians don’t accept and reject hypotheses, instead, the evidence supports one or another hypothesis. I will use the term accept anyway. []
  4. This is fixable in principle, just define another alternative. If someone proposes a new Bayesian test, ask them “what line around zero is it drawing?”  Even without understanding Bayesian statistics you can evaluate if you like the line the test generates or not. []
  5. Alex Etz in a blogpost (.html) reported the Bayesian analysis of the 100 replications, I used some of his results here. []
  6. These are the spearman correlation between the p-value testing the null that the original had at least 33% power, and Bayes Factor described above. []
  7. Technically it is the upper end of the confidence interval we consider when evaluating the power of the original sample, it goes up to d=.14, I used d=.1 to keep things simpler []

[41] Falsely Reassuring: Analyses of ALL p-values

It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers:

Fig 0

The dotted circles highlight the excess of .05s, but most p-values are way smaller, suggesting  p-hacking happens but is not a first order concern. That’s reassuring, but falsely reassuring [1],[2].

Bad Sampling.
The are several problems with looking at all p-values, here I focus on sampling [3].

If we want to know if researchers p-hack their results, we need to examine the p-values associated with their results, those they may want to p-hack in the first place. Samples, to be unbiased, must only include observations from the population of interest.

Most p-values reported in most papers are irrelevant for the strategic behavior of interest. Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data.  Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?” [4].

A Demonstration.
In our first p-curve paper (SSRN) we analyzed p-values from experiments with results reported only with a covariate.

We believed researchers would report the analysis without the covariate if it were significant, thus we believed those studies were p-hacked. The resulting p-curve was left-skewed, so we were right.

Figure 2. p-curve for relevant p-values in experiments reported only with a covariate.
Fig 1

I went back to the papers we had analyzed and redid the analyses, only this time I did them incorrectly.

Instead of collecting only the (23) p-values one should select -we provide detailed directions for selecting p-values in our paper SSRN– I proceeded the way the indiscriminate analysts of p-values proceed. I got ALL (712) p-values reported in those papers.

Figure 3. p-curve for all p-values reported in papers behind Figure 2
Fig 2

Figure 3 tells that that the things those papers were not studying were super true.
Figure 2 tells the ones they were studying were not.

Looking at all p-values is falsely reassuring.

Wide logo

Author feedback
I sent a draft of this post to the first author of the three papers with charts reprinted in Figure 1 and the paper from footnote 1. They provided valuable feedback that improved the writing and led to footnotes 2 & 4.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. The Econ and Psych papers were not meant to be reassuring, but they can be interpreted that way. For instance, a recent J of Econ Perspectives (.pdf) paper reads “Brodeur et al. do find excess bunching, [but] their results imply that it may not be quantitatively as severe as one might have thought” The PLOS Biology paper was meant to be reassuring. []
  2. The PLOS Biology paper had two parts. The first used the indiscriminate selection of p-values from articles in a broad range of journals and attempted to assess the prevalence and impact of p-hacking in the field as a whole. This part is fully invalidated by the problems described in this post. The second used p-values from a few published-metaanalyses on sexual selection in evolutionary biology; this second part is by construction not representative of biology as a whole. In the absence of a p-curve disclosure table, where we know which p-value was selected from each study, it is not possible to evaluate the validity of this exercise. []
  3. For other problems see Dorothoy Bishop’s recent paper [.html] []
  4. Brodeur et al did painstaking work to exclude some irrelevant p-values, e.g., those explicitly described as control variables, but nevertheless left many in . To give a sense, they obtained an average of about 90 p-values from each paper. To give a concrete example, one of the papers in their sample is by Ferreira and Gyourko (.pdf). Via regression discontinuity it shows that a mayor’s political party does not predict policy. To demonstrate the importance of their design, Ferreira & Gyourko also report naive OLS regressions with highly significant but spurious and incorrect results that at face value contradict the paper’s thesis (see their Table II). These very small but irrelevant p-values were included in the sample by Brodeur et al. []

[40] Reducing Fraud in Science

Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things.  In this post I counter-argue, proposing three alternative solutions.

Problems with the Change the Incentives solution.
First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We should pit costs vs benefits. Asking questions with the most upside is beneficial.

Second, if we started rewarding unsexy stuff, a likely consequence is fabricateurs continuing to fake, now just unsexy stuff.  Fabricateurs want the lifestyle of successful scientists. [1] Changing incentives involves making our lifestyle less appealing. (Finally, a benefit to committee meetings). 

Third, the evidence for “liking sexy→fraud” is just not there. Like real research, most fake research is not sexy. Life-long fabricateur Diederik Stapel mostly published dry experiments with “findings” in line with the rest of the literature. That we attend to and remember the sexy fake studies is diagnostic of what we pay attention to, not what causes fraud.  

The evidence that incentives causes fraud comes primarily from self-reports, with fabricateurs saying “the incentives made me do it” (see e.g., Tijdink et al .pdf; or Stapel interviews).  To me, the guilty saying “it’s not my fault” seems like weak evidence. What else could they say?
“I realized I was not cut-out for this; it was either faking some science or getting a job with less status”
I am kind of a psychopath, I had fun tricking everyone”
“A voice in my head told me to do it”

Similarly weak, to me, is the observation that fraud is more prevalent in top journals; we find fraud where we look for it. Fabricateurs faking articles that don’t get read don’t get caught….

It’s good for universities to ignore quantity of papers when hiring and promoting, good for journals to publish interesting questions with inconclusive answers. But that won’t help with fraud.

Solution 1. Retract without asking “are the data fake?”
We have a high bar for retracting articles, and a higher bar for accusing people of fraud. 
The latter makes sense. The former does not.

Retracting is not such a big deal, it just says “we no longer have confidence in the evidence.” 

So many things can go wrong when collecting, analyzing and reporting data that this should be a relatively routine occurrence even in the absence of fraud. An accidental killing may not land the killer in prison, but the victim goes 6 ft under regardless. I’d propose a  retraction doctrine like:

If something is discovered that would lead reasonable experts to believe the results did not originate in a study performed as described in a published paper, or to conclude the study was conducted with excessive sloppiness, the journal should retract the paper.   

Example 1. Analyses indicate published results are implausible for a study conducted as described (e.g., excessive linearity, implausibly similar means, or a covariate is impossibly imbalanced across conditions). Retract.

Example 2. Authors of a paper published in a journal that requires data sharing upon request, when asked for it, indicate to have “lost the data”.  Retract. [2]

Example 3. Comparing original materials with posted data reveals important inconsistencies (e.g., scales ranges are 1-11 in the data but 1-7 in the original). Retract.

When journals reject original submissions it is not their job to figure out why the authors run an uninteresting study or executed it poorly. They just reject it.

When journals lose confidence in the data behind a published article it is not their job to figure out why the authors published data whose confidence was eventually lost. They should just retract it.

Employers, funders, and co-authors can worry about why an author published untrustworthy data. 

Solution 2. Show receipts
Penn, my employer, reimburses me for expenses incurred at conferences.receipt

However, I don’t get to just say “hey, I bought some tacos in that Kansas City conference, please deposit $6.16 onto my checking account.” I need receipts.  They trust me, but there is a paper trail in case of need.

When I submit the work I presented in Kansas City to a journal, in contrast, I do just say “hey, I collected the data this or that way.” No receipts.

The recent Science retraction, with canvassers & gay marriage, is a great example for the value of receipts. The statistical evidence  suggested something was off, but the receipts-like paper trail helped a lot 

Author: “so and so run the survey with such and such company”
Sleuths: “hello such and such company, can we talk with so and so about this survey you guys run?”
Such and such company: “we don’t know any so and so, and we don’t have the capability to run the survey.”

Authors should provide as much documentation about how they run their science as they do about what they eat at conferences: where exactly was the study run, at what time and day, which research assistant run it (with contact information), how exactly were participants paid, etc.

We will trust everything researchers say. Until the need to verify arises.

Solution 3. Post data, materials and code
Had the raw data not been available, the recent Science retraction would probably not have happened. Stapel would probably not have gotten caught. The cases against Sanna and Smeesters would not have move forward.  To borrow from a recent paper with Joe and Leif:

Journals that do not increase data and materials posting requirements for publications are causally, if not morally, responsible for the continued contamination of the scientific record with fraud and sloppiness.  

Wide logo

Feedback from Ivan Oransky, co-founder of Retraction Watch
Ivan co-wrote an editorial in the New  York Times on changing the incentives to reduce fraud (.pdf). I reached out to him to get feedback. He directed me to some papers on the evidence of incentives and fraud. I was unaware of, but also unpersuaded by, that evidence. This prompted to add the last paragraph in the incentives section (where I am skeptical of that evidence).  
Despite our different takes on the role of rewarding sexy-findings on fraud, Ivan is on board with the three non-incentive solutions proposed here.  I thank Ivan for the prompt response and useful feedback. (and for Retraction Watch!)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. I use the word fabricateur to refer to scientists who fabricate data. Fraudster is insufficiently specific (e.g., selling 10 bagels calling them a dozen is fraud too), and fabricator has positive meanings (e.g., people who make things). Fabricateur has a nice ring to it. []
  2. Every author publishing in an American Psychological Association journal agrees to share data upon request []

[39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?

A recent Science-paper (.pdf) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. 

N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent variable, the Implicit Association Test (IAT), was contrasted within-participant before and after napping. [1]

Reasonable question: How much more power does subtracting baseline IAT give a study?
Surprising answer: it lowers power.

Design & analysis of napping study
Participants took the gender and race IATs, then trained for the gender IAT (while listening to one sound) and the race IAT (different sound). Then everyone naps.  While napping one of the two sounds is played (to cue memory of the corresponding training, facilitating learning while sleeping). Then both IATs are taken again. Nappers were reported to be less biased in the cued IAT after the nap.

This is perhaps a good place to indicate that there are many studies with similar designs and sample sizes. The blogpost is about strengthening intuitions for within-subject designs, not criticizing the authors of the study.

Intuition for the power drop
Let’s simplify the experiment. No napping. No gender IAT. Everyone takes only the race IAT.

Half train before taking it, half don’t. To test if training works we could do
         Between-subject test: is the mean IAT different across conditions?

If before training everyone took a baseline race IAT, we could instead do
         Mixed design test: is the mean change in IAT different across conditions?

Subtracting baseline, going from between-subject to a mixed-design, has two effects: one good, one bad.

Good: Reduce between-subject differences. Some people have stronger racial associations than others. Subtracting baselines reduces those differences, increasing power.

Bad: Increase noise. The baseline is, after all, just an estimate. Subtracting baseline adds noise, reducing power.

Imagine the baseline was measured incorrectly. The computer recorded, instead of the IAT, the participant’s body temperature. IAT scores minus body temperature is a noisier dependent variable than just IAT scores, so we’d have less power.

If baseline is not quite as bad as body temperature, the consequence is not quite as bad, but same idea. Subtracting baseline adds the baseline’s noise.

We can be quite precise about this. Subtracting baseline only helps power if baseline is correlated r>.5 with the dependent variable, but it hurts if r<.5. [2]

See the simple math (.html). Or, just see the simple chart. F1
e.g., running n=20 per cell and subtracting baseline, when r=.3, lowers power enough that it is as if the sample had been n=15 instead of n=20. (R Code)

Before-After correlation for  IAT
Subtracting baseline IAT will only help, then, if when people take it twice, their scores are correlated r>.5. Prior studies have found test-retest reliability of r = .4 for the racial IAT. [3]  Analyzing the posted data (.html) from this study, where manipulations take place between measures, I got r = .35. (For gender IAT I got r=.2) [4]

Aside: one can avoid the power-drop entirely if one controls for baseline in a regression/ANCOVA instead of subtracting it.  Moreover, controlling for baseline never lowers power. See bonus chart (.pdf). 

Within-subject manipulations
In addition to subtracting baseline, one may carry out the manipulation within-subject, 
every participant gets treatment and control. Indeed, in the napping study everyone had a cued and a non-cued IAT.

How much this helps depends again on the correlation of the within-subject measures: Does race IAT correlate with gender IAT?  The higher the correlation, the bigger the power boost. 

f2note: Aurélien Allard, a PhD student in Moral Psychology at Paris 8 University, caught an error in the R Code used to generate this figure.  He contacted me on 2016/11/02 and I updated the figure 2 days later. You can see the archived version of the post, with the incorrect figure, here.

When both measures are uncorrelated it is as if the study had twice as many subjects. This makes sense. r=0 is as if the data came from different people, asking two questions from n=20 is like asking one from n=40. As r increases we have more power because we expect the two measure to be more and more similar, so any given difference is more and more statistically significant (R Code for chart) [5].

Race & gender IATs capture distinct mental associations, measured with a test of low reliability, so we may not expect a high correlation. At baseline, r(race,gender)=-.07, p=.66.  The within-subject manipulation, then, “only” doubled the sample size.

So, how big was the sample?
The Science-paper reports N=40 people total. The supplement explains that actually combines two separate studies run months apart, each N=20. The analyses subtracted baseline IAT, lowering power, as if N=15. The manipulation was within subject, doubling it, to N=30. To detect “men are heavier than women” one needs N=92. [6]

Wide logoAuthor feedback
I shared an early draft of this post with the authors of the Science-paper. We had an extensive email exchange that led to clarifying some ambiguities in the writing. They also suggested I mention their results are robust to controlling instead of subtracting baseline.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. The IAT is the Implicit Association Test and assesses how strongly respondents associate, for instance, good things with Whites and bad things with Blacks; take a test (.html) []
  2. Two days after this post went live I learned, via Jason Kerwin, of this very relevant paper by David McKenzie (.pdf) arguing for economists to collect data from more rounds. David makes the same point about r>.5 for a gain in power from, in econ jargon, a diff-in-diff vs. the simple diff. []
  3. Bar-Anan & Nosek (2014, p. 676 .pdf); Lane et al. (2007, p.71 .pdf)  []
  4. That’s for post vs pre nap. In the napping study the race IAT is taken 4 times by every participant, resulting in 6 before-after  correlations. Raning  -.047 to r = .53; simple average r = .3. []
  5. This ignores the impact that going from between to within subject design has on the actual effect itself. Effects can get smaller or larger depending on the specifics. []
  6. The idea of using men-vs-women weight as a benchmark is to give a heuristic reaction; effects big enough to be detectable by the naked eye require bigger samples than the ones we are used to seeing when studying surprising effects. For those skeptical of this heuristic, let’s use published evidence on the IAT as a benchmark. Lai et al (2014 .pdf) run 17 interventions seeking to reduce IAT scores. The biggest effect among these 17 was d=.49. That effect size requires n=66 per cell, N=132 total, for 80% power (more than for men vs women weight). Moderating this effect through sleep, and moderating the moderation through cueing while sleeping, requires vastly larger samples to attain the same power.   []

[36] How to Study Discrimination (or Anything) With Names; If You Must

Consider these paraphrased famous findings:
“Because his name resembles ‘dentist,’ Dennis became one” (JPSP, .pdf)
“Because the applicant was black (named Jamal instead of Greg) he was not interviewed” (AER, .pdf)
“Because the applicant was female (named Jennifer instead of John), she got a lower offer” (PNAS, .pdf)

Everything that matters (income, age, location, religion) correlates with people’s names, hence comparing people with different names involves comparing people with potentially different everything that matters.

This post highlights the problem and proposes three practical solutions. [1]

Jennifer was the #1 baby girl name between 1970 & 1984, while John has been a top-30 boy name for the last 120 years. Comparing reactions to profiles with these names pits mental associations about women in their late 30s/early 40s with those of  men of unclear age.

More generally, close your eyes and think of Jennifers. Now do that for Johns.
Is gender the only difference between the two sets of people you considered?

Here is what Google did when I asked it to close its eyes: [2]

 Jennifer  jenn
 John  John

Johns vary more in age, appearance, affluence, and presidential ambitions. For somewhat harder data, I consulted a website where people rate names on various attributes:John vs jennifer

Distinctively black names (e.g., Jamal and Lakisha) signal low socioeconomic status while typical White names do not (QJE .pdf).  Do people not want to hire Jamal because he is Black or because he is of low status?

Even if all distinctively Black names (and even Black people) were perceived as low status, and hence Jamal were an externally valid signal of Blackness, the contrast with Greg might nevertheless be low in internal validity, because the difference attributed to race could instead be the result of status (or some other confounding variable). This is addressable because some (most?) low status people are not Black. We could compare Black names vs. low-status White names: say Jamal with Bubba or Billy Bob, and Lakisha with Bambi or Billy Jean. This would allow assessing  racial discrimination above and beyond status discrimination. [3greg jamalImagine reading a movie script where a Black drug dealer is being defended by a brilliant Black lawyer. One of these characters is named Greg, the other Jamal. The intuition that Greg is the lawyer’s name, is the intuition behind the internal validity problem.

Solution 1. Stop using names
Probably the best solution is to stop using names to manipulate race and gender.  A recent paper (PNAS .pdf) examined gender discrimination using only pronouns (and found that academics in STEM fields favored females over males 2:1).

Solution 2. Choose many names
A great paper titled “Stimulus Sampling” (PSPB .pdf) argues convincingly for choosing many stimuli for any given manipulation to avoid stumbling on unforeseen confounds. Stimulus sampling would involve going beyond Jennifer vs. John, to using, say, 20 female vs. 20 male names. This helps with idiosyncratic confounds (e.g., age) but not with the systematic confound that most distinctively Black names signal low socioeconomic status. [4]

Solution 3. Choose control names actively
If one chooses to study names, then one needs to select control names that if it weren’t for the scientific hypothesis of interest, would produce no difference with the target names (e.g., if it weren’t for racial discrimination, then people should like Jamal and this other name just as much)

I close with an example from a paper of mine where I attempted to generate proper control names to examine if people disproportionately marry others with similar names, e.g. Eric-Erica, because of implicit egotism: a preference for things that resemble the self. (JPSP .pdf)

We need control names that we would expect to marry Ericas just as frequently as Erics do in the absence of implicit egotism (e.g., of similar age, religion, income, class and location).  To find such names I looked at the relative frequency of wife names for every male name and asked “What male names have the most similar distribution of wife names to Erics?” [5].

The answer was: Joseph, Frank and Carl. We would expect these three names to marry Erica just as frequently as Eric does, if not for implicit egotism. And we would be right.

For the Jamal vs. Greg study, we could compare Jamal to non-Black names that have the most similar distribution of occupations, or of Zip Codes, or of criminal records.

Wide logo


Feedback from original authors:
I shared an early draft of this post with the authors of the Jamal vs. Greg, and Jennifer vs. John study.

Sendhil Mullainathan, co-author of the former, indicated across a few emails he did not believe it was clear one should control for socioeconomic status differences in studies about race, because status and race are correlated in real life.

Corinne Moss-Racusin sent me a note she wrote with her co-authors of their PNAS study:

Thanks so much for contacting us about this interesting topic. We agree that these are thoughtful and important points, and have often grappled with them in our own research. The names we used (John and Jennifer) had been pretested and rated as equivalent on a number of dimensions including warmth, competence, likeability, intelligence, and typicality (Brescoll & Uhlmann, 2005 .pdf), but they were not rated for perceived age, as you highlight here. However, for our study in particular, age of the target should not have extensively impacted our results, because the age of both our targets could easily be inferred from the targets’ resume information that our participants were exposed to. Both the male and female targets (John and Jennifer respectively) were presented as recent college grads (with the same graduation year), and it is thus reasonable to assume that participants believed they were the same age, as recent college grads are almost always the same age (give or take a few years). Thus, although it is possible that age (and other potential variables) may indeed be confounded with gender across our manipulation, we nonetheless do not believe that choosing different male and female names that were equivalent for age would greatly impact our findings, given our design. That said, future research should still seek to replicate our key findings using different manipulations of target gender. Specifically, your suggestions (using only pronouns, and using multiple names) are particularly promising. We have also considered utilizing target pictures in the past, but have encountered issues relating to attractiveness and other confounds.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. Galen Bodenhausen read this post and told me about a paper on confounds in names used for gender research, from 1993(!) PsychBull .pdf []
  2. Based on the Jennifers and Johns I see, I suspect Google peeked at my cookies before closing its eyes,e.g., there are two bay area business school professors. Your results may differ. []
  3. Bertrand and Mullainathan write extensively about the socioeconomic confound and report a few null results that they interpret as suggesting it is not playing a large role (see their Section “V.B Potential Confounds”, .pdf). However, (1) the n.s. results of socioeconomic status are obtained with extremely noisy proxies and small samples, reducing the ability to conclude evidence of absence from the absence of evidence, and on the other, (2) these analyses seek to remedy the consequences of the name-confound rather than avoiding the confound from the get-go through experimental design. This post is about experimental design. []
  4. The Jamal paper used 9 different names per race/gender cell []
  5. To avoid biasing the test against implicit egotism, I excluded from the calculations male and female names starting with E_ []

[35] The Default Bayesian Test is Prejudiced Against Small Effects

When considering any statistical tool I think it is useful to answer the following two practical questions:

1. “Does it give reasonable answers in realistic circumstances?”
2. “Does it answer a question I am interested in?”

In this post I explain why, for me, when it comes to the default Bayesian test that’s starting to pop up in some psychology publications, the answer to both questions is no.”

The Bayesian test
The Bayesian approach to testing hypotheses is neat and compelling. In principle. [1]

The p-value assesses only how incompatible the data are with the null hypothesis. The Bayesian approach, in contrast, assesses the relative compatibility of the data with a null vs an alternative hypothesis.

The devil is in choosing that alternative.  If the effect is not zero, what is it?

Bayesian advocates in psychology have proposed using a “default” alternative (Rouder et al 1999, .pdf). This default is used in the online (.html) and R based (.html) Bayes factor calculators. The original papers do warn attentive readers that the default can be replaced with alternatives informed by expertise or beliefs (see especially Dienes 2011 .pdf), but most researchers leave the default unchanged. [2]

This post is written with that majority of default following researchers in mind. I explain why, for me, when running the default Bayesian test, the answer to Questions 1 & 2 is “no” .

Question 1. “Does it give reasonable answers in realistic circumstances?”
No. It is prejudiced against small effects

The null hypothesis is that the effect size (henceforth d) is zero. Ho: d = 0. What’s the alternative hypothesis? It can be whatever we want it to be, say, Ha: = .5. We would then ask: are the data more compatible with = 0 or are they more compatible with = .5?

The default alternative hypothesis used in the Bayesian test is a bit more complicated. It is a distribution, so more like Ha: d~N(0,1). So we ask if the data are more compatible with zero or with d~N(0,1). [3]

That the alternative is a distribution makes it difficult to think about the test intuitively.  Let’s not worry about that. The key thing for us is that that default is prejudiced against small effects.

Intuitively (but not literally), that default means the Bayesian test ends up asking: “is the effect zero, or is it biggish?” When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero. [4]

Demo 1. Power at 50%

Let’s see how the test behaves as the effect size get smaller (R Code)Fig1The Bayesian test erroneously supports the null about 5% of the time when the effect is biggish, d=.64, but it does so five times more frequently when it is smallish, d=.28.  The smaller the effect (for studies with a given level of power), the more likely we are to dismiss its existence.  We are prejudiced against small effects. [5]

Note how as sample gets larger the test becomes more confident (smaller white area) and more wrong (larger red area).

Demo 2. Facebook
For a more tangible example consider the Facebook experiment (.html) that found that seeing images of friends who voted (see panel a below) increased voting by 0.39% (panel b).Facebook3While the null of a zero effect is rejected (p=.02) and hence the entire confidence interval for the effect is above zero, [6] the Bayesian test concludes VERY strongly in favor of the null, 35:1. (R Code)

Prejudiced against (in this case very) small effects.

Question 2. “Does it answer a question I am interested in?”
No. I am not interested in how well data support one elegant distribution.

 When people run a Bayesian test they like writing things like
“The data support the null.”

But that’s not quite right. What they actually ought to write is
“The data support the null more than they support one mathematically elegant alternative hypothesis I compared it to”

Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.

We are constantly reminded that:
The probability of the data given the null is not the probability of the null

But let’s not forget that:
P(H0|D) / P(H1|D)  ≠ P(H0)
The relative probability of the null over one mathematically elegant alternative is not the probability of the null either.

Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it. The default Bayesian test does not answer a question I would ask.

Wide logo


Feedback from Bayesian advocates:
I shared an early draft of this post with three Bayesian advocates. I asked for feedback and invited them to comment.

1. Andrew Gelman  Expressed “100% agreement” with my argument but thought I should make it clearer this is not the only Bayesian approach, e.g., he writes “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors.” I made several edits in response to his suggestions, including changing the title.

2. Jeff Rouder  Provided additional feedback and also wrote a formal reply (.html). He begins highlighting the importance of comparing p-values and Bayesian Factors when -as is the case in reality- we don’t know if the effect does or does not exist, and the paramount importance for science of subjecting specific predictions to data analysis (again, full reply: .html)

3. EJ Wagenmakers Provided feedback on terminology, the poetic response that follows, and a more in-depth critique of confidence intervals (.pdf)

In a desert of incoherent frequentist testing there blooms a Bayesian flower. You may not think it is a perfect flower. Its color may not appeal to you, and it may even have a thorn. But it is a flower, in the middle of a desert. Instead of critiquing the color of the flower, or the prickliness of its thorn, you might consider planting your own flower — with a different color, and perhaps without the thorn. Then everybody can benefit.”

Sunbaked Mud in Desert

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. If you want to learn more about it I recommend Rouder et al. 1999 (.pdf), Wagenmakers 2007 (.pdf) and Dienes 2011 (.pdf) []
  2. e.g., Rouder et al (.pdf) write “We recommend that researchers incorporate information when they believe it to be appropriate […] Researchers may also incorporate expectations and goals for specific experimental contexts by tuning the scale of the prior on effect size” p.232 []
  3. The current default distribution is d~N(0,.707), the simulations in this post use that default []
  4. Again, Bayesian advocates are upfront about this, but one has to read their technical papers attentively. Here is an example in Rouder et al (.pdf) page 30: “it is helpful to recall that the marginal likelihood of a composite hypothesis is the weighted average of the likelihood over all constituent point hypotheses, where the prior serves as the weight. As [variance of the alternative hypothesis] is increased, there is greater relative weight on larger values of [the effect size] […] When these unreasonably large values […] have increasing weight, the average favors the null to a greater extent”.   []
  5. The convention is to say that the evidence clearly supports the null if the data are at least three times more likely when the null hypothesis is true than when the alternative hypothesis is, and vice versa. In the chart above I refer to data that do not clearly support the null nor the alternative as inconclusive. []
  6. note that the figure plots standard errors, not a confidence interval []

[34] My Links Will Outlive You

If you are like me, from time to time your papers include links to online references.

Because the internet changes so often, by the time readers follow those links, who knows if the cited content will still be there.

This blogpost shares a simple way to ensure your links live “forever.”  I got the idea from a recent New Yorker article [.html].

Content Rot
It is estimated that about 20%-30% of links referenced in papers are already dead and, like you and me, the remaining links aren’t getting any younger. [1]

I asked a research assistant to follow links in papers published in April of 2005 and April 2010 across four journals, to get a sense of what happens to links 5 and 10 years out. [2]


Perusing results I noticed that:

  • Links still alive tend to involve individual newspaper articles (these will die when that newspaper shuts down) and .pdf articles hosted in university servers (these will die when faculty move on to other institutions).
  • Links to pages whose information has changed involved things like websites with financial information for 2009 (now reporting 2014 data), or working papers now replaced with updated or published versions.
  • Dead links tended to involve websites by faculty and students now at different institutions, and now-defunct online organizations.

If you intend to give future readers access to the information you are accessing today, providing links seems like a terrible way to do that.

Making links “permanent” is actually easy. It involves saving the referenced material on WebArchive.org, a repository that saves individual internet pages “forever.”

Here is an example. The Cincinnati Post was a newspaper that started in 1881 and shut down in 2007. The newspaper had a website (www.cincypost.com). If you visit it today, your browser will show this:


The browser will show the same result if we follow any link to any story ever published by that newspaper.

Using the WebArchive, however, we can still read the subset of stories that were archived, for example, this October 2007 story on a fundraising event by then president George W. Bush (.html)

How to make your links “permanent”
1) Go to http://archive.org/web
2) Enter the URL of interest into the “Save Page Now” box

webarchive image

Copy paste the resulting permanent link onto your paper

Imagine writing an academic article in which you want to cite, say, Colada[33] “The Effect Size Does not Exist”. The URL is http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

You could include that link in your paper, but eventually DataColada will die, and so will the content you are linking to. Someone reading your peer-reviewed Colada takedown in ninety years will have no way of knowing what you were talking about. But, if you copy-paste that URL into the WebArchive, you will save the post, and get a permanent link like this:


Done. Your readers can read Colada[33] long after DataColada.org is 6-feet-under.

PS: Note that WebArchive links include the original link. Were the original material to outlive WebArchive, readers could still see it. Archiving is a weakly dominating strategy.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. See “Related Work” section in this PlosONE article [.html] []
  2. I chose journals I read: The Journal of Consumer Research, Psychological Science, Management Science and The American Economic Review. Actually, I no longer read JCR articles, but that’s not 100% relevant. []

[33] “The” Effect Size Does Not Exist

Consider the robust phenomenon of anchoring, where people’s numerical estimates are biased towards arbitrary starting points. What does it mean to say “the” effect size of anchoring?

It surely depends on moderators like domain of the estimate, expertise, and perceived informativeness of the anchor. Alright, how about “the average” effect-size of anchoring? That’s simple enough. Right? Actually, that’s where the problem of interest to this post arises. Computing the average requires answering the following unanswerable question: How much weight should each possible effect-size get when computing “the average?” effect size?

Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?

Say anchoring effects are larger when estimating river lengths than door heights, does “the average” anchoring effect give all river studies combined 50% weight and all door studies the other 50%? If so, what do we do with canal-length studies, combine them with rivers or count them on their own?

If we weight by study rather than stimulus, “the average” effect gets larger as more rivers studies are conducted, and if we weight by sample size “the average” gets smaller if we run more subjects in the door studies.

31 13

What about the impact of anchoring on perceived strawberry-jam viscosity. Nobody has yet studied that but they could, does “the average” anchoring effect-size include this one?

What about all the zero estimates one would get if the experiment was done in a room without any lights or with confusing instructions?  What about all the large effects one would get via demand effects or confounds? Does the average include these?

Studies aren’t random
We can think of the problem using a sampling framework: the studies we run are a sample of the studies we could run. Just not a random sample.

Cheat-sheet. Random sample: every member of the population is equally likely to be selected.

First, we cannot run studies randomly, because we don’t know the relative frequency of every possible study in the population of studies. We don’t know how many “door” vs “river” studies exist in this platonic universe, so we don’t know with what probability to run a door vs a river study.

Second, we don’t want to run studies randomly, we want studies that will provide new information, that are similar to those we have seen elsewhere, that will have higher rhetorical value in a talk or paper, that we find intrinsically interesting, that are less confounded, etc. [1]

What can we estimate?
Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again.

To generalize beyond the data requires judgment rather than statistics.
Judgment can account for non-randomly run studies in a way that statistics cannot.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Running studies with a set instead of a single stimulus is nevertheless very important, but for construct rather than external validity. Running a set of stimuli reduces the risks of stumbling on the single confounded stimulus that works. Check out the excellent “Stimulus Sampling” paper by Wells and Windschitl (.pdf) []