[74] In Press at Psychological Science: A New ‘Nudge’ Supported by Implausible Data

Today Psychological Science issued a Corrigendum (.pdf) and an expression of concern (.pdf) for a paper originally posted online in May 2018 (.pdf). This post will spell out the data irregularities we uncovered that eventually led to the two postings from the journal today. We are not convinced that those postings are sufficient.

It is important to say at the outset we have not identified who is responsible for the problems. In the correction, for example, the authors themselves make clear that they “do not have an explanation” for some peculiarities, in part because many other people handled the data between collection and reporting. This post is therefore not about who caused the problems [1].

R Code to reproduce all calculations and figures.

The history of the correction starts back in May, in a Shanghai journal club discussion Leif participated in while on sabbatical in China. Puzzled by a few oddities, four members of the group – Frank Yu (.htm), Leif, and two other anonymous researchers – went on to consider the original data posted by the authors (.htm) and identified several patterns that were objectively, instead of merely intuitively, problematic.

Most notably, the posted data has two classic markers of data implausibility:

(i) anomalous distribution of last digits, and

(ii) means are excessively similar.

Leif and his team first went to Uri for his independent assessment of the data, who concurred that the problems looked significant and added new analyses. Then, back in June, they contacted Steve Lindsay (.html), the editor of Psychological Science.  In consultation with the editor, the authors then wrote a correction. We deemed this correction to be insufficient and we drafted a blog post. We shared it with the authors and the editor. They asked us to wait while they considered our arguments further. We promised we would, and we did. Eventually they wrote an expression of concern, to be published alongside the Corrigendum, and they shared it with us. Today, six months after we first contacted the editor, we publish this post, in part because these responses (1) seem insufficient given the gravity of the irregularities, and (2) do not convey the irregularities clearly enough for readers to understand their gravity.

The basic design in the original paper.
Li (htm), Sun (htm), & Chen, report three field experiments showing that the Decoy Effect – a classic finding from decision research [2] – can be used as a nudge to increase the use of hand‑sanitizer by food factory workers.

In the experiments, the authors manipulate the set of sanitizer dispensers available, and measure the amount of sanitizer used, by weighing the dispensers at the end of each day. There is one observation per worker-day.

For example, in Experiment 1, some workers only had a spray dispenser, while others had two dispensers, both the spray dispenser and a squeeze-bottle:

The authors postulated that the squeeze-bottle sanitizer was objectively inferior to the spray, and that it would serve as a decoy. Thus, the authors predicted that workers would be more likely to use the spray dispenser when it was next to the squeeze bottle than when it was the only dispenser available [3].

Original results: Huge effects.
Across three studies, the presence of a decoy dispenser increased the use of the spray dispenser by more than 1 standard deviation on average (d = 1.06). That’s a large effect. Notably, in Study 2, only one participant in the control condition increased sanitizer use more than the participant who increased the least in the treatment. Almost non-overlapping distributions.

Problem 1. Inconsistency in scale precision.
The original article indicated that the experimenters used “an electronic scale accurate to 5 grams” (p.4). Such a scale could measure 15 grams, or 20 grams, but not 17 grams. Contradicting this description, the posted data has many observations (8.4% of them) that were not multiples of 5.

The correction states that scales accurate to 1, 2 and 3 grams may have been used sometimes, instead of scales precise to 5 grams (We do not believe scales precise to 3 grams exist).  [4].

Problem 2. Last digit in Experiment 1
But there is another odd thing about the data purportedly obtained with the more precise scales. The problem involves the frequency of the last digit in the number of grams (by last digit we mean, for example, the 8, in 2018).

In particular, the problem with those observations draws on the generalization of something called “Benford’s Law”,  which tells us the last digit should be distributed (nearly) uniformly: there should be just about as many workers using sanitizer amounts that end in 3 grams (e.g, 23 or 43 grams), as in 4 (e.g., 24 or 44 grams), etc. But as we see below, the data looks nothing like the uniform distribution. (If you are not familiar with Benford’s law, read this footnote: [5]).

Fig 1. Histogram for last digit in Study 1 [6].

About this problem, the expression of concern reads:

This speculated behavior, one scale precise to 5 grams used in the morning, another precise to 1 or 2 grams in the afternoon, or vice versa, cannot explain the posted data. A uniform distribution of last digits is anyway expected, not the bizarre prevalence of 3s and 7s that we see (R Code).

Problem 3. Last digit in Experiment 3.
Let’s look at the last digit again. In this study sanitizer use was measured for 80 participants over 40 days, and with a scale sensitive to the 100th of a gram. The expectation of the last digit being uniformly distributed here is more obvious.

Fig 2. Last digit for Study 3

To appreciate how implausible Fig 2 is, consider that it implies, for example, that workers would be 3 times as likely to use 45.56 grams of sanitizer, as they would be to use 45.53 grams [7].

About this problem, the expression of concern reads:

Problem 4. Implausibly similar means in Experiment 2
In Experiment 2, sanitizer use was measured daily for 40 participants for 40 days (20 days of baseline, 20 of treatment), all with a scale sensitive to 100th of a gram.

Recall that the manipulation was done at the room level. This figure, which was in the original article, shows the daily average use of sanitizer across the two rooms.

Treatment started on day 21. In days 1-20 the two rooms had extraordinarily similar means. Average sanitizer usage differed, on average, by just .19 grams across rooms. Moreover, across days, average sanitizer use was correlated at r = .94 across rooms.

To quantify how surprisingly similar the conditions were in the “before treatment” period, we conducted the following resampling test: we shuffle all 40 participants into two new groups of 20 (keeping all observations per worker fixed). We then compute daily means for each of the two groups (‘rooms’). We did this one million times and asked “How often do simulation results look as extreme as the paper’s?” The answer is “almost never”:

So, for example, the figure on the right shows that the correlation between means is on average about r=.7 rather than the r=.94 that’s reported in the paper. Only 96 times in a million, would we expect it to be .94 or higher.

We don’t think readers of the expression of concern would come away with sufficient information to appreciate the impossibility we shared with the authors and editor; all it says about it is:

Problem 5. Last digit… in Experiment 2
While less visually striking than for Experiments 1 and 3, the last digit is not distributed uniform in this experiment either, with N=1600, a scale precise to 1/100th of gram, rejects the uniform null with: χ2(9) = 43.45, p<.0001; see histogram .png [8].

We appreciate that the authors acknowledge some of the problems we brought to their attention and further that they cannot assuage concerns because the data collection and management necessarily occurred at such a remove. On the other hand, as readers we are at a loss. Three experiments show unambiguous signs that there are problems with all of the reported data. How can we read the paper and interpret differences across conditions as meaningful while discounting the problems as otherwise meaningless? We think that it might be warranted to take the opposite view and see meaning in the long list of problems and therefore seeing the differences across conditions as meaningless.

We should maintain a very high burden of proof to conclude that any individual tampered with data.

But the burden of proof for dataset concerns should be considerably lower. We do not need to know the source of contamination in order to lose trust in the data.

Even after the correction, and the clarifications of the Expression of Concern, we still believe that these data do not deserve the trust of Psychological Science readers.

Wide logo

Author feedback
Our policy (.htm) is to share drafts of blog posts that discuss someone else’s work with them to solicit feedback. As mentioned above we contacted the authors and editor of Psych Science. They provided feedback on wording and asked that we wait while they revised the correction, which we did (for over 6 months).

Just before posting they gave us another round of suggestions and then Meng Li (htm) wrote a separate piece (.htm).

When all is said and done, the original authors have not yet provided benign mechanisms that could have generated the data they reported (neither the last digit pattern, nor the excessive similarity of means).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. It is also worth noting that  this post is possible because the authors elected to post their data. []
  2. Basic background for the intrigued: The original demonstration is Huber, Payne, and Puto (1982 .pdf). Heath & Chaterjee (1995 .pdf) provide a good review of several studies []
  3. Study 2 used a soaking basin as a decoy instead []
  4. Footnote 1 in the correction reads:
  5. About 80 years ago, Benford (.pdf) noticed that with collections of numbers, the leading digit (the one furthest to the left) had a predictable pattern of occurrence: 1’s were more common than 2’s, which were more common than 3’s etc. A mathematical formula generalizing Benford’s law applies to digits further to the right in a different way: as one moves right, to the 2nd, 3rd, 4th digit, etc., those numbers should be distributed closer and closer to uniformly (i.e., 1 is just as common as 2, 3, 4, etc.). Because those predictions are derived mathematically, and observed empirically, violations of Benford’s law are a signal that something is wrong. Benford’s law has, for first and last digits, been used to detect fraud in accounting, elections, and science. See Wikipedia. []
  6. In this appendix (.pdf) we document that the uniform is indeed what you’d expect for these data, even though values on this variable have just 2 digits. []
  7. As an extra precaution, we analyzed other datasets with grams as the dependent variable. We found studies on (i) soup consumption, (ii) brood carcass, (iii) American bullfrog size, and (iv) decomposing bags. Last digit was uniform across the board (See details: .pdf). []
  8. This is perhaps a good place to tell you of an additional anecdotal problem: when preparing the first draft of this post, back in June, we noticed this odd row in Experiment 2. The 5 gram scale makes a surprising re-appearance on day 4, takes a break on day 9, but returns on day 10

[73] Don’t Trust Internal Meta-Analysis

Researchers have increasingly been using internal meta-analysis to summarize the evidence from multiple studies within the same paper. Much of the time, this involves computing the average effect size across the studies, and assessing whether that effect size is significantly different from zero.

At first glance, internal meta-analysis seems like a wonderful idea. It increases statistical power, improves precision, moves us away from focusing on single studies, incentivizes reporting non-significant results and emptying the file-drawer, etc.

When we looked closer, however, it became clear that this is the absolute worst thing the field can do. Internal meta-analysis is a remarkably effective tool for making untrue effects appear significant. It is p-hacking on steroids. So it is steroids on steroids.

We have a choice to make: to stop trusting internal meta-analysis or to burn the credibility of the field to the ground. The gory details are below, and in this paper: (SSRN).

Two Assumptions
To understand the problem, consider that the validity of internal meta-analysis rests entirely on two assumptions that are conspicuously problematic:

Assumption 1. All studies in the internal meta-analysis are completely free of p-hacking.
Assumption 2. The meta-analysis includes all valid studies.

Thinking about what it takes to meet these assumptions helps realize how implausible they are:

Assumption 1 could be met if researchers perfectly pre-registered every study included in the internal meta-analysis, and if they did not deviate from any of their pre-registrations. Assumption 2 could be met if the results of any of the studies do not influence: (1) researchers’ decisions about whether to include those studies in the meta-analysis, and (2) researchers’ decisions about whether to run additional studies to potentially include in the meta-analysis [1].

It seems unlikely that either assumption will be perfectly met, and it turns out that meeting them even a little bit imperfectly is an evidential disaster.

Violating Assumption 1: A pinch of p-hacking is very bad
Using a few common forms of p-hacking can cause the false-positive rate for a single study to increase from the nominal 5% to over 60% (“False-Positive Psychology”; SSRN). That dramatic consequence is nothing compared to what p-hacking does to internal meta-analysis.

There is an intuition that aggregation cancels out error, but that intuition fails when all components share the same bias. P-hacking may be minimal and unintentional, but it is always biased in the same direction. As a result, imperceptibly small biases in individual studies intensify when studies are statistically aggregated.

This is illustrated in the single simulation depicted in Figure 1. In this simulation, we minimally p-hacked 20 studies of an nonexistent effect (d=0). Specifically, in each study, researchers conducted two analyses instead of just one, and they reported the analysis that produced the more positive effect, even if it was not significant and even if was of the wrong sign. That’s it. This level of p-hacking is so minimal that it did not even cause one of these 20 studies to be significant on its own. Nevertheless, the meta-analysis of these studies is super significant (and thus super wrong): d=.20, Z=2.81, p=.0049 [2].

 Figure 1. Minimally p-hacking twenty studies of nonexistent effect leads to super significant meta-analysis.
R Code to reproduce figure.

That’s one simulation. Let’s see what we expect in general. The figure below shows false-positive rates for internal meta-analysis for the kind of minimal p-hacking that increases the false-positive rates of individual studies from 2.5% (for a directional hypothesis) to 6%, 7%, and 8%.

Figure 2. P-hacking exerts a much larger effect on internal meta-analysis than on individual studies.
R Code to reproduce figure.

Make sure you spend time to breathe this in: If researchers barely p-hack in a way that increases their single-study false-positive rate from 2.5% to a measly 8%, the probability that their 10-study meta-analysis will yield a false-positive finding is 83%! Don’t trust internal meta-analysis.

Violating Assumption 2: Partially emptying the file-drawer makes things worse
Because internal meta-analysis leaves fewer studies in the file drawer, we may expect it to at least directionally improve things. But it turns out that partially emptying the file-drawer almost surely makes things much worse.

The problem is that researchers do not select studies at random, but are more likely to choose to include studies that are supportive of their claims. At the root of this is the fact that what counts as a valid study for a particular project is ambiguous. When deciding which studies to include in an internal meta-analysis, we must determine whether a failed study did not work because of bad design or execution (in which case it does not belong in the meta-analysis) or whether it did not work despite being competently designed and executed (in which case it belongs in the meta-analysis). Only in the utopian world of a statistics textbook do all research projects consist of several studies that unambiguously belong together. In the real world, deciding which studies belong to a project is often a messy business, and those decisions are likely to be resolved in ways that help the researchers rather than harm them. So what happens when they do that?

Imagine a one-study paper with a barely significant (p=.049) result, and the file-drawer contains two similar studies, with the same sample size, that did not “work,” a p=.20 in the right direction and a p=.20 in the wrong direction. If both of these additional studies are reported and meta-analyzed, then the overall effect would still be non-significant. But if instead the researcher only partially emptied the file drawer, including only the right-direction p=.20 in the internal meta-analysis (perhaps because the effect in the wrong direction was identified as testing a different effect, or the product of an incorrect design, etc.), then the overall p-value would drop from p=.049 to p=.021 (R Code).

Partially emptying file drawers turns weak false-positive evidence into strong false-positive evidence.

Imagine a researcher willing to file-drawer half of all attempted studies (again, because she can justify why half of them were ill-designed and thus should not be included). If she needed 5 (out of 10) individually significant studies to successfully publish her result, she would have a 1/451,398 chance of success. That’s low enough not to worry about the problem; through file drawering alone we will not get many false-positive conclusions. But if instead she just needed the internal meta-analysis of the five studies to be significant, then the probability of (false-positive) success is not merely five times higher. It is 146,795 times higher. Internal meta-analysis turns a 1 in 451,398 chance of a false-positive paper into a 146,795 in 451,398 (33%) chance of a false-positive paper (R Code).

False-positive internal meta-analyses are forever
And now for the bad part.

Although correcting a single false-positive finding is often very difficult, correcting a false-positive internal meta-analysis is disproportionately harder. Indeed, we don’t know how it can realistically be done.

Replicating one study won’t help much
The best way to try to correct a false-positive finding is to conduct a highly powered, well-designed exact replication, and to obtain a conclusive failure to replicate. An interesting challenge with internal meta-analysis is that many (or possibly all) of the original individual studies will be non-significant. How does one “replicate” or “fail to replicate” a study that never “worked” in the first place?

Leaving that aside, how should the result from the replication of one study be analyzed? The most “meta-analytical” thing to do is to add the replication to the existing set of studies. But then even the most convincing failure-to-replicate will be unlikely to alter the meta-analytic conclusion.

Building on the simulations of the 10-study meta-analyses in Figure 2, we found that adding a replication of the same sample size leaves the meta-analytic result significant 89% of the time. Most of the time a replication won’t change anything (R Code). [3]

Even replicating ALL studies is not good enough
Imagine you fail to replicate all 10 studies from Figure 2, and you re-run the meta-analyses now with 20 studies (i.e., the 10 originals and the 10 failures). If all of your replication attempts had the same sample size as those in the original studies, then 47% of false-positive internal meta-analyses would remain significant. So even if you went to the trouble of replicating every experiment, the meta-analysis would remain statistically significant almost half the time [4].

Keep in mind that all of this assumes something extremely unrealistic – that replicators could afford (or would bother) to replicate every single study in a meta-analysis and that others would judge all of those replication attempts to be of sufficient quality to count.

That is not going to happen. Once an internal meta-analysis is published, it will almost certainly never be refuted.

Internal meta-analysis makes false-positives easier to produce and harder to correct. Don’t do internal meta-analysis, and don’t trust any finding that is supported only by a statistically significant internal meta-analysis. Individual studies should be judged on their own merits.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. This second point is made by Ueno et al (2016) .pdf []
  2. Notice that whereas p-hacking tends to cause p-values of individual studies to be barely significant, it tends to cause p-values of meta-analyses to be VERY significant. Thus, even a very significant meta-analytic result may be a false-positive []
  3. If it has 2.5 times the original sample size, it is still significant 81% of the time. If instead of adding the replication, one dropped the original upon failing to replicate it, essentially replacing the original study with the failure to replicate, we still find that more than 70% of false-positive internal meta-analyses survive. []
  4. If you used 2.5 times the original sample size in all of your replication attempts, still 30% of meta-analyses would survive. []

[72] Metacritic Has A (File-Drawer) Problem

Metacritic.com scores and aggregates critics’ reviews of movies, music, and video games. The website provides a summary assessment of the critics’ evaluations, using a scale ranging from 0 to 100. Higher numbers mean that critics were more favorable.

In theory, this website is pretty awesome, seemingly leveraging the wisdom-of-crowds to give consumers the most reliable recommendations. After all, it’s surely better to know what a horde of reviewers thinks than to know what a single reviewer thinks.

But at least when it comes to music reviews, metacritic is broken. I’ll explain how/why it is broken, I’ll propose a way to fix it, and I’ll show that the fix works.

Metacritic Is Broken

A few weeks ago, a fairly unknown “Scottish chamber pop band” named Modern Studies released an album that is not very good (Spotify .html). At about the same time, the nearly perfect band Beach House released a nearly perfect album (Spotify .html).

You might think these things are subjective, but in many cases they are really not. The Great Gatsby is objectively better than this blog post, and Beach House’s album is objectively better than Modern Studies’s album.

But what does Metacritic say?

So, yeah, metacritic is broken.

If this were a one-off example, I wouldn’t be writing this post. It is not a one-off example. For example, Metacritic would lead you to believe that Fever Ray’s 2017 release (Metascore of 87) is almost as good as St. Vincent’s 2017 release (Metascore of 88), but St. Vincent’s album is, I don’t know, a trillion times better [1]. More recently, Metacritic rated an unspeakably bad album by Goat Girl as something that is worth your time (Metascore of 80). It is not worth your time.

So what’s going on?

What’s going on is publication bias. Music reviewers don’t publish a lot of negative reviews, especially of artists that are unknown. As evidence of this, consider that although Metascores theoretically range from 0-100, in practice only 16% of albums released in 2018 have received a Metascore below 70 [2]. This might be because reviewers don’t want to be mean to struggling artists. Or it might be because reviewers don’t like to spend their time reviewing bad albums. Or it might be for some other reason.

But whatever the reason, you have to correct for the fact that an album that gets just a few reviews is probably not a very good album.

How can we fix it?

What I’m going to propose is kind of stupid. I didn’t put in the effort to try to figure out the optimal way to correct for this kind of publication bias. Honestly, I don’t know that I could figure that out. So instead, I thought about it for about 19 seconds, and I came to the following three conclusions:

(1) We can approximate the number of missing reviews by subtracting the number of observed reviews from the maximum number of reviews another album received in the same year [3].

(2) We can assume that the missing reviewers would’ve given fairly poor reviews. Since it’s a nice round number, let’s say those missing reviews would average out to 70.

(3) Albums with metascores below 70 probably don’t need to be corrected at all, since reviewers already felt licensed to write negative reviews in these cases.

For 2018, the most reviews I observed for an album is 30. As you can see in the above figures, Beach House’s album received 27 reviews. Thus, my simple correction adds three reviews of 70, adjusting it (slightly) down from 81 to 79.9. Meanwhile, Modern Studies’s album received only 6 reviews, and so we would add 24 reviews of 70, resulting in a much bigger adjustment, from 86 to 73.2.

So now we have Beach House at 79.9 and Modern Studies at 73.2. That’s much better.

But the true test of whether this algorithm works would be to see whether Metascores become more predictive of consumers’ music evaluations after applying the correction than before applying the correction. But to do that, you’d need to have consumers evaluate a bunch of different albums, while ensuring that there is no selection bias in their ratings. How in the world do you do that?

Is it fixed?

Well, it just so happens that for the past 5.5 years, Leif Nelson, Yoel Inbar, and I have been systematically evaluating newly released albums. We call it Album Club, and it works like this. Almost every week, one of us assigns an album for the three of us to listen to. After we have given it enough listens, we email each other with a short review. In each review we have to (1) rate the album on a scale ranging from 0-10, and (2) identify our favorite song on the record [4].

The albums that we assign are pretty diverse. For example, we’ve listened to pop stars like Taylor Swift, popular bands like Radiohead and Vampire Weekend, underrated singer/songwriters like Eleanor Friedberger, a 21-year-old country singer who sounds like a 65-year-old country singer, a (very good) “experimental rap trio”, and even a (deservedly) unpopular “improvisational psych trio from Brooklyn” (Spotify .html) [5]. Moreover, at least one of us seems not to try to choose albums that we are likely to enjoy. So, for our purposes, this is a pretty great dataset (data .xlsx; code .R).

So let’s start by taking a look at 2018. So far this year, we have rated 22 albums, and 19 of those have received Metascores [6]. In the graphs below, I am showing the relationship between our average rating of each album and (1) actual Metascores (left panel) and (2) adjusted Metascores (right panel).

The first thing to notice is that, unlike Metacritic, Leif, Yoel, and I tend to use the whole freaking scale. The second thing to notice is the point of this post: Metascores were more predictive of our evaluations when we adjusted them for publication bias (right panel) than when we did not (left panel).

Now, I got the idea for this post because of what I noticed about a few albums in 2018. Thus, the analyses of the 2018 data must be considered a purely exploratory, potentially p-hacked endeavor. It is important to do confirmatory tests using the other years in my dataset (2013-2017). So that’s what I did. First, let’s look at a picture of 2017:

That looks like a successful replication. Though Metacritic did correctly identify the excellence of the St. Vincent album, it was otherwise kind of a disaster in 2017. But once you correct for publication bias, it does a lot better.

You can see in these plots that the corrected Metacritic scores are closer together on the “Adjusted” chart, indicating that the technique I am employing to correct for publication bias reduces the variance in Metascores. Reducing variance usually lowers correlations. But in this case, it increased the correlation. I take that as additional evidence that publication bias is indeed a big problem for Metacritic.

So what about the other years? Let’s look at a bar chart this time:

The effect is not always big, but it is always in the right direction. Metascores are more predictive when you use my dumb, blunt method of correcting for publication bias in music reviews than when you don’t.

In sum, you should listen to the new Beach House album. But not to the new Modern Studies album.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. I am convinced that Fever Ray’s song “IDK About You” was the worst song released in 2017. I tried to listen to it again after typing that sentence, but was unable to. Still, Fever Ray is definitely not all bad. Their best song “I Had A Heart” (released in 2009) soundtracks a dark scene in Season 4 of “Breaking Bad” (.html). []
  2. 59 out of 307 []
  3. The same year is important, because the number of reviews has declined over time. I don’t know why. []
  4. In case you are interested, I’ve made a playlist of 25 of my favorite songs that I discovered because of Album Club: Spotify .html []
  5. Not to oversell the amount of diversity, it is understood that assigning a death metal album will get you kicked out of the Club. You can, however, assign songs that are *about* death metal: Spotify.html []
  6. Some albums that we assign are not reviewed by enough critics to qualify for a Metascore. This is true of 59 of the 261 albums in our dataset. This is primarily because Leif likes to assign albums by obscure high school bands from suburban Wisconsin, some of which are surprisingly good: Spotify .html. []

[71] The (Surprising?) Shape of the File Drawer

Let’s start with a question so familiar that you will have answered it before the sentence is even completed:

How many studies will a researcher need to run before finding a significant (p<.05) result? (If she is studying a non-existent effect and if she is not p-hacking.)

Depending on your sophistication, wariness about being asked a potentially trick question, or assumption that I am going to be writing about p-hacking, there might be some nuance to your answer, but my guess is that for most people the answer comes to mind with the clarity and precision of certainty. The researcher needs to run 20 studies. Right?

That was my answer. Or at least it used to be [1] I was relying on a simple intuitive representation of the situation, something embarrassingly close to, “.05 is 1/20. Therefore, you need 20. Easy.” My intuition can be dumb.

For this next part to work well, I am going to recommend that you answer each of the following questions before moving on.

Imagine a bunch of researchers each studying a truly non-existent effect. Each person keeps running studies until one study succeeds (p<.05), file-drawering all the failures along the way. Now:

What is the average number of studies each researcher runs?

What is the median number of studies each researcher runs?

What is the modal number of studies each researcher runs?

Before I get to the correct answers to those questions, it is worth telling you a little about how other people answer them. It is difficult to get a full sense of expert perceptions, but it is relatively easy to get a sense of novice perceptions. With assistance from my outstanding lab manager, Chengyao Sun, I asked some people (N = 1536) to answer the same questions that I posed above. Actually, so as to make the questions slightly less unfamiliar, I asked them to consider a closely related (and mathematically identical) scenario. Respondents considered a group of people, each rolling a 20-sided die and rolling it until they rolled a 20; respondents estimated the mode, median, and mean [2]. How did those people answer?

I encourage you to go through that at your leisure, but I will draw attention to a few observations. People frequently give the same answer for all three questions [3], though there are slight overall differences: The median estimate for the mode was 11, the median for the average was 12, and the median for the median was 16. The most common response for all three was 10 and the second most common was 20. Five, 50, and 100 were also common answers. So there is some variability in perception, and certainly not everyone answers 20 for any or all questions, but that answer is common, and 10, a close conceptual cousin, was slightly more common. My guess is that you look at the chart and see that your answers were given by at least a few dozen others in my sample.

OK, so now that you have used your intuition to offer your best guess and you have seen some other people’s guesses, you might be curious about the correct answer for each. I asked myself (and answered) each question only to realize how crummy my intuition really was. The thing is, my original intuition (“20 studies!”) came polluted with an intuition about the distribution of outcomes as well. I think that I pictured a normal curve with 20 in the middle (see Figure 2).

Maybe you have a bit of that too? So that intuition tells us that 20 is the mode, 20 is the median, and 20 is the mean. Only one of those is right, and in most ways, it is the worst at summarizing the distribution.

The distribution is not normal, it is “geometric” [4]. I may have encountered that term in college, but I tried to learn about it for this post. The geometric distribution captures the critical sequential nature of this problem. Some researchers get lucky on the first try (5%). Of those who fail (95%), some succeed on the second try (5% of 95% = 4.75%). Of those who fail twice (90.25%), some succeed on the third try (5% of 90.25% = 4.5%). And so on.

Remember that hypothetical group of researchers running and file-drawering studies? Here is the expected distribution of the number of required studies.

That is really different from the napkin drawing. It takes 20 studies, on average, to get p<.05… but, the average is a pretty mediocre way to characterize the central tendency of this distribution [5].

Let’s return to that initial question: Assuming that a researcher is studying a truly false finding, how many studies will that person need to run in order to find a significant (p<.05) result? Well, one could certainly say, “20 studies,” but they could choose to clarify, “… but most of the time they will need fewer. The most common outcome is that the researcher will succeed on the very first try. I dare you to try telling those people that they benefitted from file-drawering.”

It is interesting that we think in terms of the average here, since we do not in a similar domain. Consider this question: “A researcher is running a study. How many participants do they need to run to get a significant effect?” To answer that, someone would need to know how much statistical power the researcher was aiming for [6]. For whatever reason, when we talk about the file-drawering researcher we don’t ask, “How many studies would that person need to be ready to run to have an 80% chance of getting a significant result?” That answer, by the way, is 32. If the researcher listened to my initial answer, and only planned to run 20 studies, they would only have 64% power.

For whatever reason, people [7] do not intuit the negative binomial. In my sample, estimates for the median were higher than for the mean, a strong signal that people are not picturing the sharply skewed true distribution. The correct answer for the average (20), on the other hand, was quite frequently identified (I even identified it), but probably not because overall intuition was any good. The mode (1) is, in some ways, the ONLY question that is easy to answer if you are accurately bringing to mind the distribution in the last figure, but that answer was only identified by 3% of respondents, and was the 10th most common answer, losing out to peculiar answers like 12 or 6.

I think there are a few things that one could take from this. I was surprised to see how little I understood about the number of studies (or ineffective die rolls) file drawered away before a significant finding occurred by chance. I was partially comforted to learn that my lack of understanding was mirrored in the judgments of others. But I am also intrigued by the combination. A researcher who intuits the figure I drew on the napkin will feel like a study that succeeds in the first few tries is too surprising an outcome to be due to chance. If the true distribution came to mind, on the other hand, a quickly significant study would feel entirely consistent with chance, and that researcher would likely feel like a replication was in order. After all, how many studies will a researcher need to run before finding two consecutive significant (p<.05) results?

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. Honestly, I would still be tempted to give that answer. But instead I would force the person asking the question to listen to me go on for another 10 minutes about the content of this post. All that person wanted was the answer to the damn question and now they are stuck listening to a short lecture on the negative binomial distribution and measures of central tendency. Then again, you didn’t even ask the question and you are being subjected to the same content. Sorry? []
  2. Actually, half the people imagined people trying to roll a 1. There are some differences between the responses of those groups, but they are small and beyond my comprehension. So I am just combining them here. []
  3. I also asked a question about 80% power, but I am ignoring that for now. []
  4. When I first posted this blog I referred to it as the negative binomial, but actually that’s about how many successes you expect rather than how long until you get the first success. Moreover, the geometric is about how many failures until the 1st success, what we really want is how many attempts, which is the geometric +1. My quite sincere thanks to Noah Silbert for pointing out the error in the original posting. []
  5. Actually, it is not even the average in that figure. I truncated the distribution at 100 studies, and the average in that range is only 19. One in 200 researchers would still have failed to find a significant effect even after running 100 studies. That person would be disappointed they didn’t just try a registered report. []
  6. They would actually need to know a heck of a lot more than that, but I’m keeping it simple here. []
  7. well, at least me and the 1,536 people I asked. []

[70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist

We have argued that, for most effects, it is impossible to identify the average effect (datacolada.org/33). The argument is subtle (but not statistical), and given the number of well-informed people who seem to disagree, perhaps we are simply wrong. This is my effort to explain why we think identifying the average effect is so hard. I am going to take a while to explain my perspective, but the boxed-text below highlights where I am eventually going.

When averaging is easy: Height at Berkeley.
First, let’s start with a domain where averaging is familiar, useful, and plausible. If I want to know the average height of a UC Berkeley student I merely need a random sample, and I can compute the average and have a good estimate. Good stuff.

My sense is that when people think that we should calculate the average effect size they are picturing something kind of like calculating average height: First sample (by collecting the studies that were run), then calculate (by performing a meta-analysis). When it comes to averaging effect sizes, I don’t think we can do anything particularly close to computing the “average” effect.

The effect of happiness on helpfulness is not like height
Let’s consider an actual effect size from psychology: the influence of positive emotion on helping behavior. The original paper studying this effect (or the first that I think of) manipulates whether or not a person unexpectedly finds a dime in a phone booth and then measures whether the person stops to help pick up some spilled papers (.pdf). When people have the $.10 windfall they help 88% of the time, whereas the others help only 4% of the time[1]. So that is the starting point, but it is only one study. The same paper, for example, contains another study manipulating whether people received a cookie and measures minutes volunteered to be a confederate for either a helping experiment, in one condition, or a distraction experiment, in another (a 2 x 2 design). Cookies increased minutes volunteered for helping (69 minutes vs. 16.7 minutes) and decreased minutes volunteered for the distraction experiment (20 minutes vs. 78.6 minutes) [2]. OK, so the meta-analyst can now average those effect sizes in some manner and conclude that they have identified an unbiased estimate of the average effect of positive emotion on helping behavior.

What about the effect of nickels on helpfulness?
However, that is surely not right, because those are not the only two studies investigating the effect of happiness on helpfulness. Publication bias is the main topic discussed by meta-analytic tool developers. Perhaps, for example, there was an unreported study using nickels, rather than dimes, that did not get to p<.05. Researchers are more likely to tell you about a result, and journal editors are more likely to publish a result, if it is statistically significant. There have been lots of efforts to find a way to correct for it, including p-curve. But what exactly are those aiming to correct? What is the right set of studies to attempt to reconstruct?

The studies we see versus the studies we might see
Because we developed p-curve, we know which answer it is aiming for: The true average effect of the studies it includes [3].  So it gives an unbiased estimate of the dimes and cookies, but is indifferent to nickels. We are pretty comfortable owning that limitation – p-curve can only tell you about the true effect of the studies it includes. One could reasonably say at this point, “but wait, I am looking for the average effect of happiness on helping, so I want my average to include nickels as well.” This gets to the next point: What are the other studies that should be included?

Let’s assume that there really is a non-significant (p>.05) nickels study that was conducted. Would we find out about it? Sometimes. Perhaps the p-value is really close to .05, so the authors are comfortable reporting it in the paper? [4] Perhaps it creeps into a book chapter some time later and the p-values are not so closely scrutinized? Perhaps the experimenter is a heavy open-science advocate and writes a Python script that automatically posts all JASP output on PsyArXiv regardless of what it is? The problem is not whether we will see any non-significant findings, the problem is whether we would see all of them. No one believes that we would catch all of them, and presumably everyone believes that we would see a biased sample – namely, we would be more likely to see those studies which best serve the argument of the people presenting them. But we know very little about the specifics of that biasing. How likely are we to see a p = .06? Does it matter if that study is about nickels, helping behavior, or social psychology, or are non-significant findings more or less likely to be reported in different research areas? Those aren’t whimsical questions either, because an unknown filter is impossible to correct for. Remember the averaging problem at the beginning of this post – the average height of students at UC Berkeley – and think of how essential the sampling was for that exercise. If someone said that they averaged all the student heights in their Advanced Dutch Literature class we would be concerned that the sample was not random, and since it likely has more Dutch people (who are peculiarly tall), we would worry about bias. But how biased? We have no idea. The same goes for the likelihood of seeing a non-significant nickels study. We know that we are less likely to see it, but we don’t know how much less likely [5]. It is really hard to integrate these into a true average.

But ok, what if we did see every single conducted study?
What if we did know the exact size of that bias? First: wow. Second, that wouldn’t be the only bias that affects the average, and it wouldn’t be the largest. The biggest bias is almost certainly in what studies researchers choose to conduct. Think back to the researchers choosing to use a dime in a phone booth. What if they had decided instead to measure helping behavior differently? Rather than seeing if people picked up papers, they instead observed whether people chose to spend the weekend cleaning the experimenter’s septic tank. That would still be helpful, so the true effect of such a study would indisputably be part of the true average effect of happiness on helping. But the researchers didn’t use that measure, perhaps because they were concerned that the effect would not be large enough to detect. Also, the researchers did not choose to manipulate happiness by leaving a briefcase of $100,000 in the phone booth. Not only would that be impractical, but that study is less likely to be conducted because it is not as compelling: the expected effect seems too obvious. It is not particularly exciting to say that people are more helpful when they are happy, but it is particularly exciting to show that a dime generates enough happiness to change helpfulness [6]. So the experiments people conduct are a tiny subset of the true effect, they are a biased set (no one randomly generates an experimental design, nor should they), and those biases are entirely opaque. But if you want a true average, you need to know the exact magnitude of those biases.

So what all is included in an average effect size?
So now I return to that initial list of things that need to be included in the average effect size (reposted right here to avoid unnecessary scrolling):

That is a tall order. I don’t mind someone wanting that answer, and I fully acknowledge that p-curve does not deliver it. P-curve only hopes to deliver the average effect in (a).

If you want the “Big Average” effect (a, b, c, d, e, and f) then you need to clarify that you have access to the population or can perfectly estimate the biases that influence the size of each category. That is not me being dismissive or dissuasive, it is just the nature of averaging. We are so pessimistic about calculating that average effect size that we use the shorthand of saying that the average effect size does not exist.[7]

But that is a statement of the problem and an acknowledgment of our limitations. If someone has a way to handle the complications above, they would have at least three very vocal advocates.
Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. ! []
  2. !! []
  3. “True effect” is kind of conceptual, but in this case I think that there is some agreement on the operational definition of “true.” If you conducted the study again, you would expect, on average, the “true” result. So if, because of bias or error, the published cookie effect is unusually smaller or larger than the true underlying effect, you are still most interested in the best prediction of what would happen if you ran the study again. I am open to being convinced that there is a different definition of “true”, but I think this is a pretty uncontroversial one. []
  4. Actually, it is worth noting that the cookie experiment features one critical test with a t-value of 1.96. Given the implied df for that study, the p-value would be >.05, though it is reported as p<.05. The point is, those authors were willing to report a non-significant p-value. []
  5. Scientists, statisticians, psychologists, and probably postal workers, bobsledders, and pet hamsters have frequently bemoaned the absurdity of a hard cut-off of p<.05. Granted. But it does provide a side benefit for this selection-bias issue: If p>05, we have no idea whether we will see it, but if p<.05, we know that the p-value hasn’t kept us from seeing it. []
  6. Or to quote the wonderful Prentice and Miller (1992), who in describing the cookie finding, say “the power of this demonstration derives in large part from the subtlety of the instigating stimulus… although mood effects might be interesting however heavy-handed the manipulation that produced them, the cookie study was perhaps made more interesting by its reliance on the minimalist approach.” p. 161. []
  7. It is worth noting that there is some variation between the three of us on the impracticality of calculating the average effect size. The most optimistic of us (me, probably) believe that under a very small number of circumstances – none of which are likely to happen for psychological research – the situation might be well-defined enough for the average effect to be understood and calculated. The most pessimistic of us think even that limited set of circumstances are essentially a non-existent set. From that perspective, the average effect truly does not exist. []

[69] Eight things I do to make my open research more findable and understandable

It is now common for researchers to post original materials, data, and/or code behind their published research. That’s obviously great, but open research is often difficult to find and understand.

In this post I discuss 8 things I do, in my papers, code, and datafiles, to combat that.

1) Before all method sections, I include a paragraph overviewing the open research practices behind the paper. Like this:

2) Just before the end the paper, I put the supplement’s table of contents. And the text reads something like “An online supplement is available, Table 1 summarizes its contents”

3) In tables and figure captions, I include links to code that reproduces them

4) I start my code indicating authorship, last update, and contact info.
5) I then provide an outline of its structure
Like this:

Then, through the code i use those same numbers so people can navigate the code easily [1].

6) Rule-of-thumb: At least one comment per every 3 lines of code.

Even if something is easy to figure out, a comment will make reading code more efficient and less aversive. But most things are not so easy to figure out. Moreover, nobody understands your code as well as you do when you are writing it, including yourself 72 hours later.

When writing comments in code, it is useful to keep in mind who may actually read it, see footnote for longer discussion [2].

7) Codebook (very important). Best to have a simple stand-alone text file that looks like this, variable name followed by description that includes info on possible values and relevant collection details.

8) Post the rawest form of data that I am able/allowed to. All data cleaning is then done in code that is posted as well. When cleaning is extensive, I post both raw and cleaned datafiles

Note: writing this post helped me realize I don’t always do all 8 in every paper. I will try to going forward.

In sum.
1. In paper: open-research statement
2. In paper: supplement’s table of contents
3. In figure captions: links to reproducible code
4. In code: contact info and description
5. In code: outline of program below
6. In code: At least one comment per every three lines
7. Data: post codebook (text file, variable name, description)
8. Data: post (also) rawest version of data possible

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. I think this comes from learning BASIC as a kid (my first programming language), where all code went in numbered lines like
    10 PRINT “Hola Uri”
    20 GOTO 10. []
  2. Let’s think about who will be reading your code.
    One type of reader is someone learning how to use the programming language or statistical technique you used, help that person out and spell things out for them. Wouldn’t you have liked that when you were learning? So if you use a non-vanilla procedure, throw your reader a bone and explain in 10 words stuff they could learn if they read the 3 page help file they shouldn’t really be expected to read just to follow what you did. Throw in references and links to further reading when pertinent but make your code as self-contained as possible.

    Another type of reader is at least as sophisticated as you are, but does things differently from you, so cannot quite understand what you are doing (e.g., you parallel loops, they vectorize). If they don’t quite understand what you did, they will be less likely to learn from your code, or help you identify errors in it. What’s the point of posting it then? This is especially true in R, where there are 20 ways to do everything, and some really trivial stuff is a pain to do.

    Another type of reader lives in the future, say 5 years from today, when the approach, library, structure or even  programming language you use is not used any more. Help that person map what you did into the language/function/program of the future. Also, that person will one day be you.

    The cost of excessive commenting is a few minutes of your time typing text people may not read just to be thorough and prevent errors. That’s what we do most of our time anyway. []

[68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)

Uli Schimmack recently identified an interesting pattern in the data from Daryl Bem’s infamous “Feeling the Future” JPSP paper, in which he reported evidence for the existence of extrasensory perception (ESP; .pdf)[1]. In each study, the effect size is larger among participants who completed the study earlier (blogpost: .htm). Uli referred to this as the “decline effect.” Here is his key chart:

The y-axis represents the cumulative effect size, and the x-axis the order in which subjects participated.

The nine dashed blue lines represent each of Bem’s nine studies. The solid blue line represents the average effect across the nine studies. For the purposes of this post you can ignore the gray areas of the chart [2].

Uli’s analysis is ingenious, stimulating, and insightful, and the pattern he discovered is puzzling and interesting. We’ve enjoyed thinking about it. And in doing so, we have come to believe that Uli’s explanation for this pattern is ultimately incorrect, for reasons that are quite counter-intuitive (at least to us). [3].

Pilot dropping
Uli speculated that Bem did something that we will refer to as pilot dropping. In Uli’s words: “we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continues to collect more data to see whether the effect is real (…) the strong effect during the initial trials (…) is sufficient to maintain statistical significance  (…) as more participants are added” (.htm).

In our “False-Positively Psychology” paper (.pdf) we barely mentioned pilot-dropping as a form of p-hacking (p. 1361) and so we were intrigued about the possibility that it explains Bem’s impossible results.

Pilot dropping can make false-positives harder to get
It is easiest to quantify the impact of pilot dropping on false-positives by computing how many participants you need to run before a successful (false-positive) result is expected.

Let’s say you want to publish a study with two between-subjects condition and n=100 per condition (N=200 total). If you don’t p-hack at all, then on average you need to run 20 studies to obtain one false-positive finding [4]. With N=200 in each study, then that means you need an average of 4,000 participants to obtain one finding.

The effects of pilot-dropping are less straightforward to compute, and so we simulated it [5].

We considered a researcher who collects a “pilot” of, say, n = 25. (We show later the size of the pilot doesn’t matter much). If she gets a high p-value the pilot is dropped. If she gets a low p-value she keeps the pilot and adds the remaining subjects to get to 100 (so she runs another n=75 in this case).

How many subjects she ends up running depends on what threshold she selects for dropping the pilot. Two things are counter-intuitive.

First, the lower the threshold to continue with the study (e.g., p<.05 instead of p<.10), the more subjects she ends up running in total.

Second, she can easily end up running way more subjects than if she didn’t pilot-drop or p-hack at all.

This chart has the results (R Code):

Note that if pilots are dropped when they obtain p>.05, it takes about 50% more participants on average to get a single study to work (because you drop too many pilots, and still many full studies don’t work).

Moreover, Uli conjectured that Bem added observations only when obtaining a “strong effect”. If we operationalize strong effect as p<.01, we now need about N=18,000 for one study to work, instead of “only” 4,000.

With higher thresholds, pilot-dropping does help, but only a little (the blue line is never too far below 4,000). For example, dropping pilots using a threshold of p>.30 is near the ‘optimum,’ and the expected number of subjects is about 3400.

As mentioned, these results do not hinge on the size of the pilot, i.e., on the assumed n=25 (see charts .pdf).

What’s the intuition?
Pilot dropping has two effects.
(1) It saves subjects by cutting losses after a bad early draw.
(2) It costs subjects by interrupting a study that would have worked had it gone all the way.

For lower cutoffs, (2) is larger than (1)

What does explain the decline effect in this dataset?
We were primarily interested in the consequences of pilot dropping, but the discovery that pilot dropping is not very consequential does not bring us closer to understanding the patterns that Uli found in Bem’s data. One possibility is pilot-hacking, superficially similar to, but critically different from, pilot-dropping.

It would work like this: you run a pilot and you intensely p-hack it, possibly well past p=.05. Then you keep collecting more data and analyze them the same (or a very similar) way. That probably feels honest (regardless, it’s wrong). Unlike pilot dropping, pilot hacking would dramatically decrease the # of subjects needed for a false-positive finding, because way fewer pilots would be dropped thanks to p-hacking, and because you would start with a much stronger effect so more studies would end up surviving the added observations (e.g., instead of needing 20 attempts to get a pilot to get p<.05, with p-hacking one often needs only 1). Of course, just because pilot-hacking would produce a pattern like that identified by Uli, one should not conclude that’s what happened.

Alternative explanations for decline effects within study
1) Researchers may make a mistake when sorting the data (e.g., sorting by the dependent variable and not including the timestamp in their sort, thus creating a spurious association between time and effect) [6].

2) People who participate earlier in a study could plausibly show a larger effect than those that participate later; for example, if responsible students participate earlier and pay more attention to instructions (this is not a particularly plausible explanation for Bem, as precognition is almost certainly zero for everyone)  [7]

3) Researchers may put together a series of small experiments that were originally run separately and present them as “one study,” and (perhaps inadvertently) put within the compiled dataset studies that obtained larger effects first.

Pilot dropping is not a plausible explanation for Bem’s results in general nor for the pattern of decreasing effect size in particular. Moreover, because it backfires, it is not a particularly worrisome form of p-hacking.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded. We shared a draft with Daryl Bem and Uli Schimmack. Uli replied and suggested that we extend the analyses to smaller sample sizes for the full study. We did. The qualitative conclusion was the same. The posted R Code includes the more flexible simulations that accommodated his suggestion. We are grateful for Uli’s feedback.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. In this paper, Bem claimed that participants were affected by treatments that they received in the future. Since causation doesn’t work that way, and since some have failed to replicate Bem’s results, many scholars do not believe Bem’s conclusion []
  2. The gray lines are simulated data when the true effect is d=.2 []
  3. To give a sense of how much we lacked the intuition, at least one of us was pretty convinced by Uli’s explanation. We conducted the simulations below not to make a predetermined point, but because we really did not know what to expect. []
  4. The median number of studies needed is about 14; there is a long tail []
  5. The key number one needs is the probability that the full study will work, conditional on having decided to run it after seeing the pilot. That’s almost certainly possible to compute with formulas, but why bother? []
  6. This does not require a true effect, as the overall effect behind the spurious association could have been p-hacked []
  7. Ebersole et al., in “Many Labs 3” (.pdf), find no evidence of a decline over the semester; but that’s a slightly different hypothesis. []

[67] P-curve Handles Heterogeneity Just Fine

A few years ago, we developed p-curve (see p-curve.com), a statistical tool that identifies whether or not a set of statistically significant findings contains evidential value, or whether those results are solely attributable to the selective reporting of studies or analyses. It also estimates the true average power of a set of significant findings [1].

A few methods researchers have published papers stating that p-curve is biased when it is used to analyze studies with different effect sizes (i.e., studies with “heterogeneous effects”). Since effect sizes in the real world are not identical across studies, this would mean that p-curve is not very useful.

In this post, we demonstrate that p-curve performs quite well in the presence of effect size heterogeneity, and we explain why the methods researchers have stated otherwise.

Basic setup
Most of this post consists of figures like this one, which report the results of 1,000 simulated p-curve analyses (R Code).

Each analysis contains 20 studies, and each study has its own effect size, its own sample size, and because these are drawn independently, its own statistical power. In other words, the 20 studies contain heterogeneity [2].

For example, to create this first figure, each analysis contained 20 studies. Each study had a sample size drawn at random from the orange histogram, a true effect size drawn at random from the blue histogram, and thus a level of statistical power drawn at random from the third histogram.

The studies’ statistical power ranged from 10% to 70%, and their average power was 41%. P-curve guessed that their average power was 40%. Not bad.

But what if…?

1) But what if there is more heterogeneity in effect size?
Let’s increase heterogeneity so that the analyzed set of studies contains effect sizes ranging from d = 0 (null) to d = 1 (very large), probably pretty close to the entire range of plausible effect sizes in psychology [3].

The true average power is 42%. P-curve estimates 43%. Again, not bad.

2) But what if samples are larger?
Perhaps p-curve’s success is limited to analyses of studies that are relatively underpowered. So let’s increase sample size (and therefore power) and see what happens. In this simulation, we’ve increased the average sample size from 25 per cell to 50 per cell.

The true power is 69%, and p-curve estimates 68%. This is starting to feel familiar.

3) But what if the null is true for some studies?
In real life, many p-curves will include a few truly null effects that are nevertheless significant (i.e., false-positives).  Let’s now analyze 25 studies, including 5 truly null effects (d=0) that were false-positively significant.

The true power is 56%, and p-curve estimates 57%. This is continuing to feel familiar.

4) But what if sample size and effect size are not symmetrically distributed?

Maybe p-curve only works when sample and effect size are (unrealistically) symmetrically distributed. Let’s try changing that. First we skew the sample size, then we skew the effect size:

The true powers are 58% and 60%, and p-curve estimates 59% and 61%. This is persisting in feeling familiar.

5) But what if all studies are highly powered?
Let’s go back to the first simulation and increase the average sample size to 100 per cell:The true power is 93%, and p-curve estimates 94%. It is clear that heterogeneity does not break or bias p-curve. On the contrary, p-curve does very well in the presence of heterogeneous effect sizes.

So why have others proposed that p-curve is biased in the presence of heterogeneous effects?

Reason 1:  Different definitions of p-curve’s goal.
van Aert, Wicherts, & van Assen (2016, .pdf) write that p-curve “overestimat[es] effect size under moderate-to-large heterogeneity” (abstract). McShane, Bockenholt, & Hansen (2016 .pdfwrite that p-curve “falsely assume[s] homogeneity […] produc[ing] upward[ly] biased estimates of the population average effect size.” (p.736).

We believe that the readers of those papers would be very surprised by the results we depict in the figures above. How can we reconcile our results with what these authors are claiming?

The answer is that the authors of those papers assessed how well p-curve estimated something different from what it estimates (and what we have repeatedly stated that it estimates).

They assessed how well p-curve estimated the average effect sizes of all studies that could be conducted on the topic under investigationBut p-curve informs us “only” about the studies included in p-curve [4].

Imagine that an effect is much stronger for American than for Ukrainian participants. For simplicity, let’s say that all the Ukrainian studies are non-significant and thus excluded from p-curve, and that all the American studies are p<.05 and thus included in p-curve.

P-curve would recover the true average effect of the American studies. Those arguing that p-curve is biased are saying that it should recover the average effect of both the Ukrainian and American studies, even though no Ukrainian study was included in the analysis [5].

To be clear, these authors are not particularly idiosyncratic in their desire to estimate “the” overall effect.  Many meta-analysts write their papers as if that’s what they wanted to estimate. However…

•  We don’t think that the overall effect exists in psychology (DataColada[33]).
•  We don’t think that the overall effect is of interest to psychologists (DataColada[33]).
•  And we know of no tool that can credibly estimate it.

In any case, as a reader, here is your decision:
If you want to use p-curve analysis to assess the evidential value or the average power of a set of statistically significant studies, then you can do so without having to worry about heterogeneity [6].

If you instead want to assess something about a set of studies that are not analyzed by p-curve, including studies never observed or even conducted, do not run p-curve analysis. And good luck with that.

Reason 2: Outliers vs heterogeneity
Uli Schimmack, in a working paper (.pdf), reports that p-curve overestimates statistical power in the presence of heterogeneity. Just like us, and unlike the previously referenced authors, he is looking only at the studies included in p-curve. Why do we get different results?

It will be useful to look at a concrete simulation he has proposed, one in which p-curve does indeed do poorly (R Code):

Although p-curve overestimates power in this scenario, the culprit is not heterogeneity, but rather the presence of outliers, namely several extremely highly powered studies. To see this let’s look at similarly heterogeneous studies, but ones in which the maximum power is 80% instead of 100%.

In a nutshell the overestimation with outliers occurs because power is a bounded variable but p-curve estimates it based on an unbounded latent variable (the noncentrality parameter). It’s worth keeping in mind that a single outlier does not greatly bias p-curve. For example, if 20 studies are powered on average to 50%, adding one study powered to 95% increases true average power to 52%, p-curve’s estimate to just 54%.

This problem that Uli has identified is worth taking into account and perhaps p-curve can be modified to prevent such bias [7]. But it is worth keeping in mind that this situation should be rare, as few literatures contain both (1) a substantial number of studies powered over 90% and (2) a substantial number of under-powered studies. Moreover, this is somewhat inconsequential mistake. All it means is that p-curve will exaggerate how strong a truly (and obviously) strong literature actually is.

In Summary
•  P-curve is not biased by heterogeneity.
•  It is biased upwards in the presence of both (1) low powered studies, and (2) a large share of extremely highly powered studies.
•  P-curve tells us about the study designs it includes, not the study designs it excludes.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted all 7 authors of the three methods papers discussed above. Uli Schimmack declined to comment. Karsten Hansen and Blake McShane provided suggestions that led us to more precisely describe their analyses and to describe more detailed analyses in Footnote 5. Though our exchange with Karsten and Blake started and ended with, in their words, “fundamental disagreements about the nature of evaluating the statistical properties of an estimator,” the dialogue was friendly and constructive. We are very grateful to them, both for the feedback and for the tone of the discussion. (Interestingly, we disagree with them about the nature of our disagreement: we don’t think we disagree about how to examine the statistical properties of an estimator, but rather, about how to effectively communicate methods issues to a general audience).  Marcel van Assen, Robbie van Aert, and Jelte Wicherts disagreed with our belief that readers of their paper would be surprised by how well p-curve recovers average power in the presence of heterogeneity (as they think their paper explains this as well). Specifically, like us, they think p-curve performs well when making inferences about studies included in p-curve, but, unlike us, they think that readers of their paper would realize this. They are not persuaded by our arguments that the population effect size does not exist and is not of interest to psychologists, and they are troubled by the fact that p-curve does not recover this effect. They also proposed that an important share of studies may indeed have power near 100% (citing this paper: .htm). We are very grateful to them for their feedback and collegiality as well.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. P-curve can also be used to estimate average effect size rather than power (and, as Blake McShane and Karsten Hansen pointed out to us, when used in this fashion p-curve is virtually equivalent to the maximum likelihood procedure proposed by Hedges in 1984 (.pdf) ).  Here we focus on power rather than effect size because we don’t think “average effect size” is meaningful or of interest when aggregating across psychology experiments with different designs (see Colada[33]). Moreover, whereas power calculations only require that one knows the results of the test statistic of interest (e.g., F(1,230)=5.23), effect size calculations require one to also know how the study was defined, a fact that renders effect size estimations much more prone to human error (see page 676 of our article on p-curve and effect size estimation (.pdf) ). In any case, the point that we make in this post applies at least as strongly to an analysis of effect size as it does to an analysis of power: p-curve correctly recovers the true average effect size of the studies that it analyses, even when those studies contain different (i.e., heterogeneous) effect sizes. See Figure 2c in our article on p-curve and effect size estimation (.pdf) and Supplement 2 of that same paper (.pdf) []
  2. In real life, researchers are probably more likely to collect larger samples when studying smaller effects (see Colada[58]). This would necessarily reduce heterogeneity in power across studies []
  3. To do this, we changed the blue histogram from d~N(.5,.05) to d~N(.5,.15). []
  4. We have always been transparent about this. For instance, when we described how to use p-curve for effect size estimation (.pdf) we wrote, “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve.” (p.667). []
  5. For a more quantitative example check out Supplement 2 (.pdf) of our p-curve and effect size paper. In the middle panel of Figure S2, we consider a scenario in which a research attempts to run an equal number of studies (with n = 20 per cell) testing either an effect size of d = .2 or an effect size of d = .6. Because it necessarily easier to get significance when the effect size is larger than when the effect size is smaller, the share of significant d = .6 studies will necessarily be greater than the share of significant d = .2 studies, and thus p-curve will include more d = .6 studies that d = .2 studies. Because the d = .6 studies will be over-represented among all significant studies, the true average efefct of the significant studies will be d = .53 rather than d=.4. P-curve correctly recovers this value (.53), but it is biased upwards if we expect it to guess d=.4. For an even more quantitative example, imagine the true average effect is d = .5 with a standard deviation of .2. If we study this with many n=20 studies, the average observed significant effect will be d=.91, but the true average effect of those studies is d = .61, which is the number that p-curve would recover. It would not recover the true mean of the population (d = .5) but rather the true mean of the studies that were statistically significant (d = .61). In simulations, the true mean is known and this might look like a bias. In real life, the true mean is, well, meaningless, as it depends on arbitrary definitions of what constitutes the true population of all possible studies; R Code) []
  6. Again, for both practical and conceptual reasons, we would not advise you to estimate the average effect size, regardless of whether you use p-curve or any other tool. But this has nothing to do with the supposed inability of p-curve to handle heterogeneity. See footnote 1. []
  7. Uli has proposed using z-curve, a tool he developed, instead of p-curve.  While z-curve does not seem to be biased in scenarios with many studies with extreme high-power, it performs worse than p-curve in almost all other scenarios. For example, in the examples depicted graphically in this post, z-curve’s expected estimates are about 4 times further from the truth than are p-curve’s. []

[66] Outliers: Evaluating A New P-Curve Of Power Poses

In a forthcoming Psych Science paper, Cuddy, Schultz, & Fosse, hereafter referred to as CSF, p-curved 55 power-posing studies (.pdfSSRN), concluding that they contain evidential value [1]. Thirty-four of those studies were previously selected and described as “all published tests” (p. 657) by Carney, Cuddy, & Yap (2015; .pdf). Joe and Uri p-curved those 34 studies and concluded that they lacked evidential value (.pdf | Colada[37]). The two p-curve analyses – Joe & Uri’s old p-curve and CSF’s new p-curve – arrive at different conclusions not because the different sets of authors used different sets of tools, but rather because they used the same tool to analyze different sets of data.

In this post we discuss CSF’s decision to include four studies with unusually small p-values (e.g., p < 1 in a quadrillion) in their analysis. The inclusion of these studies was sufficiently problematic that we stopped further evaluating their p-curve. [2].

Several papers have replicated the effect of power posing on feelings of power and, as Joe and Uri reported in their Psych Science paper (.pdf, pg.4), a p-curve of those feelings-of-power effects suggests they contain evidential value. CSF interpret this as a confirmation of the central power-posing hypothesis, whereas we are reluctant to interpret it as such for reasons that are both psychological and statistical. Fleshing out the arguments on both sides may be interesting, but it is not the topic of this post.

Evaluating p-curves
Evaluating any paper is time consuming and difficult. Evaluating a p-curve paper – which is in essence, a bundle of other papers – is necessarily more time consuming and more difficult.

We have, over time, found ways to do it more efficiently. We begin by preliminarily assessing three criteria. If the p-curve fails any of these criteria, we conclude that it is invalid and stop evaluating it. If the p-curve passes all three criteria, we evaluate the p-curve work more thoroughly.

Criterion 1: Study Selection Rule
Our first step is to verify that the authors followed a clear and reproducible study selection rule. CSF did not. That’s a problem, but it is not the focus of this post. Interested readers can check out this footnote: [3].

Criterion 2: Test Selection
Figure 4 (.pdf) in our first p-curve paper (SSRN) explains and summarizes which tests to select from the most common study designs. The second thing we do when evaluating a p-curve paper is to verify that the guidelines were followed by focusing on the subset of designs that are most commonly incorrectly treated by p-curvers. For example, we look at interaction hypotheses to make sure that the right test is included, and we look to see whether omnibus tests are selected (they should almost never be; see Colada[60]). CSF selected some incorrect test results (e.g., their smallest p-value comes from an omnibus test). See “Outlier 1” below.

Criterion 3. Outliers
Next we sort studies by p-value to identify possible outliers, and we carefully read the papers containing an outlier result. We do this both because outliers exert a disproportionate effect on the results of p-curve, and because outliers are much more likely to represent the erroneous inclusion of a study or the erroneous selection of a test result. This post focuses on outliers.

This figure presents the distribution of p-values in CSF’s p-curve analysis (see their disclosure table .xlsx). As you can see, there are four outliers:

Outlier 1
CSF’s smallest p-value is from F(7, 140) = 19.47, approximately p = .00000000000000002, or 1 in 141 quadrillion. It comes from a 1993 experiment published in the journal The Arts in Psychotherapy (.pdf).

In this within-subject study (N = 24), each participant held three “open” and three “closed” body poses. At the beginning of the study, and then again after every pose, they rated themselves on eight emotions. The descriptions of the analyses are insufficiently clear to us (and to colleagues we sent the paper to), but as far as we can tell, the following things are true:

(1) Some effects are implausibly large. For example, Figure 1 in their paper (.pdf) suggests that the average change in happiness for those adopting the “closed” postures was ~24 points on a 0-24 scale. This could occur only if every participant was maximally happy at baseline and then maximally miserable after adopting every one of the 3 closed postures.

(2) The statistical analyses incorrectly treat multiple answers by the same participants as independent, across emotions and across poses.

(3) The critical test of an interaction between emotion valence and pose is not reported. Instead the authors report only an omnibus interaction: F(7, 140) = 19.47. Given the degrees-of-freedom of the test, we couldn’t figure out what hypothesis this analysis was testing, but regardless, no omnibus test examines the directional hypothesis of interest. Thus, it should not be included in a p-curve analysis.

Outlier 2
CSF’s second smallest p-value is from F(1,58)=85.9,  p = .00000000005, or 1 in 2 trillion. It comes from a 2016 study published in Biofeedback Magazine (.pdf). In that study, 33 physical therapists took turns in dyads, with one of them (the “tester”) pressing down on the other’s arm, and the other (the “subject”) attempting to resist that pressure.

The p-value selected by CSF compares subjective arm strength when the subject is standing straight (with back support) vs. slouching (without support). As the authors of the original article explain, however, that has nothing to do with any psychological consequences of power posing, but rather, with its mechanical consequences. In their words: “Obviously, the loss of strength relates to the change in the shoulder/body biomechanics and affects muscle activation recorded from the trapezius and medial and anterior deltoid when the person resists the downward applied pressure” (p. 68-69; emphasis added) [4].

Outlier 3
CSF’s third smallest p-value is from F(1,68)=26.25, p = .00000267, or 1 in ~370,000. It comes from a 2014 study published in Psychology of Women Quarterly (.pdf).

This paper explores two main hypotheses, one that is quite nonintuitive, and one that is fairly straightforward. The nonintuitive hypothesis predicts, among other things, that women who power pose while sitting on a throne will attempt more math problems when they are wearing a sweatshirt but fewer math problems when they are wearing a tank-top; the prediction is different for women sitting in a child’s chair instead of a throne [5].

CSF chose the p-value for the straightforward hypothesis, the prediction that people experience fewer positive emotions while slouching (“allowing your rib cage to drop and your shoulders to rotate forward”) than while sitting upright (“lifting your rib cage forward and pull[ing] your shoulders slightly backwards”).

Unlike the previous two outliers, one might be able to validly include this p-value in p-curve. But we have reservations, both about the inclusion of this study, and about the inclusion of this p-value.

First, we believe most people find power posing interesting because it affects what happens after posing, not what happens while posing. For example, in our opinion, the fact that slouching is more uncomfortable than sitting upright should not be taken as evidence for the power poses hypothesis.

Second, while the hypothesis is about mood, this study’s dependent variable is a principal component that combines mood with various other theoretically irrelevant variables that could be driving the effect, such as how “relaxed” or “amused” the participants were. We discuss two additional reservations in this footnote: [6].

Outlier 4
CSF’s fourth smallest p-value is from F(2,44)=13.689, p=.0000238, or 1 in 42,000. It comes from a 2015 study published in the Mediterranean Journal of Social Sciences (.pdf). Fifteen male Iranian students were all asked to hold the same pose for almost the entirety of each of nine 90-minute English instruction sessions, varying across sessions whether it was an open, ordinary, or closed pose. Although the entire class was holding the same position at the same time, and evaluating their emotions at the same time, and in front of all other students, the data were analyzed as if all observations were independent, artificially reducing the p-value.

Given how difficult and time consuming it is to thoroughly review a p-curve analysis or any meta-analysis (e.g., we spent hours evaluating each of the four studies discussed here), we preliminarily rely on three criteria to decide whether a more exhaustive evaluation is even warranted. CSF’s p-curve analysis did not satisfy any of the criteria. In its current form, their analysis should not be used as evidence for the effects of power posing, but perhaps a future revision might be informative.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted CSF and the authors of the four studies we reviewed.

Amy Cuddy responded to our email, but did not discuss any of the specific points we made in our post, or ask us to make any specific changes. Erik Peper, lead author of the second outlier study, helpfully noticed that we had the wrong publication date and briefly mentioned several additional articles of his own on how slouched positions affect emotions, memory, and energy levels (.pdf; .pdf; .pdf; html; html). We also received an email from the second author of the first outlier study; he had “no recommended changes.” He suggested that we try to contact the lead author but we were unable to find her current email address.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. When p-curve concludes that there is evidential value, it is simply saying that at least one of the analyzed findings was unlikely to have arisen from the mere combination of random noise and selective reporting. In other words, at least one of the studies would be expected to repeatedly replicate []
  2. After reporting the overall p-curve, CSF also split the 55 studies based on the type of dependent variable: (i) feelings of power, (ii) EASE (“Emotion, Affect, and Self-Evaluation”), and (iii) behavior or hormonal response (non-EASE). They find evidential value for the first two, but not the last. The p-curve for EASE includes all four of the studies described in this post. []
  3. To ensure that studies are not selected in a biased manner, and more generally to help readers and reviewers detect possible errors, the set of studies included in p-curve must be determined by a predetermined rule. The rule, in turn, should be concrete and precise enough that an independent set of researchers following the rule would generate the same, or virtually the same, set of studies. The rule, as described in CSF’s paper, lacks the requisite concreteness and precision. In particular, the paper lists 24 search terms (e.g., “power”, “dominance”) that were combined (but the combinations are not listed). The resulting hits were then “filter[ed] out based on title, then abstract, and then the study description in the full text” in an unspecified manner. (Supplement: https://osf.io/5xjav/ | our archived copy .txt). In sum, though the authors provide some information about how they generated their set of studies, neither the search queries nor the filters are specified precisely enough for someone else to reproduce them. Joe and Uri’s p-curve, on the other hand, followed a reproducible study selection rule: all studies that were cited by Carney et al. (2015) as evidence for power posing. []
  4. The paper also reports that upright vs. collapsed postures may affect emotions and the valence of memories, but these claims are supported by quotations rather than by statistics. The one potential exception is that the authors report a “negative correlation between perceived strength and severity of depression (r=-.4).” Given the sample size of the study, this is indicative of a p-value in the .03-.05 range. The critical effect of pose on feelings, however, is not reported. []
  5. The study (N = 80) employed a fully between-subjects 2(self-objectification: wearing a tanktop vs. wearing a sweatshirt) x 2(power/status: sitting in a “grandiose, carved wooden decorative antique throne” vs. a “small wooden child’s chair from the campus day-care facility”) x 2(pose: upright vs slumped) design. []
  6. First, for robustness, one would need to include in the p-curve the impact of posing on negative mood, which is also reported in the paper and which has a considerably larger p-value (F = 13.76 instead of 26.25). Second, the structure of the experiment is very complex, involving a three-way interaction which in turn hinges on a two-way reversing interaction and a two-way attenuated interaction. It is hard to know if the p-value distribution of the main effect is expected to be uniform under the null (a requirement of p-curve analysis) when the researcher is interested in these trickle-down effects. For example, it hard to know whether p-hacking the attenuated interaction effect would cause the p-value associated with the main effect to be biased downwards. []

[65] Spotlight on Science Journalism: The Health Benefits of Volunteering

I want to comment on a recent article in the New York Times, but along the way I will comment on scientific reporting as well. I think that science reporters frequently fall short in assessing the evidence behind the claims they relay, but as I try to show, assessing evidence is not an easy task. I don’t want scientists to stop studying cool topics, and I don’t want journalists to stop reporting cool findings, but I will suggest that they should make it commonplace to get input from uncool data scientists and statisticians.

Science journalism is hard. Those journalists need to maintain a high level of expertise in a wide range of domains while being truly exceptional at translating that content in ways that are clear, sensible, and accurate. For example, it is possible that Ed Yong couldn’t run my experiments, but I certainly couldn’t write his articles. [1]

I was reminded about the challenges of science journalism when reading an article about the health benefits of being a volunteer. The journalist, Nicole Karlis, seamlessly connects interviews with recent victims, interviews with famous researchers, and personal anecdotes.

It also cites some evidence in the form of three scientific findings. Like the journalist, I am not an expert in this area. The journalist’s profession requires her to float above the ugly complexities of the data, whereas my career is spent living amongst (and contributing to) those complexities. So I decided to look at those three papers.

OK, here are those references (the first two come together):

If you would like to see those articles for yourself, they can be found here (.html) and here (.html).

First the blood pressure finding. The original researchers analyze data from a longitudinal panel of 6,734 people who provided information about their volunteering and had their blood pressure measured. After adding a number of control variables [2], they look to see if volunteering has an influence on blood pressure. OK, how would you do that? 40.4% of respondents reported some volunteering. Perhaps they could be compared to the remaining 59.6%? Or perhaps there is a way to look at how the number of hours volunteered decreases units of blood pressure? The point is, there are a few ways to think about this. The authors found a difference only when comparing non-volunteers to the category of people who volunteered 200 hours or more. Their report:

“In a regression including the covariates, hours of volunteer work were related to hypertension risk (Figure 1). Those who had volunteered at least 200 hours in the past 12 months were less likely to develop hypertension than non-volunteers (OR=0.60; 95% CI:0.40–0.90). There was also a decrease in hypertension risk among those who volunteered 100–199 hours; however, this estimate was not statistically reliable (OR=0.78; 95% CI=0.48–1.27). Those who volunteered 1–49 and 50–99 hours had hypertension risk similar to that of non-volunteers (OR=0.95; 95% CI: 0.68–1.33 and OR=0.96; 95% CI: 0.65–1.41, respectively).”

So what I see is some evidence that is somewhat suggestive of the claim, but it is not overly strong. The 200-hour cut-off is arbitrary, and the effect is not obviously robust to other specifications. I am worried that we are seeing researchers choosing their favorite specification rather than the best specification. So, suggestive perhaps, but I wouldn’t be ready to cite this as evidence that volunteering is related to improved blood pressure.

The second finding is “volunteering is linked to… decreased mortality rates.” That paper analyzes data from a different panel of 10,317 people who report their volunteer behavior and whose deaths are recorded. Those researchers convey their finding in the following figure:

So first, that is an enormous effect. People who volunteered were about 50% less likely to die within four years. Taken at face value, that would suggest an effect seemingly on the order of normal person versus smoker + drives without a seatbelt + crocodile-wrangler-hobbyist. But recall that this is observational data and not an experiment, so we need to be worried about confounds. For example, perhaps the soon-to-be-deceased also lack the health to be volunteers? The original authors have that concern too, so they add some controls. How did that go?

That is not particularly strong evidence. The effects are still directionally right, and many statisticians would caution against focusing on p-values… but still, that is not overly compelling. I am not persuaded. [3]

What about the third paper referenced?

That one can be found here (.html).

Unlike the first two papers, that is not a link to a particular result, but rather to a preregistration. Readers of this blog are probably familiar, but preregistrations are the time-stamped analysis plans of researchers from before they ever collect any data. Preregistrations – in combination with experimentation – eliminate some of the concerns about selective reporting that inevitably follow other studies. We are huge fans of preregistration (.html, .html, .html). So I went and found the preregistered primary outcome on page 8:

Perfect. That outcome is (essentially) one of those mentioned in the NY Times. But things got more difficult for me at that point. This intervention was an enormous undertaking, with many measures collected over many years. Accordingly, though the primary outcome was specified here, a number of follow-up papers have investigated some of those alternative measures and analyses. In fact, the authors anticipate some of that by saying “rather than adjust p-values for multiple comparison, p-values will be interpreted as descriptive statistics of the evidence, and not as absolute indicators for a positive or negative result.” (p. 13). So they are saying that, outside of the mobility finding, p-values shouldn’t be taken quite at face value. This project has led to some published papers looking at the influence of the volunteerism intervention on school climate, Stroop performance, and hippocampal volume, amongst others. But the primary outcome – mobility – appears to be reported here (.html). [4]. What do they find?

Well, we have the multiple comparison concern again – whatever difference exists is only found at 24 months, but mobility has been measured every four months up until then. Also, this is only for women, whereas the original preregistration made no such specification. What happened to the men? The authors say, “Over a 24-month period, women, but not men, in the intervention showed increased walking activity compared to their sex-matched control groups.” So the primary outcome appears not to have been supported. Nevertheless, making interpretation a little challenging, the authors also say, “the results of this study indicate that a community-based intervention that naturally integrates activity in urban areas may effectively increase physical activity.” Indeed, it may, but it also may not. These data are not sufficient for us to make that distinction.

That’s it. I see three findings, all of which are intriguing to consider, but none of which are particularly persuasive. The journalist, who presumably has been unable to read all of the original sources, is reduced to reporting their claims. The readers, who are even more removed, take the journalist’s claims at face value: “if I volunteer then I will walk around better, lower my blood pressure, and live longer. Sweet.”

I think that we should expect a little more from science reporting. It might be too much for every journalist to dig up every link, but perhaps they should develop a norm of collecting feedback from those people who are informed enough to consider the evidence, but far enough outside the research area to lack any investment in a particular claim. There are lots of highly competent commentators ready to evaluate evidence independent of the substantive area itself.

There are frequent calls for journalists to turn away from the surprising and uncertain in favor of the staid and uncontroversial. I disagree – surprising stories are fun to read. I just think that journalists should add an extra level of scrutiny to ensure that we know that the fun stories are also true stories.

Wide logo

Author Feedback.
I shared a draft of this post with the contact author for each of the four papers I mention, as well as the journalist who had written about them. I heard back from one, Sara Konrath, who had some helpful suggestions including a reference to a meta-analysis (.html) on the topic.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. Obviously Mr. Yong could run my experiments better than me also, but I wanted to make a point. At least I can still teach college students better than him though. Just kidding, he would also be better at that. []
  2. average systolic blood pressure (continuous), average diastolic blood pressure (continuous), age (continuous), sex, self-reported race (Non-Hispanic White, Non-Hispanic Black, Hispanic, Non-Hispanic Other), education (less than high school, General Equivalency Diploma [GED], high school diploma, some college, college and above), marital status (married, annulled, never married, divorced, separated, widowed), employment status (employed/not employed), and self-reported history of diabetes (yes/no), cancer (yes/no), heart problems (yes/no), stroke (yes/no), or lung problems (yes/no []
  3. It is worth noting that this paper, in particular, goes on to consider the evidence in other interesting ways. I highlight this portion because it was the fact being cited in the NYT article. []
  4. I think. It is really hard for me, as a novice in this area, to know if I have found all of the published findings from this original preregistration. If there is a different mobility finding elsewhere I couldn’t find it, but I will correct this post if it gets pointed out to me. []