[70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist

We have argued that, for most effects, it is impossible to identify the average effect (datacolada.org/33). The argument is subtle (but not statistical), and given the number of well-informed people who seem to disagree, perhaps we are simply wrong. This is my effort to explain why we think identifying the average effect is so hard. I am going to take a while to explain my perspective, but the boxed-text below highlights where I am eventually going.

When averaging is easy: Height at Berkeley.
First, let’s start with a domain where averaging is familiar, useful, and plausible. If I want to know the average height of a UC Berkeley student I merely need a random sample, and I can compute the average and have a good estimate. Good stuff.

My sense is that when people think that we should calculate the average effect size they are picturing something kind of like calculating average height: First sample (by collecting the studies that were run), then calculate (by performing a meta-analysis). When it comes to averaging effect sizes, I don’t think we can do anything particularly close to computing the “average” effect.

The effect of happiness on helpfulness is not like height
Let’s consider an actual effect size from psychology: the influence of positive emotion on helping behavior. The original paper studying this effect (or the first that I think of) manipulates whether or not a person unexpectedly finds a dime in a phone booth and then measures whether the person stops to help pick up some spilled papers (.pdf). When people have the $.10 windfall they help 88% of the time, whereas the others help only 4% of the time[1]. So that is the starting point, but it is only one study. The same paper, for example, contains another study manipulating whether people received a cookie and measures minutes volunteered to be a confederate for either a helping experiment, in one condition, or a distraction experiment, in another (a 2 x 2 design). Cookies increased minutes volunteered for helping (69 minutes vs. 16.7 minutes) and decreased minutes volunteered for the distraction experiment (20 minutes vs. 78.6 minutes) [2]. OK, so the meta-analyst can now average those effect sizes in some manner and conclude that they have identified an unbiased estimate of the average effect of positive emotion on helping behavior.

What about the effect of nickels on helpfulness?
However, that is surely not right, because those are not the only two studies investigating the effect of happiness on helpfulness. Publication bias is the main topic discussed by meta-analytic tool developers. Perhaps, for example, there was an unreported study using nickels, rather than dimes, that did not get to p<.05. Researchers are more likely to tell you about a result, and journal editors are more likely to publish a result, if it is statistically significant. There have been lots of efforts to find a way to correct for it, including p-curve. But what exactly are those aiming to correct? What is the right set of studies to attempt to reconstruct?

The studies we see versus the studies we might see
Because we developed p-curve, we know which answer it is aiming for: The true average effect of the studies it includes [3].  So it gives an unbiased estimate of the dimes and cookies, but is indifferent to nickels. We are pretty comfortable owning that limitation – p-curve can only tell you about the true effect of the studies it includes. One could reasonably say at this point, “but wait, I am looking for the average effect of happiness on helping, so I want my average to include nickels as well.” This gets to the next point: What are the other studies that should be included?

Let’s assume that there really is a non-significant (p>.05) nickels study that was conducted. Would we find out about it? Sometimes. Perhaps the p-value is really close to .05, so the authors are comfortable reporting it in the paper? [4] Perhaps it creeps into a book chapter some time later and the p-values are not so closely scrutinized? Perhaps the experimenter is a heavy open-science advocate and writes a Python script that automatically posts all JASP output on PsyArXiv regardless of what it is? The problem is not whether we will see any non-significant findings, the problem is whether we would see all of them. No one believes that we would catch all of them, and presumably everyone believes that we would see a biased sample – namely, we would be more likely to see those studies which best serve the argument of the people presenting them. But we know very little about the specifics of that biasing. How likely are we to see a p = .06? Does it matter if that study is about nickels, helping behavior, or social psychology, or are non-significant findings more or less likely to be reported in different research areas? Those aren’t whimsical questions either, because an unknown filter is impossible to correct for. Remember the averaging problem at the beginning of this post – the average height of students at UC Berkeley – and think of how essential the sampling was for that exercise. If someone said that they averaged all the student heights in their Advanced Dutch Literature class we would be concerned that the sample was not random, and since it likely has more Dutch people (who are peculiarly tall), we would worry about bias. But how biased? We have no idea. The same goes for the likelihood of seeing a non-significant nickels study. We know that we are less likely to see it, but we don’t know how much less likely [5]. It is really hard to integrate these into a true average.

But ok, what if we did see every single conducted study?
What if we did know the exact size of that bias? First: wow. Second, that wouldn’t be the only bias that affects the average, and it wouldn’t be the largest. The biggest bias is almost certainly in what studies researchers choose to conduct. Think back to the researchers choosing to use a dime in a phone booth. What if they had decided instead to measure helping behavior differently? Rather than seeing if people picked up papers, they instead observed whether people chose to spend the weekend cleaning the experimenter’s septic tank. That would still be helpful, so the true effect of such a study would indisputably be part of the true average effect of happiness on helping. But the researchers didn’t use that measure, perhaps because they were concerned that the effect would not be large enough to detect. Also, the researchers did not choose to manipulate happiness by leaving a briefcase of $100,000 in the phone booth. Not only would that be impractical, but that study is less likely to be conducted because it is not as compelling: the expected effect seems too obvious. It is not particularly exciting to say that people are more helpful when they are happy, but it is particularly exciting to show that a dime generates enough happiness to change helpfulness [6]. So the experiments people conduct are a tiny subset of the true effect, they are a biased set (no one randomly generates an experimental design, nor should they), and those biases are entirely opaque. But if you want a true average, you need to know the exact magnitude of those biases.

So what all is included in an average effect size?
So now I return to that initial list of things that need to be included in the average effect size (reposted right here to avoid unnecessary scrolling):

That is a tall order. I don’t mind someone wanting that answer, and I fully acknowledge that p-curve does not deliver it. P-curve only hopes to deliver the average effect in (a).

If you want the “Big Average” effect (a, b, c, d, e, and f) then you need to clarify that you have access to the population or can perfectly estimate the biases that influence the size of each category. That is not me being dismissive or dissuasive, it is just the nature of averaging. We are so pessimistic about calculating that average effect size that we use the shorthand of saying that the average effect size does not exist.[7]

But that is a statement of the problem and an acknowledgment of our limitations. If someone has a way to handle the complications above, they would have at least three very vocal advocates.
Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. ! []
  2. !! []
  3. “True effect” is kind of conceptual, but in this case I think that there is some agreement on the operational definition of “true.” If you conducted the study again, you would expect, on average, the “true” result. So if, because of bias or error, the published cookie effect is unusually smaller or larger than the true underlying effect, you are still most interested in the best prediction of what would happen if you ran the study again. I am open to being convinced that there is a different definition of “true”, but I think this is a pretty uncontroversial one. []
  4. Actually, it is worth noting that the cookie experiment features one critical test with a t-value of 1.96. Given the implied df for that study, the p-value would be >.05, though it is reported as p<.05. The point is, those authors were willing to report a non-significant p-value. []
  5. Scientists, statisticians, psychologists, and probably postal workers, bobsledders, and pet hamsters have frequently bemoaned the absurdity of a hard cut-off of p<.05. Granted. But it does provide a side benefit for this selection-bias issue: If p>05, we have no idea whether we will see it, but if p<.05, we know that the p-value hasn’t kept us from seeing it. []
  6. Or to quote the wonderful Prentice and Miller (1992), who in describing the cookie finding, say “the power of this demonstration derives in large part from the subtlety of the instigating stimulus… although mood effects might be interesting however heavy-handed the manipulation that produced them, the cookie study was perhaps made more interesting by its reliance on the minimalist approach.” p. 161. []
  7. It is worth noting that there is some variation between the three of us on the impracticality of calculating the average effect size. The most optimistic of us (me, probably) believe that under a very small number of circumstances – none of which are likely to happen for psychological research – the situation might be well-defined enough for the average effect to be understood and calculated. The most pessimistic of us think even that limited set of circumstances are essentially a non-existent set. From that perspective, the average effect truly does not exist. []

[69] Eight things I do to make my open research more findable and understandable

It is now common for researchers to post original materials, data, and/or code behind their published research. That’s obviously great, but open research is often difficult to find and understand.

In this post I discuss 8 things I do, in my papers, code, and datafiles, to combat that.

1) Before all method sections, I include a paragraph overviewing the open research practices behind the paper. Like this:

2) Just before the end the paper, I put the supplement’s table of contents. And the text reads something like “An online supplement is available, Table 1 summarizes its contents”

3) In tables and figure captions, I include links to code that reproduces them

4) I start my code indicating authorship, last update, and contact info.
5) I then provide an outline of its structure
Like this:

Then, through the code i use those same numbers so people can navigate the code easily [1].

6) Rule-of-thumb: At least one comment per every 3 lines of code.

Even if something is easy to figure out, a comment will make reading code more efficient and less aversive. But most things are not so easy to figure out. Moreover, nobody understands your code as well as you do when you are writing it, including yourself 72 hours later.

When writing comments in code, it is useful to keep in mind who may actually read it, see footnote for longer discussion [2].

7) Codebook (very important). Best to have a simple stand-alone text file that looks like this, variable name followed by description that includes info on possible values and relevant collection details.

8) Post the rawest form of data that I am able/allowed to. All data cleaning is then done in code that is posted as well. When cleaning is extensive, I post both raw and cleaned datafiles

Note: writing this post helped me realize I don’t always do all 8 in every paper. I will try to going forward.

In sum.
1. In paper: open-research statement
2. In paper: supplement’s table of contents
3. In figure captions: links to reproducible code
4. In code: contact info and description
5. In code: outline of program below
6. In code: At least one comment per every three lines
7. Data: post codebook (text file, variable name, description)
8. Data: post (also) rawest version of data possible

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. I think this comes from learning BASIC as a kid (my first programming language), where all code went in numbered lines like
    10 PRINT “Hola Uri”
    20 GOTO 10. []
  2. Let’s think about who will be reading your code.
    One type of reader is someone learning how to use the programming language or statistical technique you used, help that person out and spell things out for them. Wouldn’t you have liked that when you were learning? So if you use a non-vanilla procedure, throw your reader a bone and explain in 10 words stuff they could learn if they read the 3 page help file they shouldn’t really be expected to read just to follow what you did. Throw in references and links to further reading when pertinent but make your code as self-contained as possible.

    Another type of reader is at least as sophisticated as you are, but does things differently from you, so cannot quite understand what you are doing (e.g., you parallel loops, they vectorize). If they don’t quite understand what you did, they will be less likely to learn from your code, or help you identify errors in it. What’s the point of posting it then? This is especially true in R, where there are 20 ways to do everything, and some really trivial stuff is a pain to do.

    Another type of reader lives in the future, say 5 years from today, when the approach, library, structure or even  programming language you use is not used any more. Help that person map what you did into the language/function/program of the future. Also, that person will one day be you.

    The cost of excessive commenting is a few minutes of your time typing text people may not read just to be thorough and prevent errors. That’s what we do most of our time anyway. []

[68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)

Uli Schimmack recently identified an interesting pattern in the data from Daryl Bem’s infamous “Feeling the Future” JPSP paper, in which he reported evidence for the existence of extrasensory perception (ESP; .pdf)[1]. In each study, the effect size is larger among participants who completed the study earlier (blogpost: .htm). Uli referred to this as the “decline effect.” Here is his key chart:

The y-axis represents the cumulative effect size, and the x-axis the order in which subjects participated.

The nine dashed blue lines represent each of Bem’s nine studies. The solid blue line represents the average effect across the nine studies. For the purposes of this post you can ignore the gray areas of the chart [2].

Uli’s analysis is ingenious, stimulating, and insightful, and the pattern he discovered is puzzling and interesting. We’ve enjoyed thinking about it. And in doing so, we have come to believe that Uli’s explanation for this pattern is ultimately incorrect, for reasons that are quite counter-intuitive (at least to us). [3].

Pilot dropping
Uli speculated that Bem did something that we will refer to as pilot dropping. In Uli’s words: “we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continues to collect more data to see whether the effect is real (…) the strong effect during the initial trials (…) is sufficient to maintain statistical significance  (…) as more participants are added” (.htm).

In our “False-Positively Psychology” paper (.pdf) we barely mentioned pilot-dropping as a form of p-hacking (p. 1361) and so we were intrigued about the possibility that it explains Bem’s impossible results.

Pilot dropping can make false-positives harder to get
It is easiest to quantify the impact of pilot dropping on false-positives by computing how many participants you need to run before a successful (false-positive) result is expected.

Let’s say you want to publish a study with two between-subjects condition and n=100 per condition (N=200 total). If you don’t p-hack at all, then on average you need to run 20 studies to obtain one false-positive finding [4]. With N=200 in each study, then that means you need an average of 4,000 participants to obtain one finding.

The effects of pilot-dropping are less straightforward to compute, and so we simulated it [5].

We considered a researcher who collects a “pilot” of, say, n = 25. (We show later the size of the pilot doesn’t matter much). If she gets a high p-value the pilot is dropped. If she gets a low p-value she keeps the pilot and adds the remaining subjects to get to 100 (so she runs another n=75 in this case).

How many subjects she ends up running depends on what threshold she selects for dropping the pilot. Two things are counter-intuitive.

First, the lower the threshold to continue with the study (e.g., p<.05 instead of p<.10), the more subjects she ends up running in total.

Second, she can easily end up running way more subjects than if she didn’t pilot-drop or p-hack at all.

This chart has the results (R Code):

Note that if pilots are dropped when they obtain p>.05, it takes about 50% more participants on average to get a single study to work (because you drop too many pilots, and still many full studies don’t work).

Moreover, Uli conjectured that Bem added observations only when obtaining a “strong effect”. If we operationalize strong effect as p<.01, we now need about N=18,000 for one study to work, instead of “only” 4,000.

With higher thresholds, pilot-dropping does help, but only a little (the blue line is never too far below 4,000). For example, dropping pilots using a threshold of p>.30 is near the ‘optimum,’ and the expected number of subjects is about 3400.

As mentioned, these results do not hinge on the size of the pilot, i.e., on the assumed n=25 (see charts .pdf).

What’s the intuition?
Pilot dropping has two effects.
(1) It saves subjects by cutting losses after a bad early draw.
(2) It costs subjects by interrupting a study that would have worked had it gone all the way.

For lower cutoffs, (2) is larger than (1)

What does explain the decline effect in this dataset?
We were primarily interested in the consequences of pilot dropping, but the discovery that pilot dropping is not very consequential does not bring us closer to understanding the patterns that Uli found in Bem’s data. One possibility is pilot-hacking, superficially similar to, but critically different from, pilot-dropping.

It would work like this: you run a pilot and you intensely p-hack it, possibly well past p=.05. Then you keep collecting more data and analyze them the same (or a very similar) way. That probably feels honest (regardless, it’s wrong). Unlike pilot dropping, pilot hacking would dramatically decrease the # of subjects needed for a false-positive finding, because way fewer pilots would be dropped thanks to p-hacking, and because you would start with a much stronger effect so more studies would end up surviving the added observations (e.g., instead of needing 20 attempts to get a pilot to get p<.05, with p-hacking one often needs only 1). Of course, just because pilot-hacking would produce a pattern like that identified by Uli, one should not conclude that’s what happened.

Alternative explanations for decline effects within study
1) Researchers may make a mistake when sorting the data (e.g., sorting by the dependent variable and not including the timestamp in their sort, thus creating a spurious association between time and effect) [6].

2) People who participate earlier in a study could plausibly show a larger effect than those that participate later; for example, if responsible students participate earlier and pay more attention to instructions (this is not a particularly plausible explanation for Bem, as precognition is almost certainly zero for everyone)  [7]

3) Researchers may put together a series of small experiments that were originally run separately and present them as “one study,” and (perhaps inadvertently) put within the compiled dataset studies that obtained larger effects first.

Pilot dropping is not a plausible explanation for Bem’s results in general nor for the pattern of decreasing effect size in particular. Moreover, because it backfires, it is not a particularly worrisome form of p-hacking.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded. We shared a draft with Daryl Bem and Uli Schimmack. Uli replied and suggested that we extend the analyses to smaller sample sizes for the full study. We did. The qualitative conclusion was the same. The posted R Code includes the more flexible simulations that accommodated his suggestion. We are grateful for Uli’s feedback.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. In this paper, Bem claimed that participants were affected by treatments that they received in the future. Since causation doesn’t work that way, and since some have failed to replicate Bem’s results, many scholars do not believe Bem’s conclusion []
  2. The gray lines are simulated data when the true effect is d=.2 []
  3. To give a sense of how much we lacked the intuition, at least one of us was pretty convinced by Uli’s explanation. We conducted the simulations below not to make a predetermined point, but because we really did not know what to expect. []
  4. The median number of studies needed is about 14; there is a long tail []
  5. The key number one needs is the probability that the full study will work, conditional on having decided to run it after seeing the pilot. That’s almost certainly possible to compute with formulas, but why bother? []
  6. This does not require a true effect, as the overall effect behind the spurious association could have been p-hacked []
  7. Ebersole et al., in “Many Labs 3” (.pdf), find no evidence of a decline over the semester; but that’s a slightly different hypothesis. []

[67] P-curve Handles Heterogeneity Just Fine

A few years ago, we developed p-curve (see p-curve.com), a statistical tool that identifies whether or not a set of statistically significant findings contains evidential value, or whether those results are solely attributable to the selective reporting of studies or analyses. It also estimates the true average power of a set of significant findings [1].

A few methods researchers have published papers stating that p-curve is biased when it is used to analyze studies with different effect sizes (i.e., studies with “heterogeneous effects”). Since effect sizes in the real world are not identical across studies, this would mean that p-curve is not very useful.

In this post, we demonstrate that p-curve performs quite well in the presence of effect size heterogeneity, and we explain why the methods researchers have stated otherwise.

Basic setup
Most of this post consists of figures like this one, which report the results of 1,000 simulated p-curve analyses (R Code).

Each analysis contains 20 studies, and each study has its own effect size, its own sample size, and because these are drawn independently, its own statistical power. In other words, the 20 studies contain heterogeneity [2].

For example, to create this first figure, each analysis contained 20 studies. Each study had a sample size drawn at random from the orange histogram, a true effect size drawn at random from the blue histogram, and thus a level of statistical power drawn at random from the third histogram.

The studies’ statistical power ranged from 10% to 70%, and their average power was 41%. P-curve guessed that their average power was 40%. Not bad.

But what if…?

1) But what if there is more heterogeneity in effect size?
Let’s increase heterogeneity so that the analyzed set of studies contains effect sizes ranging from d = 0 (null) to d = 1 (very large), probably pretty close to the entire range of plausible effect sizes in psychology [3].

The true average power is 42%. P-curve estimates 43%. Again, not bad.

2) But what if samples are larger?
Perhaps p-curve’s success is limited to analyses of studies that are relatively underpowered. So let’s increase sample size (and therefore power) and see what happens. In this simulation, we’ve increased the average sample size from 25 per cell to 50 per cell.

The true power is 69%, and p-curve estimates 68%. This is starting to feel familiar.

3) But what if the null is true for some studies?
In real life, many p-curves will include a few truly null effects that are nevertheless significant (i.e., false-positives).  Let’s now analyze 25 studies, including 5 truly null effects (d=0) that were false-positively significant.

The true power is 56%, and p-curve estimates 57%. This is continuing to feel familiar.

4) But what if sample size and effect size are not symmetrically distributed?

Maybe p-curve only works when sample and effect size are (unrealistically) symmetrically distributed. Let’s try changing that. First we skew the sample size, then we skew the effect size:

The true powers are 58% and 60%, and p-curve estimates 59% and 61%. This is persisting in feeling familiar.

5) But what if all studies are highly powered?
Let’s go back to the first simulation and increase the average sample size to 100 per cell:The true power is 93%, and p-curve estimates 94%. It is clear that heterogeneity does not break or bias p-curve. On the contrary, p-curve does very well in the presence of heterogeneous effect sizes.

So why have others proposed that p-curve is biased in the presence of heterogeneous effects?

Reason 1:  Different definitions of p-curve’s goal.
van Aert, Wicherts, & van Assen (2016, .pdf) write that p-curve “overestimat[es] effect size under moderate-to-large heterogeneity” (abstract). McShane, Bockenholt, & Hansen (2016 .pdfwrite that p-curve “falsely assume[s] homogeneity […] produc[ing] upward[ly] biased estimates of the population average effect size.” (p.736).

We believe that the readers of those papers would be very surprised by the results we depict in the figures above. How can we reconcile our results with what these authors are claiming?

The answer is that the authors of those papers assessed how well p-curve estimated something different from what it estimates (and what we have repeatedly stated that it estimates).

They assessed how well p-curve estimated the average effect sizes of all studies that could be conducted on the topic under investigationBut p-curve informs us “only” about the studies included in p-curve [4].

Imagine that an effect is much stronger for American than for Ukrainian participants. For simplicity, let’s say that all the Ukrainian studies are non-significant and thus excluded from p-curve, and that all the American studies are p<.05 and thus included in p-curve.

P-curve would recover the true average effect of the American studies. Those arguing that p-curve is biased are saying that it should recover the average effect of both the Ukrainian and American studies, even though no Ukrainian study was included in the analysis [5].

To be clear, these authors are not particularly idiosyncratic in their desire to estimate “the” overall effect.  Many meta-analysts write their papers as if that’s what they wanted to estimate. However…

•  We don’t think that the overall effect exists in psychology (DataColada[33]).
•  We don’t think that the overall effect is of interest to psychologists (DataColada[33]).
•  And we know of no tool that can credibly estimate it.

In any case, as a reader, here is your decision:
If you want to use p-curve analysis to assess the evidential value or the average power of a set of statistically significant studies, then you can do so without having to worry about heterogeneity [6].

If you instead want to assess something about a set of studies that are not analyzed by p-curve, including studies never observed or even conducted, do not run p-curve analysis. And good luck with that.

Reason 2: Outliers vs heterogeneity
Uli Schimmack, in a working paper (.pdf), reports that p-curve overestimates statistical power in the presence of heterogeneity. Just like us, and unlike the previously referenced authors, he is looking only at the studies included in p-curve. Why do we get different results?

It will be useful to look at a concrete simulation he has proposed, one in which p-curve does indeed do poorly (R Code):

Although p-curve overestimates power in this scenario, the culprit is not heterogeneity, but rather the presence of outliers, namely several extremely highly powered studies. To see this let’s look at similarly heterogeneous studies, but ones in which the maximum power is 80% instead of 100%.

In a nutshell the overestimation with outliers occurs because power is a bounded variable but p-curve estimates it based on an unbounded latent variable (the noncentrality parameter). It’s worth keeping in mind that a single outlier does not greatly bias p-curve. For example, if 20 studies are powered on average to 50%, adding one study powered to 95% increases true average power to 52%, p-curve’s estimate to just 54%.

This problem that Uli has identified is worth taking into account and perhaps p-curve can be modified to prevent such bias [7]. But it is worth keeping in mind that this situation should be rare, as few literatures contain both (1) a substantial number of studies powered over 90% and (2) a substantial number of under-powered studies. Moreover, this is somewhat inconsequential mistake. All it means is that p-curve will exaggerate how strong a truly (and obviously) strong literature actually is.

In Summary
•  P-curve is not biased by heterogeneity.
•  It is biased upwards in the presence of both (1) low powered studies, and (2) a large share of extremely highly powered studies.
•  P-curve tells us about the study designs it includes, not the study designs it excludes.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted all 7 authors of the three methods papers discussed above. Uli Schimmack declined to comment. Karsten Hansen and Blake McShane provided suggestions that led us to more precisely describe their analyses and to describe more detailed analyses in Footnote 5. Though our exchange with Karsten and Blake started and ended with, in their words, “fundamental disagreements about the nature of evaluating the statistical properties of an estimator,” the dialogue was friendly and constructive. We are very grateful to them, both for the feedback and for the tone of the discussion. (Interestingly, we disagree with them about the nature of our disagreement: we don’t think we disagree about how to examine the statistical properties of an estimator, but rather, about how to effectively communicate methods issues to a general audience).  Marcel van Assen, Robbie van Aert, and Jelte Wicherts disagreed with our belief that readers of their paper would be surprised by how well p-curve recovers average power in the presence of heterogeneity (as they think their paper explains this as well). Specifically, like us, they think p-curve performs well when making inferences about studies included in p-curve, but, unlike us, they think that readers of their paper would realize this. They are not persuaded by our arguments that the population effect size does not exist and is not of interest to psychologists, and they are troubled by the fact that p-curve does not recover this effect. They also proposed that an important share of studies may indeed have power near 100% (citing this paper: .htm). We are very grateful to them for their feedback and collegiality as well.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. P-curve can also be used to estimate average effect size rather than power (and, as Blake McShane and Karsten Hansen pointed out to us, when used in this fashion p-curve is virtually equivalent to the maximum likelihood procedure proposed by Hedges in 1984 (.pdf) ).  Here we focus on power rather than effect size because we don’t think “average effect size” is meaningful or of interest when aggregating across psychology experiments with different designs (see Colada[33]). Moreover, whereas power calculations only require that one knows the results of the test statistic of interest (e.g., F(1,230)=5.23), effect size calculations require one to also know how the study was defined, a fact that renders effect size estimations much more prone to human error (see page 676 of our article on p-curve and effect size estimation (.pdf) ). In any case, the point that we make in this post applies at least as strongly to an analysis of effect size as it does to an analysis of power: p-curve correctly recovers the true average effect size of the studies that it analyses, even when those studies contain different (i.e., heterogeneous) effect sizes. See Figure 2c in our article on p-curve and effect size estimation (.pdf) and Supplement 2 of that same paper (.pdf) []
  2. In real life, researchers are probably more likely to collect larger samples when studying smaller effects (see Colada[58]). This would necessarily reduce heterogeneity in power across studies []
  3. To do this, we changed the blue histogram from d~N(.5,.05) to d~N(.5,.15). []
  4. We have always been transparent about this. For instance, when we described how to use p-curve for effect size estimation (.pdf) we wrote, “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve.” (p.667). []
  5. For a more quantitative example check out Supplement 2 (.pdf) of our p-curve and effect size paper. In the middle panel of Figure S2, we consider a scenario in which a research attempts to run an equal number of studies (with n = 20 per cell) testing either an effect size of d = .2 or an effect size of d = .6. Because it necessarily easier to get significance when the effect size is larger than when the effect size is smaller, the share of significant d = .6 studies will necessarily be greater than the share of significant d = .2 studies, and thus p-curve will include more d = .6 studies that d = .2 studies. Because the d = .6 studies will be over-represented among all significant studies, the true average efefct of the significant studies will be d = .53 rather than d=.4. P-curve correctly recovers this value (.53), but it is biased upwards if we expect it to guess d=.4. For an even more quantitative example, imagine the true average effect is d = .5 with a standard deviation of .2. If we study this with many n=20 studies, the average observed significant effect will be d=.91, but the true average effect of those studies is d = .61, which is the number that p-curve would recover. It would not recover the true mean of the population (d = .5) but rather the true mean of the studies that were statistically significant (d = .61). In simulations, the true mean is known and this might look like a bias. In real life, the true mean is, well, meaningless, as it depends on arbitrary definitions of what constitutes the true population of all possible studies; R Code) []
  6. Again, for both practical and conceptual reasons, we would not advise you to estimate the average effect size, regardless of whether you use p-curve or any other tool. But this has nothing to do with the supposed inability of p-curve to handle heterogeneity. See footnote 1. []
  7. Uli has proposed using z-curve, a tool he developed, instead of p-curve.  While z-curve does not seem to be biased in scenarios with many studies with extreme high-power, it performs worse than p-curve in almost all other scenarios. For example, in the examples depicted graphically in this post, z-curve’s expected estimates are about 4 times further from the truth than are p-curve’s. []

[66] Outliers: Evaluating A New P-Curve Of Power Poses

In a forthcoming Psych Science paper, Cuddy, Schultz, & Fosse, hereafter referred to as CSF, p-curved 55 power-posing studies (.pdfSSRN), concluding that they contain evidential value [1]. Thirty-four of those studies were previously selected and described as “all published tests” (p. 657) by Carney, Cuddy, & Yap (2015; .pdf). Joe and Uri p-curved those 34 studies and concluded that they lacked evidential value (.pdf | Colada[37]). The two p-curve analyses – Joe & Uri’s old p-curve and CSF’s new p-curve – arrive at different conclusions not because the different sets of authors used different sets of tools, but rather because they used the same tool to analyze different sets of data.

In this post we discuss CSF’s decision to include four studies with unusually small p-values (e.g., p < 1 in a quadrillion) in their analysis. The inclusion of these studies was sufficiently problematic that we stopped further evaluating their p-curve. [2].

Several papers have replicated the effect of power posing on feelings of power and, as Joe and Uri reported in their Psych Science paper (.pdf, pg.4), a p-curve of those feelings-of-power effects suggests they contain evidential value. CSF interpret this as a confirmation of the central power-posing hypothesis, whereas we are reluctant to interpret it as such for reasons that are both psychological and statistical. Fleshing out the arguments on both sides may be interesting, but it is not the topic of this post.

Evaluating p-curves
Evaluating any paper is time consuming and difficult. Evaluating a p-curve paper – which is in essence, a bundle of other papers – is necessarily more time consuming and more difficult.

We have, over time, found ways to do it more efficiently. We begin by preliminarily assessing three criteria. If the p-curve fails any of these criteria, we conclude that it is invalid and stop evaluating it. If the p-curve passes all three criteria, we evaluate the p-curve work more thoroughly.

Criterion 1: Study Selection Rule
Our first step is to verify that the authors followed a clear and reproducible study selection rule. CSF did not. That’s a problem, but it is not the focus of this post. Interested readers can check out this footnote: [3].

Criterion 2: Test Selection
Figure 4 (.pdf) in our first p-curve paper (SSRN) explains and summarizes which tests to select from the most common study designs. The second thing we do when evaluating a p-curve paper is to verify that the guidelines were followed by focusing on the subset of designs that are most commonly incorrectly treated by p-curvers. For example, we look at interaction hypotheses to make sure that the right test is included, and we look to see whether omnibus tests are selected (they should almost never be; see Colada[60]). CSF selected some incorrect test results (e.g., their smallest p-value comes from an omnibus test). See “Outlier 1” below.

Criterion 3. Outliers
Next we sort studies by p-value to identify possible outliers, and we carefully read the papers containing an outlier result. We do this both because outliers exert a disproportionate effect on the results of p-curve, and because outliers are much more likely to represent the erroneous inclusion of a study or the erroneous selection of a test result. This post focuses on outliers.

This figure presents the distribution of p-values in CSF’s p-curve analysis (see their disclosure table .xlsx). As you can see, there are four outliers:

Outlier 1
CSF’s smallest p-value is from F(7, 140) = 19.47, approximately p = .00000000000000002, or 1 in 141 quadrillion. It comes from a 1993 experiment published in the journal The Arts in Psychotherapy (.pdf).

In this within-subject study (N = 24), each participant held three “open” and three “closed” body poses. At the beginning of the study, and then again after every pose, they rated themselves on eight emotions. The descriptions of the analyses are insufficiently clear to us (and to colleagues we sent the paper to), but as far as we can tell, the following things are true:

(1) Some effects are implausibly large. For example, Figure 1 in their paper (.pdf) suggests that the average change in happiness for those adopting the “closed” postures was ~24 points on a 0-24 scale. This could occur only if every participant was maximally happy at baseline and then maximally miserable after adopting every one of the 3 closed postures.

(2) The statistical analyses incorrectly treat multiple answers by the same participants as independent, across emotions and across poses.

(3) The critical test of an interaction between emotion valence and pose is not reported. Instead the authors report only an omnibus interaction: F(7, 140) = 19.47. Given the degrees-of-freedom of the test, we couldn’t figure out what hypothesis this analysis was testing, but regardless, no omnibus test examines the directional hypothesis of interest. Thus, it should not be included in a p-curve analysis.

Outlier 2
CSF’s second smallest p-value is from F(1,58)=85.9,  p = .00000000005, or 1 in 2 trillion. It comes from a 2016 study published in Biofeedback Magazine (.pdf). In that study, 33 physical therapists took turns in dyads, with one of them (the “tester”) pressing down on the other’s arm, and the other (the “subject”) attempting to resist that pressure.

The p-value selected by CSF compares subjective arm strength when the subject is standing straight (with back support) vs. slouching (without support). As the authors of the original article explain, however, that has nothing to do with any psychological consequences of power posing, but rather, with its mechanical consequences. In their words: “Obviously, the loss of strength relates to the change in the shoulder/body biomechanics and affects muscle activation recorded from the trapezius and medial and anterior deltoid when the person resists the downward applied pressure” (p. 68-69; emphasis added) [4].

Outlier 3
CSF’s third smallest p-value is from F(1,68)=26.25, p = .00000267, or 1 in ~370,000. It comes from a 2014 study published in Psychology of Women Quarterly (.pdf).

This paper explores two main hypotheses, one that is quite nonintuitive, and one that is fairly straightforward. The nonintuitive hypothesis predicts, among other things, that women who power pose while sitting on a throne will attempt more math problems when they are wearing a sweatshirt but fewer math problems when they are wearing a tank-top; the prediction is different for women sitting in a child’s chair instead of a throne [5].

CSF chose the p-value for the straightforward hypothesis, the prediction that people experience fewer positive emotions while slouching (“allowing your rib cage to drop and your shoulders to rotate forward”) than while sitting upright (“lifting your rib cage forward and pull[ing] your shoulders slightly backwards”).

Unlike the previous two outliers, one might be able to validly include this p-value in p-curve. But we have reservations, both about the inclusion of this study, and about the inclusion of this p-value.

First, we believe most people find power posing interesting because it affects what happens after posing, not what happens while posing. For example, in our opinion, the fact that slouching is more uncomfortable than sitting upright should not be taken as evidence for the power poses hypothesis.

Second, while the hypothesis is about mood, this study’s dependent variable is a principal component that combines mood with various other theoretically irrelevant variables that could be driving the effect, such as how “relaxed” or “amused” the participants were. We discuss two additional reservations in this footnote: [6].

Outlier 4
CSF’s fourth smallest p-value is from F(2,44)=13.689, p=.0000238, or 1 in 42,000. It comes from a 2015 study published in the Mediterranean Journal of Social Sciences (.pdf). Fifteen male Iranian students were all asked to hold the same pose for almost the entirety of each of nine 90-minute English instruction sessions, varying across sessions whether it was an open, ordinary, or closed pose. Although the entire class was holding the same position at the same time, and evaluating their emotions at the same time, and in front of all other students, the data were analyzed as if all observations were independent, artificially reducing the p-value.

Given how difficult and time consuming it is to thoroughly review a p-curve analysis or any meta-analysis (e.g., we spent hours evaluating each of the four studies discussed here), we preliminarily rely on three criteria to decide whether a more exhaustive evaluation is even warranted. CSF’s p-curve analysis did not satisfy any of the criteria. In its current form, their analysis should not be used as evidence for the effects of power posing, but perhaps a future revision might be informative.

Wide logo

Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted CSF and the authors of the four studies we reviewed.

Amy Cuddy responded to our email, but did not discuss any of the specific points we made in our post, or ask us to make any specific changes. Erik Peper, lead author of the second outlier study, helpfully noticed that we had the wrong publication date and briefly mentioned several additional articles of his own on how slouched positions affect emotions, memory, and energy levels (.pdf; .pdf; .pdf; html; html). We also received an email from the second author of the first outlier study; he had “no recommended changes.” He suggested that we try to contact the lead author but we were unable to find her current email address.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. When p-curve concludes that there is evidential value, it is simply saying that at least one of the analyzed findings was unlikely to have arisen from the mere combination of random noise and selective reporting. In other words, at least one of the studies would be expected to repeatedly replicate []
  2. After reporting the overall p-curve, CSF also split the 55 studies based on the type of dependent variable: (i) feelings of power, (ii) EASE (“Emotion, Affect, and Self-Evaluation”), and (iii) behavior or hormonal response (non-EASE). They find evidential value for the first two, but not the last. The p-curve for EASE includes all four of the studies described in this post. []
  3. To ensure that studies are not selected in a biased manner, and more generally to help readers and reviewers detect possible errors, the set of studies included in p-curve must be determined by a predetermined rule. The rule, in turn, should be concrete and precise enough that an independent set of researchers following the rule would generate the same, or virtually the same, set of studies. The rule, as described in CSF’s paper, lacks the requisite concreteness and precision. In particular, the paper lists 24 search terms (e.g., “power”, “dominance”) that were combined (but the combinations are not listed). The resulting hits were then “filter[ed] out based on title, then abstract, and then the study description in the full text” in an unspecified manner. (Supplement: https://osf.io/5xjav/ | our archived copy .txt). In sum, though the authors provide some information about how they generated their set of studies, neither the search queries nor the filters are specified precisely enough for someone else to reproduce them. Joe and Uri’s p-curve, on the other hand, followed a reproducible study selection rule: all studies that were cited by Carney et al. (2015) as evidence for power posing. []
  4. The paper also reports that upright vs. collapsed postures may affect emotions and the valence of memories, but these claims are supported by quotations rather than by statistics. The one potential exception is that the authors report a “negative correlation between perceived strength and severity of depression (r=-.4).” Given the sample size of the study, this is indicative of a p-value in the .03-.05 range. The critical effect of pose on feelings, however, is not reported. []
  5. The study (N = 80) employed a fully between-subjects 2(self-objectification: wearing a tanktop vs. wearing a sweatshirt) x 2(power/status: sitting in a “grandiose, carved wooden decorative antique throne” vs. a “small wooden child’s chair from the campus day-care facility”) x 2(pose: upright vs slumped) design. []
  6. First, for robustness, one would need to include in the p-curve the impact of posing on negative mood, which is also reported in the paper and which has a considerably larger p-value (F = 13.76 instead of 26.25). Second, the structure of the experiment is very complex, involving a three-way interaction which in turn hinges on a two-way reversing interaction and a two-way attenuated interaction. It is hard to know if the p-value distribution of the main effect is expected to be uniform under the null (a requirement of p-curve analysis) when the researcher is interested in these trickle-down effects. For example, it hard to know whether p-hacking the attenuated interaction effect would cause the p-value associated with the main effect to be biased downwards. []

[65] Spotlight on Science Journalism: The Health Benefits of Volunteering

I want to comment on a recent article in the New York Times, but along the way I will comment on scientific reporting as well. I think that science reporters frequently fall short in assessing the evidence behind the claims they relay, but as I try to show, assessing evidence is not an easy task. I don’t want scientists to stop studying cool topics, and I don’t want journalists to stop reporting cool findings, but I will suggest that they should make it commonplace to get input from uncool data scientists and statisticians.

Science journalism is hard. Those journalists need to maintain a high level of expertise in a wide range of domains while being truly exceptional at translating that content in ways that are clear, sensible, and accurate. For example, it is possible that Ed Yong couldn’t run my experiments, but I certainly couldn’t write his articles. [1]

I was reminded about the challenges of science journalism when reading an article about the health benefits of being a volunteer. The journalist, Nicole Karlis, seamlessly connects interviews with recent victims, interviews with famous researchers, and personal anecdotes.

It also cites some evidence in the form of three scientific findings. Like the journalist, I am not an expert in this area. The journalist’s profession requires her to float above the ugly complexities of the data, whereas my career is spent living amongst (and contributing to) those complexities. So I decided to look at those three papers.

OK, here are those references (the first two come together):

If you would like to see those articles for yourself, they can be found here (.html) and here (.html).

First the blood pressure finding. The original researchers analyze data from a longitudinal panel of 6,734 people who provided information about their volunteering and had their blood pressure measured. After adding a number of control variables [2], they look to see if volunteering has an influence on blood pressure. OK, how would you do that? 40.4% of respondents reported some volunteering. Perhaps they could be compared to the remaining 59.6%? Or perhaps there is a way to look at how the number of hours volunteered decreases units of blood pressure? The point is, there are a few ways to think about this. The authors found a difference only when comparing non-volunteers to the category of people who volunteered 200 hours or more. Their report:

“In a regression including the covariates, hours of volunteer work were related to hypertension risk (Figure 1). Those who had volunteered at least 200 hours in the past 12 months were less likely to develop hypertension than non-volunteers (OR=0.60; 95% CI:0.40–0.90). There was also a decrease in hypertension risk among those who volunteered 100–199 hours; however, this estimate was not statistically reliable (OR=0.78; 95% CI=0.48–1.27). Those who volunteered 1–49 and 50–99 hours had hypertension risk similar to that of non-volunteers (OR=0.95; 95% CI: 0.68–1.33 and OR=0.96; 95% CI: 0.65–1.41, respectively).”

So what I see is some evidence that is somewhat suggestive of the claim, but it is not overly strong. The 200-hour cut-off is arbitrary, and the effect is not obviously robust to other specifications. I am worried that we are seeing researchers choosing their favorite specification rather than the best specification. So, suggestive perhaps, but I wouldn’t be ready to cite this as evidence that volunteering is related to improved blood pressure.

The second finding is “volunteering is linked to… decreased mortality rates.” That paper analyzes data from a different panel of 10,317 people who report their volunteer behavior and whose deaths are recorded. Those researchers convey their finding in the following figure:

So first, that is an enormous effect. People who volunteered were about 50% less likely to die within four years. Taken at face value, that would suggest an effect seemingly on the order of normal person versus smoker + drives without a seatbelt + crocodile-wrangler-hobbyist. But recall that this is observational data and not an experiment, so we need to be worried about confounds. For example, perhaps the soon-to-be-deceased also lack the health to be volunteers? The original authors have that concern too, so they add some controls. How did that go?

That is not particularly strong evidence. The effects are still directionally right, and many statisticians would caution against focusing on p-values… but still, that is not overly compelling. I am not persuaded. [3]

What about the third paper referenced?

That one can be found here (.html).

Unlike the first two papers, that is not a link to a particular result, but rather to a preregistration. Readers of this blog are probably familiar, but preregistrations are the time-stamped analysis plans of researchers from before they ever collect any data. Preregistrations – in combination with experimentation – eliminate some of the concerns about selective reporting that inevitably follow other studies. We are huge fans of preregistration (.html, .html, .html). So I went and found the preregistered primary outcome on page 8:

Perfect. That outcome is (essentially) one of those mentioned in the NY Times. But things got more difficult for me at that point. This intervention was an enormous undertaking, with many measures collected over many years. Accordingly, though the primary outcome was specified here, a number of follow-up papers have investigated some of those alternative measures and analyses. In fact, the authors anticipate some of that by saying “rather than adjust p-values for multiple comparison, p-values will be interpreted as descriptive statistics of the evidence, and not as absolute indicators for a positive or negative result.” (p. 13). So they are saying that, outside of the mobility finding, p-values shouldn’t be taken quite at face value. This project has led to some published papers looking at the influence of the volunteerism intervention on school climate, Stroop performance, and hippocampal volume, amongst others. But the primary outcome – mobility – appears to be reported here (.html). [4]. What do they find?

Well, we have the multiple comparison concern again – whatever difference exists is only found at 24 months, but mobility has been measured every four months up until then. Also, this is only for women, whereas the original preregistration made no such specification. What happened to the men? The authors say, “Over a 24-month period, women, but not men, in the intervention showed increased walking activity compared to their sex-matched control groups.” So the primary outcome appears not to have been supported. Nevertheless, making interpretation a little challenging, the authors also say, “the results of this study indicate that a community-based intervention that naturally integrates activity in urban areas may effectively increase physical activity.” Indeed, it may, but it also may not. These data are not sufficient for us to make that distinction.

That’s it. I see three findings, all of which are intriguing to consider, but none of which are particularly persuasive. The journalist, who presumably has been unable to read all of the original sources, is reduced to reporting their claims. The readers, who are even more removed, take the journalist’s claims at face value: “if I volunteer then I will walk around better, lower my blood pressure, and live longer. Sweet.”

I think that we should expect a little more from science reporting. It might be too much for every journalist to dig up every link, but perhaps they should develop a norm of collecting feedback from those people who are informed enough to consider the evidence, but far enough outside the research area to lack any investment in a particular claim. There are lots of highly competent commentators ready to evaluate evidence independent of the substantive area itself.

There are frequent calls for journalists to turn away from the surprising and uncertain in favor of the staid and uncontroversial. I disagree – surprising stories are fun to read. I just think that journalists should add an extra level of scrutiny to ensure that we know that the fun stories are also true stories.

Wide logo

Author Feedback.
I shared a draft of this post with the contact author for each of the four papers I mention, as well as the journalist who had written about them. I heard back from one, Sara Konrath, who had some helpful suggestions including a reference to a meta-analysis (.html) on the topic.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. Obviously Mr. Yong could run my experiments better than me also, but I wanted to make a point. At least I can still teach college students better than him though. Just kidding, he would also be better at that. []
  2. average systolic blood pressure (continuous), average diastolic blood pressure (continuous), age (continuous), sex, self-reported race (Non-Hispanic White, Non-Hispanic Black, Hispanic, Non-Hispanic Other), education (less than high school, General Equivalency Diploma [GED], high school diploma, some college, college and above), marital status (married, annulled, never married, divorced, separated, widowed), employment status (employed/not employed), and self-reported history of diabetes (yes/no), cancer (yes/no), heart problems (yes/no), stroke (yes/no), or lung problems (yes/no []
  3. It is worth noting that this paper, in particular, goes on to consider the evidence in other interesting ways. I highlight this portion because it was the fact being cited in the NYT article. []
  4. I think. It is really hard for me, as a novice in this area, to know if I have found all of the published findings from this original preregistration. If there is a different mobility finding elsewhere I couldn’t find it, but I will correct this post if it gets pointed out to me. []

[64] How To Properly Preregister A Study

P-hacking, the selective reporting of statistically significant analyses, continues to threaten the integrity of our discipline. P-hacking is inevitable whenever (1) a researcher hopes to find evidence for a particular result, (2) there is ambiguity about how exactly to analyze the data, and (3) the researcher does not perfectly plan out his/her analysis in advance. Although some mistakenly believe that accusations of p-hacking are tantamount to accusations of cheating, the truth is that accusations of p-hacking are nothing more than accusations of imperfect planning.

The best way to address the problem of imperfect planning is to plan more perfectly: to preregister your studies. Preregistrations are time-stamped documents in which researchers specify exactly how they plan to collect their data and to conduct their key confirmatory analyses. The goal of a preregistration is to make it easy to distinguish between planned, confirmatory analyses – those for which statistical significance is meaningful – and unplanned exploratory analyses – those for which statistical significance is not meaningful [1]. Because a good preregistration prevents researchers from p-hacking, it also protects them from suspicions of p-hacking [2].

In the past five years or so, preregistration has gone from being something that no psychologists did to something that many psychologists are doing. In our view, this wonderful development is the biggest reason to be optimistic about the future of our discipline.

But if preregistration is going to be the solution, then we need to ensure that it is done right. After casually reviewing several recent preregistration attempts in published papers, we noticed that there is room for improvement. We saw two kinds of problems.

Problem 1. Not enough information
For example, we saw one “preregistration” that was simply a time-stamped abstract of the project; it contained almost no details about how data were going to be collected and analyzed. Others failed to specify one or more critical aspects of the analysis: sample size, rules for exclusions, or how the dependent variable would be scored (in a case for which there were many ways to score it). These preregistrations are time-stamped, but they lack the other critical ingredient: precise planning.

To decide which information to include in your preregistration, it may be helpful to imagine a skeptical reader of your paper. Let’s call him Leif. Imagine that Leif is worried that p-hacking might creep into the analyses of even the best-intentioned researchers. The job of your preregistration is to set Leif’s mind at ease [3]. This means identifying all of the ways you could have p-hacked – choosing a different sample size, or a different exclusion rule, or a different dependent variable, or a different set of controls/covariates, or a different set of conditions to compare, or a different data transformation – and including all of the information that lets Leif know that these decisions were set in stone in advance. In other words, your job is to prevent Leif from worrying that you tried to run your critical analysis in more than one way.

This means that your preregistration needs to be sufficiently exhaustive and sufficiently specific. If you say, “We will exclude participants who are distracted,” Leif could think, “Right, but distracted how? Did you define “distracted” in advance?” It is better to say, “We will exclude participants who incorrectly answered at least 2 out of our 3 comprehension checks.” If you say, “We will measure happiness,” Leif could think, “Right, but aren’t there a number of ways to measure it? I wonder if this was the only one they tried or if it was just the one they most wanted to report after the data came in?” So it’s better to say, “Our dependent variable is happiness, which we will measure by asking people ‘How happy do you feel right now?’ on a scale ranging from 1 (not at all happy) to 7 (extremely happy).”

If including something in a preregistration would make Leif less likely to wonder whether you p-hacked, then include it.

Problem 2. Too much information
A preregistration cannot allow readers and reviewers to distinguish between confirmatory and exploratory analyses if it is not easy to read or understand. Thus, a preregistration needs to be easy to read and understand. This means that it should contain only the information that is essential for the task at hand. We have seen many preregistrations that are just too long, containing large sections on theoretical background and on exploratory analyses, or lots of procedural details that on the one hand will definitely be part of the paper, and on the other, are not p-hackable. Don’t forget that you will publish the paper also, not just the preregistration; you don’t need to say in the preregistration everything that you will say in the paper. A hard-to-read preregistration makes preregistration less effective [4].

To decide which information to exclude in your preregistration, you can again imagine that a skeptical Leif is reading your paper, but this time you can ask, “If I leave this out, will Leif be more concerned that my results are attributable to p-hacking?”

For example, if you leave out the literature review from your preregistration, will Leif now be more concerned? Of course not, as your literature review does not affect how much flexibility you have in your key analysis. If you leave out how long people spent in the lab, how many different RAs you are using, why you think your hypothesis is interesting, or the description of your exploratory analyses, will Leif be more concerned? No, because none of those things affect the fact that your analyses are confirmatory.

If excluding something from a preregistration would not make Leif more likely to wonder whether you p-hacked, then you should exclude it.

The Takeaway
Thus, a good preregistration needs to have two features:

  1. It needs to specify exactly how the key confirmatory analyses will be conducted.
  2. It needs to be short and easy to read.

We designed AsPredicted.org with these goals in mind. The website poses a standardized set of questions asking you only to include what needs to be included, thus also making it obvious what does not need to be. The OSF offers lots of flexibility, but they also offer an AsPredicted template here: https://osf.io/fnsb/ [5].

Still, even on AsPredicted, it is possible to get it wrong, by, for example, not being specific enough in your answers to the questions it poses. This table provides an example of how to wrongly and properly answer these questions.

Regardless of where your next preregistration is hosted, make it a good preregistration by including what should be included and excluding what should be excluded.


Wide logo

We would like to thank Stephen Lindsay and Simine Vazire for taking time out of their incredibly busy schedules to give us invaluable feedback on a previous version of this post.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. This is because conducting unplanned analyses necessarily inflates the probability that you will find a statistically significant relationship even if no relationship exists. []
  2. For good explanations of the virtues of preregistration, see Lindsay et al. (2016) <.html>, Moore (2016) <.pdf>, and van’t Veer & Giner-Sorolla (2016) <.pdf>. []
  3. Contrary to popular belief, the job of your pre-registration is NOT to show that your predictions were confirmed. Indeed, the critical aspect of pre-registration is not the prediction that you register – many good preregistrations pose questions (e.g., “We are testing whether eating Funyuns cures cancer”) rather than hypotheses (e.g., “We hypothesize that eating Funyuns cures cancer”) – but the analysis that you specify. In hindsight, perhaps our preregistration website should have been called AsPlanned rather than AsPredicted, although AsPredicted sounds better. []
  4. Even complex studies should have a simple and clear preregistration, one that allows a reader to casually differentiate between confirmation and exploration. Additional complexities could potentially be captured in other secondary planning documents, but because these are far less likely to be read, they shouldn’t obscure the core basics of the simple preregistration. []
  5. We recently updated the AsPredicted questions, and so this OSF template contains slightly different questions than the ones currently on AsPredicted. We advise readers who wish to use the OSF to answer the questions that are currently on https://AsPredicted.org. []

[63] “Many Labs” Overestimated The Importance of Hidden Moderators

Are hidden moderators a thing? Do experiments intended to be identical lead to inexplicably different results?

Back in 2014, the “Many Labs” project (.pdf) reported an ambitious attempt to answer these questions. More than 30 different labs ran the same set of studies and the paper presented the results side-by-side. They did not find any evidence that hidden moderators explain failures to replicate, but did conclude that hidden moderators play a large role in studies that do replicate.

Statistically savvy observers now cite the Many Labs paper as evidence that hidden moderators, “unobserved heterogeneity”, are a big deal. For example, McShane & Böckenholt (.pdf) cite only the ManyLabs paper to justify this paragraph “While accounting for heterogeneity has long been regarded as important in meta-analyses of sets of studies that consist of […] [conceptual] replications, there is mounting evidence […] this is also the case [with] sets of [studies] that use identical or similar materials.” (p.1050)

Similarly, van Aert, Wicherts , and van Assen (.pdf) conclude heterogeneity is something to worry about in meta-analysis by pointing out that “in 50% of the replicated psychological studies in the Many Labs Replication Project, heterogeneity was present” (p.718)

In this post I re-analyze the Many Labs data and conclude the authors substantially over-estimated the importance of hidden moderators in their data.

Aside:  This post was delayed a few weeks because I couldn’t reproduce some results in the Many Labs paper.  See footnote for details [1].

How do you measure hidden moderators?
In meta-analysis one typically tests for the presence of hidden moderators, “unobserved heterogeneity”, by comparing how much the dependent variable jumps around across studies, to how much it jumps around within studies (This is analogous to ANOVA if that helps).  Intuitively, when the differences are bigger across studies than within, we conclude that there is a hidden moderator across studies.

This was the approach taken by Many Labs [2]. Specifically, they reported a statistic called I2 for each study. I2 measures the percentage of the variation across studies that is surprising.  For example, if a meta-analysis has I2=40%, then 60% of the observed differences across studies is attributed to chance, and 40% is attributed to hidden moderators.

Aside: in my opinion the I2 is kind of pointless.
We want to know if heterogeneity is substantial in absolute, not relative terms. No matter how inconsequential a hidden moderator is, as you increase sample size of the underlying studies, you will decrease the variation due to chance, and thus increase I2. Any moderator, no matter how small, can approach I2=100% with a big enough sample. To me, saying that a particular design is rich in heterogeneity because it has I2=60% is a lot like saying that a particular person is rich because she has something that is 60% gold, without specifying if the thing is her earring or her life-sized statue of Donald Trump. But I don’t know very much about gold. (One could report, instead, the estimated standard deviation of effect size across studies ‘tau’ / τ). Most meta-analyses rely on I2. This is in no way a criticism of the Many Labs paper.

Criticisms of I2 along these lines have been made before, see e.g., Rucker et al (.pdf) or Borenstein et al (.pdf).

I2 in Many Labs
Below is a portion of Table 3 in their paper:

For example, the first row shows that for the first anchoring experiment, 59.8% of the variation across labs is being attributed to chance, and the remaining 40.2% to some hidden moderator.

Going down the table, we see that just over half the studies have a significant I2. Notably, four of these involve questions where participants are given an anchor and then asked to generate an open-ended numerical estimate.

For some time I have wondered if the strange distributions of responses that one tends to get with anchoring questions (bumpy, skewed, some potentially huge outliers; see histograms .png), or the data-cleaning that such responses led the Many Labs authors to take, may have biased upwards the apparent role of hidden moderators for these variables.

For this post I looked into it, and it seems like the answer is: ay, squared.

Shuffling data.
In essence, I wanted to answer the question: “If there were no hidden moderators what-so-ever in the anchoring questions in Many Labs, how likely would it be that the authors would find (false-positive) evidence for them anyway?

To answer this question I could run simulations with made up data. But because the data were posted, there is a much better solution: run simulations with the real data.   Importantly, I run the simulations “under the null,” where the data are minimally modified to ensure there is in fact no hidden moderator, and then we assess if I2 manages to realize that (it does not). This is essentially a randomization/reallocation/permutation test [3].

The posted “raw” datafile (.csv | 38Mb) has more than 6000 rows, one for each participant, across all labs. The columns have the variables for each of the 13 studies. There is also a column indicating which lab the participant is from. To run my simulations I shuffle that “lab” column. That is, I randomly sort that column, and only that column, keeping everything else in the spreadsheet intact. This creates a placebo lab column which cannot be correlated with any of the effects. With the shuffled column the effects are all, by construction, homogenous, because each observation is equally likely to originate in any “lab.” This means that variation within lab must be entirely comparable to variation across labs, and thus that the true I2 is equal to zero.  When testing for heterogeneity in this shuffled dataset we are asking: are observations randomly labeled “Brian’s Lab” systematically different from those randomly labeled “Fred’s Lab”? Of course not.

This approach sounds kinda cool and kinda out-there. Indeed it is super cool, but it is as old as it gets. It is closely related to permutation tests, which was developed in the 1930s when hypothesis testing was just getting state. [4].

I shuffled the dataset, conducted a meta-analysis on each of the four anchoring questions, computed I2, and repeated this 1000 times (R Code).  The first thing we can ask is how many of those meta-analyses led to a significant, false-positive, I2.  How often do they wrongly conclude there is a hidden moderator?

The answer should be, for each of the 4 anchoring questions: “5%”.
But the answers are: 20%, 49%, 47% and 46%.


The figures below show the distributions of I2 across the simulations.

Figure 1: For the 4 anchoring questions, the hidden-moderators test Many Labs used is invalid: high false-positive rate, inflated estimates of heterogeneity (R Code).

For example, for Anchoring 4, we see that the median estimate is that I2=38.4% of the variance is caused by hidden moderators, and 46% of the time it concludes there is statistically significant heterogeneity. Recall, the right answer is that there is zero heterogeneity, and only 5% of the time we should conclude otherwise. [5],[6],[7].

Validating the shuffle.
To reassure readers the approach I take above is valid, I submitted normally distributed data to the shuffle test. I added three columns to the Many Labs spreadsheets with made up data: In the first, the true anchoring effect was d=0 in all labs. In the second it was d=.5 in all labs. And in the third it was on average d=.5, but it varied across labs with sd(d)=.2 [8].

Recall that because the shuffle-test is done under the null, 5% of the results should be significant (false-positive), and I should get a ton of I2=0% estimates, no matter which of the three true effects are simulated.

That’s exactly what happens.

Figure 2: for normally distributed data, ~5% false-positive rate, and lots of accurate I2=0% estimates (R Code).

I suspect I2 can tolerate fairly non-normal data, but anchoring (or perhaps the intensive way it was ‘cleaned’) was too much for it. I have not looked into which specific aspect of the data, or possibly the data cleaning, disconcerts I2 [9].

The authors of Many Labs saw the glass half full, concluding hidden moderators were present only in studies with big effect sizes. The statistically savvy authors of the opening paragraphs saw it mostly-empty, warning readers of hidden moderators everywhere. I see the glass mostly-full: the evidence that hidden moderators influenced large-effect anchoring studies appears to be spurious.

We should stop arguing hidden moderators are a thing based on the Many Labs paper.

Author feedback
Our policy is to share drafts of blog posts that discuss someone else’s work with them to solicit feedback (see our updated policy .htm). A constructive dialogue with several authors from Many Labs helped make progress with the reproducibility issues (see footnote 1) and improve the post more generally. They suggested I share the draft with Robbie van Aert, Jelte Wicherts and Marcel Van Assen and I did. We had an interesting discussion on the properties of randomization tests, shared some R Code back and forth, and they alerted me to the existence of the Rucker et al paper cited above.

In various email exchanges the term “hidden moderator” came up. I use the term literally: there are things we don’t see (hidden) that influence the size of the effect (moderators), thus to me it is a synonym with unobserved (hidden) heterogeneity (moderators). Some authors were concerned that “hidden moderator” is a loaded term used to excuse failures to replicate. I revised the writing of the post taking that into account, hopefully making it absolutely clear that the Many Labs authors concluded that hidden moderators are not responsible for the studies that failed to replicate.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. The most important problem I had was that I could not reproduce the key results in Table 3 using the posted data. As the authors explained when I shared a draft of this post, a few weeks after first asking them about this, the study names are mismatched in that table, such that the results for one study are reported in the row for a different study.  A separate issue is that I tried to reproduce some extensive data cleaning performed on the anchoring questions but gave up when noticing large discrepancies between sample sizes described in the supplement and present in the data. The authors have uploaded a new supplement to the OSF where the sample sizes match the posted data files (though not the code itself that would allow reproducing the data cleaning). []
  2. they also compared lab to online, and US vs non-US, but these are obviously observable moderators []
  3. I originally called this “bootstrapping,” many readers complained. Some were not aware modifying the data so that the null is true can also be called bootstrapping. Check out section 3 in this “Introduction to the bootstrap world” by Boos (2003) .pdf. But others were aware of it and nevertheless objected because I keep the data fixed without resampling so to them bootstrapping was misleading. Permutation is not perfect because permutation tests are associated with trying all permutations. Reallocation test is not perfect because they usually involve swapping the treatment column, not the lab column. This is a purely semantic issue, as we all know what the test I run consisted of. []
  4. Though note that here I randomize the data to see if a statistical procedure is valid for the data, not to adjust statistical significance []
  5. It is worth noting that the other four variables with statistically significant heterogeneity in Table 3 do not suffer from the the same unusual distributions and/or data cleaning procedures as do the four anchoring questions. But, because they are quite atypical exemplars of psychology experiments I would personally not base estimates of heterogeneity in psychology experiments in general from them. One is not an experiment, two involve items that are more culturally dependent than most psychological constructs: reactions to George Washington and an anti-democracy speech. The fourth is barely significantly heterogeneous: p=.04. But this is just my opinion []
  6. For robustness I re-run the simulations keeping the number of observations in each condition constant within lab, the results were are just as bad. []
  7. One could use these distributions to compute an adjusted p-value for heterogeneity; that is, compute how often we get a p-value as low as the one obtained by Many Labs, or an I2 as high as they obtained, among the shuffled datasets. But I do not do that here because such calculations should build in the data-cleaning procedures, and code for such data cleaning is not available. []
  8. In particular, I made the effect size be proportional to the size of the name of the lab. So for the rows in the data where the lab it called “osu” the effect is d=.3, where it is “Ithaca” the effect is d=.6. As it happens, the average lab length is about 5 characters, so this works, and sd(length) is .18, which i round to .2 in the figure title. []
  9. BTW I noted that tau makes more sense conceptually than I2 . But mathematically, for a given dataset, tau is a simple transformation of it I2 (or rather, vice versa: I2=tau2/(total variance) ) thus if one is statistically invalid, so is the other. Tau is not a more robust statistic than I2 is. []

[62] Two-lines: The First Valid Test of U-Shaped Relationships

Can you have too many options in the menu, too many talented soccer players in a national team, or too many examples in an opening sentence? Social scientists often hypothesize u-shaped relationships like these, where the effect of x on y starts positive and becomes negative, or starts negative and becomes positive. Researchers rely almost exclusively on quadratic regressions to test if a u-shape is present (y ax+bx2), typically asking if the b term is significant [1].

The problem is that quadratic regressions are not diagnostic regarding u-shapedness. Under realistic circumstances the quadratic has 100% false-positive rate, for example, concluding y=log(x) is u-shaped. In addition, under plausible circumstances, it can obtain 0% power: failing to diagnose a function that is u-shaped as such, even with an infinite sample size [2].

With Leif we wrote Colada[27] on this problem a few years ago. We got started on a solution that I developed further in a recent working paper (SSRN). I believe it constitutes the first general and valid test of u-shaped relationships.

The test consists of estimating two regression lines, one for ‘low’ values of x, another for ‘high’ values of x. A U-shape is present if the two lines have opposite sign and are individually significant. The difficult thing is setting the breakpoint, what is a low x vs high x? Most of the work in the project went into developing a procedure to set the breakpoint that would maximize power [3].

The solution is what I coined the “Robin Hood” procedure; it increases the statistical power of the u-shape test by setting a breakpoint that strengthens the weaker line at the expense of the stronger one .

The paper (SSRN) discusses its derivation and computation in detail. Here I want to focus on its performance.

To gauge the performance of this test, I ran a horse-race between the quadratic and the two-lines procedure. The image below previews the results.

The next figure has more details.

It reports the share of simulations that obtain a significant (p<.05) u-shape for many scenarios. For each scenario, I simulated a relationship that is not u-shaped (Panel A), or one that is u-shaped (Panel B). I changed a bunch of things across scenarios, such as the distribution of x, the level of noise, etc. The figure caption has all the details you may want.

Panel A shows the quadratic has a comically high false-positive rate. It sees u-shapes everywhere.

Panel B – now excluding the fundamentally invalid quadratic regression – shows that for the two-lines test, Robin Hood is the most powerful way to set the breakpoint. Interestingly, the worst way is what we had first proposed in Colada[27], splitting the two lines at the highest point identified by the quadratic regression. The next worst is to use 3 lines instead of 2 lines.

The next figure applies the two-lines procedure to data from two published psych papers in which a u-shaped hypothesis appears to be spuriously supported by the invalid quadratic regression test.  In both cases the first positive line is highly significant, but the supposed sign reversal for high xs is not close.

It is important to state that the (false-positive) u-shaped hypothesis by Sterling et al. is incidental to their central thesis.

Fortunately the two-lines test does not refute all hypothesized u-shapes. A couple of weeks ago Rogier Kievit tweeted about a paper of his (.htm) which happened to involve testing a u-shaped hypothesis. Upon seeing their Figure 7 I replied:

Rogier’s reply:

Both u-shapes were significant via the two-lines procedure.

Researchers should continue to have u-shaped hypotheses, but they should stop using quadratic regressions to test them. The two lines procedure, with its Robin Hood optimization, offers an alternative that is both valid and statistically powerful

  • Full paper: SSRN.
  • Online app (with R code): .html
  • Email me questions or comments

PD (2017 11 02): After reading this post, Yair Heller contacted me, via Andrew Gelman, sharing an example where the two-lines test had an elevated false-positive rate (~20%). The problem involved heteroskedasticy, Yair considered a situation where noise was grater in one segment (the flat one) than the other. Fortunately it turned out to be quite easy to fix this problem: the two-lines test should be conducted by estimating regressions with “robust” standard errors, that is, standard errors that don’t assume the same level of noise throughout.  The online app will soon be modified to compute robust standard errors, and the paper will reflect this change as well. Thanks to Yair & Andrew.  (See new R code showing problem and solution).

Wide logo

Frequently Asked Questions about two-lines test.
1. Wouldn’t running a diagnostic plot after the quadratic regression prevent researchers from mis-using it?
Leaving aside the issue that in practice diagnostic plots are almost never reported in papers (thus presumably not run), it is important to note that diagnostic plots are not diagnostic, at least not of u-shapedness. They tell you if overall you have a perfect model (which you never do) not if you are making the right inference.  Because true data are almost never exactly quadratic, diagnostic plots will look slightly off whether a relationship is or is not u-shaped. See concrete example (.pdf).

2. Wouldn’t running 3 lines be better, allowing for the possibility that there is a middle flat section in the U, thus increasing power?.
Interestingly, this intuition is wrong, if you have three lines instead of two, you will have a dramatic loss of power to detect u-shapes. In Panel B in the figure after the fallen horse, you can see the poorer performance of the 3-line solution.

3. Why fit two-lines to the data, an obviously wrong model, and not a spline, kernel regression, general additive model, etc?
Estimating two regression lines does not assume the true model has two lines. Regression lines are unbiased estimates of average slopes in a region, whether the slope is constant in that region or not. The specification error is thus inconsequential. To visually summarize what the data look like, indeed a spline, kernel regression, etc., is better (and the online app for the two-lines test reports such line, see Rogier’s plots above).  But these lines do not allow testing holistic data properties, such as whether a relationship is u-shaped. In addition, these models require that researchers set arbitrary parameters such as how much to smooth the data; one arbitrary choice of smoothing may lead to an apparent u-shape while another choice not. Flexible models are seldom the right choice for hypothesis testing.

4. Shouldn’t one force the two lines to meet? Surely there is no discontinuity in the true model.
No. It is imperative that the two lines be allowed not to meet. To run an “interrupted” rather than a “segmented” regression. This goes back to Question 3. We are not trying to fit the data overall as well as possible, but merely to compute an average slope within a set of x-values. If you don’t allow the two lines to be interrupted, you are no longer computing two separate means and can have a very high false-positive rate detecting u-shapes (see example .png).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. At least three papers have proposed a slightly more sophisticated test of u-shapedness relying on quadratic regression (Lind & Mehlum .pdf | Miller et al. pdf | Spiller et al .pdf). All three propose, actually, a mathematically equivalent solution, and that solution is only valid if the true relationship is quadratic (and not, for example, y=log(x) ). This  assumption is strong, unjustified, and untestable, and if it is not met, the results are invalid. When I showcase the quadratic regression performing terribly in this post, I am relying on this more sophisticated test. The more commonly used test, simply looking if b is significant, fares worse. []
  2. The reason for the poor performance is that quadratic regressions assume the true relationships is, well, quadratic, and if it is is not, and why would it be, anything can happen. This is in contrast to linear regression, which assumes linearity but if linearity is not met, the linear regression is nevertheless interpretable as the unbiased estimate of the average slope. []
  3. The two lines are estimated within a single interrupted regression for greater efficiency []

[61] Why p-curve excludes ps>.05

In a recent working paper, Carter et al (.pdf) proposed that one can better correct for publication bias by including not just p<.05 results, the way p-curve does, but also p>.05 results [1]. Their paper, currently under review, aimed to provide a comprehensive simulation study that compared a variety of bias-correction methods for meta-analysis.

Although the paper is well written and timely, the advice is problematic. Incorporating non-significant results into a tool designed to correct for publication bias requires one to make assumptions about how difficult it is to publish each possible non-significant result. For example, one has to make assumptions about how much more likely an author is to publish a p=.051 than a p=.076, or a p=.09 in the wrong direction than a p=.19 in the right direction, etc. If the assumptions are even slightly wrong, the tool’s performance becomes disastrous [2]

Assumptions and p>.05s
The desire to include p>.05 results in p-curve type analyses is understandable. Doing so would increase our sample sizes (of studies), rendering our estimates more precise. Moreover, we may be intrinsically interested in learning about studies that did not get to p<.05.

So why didn’t we do that when we developed p-curve? Because we wanted a tool that would work well in the real world.  We developed a good tool, because the perfect tool is unattainable.

While we know that the published literature generally does not discriminate among p<.05 results (e.g., p=.01 is not perceptibly easier to publish than is p=.02), we don’t know how much easier it is to publish some non-significant results rather than others.

The downside of p-curve focusing only on p<.05 is that p-curve can “only” tell us about the (large) subset of published results that are statistically significant. The upside is that p-curve actually works.

All p>.05 are not created equal
The simulations reported by Carter et al. assume that all p>.05 findings are equally likely to be published: a p=.051 in the right direction is as likely to be published as a p=.051 in the wrong direction. A p=.07 in the right direction is as likely to be published as a p=.97 in the right direction. If this does not sound implausible to you, we recommend re-reading this paragraph.

Intuitively it is easy to see how getting this assumption wrong will introduce bias. “Imagine” that a p=.06 is easier to publish than is a p=.76. A tool that assumes both results are equally likely to be published will be naively impressed when it sees many more p=.06s than p=.76s, and it will fallaciously conclude there is evidential value when there isn’t any.

A calibration
We ran simulations matching one of the setups considered by Carter et al., and assessed what happens if the publishability of p>.05 results deviated from their assumptions (R Code). The black bar in the figure below shows that if their fantastical assumption were true, the tool would do well, producing a false-positive rate of 5%. The other bars show that under some (slightly) more realistic circumstances, false-positives abound.

One must exclude p>.05
It is obviously not true that all p>.05s are equally publishable. But no alternative assumption is plausible. The mechanisms that influence the publication of p>.05 results are too unknowable, complex, and unstable from paper to paper, to allow one to make sensible assumptions or generate reasonable estimates. The probability of publication depends on the research question, on the authors’ and editors’ idiosyncratic beliefs and standards, on how strong other results in the paper are, on how important the finding is for the paper’s thesis, etc.  Moreover, comparing the 2nd and 3rd bar in the graph above, we see that even minor quantitative differences in a face-valid assumption make a huge difference.

P-curve is not perfect. But it makes minor and sensible assumptions, and is robust to realistic deviations from those assumptions. Specifically, it assumes that all p<.05 are equally publishable regardless of what exact p-value they have. This captures how most researchers perceive publication bias to occur (at least in psychology). Its inferences about evidential value are robust to relatively large deviations from this assumption (e.g., if researchers start aiming for p<.045 instead of p<.05, or even p<.035, or even p<.025, p-curve analysis, as implemented in the online app (.htm), will falsely conclude there is evidential value when the null is true, no more than 5% of the time.  See our “Better P-Curvespaper (SSRN)).

With p-curve we can determine whether a set of p<.05 results have evidential value, and what effect we may expect in a direct replication of those studies.  Those are not the only questions you may want to ask. For example, traditional meta-analysis tools ask what is the average effect of all of the studies that one could possibly run (whatever that means; see Colada[33]), not just those you observe. P-curve does not answer that question. Then again, no existing tool does. At least not even remotely accurately.

P-curve tells you “only” this: If I were to run these statistically significant studies again, what should I expect?

Wide logo

Author feedback.
We shared a draft of this post with Evan Carter, Felix Schönbrodt, Joe Hilgard and Will Gervais. We had an incredibly constructive and valuable discussion, sharing R Code back and forth and jointly editing segments of the post.

We made minor edits after posting responding to readers’ feedback. The original version is archived here .htm.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. When applying p-curve to estimate effect size, it is extremely similar to following the “one-parameter-selection-model” by Hedges 1984 (.pdf). []
  2. Their paper is nuanced in many sections, but their recommendations are not. For example, they write in the abstract, “we generally recommend that meta-analysis of data in psychology use the three-parameter selection model.” []