Data Colada
Menu
  • Home
  • About
  • Feedback Policy
  • Table of Contents
Menu

[67] P-curve Handles Heterogeneity Just Fine


Posted on January 8, 2018January 8, 2018 by Joe Leif Uri

A few years ago, we developed p-curve (see p-curve.com), a statistical tool that identifies whether or not a set of statistically significant findings contains evidential value, or whether those results are solely attributable to the selective reporting of studies or analyses. It also estimates the true average power of a set of significant findings [1].

A few methods researchers have published papers stating that p-curve is biased when it is used to analyze studies with different effect sizes (i.e., studies with "heterogeneous effects"). Since effect sizes in the real world are not identical across studies, this would mean that p-curve is not very useful.

In this post, we demonstrate that p-curve performs quite well in the presence of effect size heterogeneity, and we explain why the methods researchers have stated otherwise.

Basic setup
Most of this post consists of figures like this one, which report the results of 1,000 simulated p-curve analyses (R Code).

Each analysis contains 20 studies, and each study has its own effect size, its own sample size, and because these are drawn independently, its own statistical power. In other words, the 20 studies contain heterogeneity [2].

For example, to create this first figure, each analysis contained 20 studies. Each study had a sample size drawn at random from the orange histogram, a true effect size drawn at random from the blue histogram, and thus a level of statistical power drawn at random from the third histogram.

The studies' statistical power ranged from 10% to 70%, and their average power was 41%. P-curve guessed that their average power was 40%. Not bad.

But what if…?

1) But what if there is more heterogeneity in effect size?
Let's increase heterogeneity so that the analyzed set of studies contains effect sizes ranging from d = 0 (null) to d = 1 (very large), probably pretty close to the entire range of plausible effect sizes in psychology [3].

The true average power is 42%. P-curve estimates 43%. Again, not bad.

2) But what if samples are larger?
Perhaps p-curve's success is limited to analyses of studies that are relatively underpowered. So let's increase sample size (and therefore power) and see what happens. In this simulation, we've increased the average sample size from 25 per cell to 50 per cell.

The true power is 69%, and p-curve estimates 68%. This is starting to feel familiar.

3) But what if the null is true for some studies?
In real life, many p-curves will include a few truly null effects that are nevertheless significant (i.e., false-positives).  Let's now analyze 25 studies, including 5 truly null effects (d=0) that were false-positively significant.

The true power is 56%, and p-curve estimates 57%. This is continuing to feel familiar.

4) But what if sample size and effect size are not symmetrically distributed?

Maybe p-curve only works when sample and effect size are (unrealistically) symmetrically distributed. Let's try changing that. First we skew the sample size, then we skew the effect size:

The true powers are 58% and 60%, and p-curve estimates 59% and 61%. This is persisting in feeling familiar.

5) But what if all studies are highly powered?
Let's go back to the first simulation and increase the average sample size to 100 per cell:The true power is 93%, and p-curve estimates 94%. It is clear that heterogeneity does not break or bias p-curve. On the contrary, p-curve does very well in the presence of heterogeneous effect sizes.

So why have others proposed that p-curve is biased in the presence of heterogeneous effects?

Reason 1:  Different definitions of p-curve's goal.
van Aert, Wicherts, & van Assen (2016, .pdf) write that p-curve "overestimat[es] effect size under moderate-to-large heterogeneity" (abstract). McShane, Bockenholt, & Hansen (2016 .pdf) write that p-curve "falsely assume[s] homogeneity […] produc[ing] upward[ly] biased estimates of the population average effect size." (p.736).

We believe that the readers of those papers would be very surprised by the results we depict in the figures above. How can we reconcile our results with what these authors are claiming?

The answer is that the authors of those papers assessed how well p-curve estimated something different from what it estimates (and what we have repeatedly stated that it estimates).

They assessed how well p-curve estimated the average effect sizes of all studies that could be conducted on the topic under investigation. But p-curve informs us "only" about the studies included in p-curve [4].

Imagine that an effect is much stronger for American than for Ukrainian participants. For simplicity, let's say that all the Ukrainian studies are non-significant and thus excluded from p-curve, and that all the American studies are p<.05 and thus included in p-curve.

P-curve would recover the true average effect of the American studies. Those arguing that p-curve is biased are saying that it should recover the average effect of both the Ukrainian and American studies, even though no Ukrainian study was included in the analysis [5].

To be clear, these authors are not particularly idiosyncratic in their desire to estimate "the" overall effect.  Many meta-analysts write their papers as if that's what they wanted to estimate. However…

•  We don't think that the overall effect exists in psychology (DataColada[33]).
•  We don't think that the overall effect is of interest to psychologists (DataColada[33]).
•  And we know of no tool that can credibly estimate it.

In any case, as a reader, here is your decision:
If you want to use p-curve analysis to assess the evidential value or the average power of a set of statistically significant studies, then you can do so without having to worry about heterogeneity [6].

If you instead want to assess something about a set of studies that are not analyzed by p-curve, including studies never observed or even conducted, do not run p-curve analysis. And good luck with that.

Reason 2: Outliers vs heterogeneity
Uli Schimmack, in a working paper (.pdf), reports that p-curve overestimates statistical power in the presence of heterogeneity. Just like us, and unlike the previously referenced authors, he is looking only at the studies included in p-curve. Why do we get different results?

It will be useful to look at a concrete simulation he has proposed, one in which p-curve does indeed do poorly (R Code):

Although p-curve overestimates power in this scenario, the culprit is not heterogeneity, but rather the presence of outliers, namely several extremely highly powered studies. To see this let's look at similarly heterogeneous studies, but ones in which the maximum power is 80% instead of 100%.

In a nutshell the overestimation with outliers occurs because power is a bounded variable but p-curve estimates it based on an unbounded latent variable (the noncentrality parameter). It's worth keeping in mind that a single outlier does not greatly bias p-curve. For example, if 20 studies are powered on average to 50%, adding one study powered to 95% increases true average power to 52%, p-curve's estimate to just 54%.

This problem that Uli has identified is worth taking into account and perhaps p-curve can be modified to prevent such bias [7]. But it is worth keeping in mind that this situation should be rare, as few literatures contain both (1) a substantial number of studies powered over 90% and (2) a substantial number of under-powered studies. Moreover, this is somewhat inconsequential mistake. All it means is that p-curve will exaggerate how strong a truly (and obviously) strong literature actually is.

In Summary
•  P-curve is not biased by heterogeneity.
•  It is biased upwards in the presence of both (1) low powered studies, and (2) a large share of extremely highly powered studies.
•  P-curve tells us about the study designs it includes, not the study designs it excludes.

Wide logo


Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted all 7 authors of the three methods papers discussed above. Uli Schimmack declined to comment. Karsten Hansen and Blake McShane provided suggestions that led us to more precisely describe their analyses and to describe more detailed analyses in Footnote 5. Though our exchange with Karsten and Blake started and ended with, in their words, "fundamental disagreements about the nature of evaluating the statistical properties of an estimator," the dialogue was friendly and constructive. We are very grateful to them, both for the feedback and for the tone of the discussion. (Interestingly, we disagree with them about the nature of our disagreement: we don't think we disagree about how to examine the statistical properties of an estimator, but rather, about how to effectively communicate methods issues to a general audience).  Marcel van Assen, Robbie van Aert, and Jelte Wicherts disagreed with our belief that readers of their paper would be surprised by how well p-curve recovers average power in the presence of heterogeneity (as they think their paper explains this as well). Specifically, like us, they think p-curve performs well when making inferences about studies included in p-curve, but, unlike us, they think that readers of their paper would realize this. They are not persuaded by our arguments that the population effect size does not exist and is not of interest to psychologists, and they are troubled by the fact that p-curve does not recover this effect. They also proposed that an important share of studies may indeed have power near 100% (citing this paper: .htm). We are very grateful to them for their feedback and collegiality as well.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. P-curve can also be used to estimate average effect size rather than power (and, as Blake McShane and Karsten Hansen pointed out to us, when used in this fashion p-curve is virtually equivalent to the maximum likelihood procedure proposed by Hedges in 1984 (.pdf) ).  Here we focus on power rather than effect size because we don't think "average effect size" is meaningful or of interest when aggregating across psychology experiments with different designs (see Colada[33]). Moreover, whereas power calculations only require that one knows the results of the test statistic of interest (e.g., F(1,230)=5.23), effect size calculations require one to also know how the study was defined, a fact that renders effect size estimations much more prone to human error (see page 676 of our article on p-curve and effect size estimation (.pdf) ). In any case, the point that we make in this post applies at least as strongly to an analysis of effect size as it does to an analysis of power: p-curve correctly recovers the true average effect size of the studies that it analyses, even when those studies contain different (i.e., heterogeneous) effect sizes. See Figure 2c in our article on p-curve and effect size estimation (.pdf) and Supplement 2 of that same paper (.pdf) [↩]
  2. In real life, researchers are probably more likely to collect larger samples when studying smaller effects (see Colada[58]). This would necessarily reduce heterogeneity in power across studies [↩]
  3. To do this, we changed the blue histogram from d~N(.5,.05) to d~N(.5,.15). [↩]
  4. We have always been transparent about this. For instance, when we described how to use p-curve for effect size estimation (.pdf) we wrote, "Here is an intuitive way to think of p-curve's estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve." (p.667). [↩]
  5. For a more quantitative example check out Supplement 2 (.pdf) of our p-curve and effect size paper. In the middle panel of Figure S2, we consider a scenario in which a research attempts to run an equal number of studies (with n = 20 per cell) testing either an effect size of d = .2 or an effect size of d = .6. Because it necessarily easier to get significance when the effect size is larger than when the effect size is smaller, the share of significant d = .6 studies will necessarily be greater than the share of significant d = .2 studies, and thus p-curve will include more d = .6 studies that d = .2 studies. Because the d = .6 studies will be over-represented among all significant studies, the true average efefct of the significant studies will be d = .53 rather than d=.4. P-curve correctly recovers this value (.53), but it is biased upwards if we expect it to guess d=.4. For an even more quantitative example, imagine the true average effect is d = .5 with a standard deviation of .2. If we study this with many n=20 studies, the average observed significant effect will be d=.91, but the true average effect of those studies is d = .61, which is the number that p-curve would recover. It would not recover the true mean of the population (d = .5) but rather the true mean of the studies that were statistically significant (d = .61). In simulations, the true mean is known and this might look like a bias. In real life, the true mean is, well, meaningless, as it depends on arbitrary definitions of what constitutes the true population of all possible studies; R Code) [↩]
  6. Again, for both practical and conceptual reasons, we would not advise you to estimate the average effect size, regardless of whether you use p-curve or any other tool. But this has nothing to do with the supposed inability of p-curve to handle heterogeneity. See footnote 1. [↩]
  7. Uli has proposed using z-curve, a tool he developed, instead of p-curve.  While z-curve does not seem to be biased in scenarios with many studies with extreme high-power, it performs worse than p-curve in almost all other scenarios. For example, in the examples depicted graphically in this post, z-curve's expected estimates are about 4 times further from the truth than are p-curve's. [↩]

Get Colada email alerts.

Join 2,708 other subscribers

Your hosts

Uri Simonsohn (.htm)
Joe Simmons (.htm)
Leif Nelson (.htm)

Other Posts on Similar Topics

p-curve
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings
  • [45] Ambitious P-Hacking and P-Curve 4.0
  • [41] Falsely Reassuring: Analyses of ALL p-values
  • [37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk
  • [24] P-curve vs. Excessive Significance Test

tweeter & facebook

We tweet new posts: @DataColada
And link to them on our Facebook page

search

All posts

  • [81] Data Replicada
  • [80] Interaction Effects Need Interaction Controls
  • [79] Experimentation Aversion: Reconciling the Evidence
  • [78c] Bayes Factors in Ten Recent Psych Science Papers
  • [78b] Hyp-Chart, the Missing Link Between P-values and Bayes Factors
  • [78a] If you think p-values are problematic, wait until you understand Bayes Factors
  • [77] Number-Bunching: A New Tool for Forensic Data Analysis
  • [76] Heterogeneity Is Replicable: Evidence From Maluma, MTurk, and Many Labs
  • [75] Intentionally Biased: People Purposely Don't Ignore Information They "Should" Ignore
  • [74] In Press at Psychological Science: A New 'Nudge' Supported by Implausible Data
  • [73] Don't Trust Internal Meta-Analysis
  • [72] Metacritic Has A (File-Drawer) Problem
  • [71] The (Surprising?) Shape of the File Drawer
  • [70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist
  • [69] Eight things I do to make my open research more findable and understandable
  • [68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)
  • [67] P-curve Handles Heterogeneity Just Fine
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [65] Spotlight on Science Journalism: The Health Benefits of Volunteering
  • [64] How To Properly Preregister A Study
  • [63] "Many Labs" Overestimated The Importance of Hidden Moderators
  • [62] Two-lines: The First Valid Test of U-Shaped Relationships
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0
  • [57] Interactions in Logit Regressions: Why Positive May Mean Negative
  • [56] TWARKing: Test-Weighting After Results are Known
  • [55] The file-drawer problem is unfixable, and that's OK
  • [54] The 90x75x50 heuristic: Noisy & Wasteful Sample Sizes In The "Social Science Replication Project"
  • [53] What I Want Our Field To Prioritize
  • [52] Menschplaining: Three Ideas for Civil Criticism
  • [51] Greg vs. Jamal: Why Didn't Bertrand and Mullainathan (2004) Replicate?
  • [50] Teenagers in Bikinis: Interpreting Police-Shooting Data
  • [49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings
  • [48] P-hacked Hypotheses Are Deceivingly Robust
  • [47] Evaluating Replications: 40% Full ≠ 60% Empty
  • [46] Controlling the Weather
  • [45] Ambitious P-Hacking and P-Curve 4.0
  • [44] AsPredicted: Pre-registration Made Easy
  • [43] Rain & Happiness: Why Didn't Schwarz & Clore (1983) 'Replicate' ?
  • [42] Accepting the Null: Where to Draw the Line?
  • [41] Falsely Reassuring: Analyses of ALL p-values
  • [40] Reducing Fraud in Science
  • [39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?
  • [38] A Better Explanation Of The Endowment Effect
  • [37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk
  • [36] How to Study Discrimination (or Anything) With Names; If You Must
  • [35] The Default Bayesian Test is Prejudiced Against Small Effects
  • [34] My Links Will Outlive You
  • [33] "The" Effect Size Does Not Exist
  • [32] Spotify Has Trouble With A Marketing Research Exam
  • [31] Women are taller than men: Misusing Occam's Razor to lobotomize discussions of alternative explanations
  • [30] Trim-and-Fill is Full of It (bias)
  • [29] Help! Someone Thinks I p-hacked
  • [28] Confidence Intervals Don't Change How We Think about Data
  • [27] Thirty-somethings are Shrinking and Other U-Shaped Challenges
  • [26] What If Games Were Shorter?
  • [25] Maybe people actually enjoy being alone with their thoughts
  • [24] P-curve vs. Excessive Significance Test
  • [23] Ceiling Effects and Replications
  • [22] You know what's on our shopping list
  • [21] Fake-Data Colada: Excessive Linearity
  • [20] We cannot afford to study effect size in the lab
  • [19] Fake Data: Mendel vs. Stapel
  • [18] MTurk vs. The Lab: Either Way We Need Big Samples
  • [17] No-way Interactions
  • [16] People Take Baths In Hotel Rooms
  • [15] Citing Prospect Theory
  • [14] How To Win A Football Prediction Contest: Ignore Your Gut
  • [13] Posterior-Hacking
  • [12] Preregistration: Not just for the Empiro-zealots
  • [11] "Exactly": The Most Famous Framing Effect Is Robust To Precise Wording
  • [10] Reviewers are asking for it
  • [9] Titleogy: Some facts about titles
  • [8] Adventures in the Assessment of Animal Speed and Morality
  • [7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate
  • [6] Samples Can't Be Too Large
  • [5] The Consistency of Random Numbers
  • [4] The Folly of Powering Replications Based on Observed Effect Size
  • [3] A New Way To Increase Charitable Donations: Does It Replicate?
  • [2] Using Personal Listening Habits to Identify Personal Music Preferences
  • [1] "Just Posting It" works, leads to new retraction in Psychology

Pages

  • About
  • Drop That Bayes: A Colada Series on Bayes Factors
  • Policy on Soliciting Feedback From Authors
  • Table of Contents

Get email alerts

Data Colada - All Content Licensed: CC-BY [Creative Commons]