Data Colada
Menu
  • Home
  • Table of Contents
  • Feedback Policy
  • About
Menu

[24] P-curve vs. Excessive Significance Test


Posted on June 27, 2014February 12, 2020 by Uri Simonsohn

In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.

The many-labs project is a collaboration of 36 labs around the world, each running a replication of 13 published effects in psychology (paper: .pdf; data: xlsx). [1]

One of the most replicable effects was the Asian Disease problem, a demonstration of people being risk seeking for losses but risk averse for gains; it was p<.05 in 31 of 36 labs (we also replicated it  in Colada[11]).

Here I apply the Excessive Significance Test and p-curve to those 31 studies (summary table .xlsx).

How The Excessive Significance Test Works
It takes a set of studies (e.g., all studies in a paper) and asks whether too many are statistically significant. For example, say a paper has five studies, all p<.05. Imagine each obtained an effect size that would have given it 50% power. The probability that five out of five studies powered to 50% would all get p<.05 is .5*.5*.5*.5*.5=.03125. So we reject the null of full reporting, meaning that at least one null finding was not reported.

The excessive significance test was developed by Ioannidis and Trikalinos (.html). In psychology it has been popularized by Greg Francis (.html) and Ulrich Schimmack (.html). I have twice been invited to publish commentaries on Francis' use of the test: "It Does not Follow" (htm) and "It Really Just Does Not Follow" (htm)

How p-curve Works
P-curve is a tool that assesses if, after accounting for p-hacking and file-drawering, a set of statistically significant findings have evidential value.  It looks at the distribution of p-values and asks whether that distribution is what we would expect of a set of true findings. In a nutshell, you see more low (e.g., p<.025) than high (e.g., p>.025) significant p-values when an effect is true (for details see www.p-curve.com)

Running both tests
The Excessive Significance Test takes the 31 studies that worked and spits out p=.03: rejecting the null that all studies were reported. It nails it. We know 5 studies were not “reported” and the test infers accordingly (R Code) [2].

This inference is pointless for two reasons.

First, we always know the answer to the question of whether all studies were published. The answer is always "No." Some people publish some null findings, but nobody publishes all null findings.

Second, it tells us about researcher behavior, not about the world, and we do science to learn about the world, not to learn about researcher behavior.

The question of interest is not “is there a null finding you are not telling me about?” The question of interest is “do these significant findings you are telling me about have truth value?”

P-curve takes the 31 studies and tells us that taken as a whole the studies do support the notion that gain vs loss framing has an effect on risk preferences.

f1

The figure (generated with the online app) shows that consistent with a true effect, there are more low than high p-values among the 31 studies that worked.

The excessive significance test tells you only that the glass is not 100% full.
P-curve tells you whether it has enough water to quench your thirst.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. More data: https://osf.io/wx7ck/ [↩]
  2. Ulrich Schimmack (.html) proposes a variation in how the test is conducted, computing power based on each individual effect size rather than pooling. When done this way, the Excessive Signifiance Test is also significant, p=.01; see R Code link above. [↩]

Related

Get Colada email alerts.

Join 10.5K other subscribers

Social media

Recent Posts

  • [125] "Complexity" 2: Don't be mean to the median
  • [124] "Complexity": 75% of participants missed comprehension questions in AER paper critiquing Prospect Theory
  • [123] Dear Political Scientists: The binning estimator violates ceteris paribus
  • [122] Arresting Flexibility: A QJE field experiment on police behavior with about 40 outcome variables
  • [121] Dear Political Scientists: Don't Bin, GAM Instead

Get blogpost email alerts

Join 10.5K other subscribers

tweeter & facebook

We announce posts on Twitter
We announce posts on Bluesky
And link to them on our Facebook page

Posts on similar topics

file-drawer, p-curve
  • [91] p-hacking fast and slow: Evaluating a forthcoming AER paper deeming some econ literatures less trustworthy
  • [73] Don't Trust Internal Meta-Analysis
  • [72] Metacritic Has A (File-Drawer) Problem
  • [71] The (Surprising?) Shape of the File Drawer
  • [67] P-curve Handles Heterogeneity Just Fine
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0

search

© 2021, Uri Simonsohn, Leif Nelson, and Joseph Simmons. For permission to reprint individual blog posts on DataColada please contact us via email..