In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.
The many-labs project is a collaboration of 36 labs around the world, each running a replication of 13 published effects in psychology (paper: .pdf; data: xlsx). [1]
One of the most replicable effects was the Asian Disease problem, a demonstration of people being risk seeking for losses but risk averse for gains; it was p<.05 in 31 of 36 labs (we also replicated it in Colada[11]).
Here I apply the Excessive Significance Test and p-curve to those 31 studies (summary table .xlsx).
How The Excessive Significance Test Works
It takes a set of studies (e.g., all studies in a paper) and asks whether too many are statistically significant. For example, say a paper has five studies, all p<.05. Imagine each obtained an effect size that would have given it 50% power. The probability that five out of five studies powered to 50% would all get p<.05 is .5*.5*.5*.5*.5=.03125. So we reject the null of full reporting, meaning that at least one null finding was not reported.
The excessive significance test was developed by Ioannidis and Trikalinos (.html). In psychology it has been popularized by Greg Francis (.html) and Ulrich Schimmack (.html). I have twice been invited to publish commentaries on Francis' use of the test: "It Does not Follow" (htm) and "It Really Just Does Not Follow" (htm)
How p-curve Works
P-curve is a tool that assesses if, after accounting for p-hacking and file-drawering, a set of statistically significant findings have evidential value. It looks at the distribution of p-values and asks whether that distribution is what we would expect of a set of true findings. In a nutshell, you see more low (e.g., p<.025) than high (e.g., p>.025) significant p-values when an effect is true (for details see www.p-curve.com)
Running both tests
The Excessive Significance Test takes the 31 studies that worked and spits out p=.03: rejecting the null that all studies were reported. It nails it. We know 5 studies were not “reported” and the test infers accordingly (R Code) [2].
This inference is pointless for two reasons.
First, we always know the answer to the question of whether all studies were published. The answer is always "No." Some people publish some null findings, but nobody publishes all null findings.
Second, it tells us about researcher behavior, not about the world, and we do science to learn about the world, not to learn about researcher behavior.
The question of interest is not “is there a null finding you are not telling me about?” The question of interest is “do these significant findings you are telling me about have truth value?”
P-curve takes the 31 studies and tells us that taken as a whole the studies do support the notion that gain vs loss framing has an effect on risk preferences.
The figure (generated with the online app) shows that consistent with a true effect, there are more low than high p-values among the 31 studies that worked.
The excessive significance test tells you only that the glass is not 100% full.
P-curve tells you whether it has enough water to quench your thirst.
- More data: https://osf.io/wx7ck/ [↩]
- Ulrich Schimmack (.html) proposes a variation in how the test is conducted, computing power based on each individual effect size rather than pooling. When done this way, the Excessive Signifiance Test is also significant, p=.01; see R Code link above. [↩]