In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.
Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.
For a recently published paper, "Better P-Curves" (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.
Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.[1]
P-Curve and Ambitious p-hacking
P-curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.
In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.
However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.
But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.
Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).
The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.
Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.
The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. [2]
In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.
The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.[3]
We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.
Footnotes.
- This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. [↩]
- As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. [↩]
- When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/ [↩]