[45] Ambitious P-Hacking and P-Curve 4.0

In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.

Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.f1

For a recently published paper, “Better P-Curves” (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.

F2

Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.[1]

P-Curve and Ambitious p-hacking
P-
curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.

In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.

However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.

But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.

Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).

The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.

Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.

The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. [2]

In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.

f3

The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.[3]

We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. []
  2. As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. []
  3. When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/  []