Data Colada
Menu
  • Home
  • About
  • Feedback Policy
  • Table of Contents
Menu

[49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings


Posted on June 14, 2016October 3, 2017 by Uri Joe Leif

In a recent critique, Bruns and Ioannidis (PlosONE 2016 .pdf) proposed that p-curve makes mistakes when analyzing studies that have collected field/observational data. They write that in such cases:

p-curves based on true effects and p‑curves based on null-effects with p-hacking cannot be reliably distinguished" (abstract).

In this post we show, with examples involving sex, guns, and the supreme court, that the statement is incorrect. P-curve does reliably distinguish between null effects and non-null effects. The observational nature of the data isn't relevant.

The erroneous conclusion seems to arise from their imprecise use of terminology. Bruns & Ioannidis treat a false-positive finding and a confounded finding as the same thing.  But they are different things. The distinction is as straightforward as it is important.

Confound vs False-positive.
We present examples to clarify the distinction, but first let's speak conceptually.

A Confounded effect of X on Y is real, but the association arises because another (omitted) variable causes both X and Y. A new study of X on Y is expected to find that association again.

A False-positive effect of X on Y, in contrast, is not real. The apparent association between X and Y is entirely the result of sampling error. A new study of X on Y is not expected to find an association again.

Confounded effects are real and replicable, while false-positive effects are neither. Those are big differences, but Bruns & Ioannidis conflate them. For instance, they write:

the estimated effect size may be different from zero due to an omitted-variable bias rather than due to a true effect. (p. 3; emphasis added).

Omitted-variable bias does not make a relationship untrue; it makes it un-causal.

This is not just semantics, nor merely a discussion of "what do you mean by a true effect?"
We can learn something from examining  replicable effects further (e.g., learn if there is a confound and what it is; confounds are sometimes interesting).  We cannot learn something from examining non-replicable  effects further.

This critical distinction between replicable and non-replicable effects  can be informed by p-curve. Replicable results, whether causal or not, lead to right-skewed p-curves. False-positive, non-replicable effects lead to flat or left-skewed p-curves.

Causality
P-curve's inability to distinguish causal vs. confounded relationships is no more of a shortcoming than is its inability to fold laundry or file income tax returns. Identifying causal relationships is not something we can reasonably expect any statistical test to do [1].

laundryWhen researchers try to assess causality through techniques such as instrumental variables, regression discontinuity, or randomized field experiments, they do so via superior designs, not via superior statistical tests. The Z, t, and F tests reported in papers that credibly establish causality are the same tests as those reported in papers that do not.

Correlation is not causation. Confusing the two is human error, not tool error.

To make things concrete we provide two examples. Both use General Social Survey (GSS) data, which is, of course, observational data.

Example 1. Shotguns and female partners (Confound)
With the full GSS, we identified the following confounded association: Shotgun owners report having had 1.9 more female sexual partners, on average, than do non-owners, t(14824)=10.58, p<.0001.  The omitted variable is gender.

33% of Male respondents report owning a shotgun, whereas, um, 'only' 19% of Women do.

Males, relative to females, also report having had a greater number of sexual encounters with females (means of 9.82 vs 0.21)

Moreover, controlling for gender, the effect goes away (t(14823)=.68, p=.496) [2].

So the relationship is confounded. It is real but not causal. Let's see what p-curve thinks of it. We use data from 1994 as the focal study, and create a p-curve using data from previous years (1989-1993) following a procedure similar to Bruns and Ioannidis (2016)  [3]. Panel A in Figure 1 shows the resulting right-skewed p-curve. It suggests the finding should replicate in subsequent years. Panel B shows that it does.

F1 for blogR Code to reproduce this figure: https://osf.io/wn7ju/

Example 2. Random numbers and the Supreme Court (false-positive)
With observational data it's hard to identify exactly zero effects because there is always the risk of omitted variables, selection bias, long and difficult-to-understand causal chains, etc.

To create a definitely false-positive finding we started with a predictor that could not possibly be expected to truly correlate with any variable: whether the random respondent ID was odd vs. even.

We then p-hacked an effect by running t-tests on every other variable in the 1994 GSS dataset for odd vs. even participants, arriving at 36 false-positive ps<.05. For its amusement value, we focused on the question asking participants how much confidence they have in the U.S. Supreme Court (1: a great deal, 2: only some, 3: hardly any).

Panel C in Figure 1 shows that, following the same procedure as for the previous example, the p-curve for this finding is flat, suggesting that the finding would not replicate in subsequent years. Panel D shows that it does not. Figure 1 demonstrates how p-curve successfully distinguishes between statistically significant studies that are vs. are not expected to replicate.

Punchline: p-curve can distinguish replicable from non-replicable findings. To distinguish correlational from causal findings, call an expert.

Note: this is a blog-adapted version of a formal reply we wrote and submitted to PlosONE, but since 2 months have passed and they have not sent it out to reviewers yet, we decided to Colada it and hope someday PlosONE generously decides to send our paper out for review.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. We contacted Stephan Bruns and John Ioannidis. They didn't object to our distinction between confounded and false-positive findings, but propose that "the ability of 'experts' to identify confounding is close to non-existent." See their full 3-page response (.pdf).

 


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. For what is worth, we have acknowledged this in prior work. For example, in Simonsohn, Nelson, and Simmons (2014, p. 535) we wrote, "Just as an individual finding may be statistically significant even if the theory it tests is incorrect— because the study is flawed (e.g., due to confounds, demand effects, etc.)—a set of studies investigating incorrect theories may nevertheless contain evidential value precisely because that set of studies is flawed" (emphasis added). [↩]
  2. We are not claiming, of course, that the residual effect is exactly zero. That's untestable. [↩]
  3. In particular, we generated random subsamples (of the size of the 1994 sample), re-ran the regression predicting number of female sexual partners with the shotgun ownership dummy, and constructed a p-curve for the subset of statistically significant results that were obtained. This procedure is not really necessary. Once we know the effect size and sample size we know the non-centrality parameter of the distribution for the test-statistic and can compute expected p-curves without simulations (see Supplement 1 in Simonsohn et al., 2014), but we did our best to follow the procedures by Bruns and Ioannidis. [↩]

Get Colada email alerts.

Join 2,708 other subscribers

Your hosts

Uri Simonsohn (.htm)
Joe Simmons (.htm)
Leif Nelson (.htm)

Other Posts on Similar Topics

Discuss own paper, p-curve
  • [73] Don't Trust Internal Meta-Analysis
  • [67] P-curve Handles Heterogeneity Just Fine
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [62] Two-lines: The First Valid Test of U-Shaped Relationships
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings
  • [45] Ambitious P-Hacking and P-Curve 4.0
  • [43] Rain & Happiness: Why Didn't Schwarz & Clore (1983) 'Replicate' ?

tweeter & facebook

We tweet new posts: @DataColada
And link to them on our Facebook page

search

All posts

  • [81] Data Replicada
  • [80] Interaction Effects Need Interaction Controls
  • [79] Experimentation Aversion: Reconciling the Evidence
  • [78c] Bayes Factors in Ten Recent Psych Science Papers
  • [78b] Hyp-Chart, the Missing Link Between P-values and Bayes Factors
  • [78a] If you think p-values are problematic, wait until you understand Bayes Factors
  • [77] Number-Bunching: A New Tool for Forensic Data Analysis
  • [76] Heterogeneity Is Replicable: Evidence From Maluma, MTurk, and Many Labs
  • [75] Intentionally Biased: People Purposely Don't Ignore Information They "Should" Ignore
  • [74] In Press at Psychological Science: A New 'Nudge' Supported by Implausible Data
  • [73] Don't Trust Internal Meta-Analysis
  • [72] Metacritic Has A (File-Drawer) Problem
  • [71] The (Surprising?) Shape of the File Drawer
  • [70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist
  • [69] Eight things I do to make my open research more findable and understandable
  • [68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)
  • [67] P-curve Handles Heterogeneity Just Fine
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [65] Spotlight on Science Journalism: The Health Benefits of Volunteering
  • [64] How To Properly Preregister A Study
  • [63] "Many Labs" Overestimated The Importance of Hidden Moderators
  • [62] Two-lines: The First Valid Test of U-Shaped Relationships
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0
  • [57] Interactions in Logit Regressions: Why Positive May Mean Negative
  • [56] TWARKing: Test-Weighting After Results are Known
  • [55] The file-drawer problem is unfixable, and that's OK
  • [54] The 90x75x50 heuristic: Noisy & Wasteful Sample Sizes In The "Social Science Replication Project"
  • [53] What I Want Our Field To Prioritize
  • [52] Menschplaining: Three Ideas for Civil Criticism
  • [51] Greg vs. Jamal: Why Didn't Bertrand and Mullainathan (2004) Replicate?
  • [50] Teenagers in Bikinis: Interpreting Police-Shooting Data
  • [49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings
  • [48] P-hacked Hypotheses Are Deceivingly Robust
  • [47] Evaluating Replications: 40% Full ≠ 60% Empty
  • [46] Controlling the Weather
  • [45] Ambitious P-Hacking and P-Curve 4.0
  • [44] AsPredicted: Pre-registration Made Easy
  • [43] Rain & Happiness: Why Didn't Schwarz & Clore (1983) 'Replicate' ?
  • [42] Accepting the Null: Where to Draw the Line?
  • [41] Falsely Reassuring: Analyses of ALL p-values
  • [40] Reducing Fraud in Science
  • [39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?
  • [38] A Better Explanation Of The Endowment Effect
  • [37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk
  • [36] How to Study Discrimination (or Anything) With Names; If You Must
  • [35] The Default Bayesian Test is Prejudiced Against Small Effects
  • [34] My Links Will Outlive You
  • [33] "The" Effect Size Does Not Exist
  • [32] Spotify Has Trouble With A Marketing Research Exam
  • [31] Women are taller than men: Misusing Occam's Razor to lobotomize discussions of alternative explanations
  • [30] Trim-and-Fill is Full of It (bias)
  • [29] Help! Someone Thinks I p-hacked
  • [28] Confidence Intervals Don't Change How We Think about Data
  • [27] Thirty-somethings are Shrinking and Other U-Shaped Challenges
  • [26] What If Games Were Shorter?
  • [25] Maybe people actually enjoy being alone with their thoughts
  • [24] P-curve vs. Excessive Significance Test
  • [23] Ceiling Effects and Replications
  • [22] You know what's on our shopping list
  • [21] Fake-Data Colada: Excessive Linearity
  • [20] We cannot afford to study effect size in the lab
  • [19] Fake Data: Mendel vs. Stapel
  • [18] MTurk vs. The Lab: Either Way We Need Big Samples
  • [17] No-way Interactions
  • [16] People Take Baths In Hotel Rooms
  • [15] Citing Prospect Theory
  • [14] How To Win A Football Prediction Contest: Ignore Your Gut
  • [13] Posterior-Hacking
  • [12] Preregistration: Not just for the Empiro-zealots
  • [11] "Exactly": The Most Famous Framing Effect Is Robust To Precise Wording
  • [10] Reviewers are asking for it
  • [9] Titleogy: Some facts about titles
  • [8] Adventures in the Assessment of Animal Speed and Morality
  • [7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate
  • [6] Samples Can't Be Too Large
  • [5] The Consistency of Random Numbers
  • [4] The Folly of Powering Replications Based on Observed Effect Size
  • [3] A New Way To Increase Charitable Donations: Does It Replicate?
  • [2] Using Personal Listening Habits to Identify Personal Music Preferences
  • [1] "Just Posting It" works, leads to new retraction in Psychology

Pages

  • About
  • Drop That Bayes: A Colada Series on Bayes Factors
  • Policy on Soliciting Feedback From Authors
  • Table of Contents

Get email alerts

Data Colada - All Content Licensed: CC-BY [Creative Commons]