Data Colada
Menu
  • Home
  • About
  • Feedback Policy
  • Table of Contents
Menu

[29] Help! Someone Thinks I p-hacked


Posted on October 22, 2014January 25, 2019 by Uri Simonsohn

It has become more common to publicly speculate, upon noticing a paper with unusual analyses, that a reported finding was obtained via p-hacking. This post discusses how authors can persuasively respond to such speculations.

Examples of public speculation of p-hacking
Example 1. A Slate.com post by Andrew Gelman suspected p-hacking in a paper that collected data on 10 colors of clothing, but analyzed red & pink as a single color [.html] (see authors' response to the accusation .html)

Example 2. An anonymous referee suspected p-hacking and recommended rejecting a paper, after noticing participants with low values of the dependent variable were dropped [.html]

Example 3. A statistics blog suspected p-hacking after noticing a paper studying number of hurricane deaths relied on the somewhat unusual Negative-Binomial Regression [.html]

First, the wrong response
The most common & tempting response to concerns like these is also the wrong response: justifying what one did. Explaining, for instance, why it makes sense to collapse red with pink or to run a negative-binomial.

It is the wrong response because when we p-hack, we self-servingly choose among justifiable analyses. P-hacked findings are by definition justifiable. Unjustifiable research practices involve incompetence or fraud, not p-hacking.

Showing an analysis is justifiable does not inform the question of whether it was p-hacked.

Right Response #1.  "We decided in advance"
P-hacking involves post-hoc selection of analyses to get p<.05. One way to address p-hacking concerns is to indicate analysis decisions were made ex-ante.

A good way to do this is to just say so:  "We decided to collapse red & pink before running any analyses" A better way is with a more general and verifiable statement:  "In all papers we collapse red & pink" An even better way is:  "We preregistered that we would collapse red & pink in this study" (see related Colada[12]: "Preregistration: Not Just for the Empiro-Zealots")

Right Response #2.  "We didn't decide in advance, but the results are robust"
Often we don't decide in advance. We don't think of outliers till we see them. What to do then? Show the results don't hinge on how the problem is dealt with. Show dropping  >2SD, >2.5SD, >3SD, logging the dependent variable, comparing medians and running a non-parametric test. If the conclusion is the same in most of these, tell the blogger to shut up.

Right Response 3. "We didn't decide in advance, and the results are not robust. So we run a direct replication."
Sometimes the result will only be there if you drop >2SD and it will not have occurred to you to do so till you saw the p=.24 without it. One possibility is that you are chasing noise. Another possibility is that you are right. The one way to tell these two apart is with a new study. Run everything the same, exclude again based on >2SD.

If in your "replication" you now need a gender interaction for the >2SD exclusion to give you p<.05, it is not too late to read "False-Positive Psychology" (.html)

Cheers
If a blogger raises concerns of p-hacking, and you cannot provide any of the three responses above: buy the blogger a drink. She is probably right.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Get Colada email alerts.

Join 2,708 other subscribers

Your hosts

Uri Simonsohn (.htm)
Joe Simmons (.htm)
Leif Nelson (.htm)

Other Posts on Similar Topics

p-hacking
  • [68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)
  • [48] P-hacked Hypotheses Are Deceivingly Robust
  • [29] Help! Someone Thinks I p-hacked

tweeter & facebook

We tweet new posts: @DataColada
And link to them on our Facebook page

search

All posts

  • [81] Data Replicada
  • [80] Interaction Effects Need Interaction Controls
  • [79] Experimentation Aversion: Reconciling the Evidence
  • [78c] Bayes Factors in Ten Recent Psych Science Papers
  • [78b] Hyp-Chart, the Missing Link Between P-values and Bayes Factors
  • [78a] If you think p-values are problematic, wait until you understand Bayes Factors
  • [77] Number-Bunching: A New Tool for Forensic Data Analysis
  • [76] Heterogeneity Is Replicable: Evidence From Maluma, MTurk, and Many Labs
  • [75] Intentionally Biased: People Purposely Don't Ignore Information They "Should" Ignore
  • [74] In Press at Psychological Science: A New 'Nudge' Supported by Implausible Data
  • [73] Don't Trust Internal Meta-Analysis
  • [72] Metacritic Has A (File-Drawer) Problem
  • [71] The (Surprising?) Shape of the File Drawer
  • [70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist
  • [69] Eight things I do to make my open research more findable and understandable
  • [68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)
  • [67] P-curve Handles Heterogeneity Just Fine
  • [66] Outliers: Evaluating A New P-Curve Of Power Poses
  • [65] Spotlight on Science Journalism: The Health Benefits of Volunteering
  • [64] How To Properly Preregister A Study
  • [63] "Many Labs" Overestimated The Importance of Hidden Moderators
  • [62] Two-lines: The First Valid Test of U-Shaped Relationships
  • [61] Why p-curve excludes ps>.05
  • [60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research
  • [59] PET-PEESE Is Not Like Homeopathy
  • [58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0
  • [57] Interactions in Logit Regressions: Why Positive May Mean Negative
  • [56] TWARKing: Test-Weighting After Results are Known
  • [55] The file-drawer problem is unfixable, and that's OK
  • [54] The 90x75x50 heuristic: Noisy & Wasteful Sample Sizes In The "Social Science Replication Project"
  • [53] What I Want Our Field To Prioritize
  • [52] Menschplaining: Three Ideas for Civil Criticism
  • [51] Greg vs. Jamal: Why Didn't Bertrand and Mullainathan (2004) Replicate?
  • [50] Teenagers in Bikinis: Interpreting Police-Shooting Data
  • [49] P-Curve Won't Do Your Laundry, But Will Identify Replicable Findings
  • [48] P-hacked Hypotheses Are Deceivingly Robust
  • [47] Evaluating Replications: 40% Full ≠ 60% Empty
  • [46] Controlling the Weather
  • [45] Ambitious P-Hacking and P-Curve 4.0
  • [44] AsPredicted: Pre-registration Made Easy
  • [43] Rain & Happiness: Why Didn't Schwarz & Clore (1983) 'Replicate' ?
  • [42] Accepting the Null: Where to Draw the Line?
  • [41] Falsely Reassuring: Analyses of ALL p-values
  • [40] Reducing Fraud in Science
  • [39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?
  • [38] A Better Explanation Of The Endowment Effect
  • [37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk
  • [36] How to Study Discrimination (or Anything) With Names; If You Must
  • [35] The Default Bayesian Test is Prejudiced Against Small Effects
  • [34] My Links Will Outlive You
  • [33] "The" Effect Size Does Not Exist
  • [32] Spotify Has Trouble With A Marketing Research Exam
  • [31] Women are taller than men: Misusing Occam's Razor to lobotomize discussions of alternative explanations
  • [30] Trim-and-Fill is Full of It (bias)
  • [29] Help! Someone Thinks I p-hacked
  • [28] Confidence Intervals Don't Change How We Think about Data
  • [27] Thirty-somethings are Shrinking and Other U-Shaped Challenges
  • [26] What If Games Were Shorter?
  • [25] Maybe people actually enjoy being alone with their thoughts
  • [24] P-curve vs. Excessive Significance Test
  • [23] Ceiling Effects and Replications
  • [22] You know what's on our shopping list
  • [21] Fake-Data Colada: Excessive Linearity
  • [20] We cannot afford to study effect size in the lab
  • [19] Fake Data: Mendel vs. Stapel
  • [18] MTurk vs. The Lab: Either Way We Need Big Samples
  • [17] No-way Interactions
  • [16] People Take Baths In Hotel Rooms
  • [15] Citing Prospect Theory
  • [14] How To Win A Football Prediction Contest: Ignore Your Gut
  • [13] Posterior-Hacking
  • [12] Preregistration: Not just for the Empiro-zealots
  • [11] "Exactly": The Most Famous Framing Effect Is Robust To Precise Wording
  • [10] Reviewers are asking for it
  • [9] Titleogy: Some facts about titles
  • [8] Adventures in the Assessment of Animal Speed and Morality
  • [7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate
  • [6] Samples Can't Be Too Large
  • [5] The Consistency of Random Numbers
  • [4] The Folly of Powering Replications Based on Observed Effect Size
  • [3] A New Way To Increase Charitable Donations: Does It Replicate?
  • [2] Using Personal Listening Habits to Identify Personal Music Preferences
  • [1] "Just Posting It" works, leads to new retraction in Psychology

Pages

  • About
  • Drop That Bayes: A Colada Series on Bayes Factors
  • Policy on Soliciting Feedback From Authors
  • Table of Contents

Get email alerts

Data Colada - All Content Licensed: CC-BY [Creative Commons]