[13] Posterior-Hacking

Many believe that while p-hacking invalidates p-values, it does not invalidate Bayesian inference. Many are wrong.

This blog post presents two examples from my new “Posterior-Hacking” (SSRN) paper showing  selective reporting invalidates Bayesian inference as much as it invalidates p-values.

Example 1. Chronological Rejuvenation experiment
In  “False-Positive Psychology” (SSRN), Joe, Leif and I run experiments to demonstrate how easy p-hacking makes it to obtain statistically significant evidence for any effect, no matter how untrue. In Study 2 we “showed” that undergraduates randomly assigned to listen to the song “When I am 64” became 1.4 years younger (p<.05).

We obtained this absurd result by data-peeking, dropping a condition, and cherry-picking a covariate. p-hacking allowed us to fool Mr. p-value. Would it fool Mrs. Posterior also? If we take the selectively reported result and feed it to a Bayesian calculator. What happens?

The figure below shows traditional and Bayesian 95% confidence intervals for the above mentioned 1.4 years-younger chronological rejuvenation effect.  Both point just as strongly (or weakly) toward the absurd effect existing. [1]


When researchers p-hack they also posterior-hack

Example 2. Simulating p-hacks
Many Bayesian advocates propose concluding an experiment suggests an effect exists if the data are at least three times more likely under the alternative than under the null hypothesis. This “Bayes factor>3” approach is philosophically different, and mathematically more complex than computing p-values, but it is in practice extremely similar to simply requiring p< .01 for statistical significance. I hence run simulations assessing how p-hacking facilitates getting p<.01 vs getting Bayes factor>3. [2]

I simulated difference-of-means t-tests p-hacked via data-peeking (getting n=20 per-cell, going to n=30 if necessary), cherry-picking among three dependent variables, dropping a condition, and dropping outliers. See R-code.

Adding 10 observations to samples of size n=20 a researcher can increase her false-positive rate from the nominal 1% to 1.7%. The probability of getting a Bayes factor >3 is a comparable 1.8%. Combined with other forms of p-hacking, the ease with which a false finding is obtained increases multiplicatively. A researcher willing to engage in any of the four forms of p-hacking, has a 20.1% chance of obtaining p<.01, and a 20.8% chance of obtaining a Bayes factor >3.

When a researcher p-hacks, she also Bayes-factor-hacks.

Everyone needs disclosure
Andrew Gelman and colleagues, in their influential Bayesian textbook write:

A naïve student of Bayesian inference might claim that because all inference is conditional on the observed data, it makes no difference how those data were collected, […] the essential flaw in the argument is that a complete definition of ‘the observed data’ should include information on how the observed values arose […]”
(p.203, 2nd edition)

Whether doing traditional or Bayesian statistics, without disclosure, we cannot evaluate evidence.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The Bayesian confidence interval is the “highest density posterior interval”, computed using Kruschke’s BMLR (html). []
  2. This equivalence is for the default-alternative, see Table 1 in Rouder et al, 2009 (HTML).  []

[9] Titleogy: Some facts about titles

Naming things is fun. Not sure why, but it is. I have collaborated in the naming of people, cats, papers, a blog, its posts, and in coining the term “p-hacking.” All were fun to do. So I thought I would write a Colada on titles.

To add color I collected some data. At the end what I wrote was quite boring, so I killed it, but the facts seemed worth sharing. Here they go, in mostly non-contextualized prose.

Cliché titles
I dislike titles with (unmodified) idioms. The figure below shows how frequent some of them are in the web-of-science archive.
Ironically, the most popular (I found), at 970 papers, is “What’s in a name?” …Lack of originality?

A colleague once shared his disapproval of the increase in the use of colons in titles. With this post as an excuse, I used Mozenda to scrape ~30,000 psychology paper titles published across 19 journals over 40 years, and computed the fraction including a colon. “Colleague was Wrong: Title Colonization Has Been Stable at about 63% Since the 1970s.” [1]

That factoid took a couple of hours to generate. Data in hand I figured I should answer more questions. Any sense of coherence in this piece disappears with the next pixel.

Have titles gotten longer over time? 
Yes. At about 1.5 characters per year (or a tweet a century).
note: controlling for journal fixed effects.

Three less obvious questions to ask
Question 1. What are the two highest scoring Scrabble words used in a Psychology title?
Hypnotizability (37 points), is used in several articles it turns out. [2]
Ventriloquized (36 points) only in this paper.

Question 2. What is the most frequent last-word in a Psychology paper title?
(try guessing before reading the next line)

This is probably the right place to let you know the Colada has a Facebook page now 

Winner: 137 titles end with: “Tasks”
Runner up: 70 titles end with “Effect”

Question 3. What’s more commonly used in a Psychology title, “thinking” or “sex”?
Not close.

Sex: 407.
Thinking: 172.

Alright, that’s not totally fair, in psychology sex often refers to gender rather than the activity. Moreover, thinking (172) is, as expected for academic papers, more common than doing (44).
But memory blows sex, thinking, and doing combined out of the water with 2008 instances; one in 15 psychology titles has the word memory in them.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. I treated the Journal of Consumer Research as a psychology journal, a decision involving two debatable assumptions. []
  2. Shane Frederick indicated via email that this is a vast underestimate that ignores tripling of points; Hypnotizability could get you 729 points . []

[4] The Folly of Powering Replications Based on Observed Effect Size

It is common for researchers running replications to set their sample size assuming the effect size the original researchers got is correct. So if the original study found an effect-size of d=.73, the replicator assumes the true effect is d=.73, and sets sample size so as to have 90% chance, say, of getting a significant result.

This apparently sensible way to power replications is actually deeply misleading.

Why Misleading?
Because of publication bias. Given that (original) research is only publishable if it is significant, published research systematically overestimates effect size (Lane & Dunlap, 1978). For example, if sample size is n=20 per cell, and true effect size is d=.2, published studies will on average estimate the effect to be d=.78. The intuition is that overestimates are more likely to be significant than underestimates, and so more likely to be published.

If we systematically overestimate effect sizes in original work, then we systematically overestimate the power of replications that assume those effects are real.

Let’s consider some scenarios. If original research were powered to 50%, a highly optimistic benchmark (Button et al, 2013;Sedlmeier Gigerenzer, 1989), here is what it looks like:

So replications claiming 80% power actually have just 51% (Details | R code).

Ok. What if original research were powered at a more realistic level of, say, 35%:
The figures show that the extent of overclaiming depends on the power of the original study. Because nobody knows what that is, nobody knows how much power a replication claiming 80%, 90% or 95% really has.

A self-righteous counterargument
A replicator may say:

Well, if the original author underpowered her studies, then she is getting what she deserves when the replications fail; it is not my fault my replication is underpowered, it is hers. SHE SHOULD BE DOING POWER ANALYSIS!!!

Three problems.
1. Replications in particular and research in general are not about justice. We should strive to maximize learning, not schadenfreude.

2. The original researcher may have thought the effect was bigger than it is, she thought she had  80% power, but she had only 50%. It is not “fair” to “punish” her for not knowing the effect size she is studying. That’s precisely why she is studying it.

3. Even if all original studies had 80% power, most published estimates would be over-estimates, and so even if  all original studies had 80% power, most replications based on observed effects would overclaim power. For instance, one in five replications claiming 80% would actually have <50% power (R code).


What’s the alternative?
In a recent paper (“Evaluating Replication Results”) I put forward a different approach to thinking about replication results altogether. For a replication to fail it is not enough that p>.05 in it, we need to also conclude the effect is too small to have been detected in the original study (in effect, we need tight confidence intervals around 0). Underpowered replications will tend to fail to reject 0, be n.s., but will also tend to fail to reject big effects. In the new approach this result is considered as uninformative rather than as a “failure-to-replicate.” The paper also derives a simple rule for sample size to be properly powered for obtaining informative failures to replicate:  2.5 times the original sample size ensures 80% power for that test. That number is unaffected by publication bias, how original authors power their studies, and the study design (e.g., two-proportions vs. ANOVA).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[1] "Just Posting It" works, leads to new retraction in Psychology

The fortuitous discovery of new fake data.
For a project I worked on this past May, I needed data for variables as different from each other as possible. From the data-posting journal Judgment and Decision Making I downloaded data for ten, including one from a now retracted paper involving the estimation of coin sizes. I created a chart and inserted it into a paper that I sent to several colleagues, and into slides presented at an APS talk.

An anonymous colleague, “Larry,” saw the chart and, for not-entirely obvious reasons, became interested in the coin-size study. After downloading the publicly available data he noticed something odd (something I had not noticed): while each participant had evaluated four coins, the data contained only one column of estimates. The average? No, for all entries were integers; averages of four numbers are rarely integers. Something was off.

Interest piqued, he did more analyses leading to more anomalies. He shared them with the editor, who contacted the author. The author provided explanations. These were nearly as implausible as they were incapable of accounting for the anomalies. The retraction ensued.

Some of the anomalies
1. Contradiction with paper
Paper describes 0-10 integer scale, dataset has decimals and negative numbers.

2. Implausible correlations among emotion measures
Shame and embarrassment are intimately related emotions, and yet they are correlated negatively in the data r = -.27. Fear and anxiety: r = -.01. Real emotion ratings don’t exhibit these correlations.

3. Impossibly similar results
Fabricated data often exhibit a pattern of excessive similarity (e.g., very similar means across conditions). This pattern led to uncovering Sanna and Smeesters as fabricateurs (see “Just Post It” paper). Diederik Stapel’s data also exhibit excessive similarity, going back to his dissertation at least.

The coin-size paper also has excessive similarity. For example, coin-size estimates supposedly obtained from 49 individuals across two different experiments are almost identical:
Experiment 1 (n=25): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7
Experiment 2 (n=24): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,_,6,6,6,6,6,6,6,7

Simulations drawing random samples from the data themselves (bootstrapping) show that it is nearly impossible to obtain such similar results. The hypothesis that these data came from random samples is rejected, p<.000025 (see R code, detailed explanation).

Who vs. which
These data are fake beyond reasonable doubt.  We don’t know, however, who faked them.
That question is of obvious importance to the authors of the paper and perhaps their home and granting institutions, but arguably not so much to the  research community more broadly. We should care, instead, about which data are fake.

If other journals followed the lead of Judgment and Decision Making and required data posting (its  editor Jon Baron, by the way,  started the data posting policy well before I wrote my “Just Post It”), we would have a much easier time identifying invalid data.  Some of the coin-size authors have  a paper in JESP, one in Psychological Science, and another with similar results  in Appetite.  If the data behind those papers were available, we would not need to speculate as to their validity.

Author’s response
When discussing the work of others, our policy here at Data Colada is to contact them before posting. We ask for feedback to avoid inaccuracies and misunderstandings, and  give authors space for commenting within our original blog post. The corresponding author of the retracted article,  Dr. Wen-Bin Chiou, wrote to me via email:

Although the data collection and data coding was done by my research assistant, I must be responsible for the issue.Unfortunately, the RA had left my lab last year and studied abroad. At this time, I cannot get the truth from him and find out what was really going wrong […] as to the decimal points and negative numbers, I recoded the data myself and sent the editor with the new dataset. I guess the problem does not exist in the new dataset. With regard to the impossible similar results, the RA sorted the coin-size estimate variable, producing the similar results. […]  Finally, I would like to thank Dr. Simonsohn for including my clarifications in this post.
[See unedited version]

Uri’s note: the similarity of data is based on the frequency of values across samples, not their order, so sorting does not explain  that the data are incompatible with random sampling.