Sometimes we selectively report the analyses we run to test a hypothesis. Other times we selectively report which hypotheses we tested. One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or…
Author: Uri Simonsohn
[47] Evaluating Replications: 40% Full ≠ 60% Empty
Last October, Science published the paper “Estimating the Reproducibility of Psychological Science” (htm), which reported the results of 100 replication attempts. Today it published a commentary by Gilbert et al. (.htm) as well as a response by the replicators (.htm). The commentary makes two main points. First, because of sampling error, we should not expect all of…
[43] Rain & Happiness: Why Didn’t Schwarz & Clore (1983) ‘Replicate’ ?
In my “Small Telescopes” paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days. Figure and text from Small Telescopes paper (SSRN) I…
[42] Accepting the Null: Where to Draw the Line?
We typically ask if an effect exists. But sometimes we want to ask if it does not. For example, how many of the “failed” replications in the recent reproducibility project published in Science (.pdf) suggest the absence of an effect? Data have noise, so we can never say ‘the effect is exactly zero.’ We can…
[41] Falsely Reassuring: Analyses of ALL p-values
It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers: The dotted circles highlight the excess of…
[40] Reducing Fraud in Science
Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things. In this post I counter-argue, proposing three alternative solutions. Problems with the Change the Incentives solution. First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We…
[39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?
A recent Science-paper (.html) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent…
[36] How to Study Discrimination (or Anything) With Names; If You Must
Consider these paraphrased famous findings: “Because his name resembles ‘dentist,’ Dennis became one” (JPSP, .pdf) “Because the applicant was black (named Jamal instead of Greg) he was not interviewed” (AER, .pdf) “Because the applicant was female (named Jennifer instead of John), she got a lower offer” (PNAS, .pdf) Everything that matters (income, age, location, religion) correlates with…
[35] The Default Bayesian Test is Prejudiced Against Small Effects
When considering any statistical tool I think it is useful to answer the following two practical questions: 1. “Does it give reasonable answers in realistic circumstances?” 2. “Does it answer a question I am interested in?” In this post I explain why, for me, when it comes to the default Bayesian test that's starting to…
[34] My Links Will Outlive You
If you are like me, from time to time your papers include links to online references. Because the internet changes so often, by the time readers follow those links, who knows if the cited content will still be there. This blogpost shares a simple way to ensure your links live “forever.” I got the idea…