[53] What I Want Our Field To Prioritize

When I was a sophomore in college, I read a book by Carl Sagan called The Demon-Haunted World. By the time I finished it, I understood the difference between what is scientifically true and what is not. It was not obvious to me at the time: If a hypothesis is true, then you can use it to predict the future. If a hypothesis is false, then you can’t. Replicable findings are true precisely because you can predict that they will replicate. Non-replicable findings are not true precisely because you can’t. Truth is replicability. This lesson changed my life. I decided to try to become a scientist.

Although this lesson inspired me to pursue a career as a psychological scientist, for a long time I didn’t let it affect how I actually pursued that career. For example, during graduate school Leif Nelson and I investigated the hypothesis that people strive for outcomes that resemble their initials. For example, we set out to show that (not: test whether) people with an A or B initial get better grades than people with a C or D initial. After many attempts (we ran many analyses and we ran many studies), we found enough “evidence” for this hypothesis, and we published the findings in Psychological Science. At the time, we believed the findings and this felt like a success. Now we both recognize it as a failure.

The findings in that paper are not true. Yes, if you run the exact analyses we report on our same datasets, you will find significant effects. But they are not true because they would not replicate under specifiable conditions. History is about what happened. Science is about what happens next. And what happens next is that initials don’t affect your grades.

Inspired by discussions with Leif, I eventually (in 2010) reflected on what I was doing for a living, and I finally remembered that at some fundamental level a scientist’s #1 job is to differentiate what is true/replicable from what is not. This simple realization forever changed the way I conduct and evaluate research, and it is the driving force behind my desire for a more replicable science. If you accept this premise, then life as a scientist becomes much easier and more straightforward. A few things naturally follow.

First, it means that replicability is not merely a consideration, but the most important consideration. Of course I also care about whether findings are novel or interesting or important or generalizable, or whether the authors of an experiment are interpreting their findings correctly. But none of those considerations matter if the finding is not replicable. Imagine I claim that eating Funyuns® cures cancer. This hypothesis is novel and interesting and important, but those facts don’t matter if it is untrue. Concerns about replicability must trump all other concerns. If there is no replicability, there is no finding, and if there is no finding, there is no point assessing whether it is novel, interesting, or important. [1] Thus, more than any other attribute, journal editors and reviewers should use attributes that are diagnostic of replicability (e.g., statistical power and p-values) as a basis for rejecting papers. (Thank you, Simine Vazire, for taking steps in this direction at SPPS <.pdf>). [2]

Second, it means that the best way to prevent others from questioning the integrity of your research is to publish findings that you know to be replicable under specifiable conditions. You should be able to predict that if you do exactly X, then you will get Y. Your method section should be a recipe for getting an effect, specifying exactly which ingredients are sufficient to produce it. Of course, the best way to know that your finding replicates is to replicate it yourself (and/or to tie your hands by pre-registering your exact key analysis). This is what I now do (particularly after I obtain a p > .01 result), and I sleep a lot better because of it.

Third, it means that if someone fails to replicate your past work, you have two options. You can either demonstrate that the finding does replicate under specifiable/pre-registered conditions or you can politely tip your cap to the replicators for discovering that one of your published findings is not likely to be true. If you believe that your finding is replicable but don’t have the resources to run the replication, then you can pursue a third option: Specify the exact conditions under which you predict that your effect will emerge. This allows others with more resources to test that prediction. If you can’t specify testable circumstances under which your effect will emerge, then you can’t use your finding to predict the future, and, thus, you can’t say that it is true.

Andrew Meyer and his colleagues recently published several highly powered failures to reliably replicate my and Leif’s finding (.pdf; see Study 13) that disfluent fonts change how people predict sporting events (.pdf; see Table A6). We stand by the central claims of our paper, as we have replicated the main findings many times. But Meyer et al. showed that we should not  – and thus we do not – stand by the findings of Study 13. Their evidence that it doesn’t consistently replicate (20 games; 12,449 participants) is much better than our evidence that it does (2 games; 181 participants), and we can look back on our results and see that they are not convincing (most notably, p = .03). As a result, all we can do is to acknowledge that the finding is unlikely to be true. Meyer et al.’s paper wasn’t happy news, of course, but accepting their results was so much less stressful than mounting a protracted, evidence-less defense of a finding that we are not confident would replicate. Having gone that route before, I can tell you that this one was about a million times less emotionally punishing, in addition to being more scientific. It is a comfort to know that I will no longer defend my own work in that way. I’ll either show you’re wrong, or I’ll acknowledge that you’re right.

Fourth, it means advocating for policies and actions that enhance the replicability of our science. I believe that the #1 job of the peer review process is to assess whether a finding is replicable, and that we can all do this better if we know exactly what the authors did in their study, and if we have access to their materials and data. I also believe that every scientist has a conflict of interest – we almost always want the evidence to come out one way rather than another – and that those conflicts of interest lead even the best of us to analyze our data in a way that makes us more likely to draw our preferred conclusions. I still catch myself p-hacking analyses that I did not pre-register. Thus, I am in favor of policies and actions that make it harder/impossible for us to do that, including incentives for pre-registration, the move toward including exact replications in published papers, and the use of methods for checking that our statistical analyses are accurate and that our results are unlikely to have been p-hacked (e.g., because the study was highly powered).

I am writing all of this because it’s hard to resolve a conflict when you don’t know what the other side wants. I honestly don’t know what those who are resistant to change want, but at least now they know what I want. I want to be in a field that prioritizes replicability over everything else. Maybe those who are resistant to change believe this too, and their resistance is about the means (e.g., public criticism) rather than the ends. Or maybe they don’t believe this, and think that concerns about replicability should take a back seat to something else. It would be helpful for those who are resistant to change to articulate their position. What do you want our field to prioritize, and why?

  1. I sometimes come across the argument that a focus on replicability will increase false-negatives. I don’t think that is true. If a field falsely believes that Funyuns will cure cancer, then the time and money that may have been spent discovering true cures will instead be spent studying the Funyun Hypothesis. True things aren’t discovered when resources are allocated to studying false things. In this way, false-positives cause false-negatives. []
  2. At this point I should mention that although I am an Associate Editor at SPPS, what I write here does not reflect journal policy. []

[52] Menschplaining: Three Ideas for Civil Criticism

As bloggers, commentators, reviewers, and editors, we often criticize the work of fellow academics. In this post I share three ideas to be more civil and persuasive when doing so.

But first: should we comment publicly in the first place?
One of the best known social psychologist, Susan Fiske (.htm), last week circulated a draft of an invited opinion piece (.pdf), where she called academics who critically discuss published research in social media and blogs a long list of names including ‘self-appointed data police’ [1].

I think data-journalist is a more accurate metaphor than is data-police. Like journalists and unlike police officers, (academic) bloggers don’t have physical nor legal powers, they merely exercise free-speech sharing analyses and non-binding opinions that are valued by the people who choose to read them (in contrast, we are not free to ignore the police). Like journalists’, bloggers’ power hinges on being interesting, right, and persuasive.

Importantly, unlike journalists, most academic bloggers have similar training in the subject matter as the original authors whose work they discuss, and they inhabit their social and professional circles as well.  So bloggers are elite journalists: more qualified and better incentivized to be right [2].

Notorious vs influential
Bloggers and other commentators, as Susan Fiske reminds us, can fall in the temptation of getting more attention by acting out, say using colorful language to make unsubstantiated and vague accusations. But the consequences are internalized.

Acting out ends up hurting commentators more than those commented on. Being loud makes you notorious, not influential (think Ann Coulter). Moreover, when you have a good argument, acting out is counterproductive, it distracts from it.  You become less persuasive. Only those who already agree with you will respond to your writing. Academics who make a living within academia have no incentive to be notorious.

Despite the incentives to be civil, there is certainly room in public discussions for greater civility. If the president of APS had asked me to write a non-peer-reviewed article on this topic, I would have skipped the name-calling and gotten to the following three ideas.

Idea 1. Don’t label, describe
It is tempting to label the arguments we critique, saying about them things like ‘faulty logic,’ ‘invalid analyses,’ ‘unwarranted conclusions.’ These terms sound specific, but are ultimately empty phrases that cannot be evaluated by readers. When we label, all we are really saying is “Listen, I am so upset about this, I am ready to throw some colorful terms around.”

An example from my own (peer-reviewed & published) writing makes me cringe every time:
Their rebuttal is similarly lacking in diligence. The specific empirical concerns it raised are contradicted by evidence, logic, or both.
What a douche.

Rather than vague but powerfully sounding labels, “lacking diligence”, “contradicted by evidence,” it is better to describe the rationale for those labels. What additional analyses should have they run, and how do they contradict their conclusions? I should’ve written:
The rebuttal identifies individual examples that intuitively suggest my analyses were too conservative, but, on the one hand, closer examination shows the examples are not actually conservative, and on the other, the removal of those examples leaves the results unchanged.

Now readers know what to look for to decide if they agree with me. Labels becomes redundant, we can drop them.

Idea 2. Don’t speculate about motives
We often assume our counterparts have bad intentions. A hidden agenda, an ulterior nefarious motive. They do this because they are powerful, if not then because they are powerless. For instance, I once wrote
They then, perhaps disingenuously, argued that” Jerk.

Two problems with speculating about motives. First, it is delusional to think we know why someone did something by just seeing what they did, especially if it is in our interest to believe their intentions are not benign.  We don’t know why people do what they do, and it is too easy to assume they do so for reasons that would make us happier or holier.  Second, intentions are irrelevant. If someone publishes a critique of p-curve because they hate Joe’s, Leif’s and/or my guts, all that matters is if they are right or wrong, so when discussing the critique, all we should focus on is whether it is right or wrong.

Idea 3. Reach out
Probably the single thing that has helped me improve the most in the civility department consists of our policy in this blog: contacting authors whose work we discuss before making things public. It is amazing how much it helps, both after receiving the feedback, addressing things that tick people off that we would have never guessed, and before, anticipating remarks that may be irritating, and dropping them.

I obviously cannot do this as a reviewer or editor, in those cases I still apply Ideas 1 & 2. Also, as a heuristic check on tone, I imagine I am going to dinner with the authors and their parents that night.

1) I disagree with the substance of Susan Fiske’s piece. Academics discussing research in blogs and social media are elite data-journalists playing an indispensable role in modern science: disseminating knowledge, detecting and correcting errors, and facilitating open discussions.
2) I object its tone, calling colleagues terrorists, among many other things, increases the notoriety of our writing, but reduces its influence. (It’s also disrespectful).
3) I shared three ideas to improve civility in public discourse: Don’t label, don’t infer motives, and reach out.

Wide logo

Author feedback.
I shared a draft of this post with Susan Fiske who suggested I make clear the document that circulated was a draft which she will revise before publication. I edited the writing to reflect this.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Her piece included the following terms to describe bloggers or their actions: (1) Mob, (2) Online vigilantes, (3) Self-appointed data police, (4) Personal ferocity, (5) crashing people, (6) Unmoderated attacks, (7) Unaccountable bullies, (8) Adversarial viciousness, (9) Methodological terrorists, (10) Dangerous minority, (11) Destructo-critics, (12) They attack the person (oops), (13) Self-appointed critics. []
  2. and less interesting, and worse at writing, and with less exciting social lives, but with more stable jobs. []

[51] Greg vs. Jamal: Why Didn’t Bertrand and Mullainathan (2004) Replicate?

Bertrand & Mullainathan (2004, .pdf) is one of the best known and most cited American Economic Review (AER) papers [1]. It reports a field experiment in which resumes given typically Black names (e.g., Jamal and Lakisha) received fewer callbacks than those given typically White names (e.g., Greg and Emily). This finding is interpreted as evidence of racial discrimination in labor markets .

Deming, Yuchtman, Abulafi, Goldin, and Katz (2016, .pdf) published a paper in the most recent issue of AER. It also reports an audit study in which resumes were sent to recruiters. While its goal was to estimate the impact of for-profit degrees, they did also manipulate race of job applicants via names. They did not replicate the (secondary to them) famous finding that at least three other studies had replicated (.ssrn | .pdf | .pdf). Why? What was different in this replication? [2]

Small Telescope Test 
The new study had more than twice the sample size (N = 10,484) of the old study (N = 4,870). But it also included Hispanic names, the sample for White and Black names was nevertheless about 50% larger than for the old study.

Callback rates in the two studies were (coincidentally) almost identical: 8.2% in both. In the famous study, Whites were 3% percentage points (pp) more likely to receive a call-back than non-Whites. In the new study they were 0.4% pp less likely than non-Whites, not significantly different from 0%, χ2(1)=.61, p=.44 [3].

The small-telescopes test (SSRN) asks if a replication result rules out effects big enough to be detectable by the original. The confidence interval around the new estimate spans -1.3% pp against Whites to +0.5% pp pro-White. For the original sample size, an effect as small as that upper end, +.5% pp, would lead to a meager 9% of statistical power. The new study’s results are inconsistent with effects big enough to be detectable by the older study, so we accept the null. (For more on accepting the null: DataColada[42].)

Why no effect of race?
Deming et al. list a few differences between their study and Bertrand & Mullainathan’s, but do not empirically explore their role [4]:
Quotes from Deming
In addition to these differences, the new study used a different set of names. They are not listed in paper, but the first author quickly provided them upon request.

That names thing is key for this post.

In DataColada[36], I wrote about a possible Socioeconomic Status (SES) confound in Bertrand and Mullainathan (2004) [5]. Namely, their Black names (e.g., Rasheed, Leroy, Jamal) seem low SES, while the White names (e.g., Greg, Matthew, Todd) do not. The Black names from the new study (e.g., Malik, Darius and Andre) do not seem to share this confound. Below is the picture I used to summarize Colada[36], I modified it (with green font) to capture why I am interested in the new study’s Black names [6].

denzel II
But I am a Chilean Jew with a PhD, my own views on whether a particular name hints at SES may not be terribly representative of that of an everyday American. So I collected some data.

Study 1
In this study I randomly paired a Black name used in the old study with a Black name used in the new study and N=202 MTurkers answered this question (Qualtrics survey file and data: .htm):
Study 1Each respondent saw one random pair of names, in counterbalanced order. The results were more striking than I expected [7].
Study 1 Results with axis labeled
Note: Tyrone was used in both studies so I did not include it.

Study 2 
One possible explanation for Study 1’s result (and the failure to replicate in the recent AER article) is that the names used in the new study were not perceived to be Black names. In Study 2, I asked N=201 MTurkers to determine the race and gender of some of the names from the old and new studies [8].
Study 2 Results
Indeed the Black names in the new study were perceived as somewhat less Black than those used in the old one. Nevertheless, they were vastly more likely to be perceived as Black than were the control names. Based on this figure alone, we would not expect discrimination against Black names to disappear in the new study. But it did.

In Sum.
The lower callback rates for Jamal and Lakisha in the classic 2004 AER paper, and the successful replications mentioned earlier, are as consistent with racial as with SES discrimination. The SES account parsimoniously also explains this one failure to replicate the effect. But this conclusion is tentative as best, we are comparing studies that differ on many dimensions (and the new study had some noteworthy glitches, read footnote 4). To test racial discrimination in particular, and name effects in general, we need the same study to orthogonally manipulate these, or at least use names pretested to differ only on the dimension of interest. I don’t think any audit study has done that.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. I contacted the 7 authors behind the two articles and received feedback from 5 of them. I hope to have successfully incorporated their suggestions; they focused on clearer use of “replication” terminology, and broader representation of the relevant literature.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. According to Web-of-science (WOS), its 682 citations put it as the 9th most cited article published since 2000. Aside: The WOS should report an age-adjusted citation index. []
  2. For a review of field experiments on discrimination see Bertrand & Duflo (.pdf) []
  3. I carried out these calculations using the posted data (.dta | .csv) . The new study reports regression results with multiple covariates and on subsamples. In particular, in Tables 3 and 4 they report results separately for jobs that do and do not require college degrees. For jobs without required degrees, White males did significantly worse: they were 4% percentage points (pp) less likely to be called back. For jobs that do require a college degree the point estimate is a -1.5% pp effect (still anti-White bias), but with a confidence interval that includes +2.5% pp (pro-White), in the ball-park of the 3% pp of the famous study, and large enough to be detectable with its sample size. So for jobs without requirements a conclusive failure to replicate, for those that require college, an inconclusive result. The famous study investigated jobs that did not require a college degree. []
  4. Deming et al surprisingly do not break down results for Hispanics vs Blacks in their article; I emailed the first author and he explained that, due to a glitch in their computer program, they could not differentiate Hispanic from Black resumes. This is not mentioned in their paper. This glitch generates another possible explanation: Blacks may have been discriminated against just like in the original, but Hispanics were discriminated in favor by a larger amount, so that when collapsed into a single group, the negative effect is not observable. []
  5. Fryer & Levitt (2004) have a QJE paper on the origin and consequences of distinctively Black names (.pdf) []
  6. The opening section mentions three articles that replicate Bertrand and Mullainathan’s finding of Black names receiving fewer callbacks. The Black names in those papers also seem low SES to me but I did not include them in my data-collection. For instance, the names include  Latoya, Tanisha, DeAndre, DeShawn, and Reginald. They do not include any of the high SES Black names from Deming et al. []
  7. It would be most useful to compare SES between White and Black names, but I did not include White names in this study worried respondents would think “I am not going to tell you that I think White names are higher SES than Black names.” Possibly my favorite Mike Norton study (.pdf) documents people are OK making judgments between two White or two Black faces, but reluctant between a White and a Black one []
  8. From Bertrand & Mullainathan I chose 3 Black names, the one their participants rated the Blackest, the median, and the least Black; see their Table A1. []

[50] Teenagers in Bikinis: Interpreting Police-Shooting Data

The New York Times, on Monday, showcased (.htm) an NBER working paper (.pdf) that proposed that “blacks are 23.8 percent less likely to be shot at by police relative to whites.” (p.22)

The paper involved a monumental data collection effort  to address an important societal question. The analyses are rigorous, clever and transparently reported. Nevertheless, I do not believe the above conclusions are justified from the evidence. Relying on additional results reported in the paper, I show here the data are consistent with police shootings being biased against Blacks, but too noisy to confidently conclude either way [1],[2].

Teenagers in bikinis
As others have noted [3], an interesting empirical challenge for interpreting the shares of Whites vs Blacks shot by police while being arrested is that biased officers, those overestimating the threat posted by a Black civilian, will arrest less dangerous Blacks on average. They will arrest those posing a real threat, but also some not posing a real threat, resulting in lower average threat among those arrested by biased officers [4].

For example, a biased officer may be more likely to perceive a Black teenager in a bikini as a physical threat (YouTube) than a non-biased officer would, lowering the average threat of his arrestees. If teenagers in bikinis, in turn, are less likely to be shot by police than armed criminals are, racial bias will cause a smaller share of Black arrestees to be shot. A spurious association showing no bias precisely because there is bias.

A closer look at the table behind the result that Blacks are 23.8% less likely to be shot, leads me to suspect the finding is indeed spurious.

Table 5Let’s focus on the red rectangle (the other columns don’t control for threat posed by arrestee). It reports odds ratios for Black relative to White arrestees being shot, controlling for more and more variables. The numbers tell us how many Blacks are shot for every White that is. The first number, .762 is where the result that Blacks are 23.8% less likely to be shot comes from (1-.762=.238). It controls for nothing, criminals and teenagers in bikinis are placed in the same pool .

The highlighted Row 4 shows what happens when we control for, among other things, how much of a threat the arrestee posed (namely, whether s/he drew a weapon). The odds ratio jumps from  .76 to 1.1. The evidence suggesting discrimination in favor of Blacks disappears, exactly what you expect if the result is driven by selection bias (by metaphorical teenagers in bikinis lowering the average threat of arrestees).

Given how noisy the results are, big standard errors (see next point), I don’t read much from the fact that the estimate goes over 1.0 (showing discrimination against Blacks), I do make much of the fact that the estimate is so unstable, and it moves dramatically in the predicted direction by the “it is driven by selection-bias” explanation.

Noisy estimates
The above discussion took the estimates at face value, but they have very large standard errors, to the point they provide virtually no signal. For example:

Row 4 is compatible with Blacks being 50% less likely to be shot than Whites, but
Row 4 is compatible with Blacks being 80% more likely to be shot than Whites.

These results do not justify updating our beliefs on the matter one way or the other.

How threatening was the person shot at?
Because the interest in the topic is sparked by videos showing Black civilians killed by police officers despite posing no obvious threat to them, I would define the research question as follows:

When a police officer interacts with a civilian, is a Black civilian more likely to be shot than a White civilian is, for a given level of actual threat to the police officer and the public?

The better we can measure and take into account threat, the better we can answer that research question.

The NBER paper includes analyses that answer this question better than the analyses covered by The New York Times do. For instance, Table 8 (.png) focuses on civilians shot by police and asks: Did they have a weapon?  If there is bias against Blacks, we expect fewer of them to have had a weapon when shot, and that’s what the table reports [5].

14.9% of White civilians shot by White officers did not have a weapon.
19.0% of Black civilians shot by White officers did not have a weapon.

The observed difference is 4.1 percentage points, or about 1/3 the baseline (a larger effect size than the 23.8% behind the NY Times story). As before the estimates are noisy and not statistically significant.

When big effect size estimates are not statistically significant we don’t learn the effect is zero, we learn the sample is too small,the results inconclusive; not newsworthy.

Ideas for more precise estimates
One solution is larger samples. Obviously, but sometimes hard to achieve.

Collecting additional proxies for threat could help too. For example, the arrestee’s criminal record, the reason for the arrest, and the origin of the officer-civilian interaction (e.g., routine traffic stop vs responding to 911 call). What kind of weapon did the civilian have and was it within easy reach. Etc.

The data used for the NBER article includes long narrative accounts written by the police about the interactions. These could be stripped of race identifying information and raters be asked to subjectively evaluate the threat level right before the shooting takes place.

Finally, I’d argue we expect not just a main effect of threat, one to be controlled with a covariate, but an interaction. In high-threat situations the use of force may be unambiguously appropriate. Racial bias may be play a larger role in lower-threat situations.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. Roland Fryer, the Harvard economist author of the NBER article, generously and very promptly responded providing valuable feedback that I hope to have been able to adequately incorporate, including the last paragraphs with constructive suggestions. (I am especially grateful given how many people must be contacting him right after the New York Times articles came out.) 

PS: Josh Miller (.htm) from Bocconi had a similar set of reactions that are discussed in today’s blogpost by Andrew Gelman (.htm).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The paper and supplement add to nearly 100 pages, by necessity I focus on the subset of analyses most directly relevant to the 23.8% result. []
  2. The result most inconsistent with my interpretation of the data are reported in Table 6 (.png), page 24 in the paper, comparing the share of police officers who self-reported, after the fact, whether they shot before or after being attacked. The results show an unbelievably large difference favoring Blacks; officers self-report being 44% less likely to shoot before being attacked by Black vs White arrestees. []
  3. See e.g. these tweets by political scientist Matt Blackwell .pdf []
  4. The intuition behind this selection bias is commonly relied on to test for discrimination. It dates back at least to Becker (1957) “The Economics of Discrimination”; it’s been used in empirical papers examining discrimination in real estate, bank loans, traffic stops, teaching evaluations, etc. []
  5. The table reports vast differences in the behavior of White and Black officers; I suspect this means the analyses need to include more controls. []

[49] P-Curve Won’t Do Your Laundry, But Will Identify Replicable Findings

In a recent critique, Bruns and Ioannidis (PlosONE 2016 .pdf) proposed that p-curve makes mistakes when analyzing studies that have collected field/observational data. They write that in such cases:

p-curves based on true effects and p‑curves based on null-effects with p-hacking cannot be reliably distinguished” (abstract).

In this post we show, with examples involving sex, guns, and the supreme court, that the statement is incorrect. P-curve does reliably distinguish between null effects and non-null effects. The observational nature of the data isn’t relevant.

The erroneous conclusion seems to arise from their imprecise use of terminology. Bruns & Ioannidis treat a false-positive finding and a confounded finding as the same thing.  But they are different things. The distinction is as straightforward as it is important.

Confound vs False-positive.
We present examples to clarify the distinction, but first let’s speak conceptually.

A Confounded effect of X on Y is real, but the association arises because another (omitted) variable causes both X and Y. A new study of X on Y is expected to find that association again.

A False-positive effect of X on Y, in contrast, is not real. The apparent association between X and Y is entirely the result of sampling error. A new study of X on Y is not expected to find an association again.

Confounded effects are real and replicable, while false-positive effects are neither. Those are big differences, but Bruns & Ioannidis conflate them. For instance, they write:

the estimated effect size may be different from zero due to an omitted-variable bias rather than due to a true effect. (p. 3; emphasis added).

Omitted-variable bias does not make a relationship untrue; it makes it un-causal.

This is not just semantics, nor merely a discussion of “what do you mean by a true effect?”
We can learn something from examining  replicable effects further (e.g., learn if there is a confound and what it is; confounds are sometimes interesting).  We cannot learn something from examining non-replicable  effects further.

This critical distinction between replicable and non-replicable effects  can be informed by p-curve. Replicable results, whether causal or not, lead to right-skewed p-curves. False-positive, non-replicable effects lead to flat or left-skewed p-curves.

P-curve’s inability to distinguish causal vs. confounded relationships is no more of a shortcoming than is its inability to fold laundry or file income tax returns. Identifying causal relationships is not something we can reasonably expect any statistical test to do [1].

laundryWhen researchers try to assess causality through techniques such as instrumental variables, regression discontinuity, or randomized field experiments, they do so via superior designs, not via superior statistical tests. The Z, t, and F tests reported in papers that credibly establish causality are the same tests as those reported in papers that do not.

Correlation is not causation. Confusing the two is human error, not tool error.

To make things concrete we provide two examples. Both use General Social Survey (GSS) data, which is, of course, observational data.

Example 1. Shotguns and female partners (Confound)
With the full GSS, we identified the following confounded association: Shotgun owners report having had 1.9 more female sexual partners, on average, than do non-owners, t(14824)=10.58, p<.0001.  The omitted variable is gender.

33% of Male respondents report owning a shotgun, whereas, um, ‘only’ 19% of Women do.

Males, relative to females, also report having had a greater number of sexual encounters with females (means of 9.82 vs 0.21)

Moreover, controlling for gender, the effect goes away (t(14823)=.68, p=.496) [2].

So the relationship is confounded. It is real but not causal. Let’s see what p-curve thinks of it. We use data from 1994 as the focal study, and create a p-curve using data from previous years (1989-1993) following a procedure similar to Bruns and Ioannidis (2016)  [3]. Panel A in Figure 1 shows the resulting right-skewed p-curve. It suggests the finding should replicate in subsequent years. Panel B shows that it does.

F1 for blogR Code to reproduce this figure: https://osf.io/v4spq/ 

Example 2. Random numbers and the Supreme Court (false-positive)
With observational data it’s hard to identify exactly zero effects because there is always the risk of omitted variables, selection bias, long and difficult-to-understand causal chains, etc.

To create a definitely false-positive finding we started with a predictor that could not possibly be expected to truly correlate with any variable: whether the random respondent ID was odd vs. even.

We then p-hacked an effect by running t-tests on every other variable in the 1994 GSS dataset for odd vs. even participants, arriving at 36 false-positive ps<.05. For its amusement value, we focused on the question asking participants how much confidence they have in the U.S. Supreme Court (1: a great deal, 2: only some, 3: hardly any).

Panel C in Figure 1 shows that, following the same procedure as for the previous example, the p-curve for this finding is flat, suggesting that the finding would not replicate in subsequent years. Panel D shows that it does not. Figure 1 demonstrates how p-curve successfully distinguishes between statistically significant studies that are vs. are not expected to replicate.

Punchline: p-curve can distinguish replicable from non-replicable findings. To distinguish correlational from causal findings, call an expert.

Note: this is a blog-adapted version of a formal reply we wrote and submitted to PlosONE, but since 2 months have passed and they have not sent it out to reviewers yet, we decided to Colada it and hope someday PlosONE generously decides to send our paper out for review.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. We contacted Stephan Bruns and John Ioannidis. They didn’t object to our distinction between confounded and false-positive findings, but propose that “the ability of ‘experts’ to identify confounding is close to non-existent.” See their full 3-page response (.pdf).


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. For what is worth, we have acknowledged this in prior work. For example, in Simonsohn, Nelson, and Simmons (2014, p. 535) we wrote, “Just as an individual finding may be statistically significant even if the theory it tests is incorrect— because the study is flawed (e.g., due to confounds, demand effects, etc.)—a set of studies investigating incorrect theories may nevertheless contain evidential value precisely because that set of studies is flawed” (emphasis added). []
  2. We are not claiming, of course, that the residual effect is exactly zero. That’s untestable. []
  3. In particular, we generated random subsamples (of the size of the 1994 sample), re-ran the regression predicting number of female sexual partners with the shotgun ownership dummy, and constructed a p-curve for the subset of statistically significant results that were obtained. This procedure is not really necessary. Once we know the effect size and sample size we know the non-centrality parameter of the distribution for the test-statistic and can compute expected p-curves without simulations (see Supplement 1 in Simonsohn et al., 2014), but we did our best to follow the procedures by Bruns and Ioannidis. []

[48] P-hacked Hypotheses Are Deceivingly Robust

Sometimes we selectively report the analyses we run to test a hypothesis.
Other times we selectively report which hypotheses we tested.

One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or extroverts — do.  Another popular way is to get an interesting dataset first, and figure out what to test with it second [1].


For example, a researcher gets data from a spelling bee competition and asks: Is there evidence of gender discrimination? How about race? Peer-effects? Saliency? Hyperbolic discounting? Weather? Yes! Then s/he writes a paper titled “Weather & (Spelling) Bees” as if that were the only hypothesis tested [2]. The odds of a p<.05 when testing all these hypotheses is 26% rather than the nominal 5% [3].

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks [4].

Example: Odd numbers and the horoscope
To demonstrate the problem I conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,”  may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code) [5]

T1dPeople are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS.  Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

How to deal with p-hacked hypotheses?
Replications are the obvious way to tease apart true from false positives. Direct replications, testing the same prediction in new studies, are often not feasible with observational data.  In experimental psychology it is common to instead run conceptual replications, examining new hypotheses based on the same underlying theory.  We should do more of this in non-experimental work. One big advantage is that with rich data sets we can often run conceptual replications on the same data.

To do a conceptual replication, we start from the theory behind the hypothesis, say “odd numbers prompt use of less traditional sources of information” and test new hypotheses. For example, this theory may predict that odd numbered respondents are more likely to read blogs instead of academic articles, read nutritional labels from foreign countries, or watch niche TV shows [6].

Conceptual replications should be statistically independent from original (under the null).[7]
That is to say, if an effect we observe is false-positive, the probability that the conceptual replication obtains p<.05 should be 5%. An example that would violate this would be testing if respondents with odd numbers are more likely to consult tarot readers. If by chance many superstitious individuals received an odd number by the GSS, they will both read the horoscope and consult tarot readers more often. Not independent under the null, hence not a good conceptual replication with the same data.

A closely related alternative is also commonly used in experimental psychology: moderation. Does the effect get smaller/larger when the theory predicts it should?

For example, I once examined how the price of infant carseats sold on eBay responded to a new safety rating by Consumer Reports (CR), and to its retraction (surprisingly, the retraction was completely effective, .pdf). A referee noted that if the effects  were indeed caused by CR information, they should be stronger for new carseats, as CR advises against buying used ones. If I had a false-positive in my hands we would not expect moderation to work (it did).

1. With field data it’s easy to p-hack hypotheses.
2. The resulting false-positive findings will be robust to alternative specifications
3. Tools common in experimental psychology, conceptual replications and testing moderation, are viable solutions.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. As with most forms of p-hacking, selectively reporting hypotheses typically does not involve willful deception. []
  2. I chose weather and spelling bee as an arbitrary example. Any resemblance to actual papers is seriously unintentional. []
  3. (1-.95^6)=.2649 []
  4. Robustness tests may help with the selective reporting of hypothesis if a spurious finding is obtained due to specification rather than sampling error. []
  5. This finding is necessarily false-positive because ID numbers are assigned after the opportunity to read the horoscope has passed, and respondents are unaware of the number they have been assigned to; but see Bem (2011 .htm) []
  6. This opens the door to more selective reporting as a researcher may attempt many conceptual replications and report only the one(s) that worked. By virtue of using the same dataset to test a fixed theory, however, this is relatively easy to catch/correct if reviewers and readers have access to the set of variables available to the researcher and hence can at least partially identify the menu of conceptual replications available. []
  7. Red font clarification added after tweet from Sanjay Srivastava .htm []

[47] Evaluating Replications: 40% Full ≠ 60% Empty

Last October, Science published the paper “Estimating the Reproducibility of Psychological Science” (.pdf), which reported the results of 100 replication attempts. Today it published a commentary by Gilbert et al. (.pdf) as well as a response by the replicators (.pdf).

The commentary makes two main points. First, because of sampling error, we should not expect all of the effects to replicate even if all of them were true. Second, differences in design between original studies and replication attempts may explain differences in results. Let’s start with the latter.[1]

Design differences
The commentators provide some striking examples of design differences. For example, they write, “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon” (p. 1037).

People can debate if such differences can explain the results (and in their reply, the replicators explain why they don’t think so). However, for readers to consider whether design differences matter, they first need to know those differences exist. I, for one, was unaware of them before reading Gilbert et al. (They are not mentioned in the 6 page Science article .pdf, nor 26 page supplement .pdf). [2]

This is not about pointing fingers, as I have also made this mistake: I did not sufficiently describe differences between original and replication studies  in my Small Telescopes paper (see Colada [43]).

This is also not about taking a position on whether any particular difference is responsible for any particular discrepancy in results. I have no idea. Nor am I arguing design differences are a problem per-se, in most cases they were even approved by the original authors.

This is entirely about improving the reporting of replications going forward. After reading the commentary I better appreciate the importance of prominently disclosing design differences. This better enables readers to consider the consequences of such differences, while encouraging replicators to anticipate and address, before publication, any concerns they may raise. [3]

Noisy results
I am also sympathetic to the commentators’ other concern, which is that sampling error may explain the low reproducibility rate. Their statistical analyses are not quite right, but neither are those by the replicators in the reproducibility project.

A study result can be imprecise enough to be consistent both with an effect existing and with it not existing. (See Colada[7] for a remarkable example from Economics). Clouds are consistent with rain, but also consistent with no rain. Clouds, like noisy results, are inconclusive.

The replicators interpreted inconclusive replications as failures, the commentators as successes. For instance, one of the analyses by the replicators considered replications as successful only if they obtained p<.05, effectively treating all inconclusive replications as failures. [4]

Both sets of authors examined whether the results from one study were within the confidence interval of the other, selectively ignoring sampling error of one or the other study.[5]

In particular, the replicators deemed a replication successful if the original finding was within the confidence interval of the replication. Among other problems this approach leads most true effects to fail to replicate with sufficiently big replication samples.[6]

The commentators, in contrast, deemed replications successful if their estimate was within the confidence interval of the original. Among other problems, this approach leads too many false-positive findings to survive most replication efforts.[7]

For more on these problems with effect size comparisons, see p. 561 in “Small Telescopes” (.pdf).

Accepting the null
Inconclusive replications are not failed replications.

For a replication to fail, the data must support the null. They must affirm the non-existence of a detectable effect. There are four main approaches to accepting the null (see Colada [42]). Two lend themselves particularly well to evaluating replications:

(i) Small Telescopes (.pdf): Test whether the replication rejects effects big enough to be detectable by the original study, and (ii) Bayesian evaluation of replications (.pdf).

These are philosophically and mathematically very different, but in practice they often agree. In Colada [42] I reported that for this very reproducibility project, the Small Telescopes and the Bayesian approach are correlated r = .91 overall, and r = .72 among replications with p>.05. Moreover, both find that about 30% of replications were inconclusive. (R Code).  [8],[9]

40% full is not 60% empty
The opening paragraph of the response by the replicators reads:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled”

They are saying the glass is 40% full.  They are not explicitly saying it is 60% empty. But readers may be forgiven for jumping to that conclusion, and they almost invariably have.  This opening paragraph would have been equally justified:
[…] the Open Science Collaboration observed that the original result failed to replicate in ~30 of 100 studies sampled”

It would be much better to fully report:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 replications were inconclusive.”

1. Replications must be analyzed in ways that allow for results to be inconclusive, not just success/fail
2. Design differences between original and replication should be prominently disclosed.

Wide logo

Author feedback.
I shared a draft of this post with Brian Nosek, Dan Gilbert and Tim Wilson, and invited them and their co-authors to provide feedback. I exchanged over 20 emails total with 7 of them. Their feedback greatly improved, and considerably lengthened, this post. Colada Co-host Joe Simmons provided lots of feedback as well.  I kept editing after getting feedback from all of them, so the version you just read is probably worse and surely different from the versions any of them commented on.

Concluding remarks
My views on the state of social science and what to do about it are almost surely much closer to those of the reproducibility team than to those of the authors of the commentary. But. A few months ago I came across a “Rationally Speaking” podcast (.htm) by Julia Galef (relevant part of transcript starts on page 7, .pdf) where she talks about debating with a “steel-man” version, as opposed to straw-man, of an argument. It changed how Iman of steel approach disagreements. For example, the Gilbert et al  commentary opens with what appears to be an incorrectly calculated probability. One could straw-man argue against the commentary by focusing on that calculation. But the argument such probability is meant to support does not hinge on precisely estimating it. There are other weak-links in the commentary, but its steel-man version, the one focusing on its strengths rather than weaknesses, did make me think better about the issues at hand and ended up with what I think is an improved perspective on replications.

We are greatly indebted to the collaborative work of 100s of colleagues behind the reproducibility project, and to Brian Nosek for leading that gargantuan effort (as well as many other important efforts to improve the transparency and replicability of social science). This does not mean we should not try to improve on it or to learn from its shortcomings.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The commentators  actually focus on three issues: (1) (Sampling) error, (2) Statistical power, and (3) Design differences. I treat (1) and (2) as the same problem []
  2. However, the 100 detailed study protocols are available online (.htm), and so people can identify them by reading those protocols. For instance, here (.htm) is the (8 page) protocol for the military vs honeymoon study. []
  3. Brandt et al (JESP 2014) understood the importance of this long before I did, see their ‘Replication Recipe’ paper .pdf []
  4. Any true effect can fail to replicate with a small enough sample, a point made in most articles making suggestions for conducting and evaluating replications, including Small Telescopes (.pdf). []
  5. The original paper reported 5 tests of reproducibility: (i) Is the replication p<.05?, (ii) Is the original within the confidence interval of the replication?, (iii) Does the replication team subjectively rate it as successful vs failure? (iv) Is the replication directionally smaller than the original? and (v) Is the average of original and replication significantly different from zero? In the post I focus only on (i) and (ii) because: (iii)  is not a statistic with evaluative properties (but in any case, also does not include an ‘inconclusive bin’), and neither (iv) nor (v) measure reproducibility.  (iv) Measures publication bias (with lots of noise), and I couldn’t say what (v) measures. []
  6. Most true findings are inflated due to publication bias, so the unbiased estimate from the replication will eventually reject it []
  7. For example, the prototypically  p-hacked p=.049 finding, has a confidence interval that nearly touches zero. To obtain a replication outside that confidence interval, therefore, we need to observe a negative estimate. If the true effect is zero, that will happen only 50% of the time, so about half of false-positive p=.049 would survive replication attempts []
  8. Alex Etz in his blog post did the Bayesian analyses long before I did and I used his summary dataset, as is, to run my analyses. See his PLOS ONE paper, .htm. []
  9. The Small Telescope approach finds that only 25% of replications conclusively failed to replicate, whereas the Bayesian approach says this number is about 37%. However, several of the disagreements come from results that barely accept or don’t accept the null, so the two agree more than these two figures suggest. In the last section of Colada[42] I explain what causes disagreements between the two. []

[46] Controlling the Weather

Behavioral scientists have put forth evidence that the weather affects all sorts of things, including the stock market, restaurant tips, car purchases, product returns, art prices, and college admissions.

It is not easy to properly study the effects of weather on human behavior. This is because weather is (obviously) seasonal, as is much of what people do. This means that any investigation of the relation between weather and behavior must properly control for seasonality.

For example, in the U.S., Google searches for “fireworks” correlate positively with temperature throughout the year, but only because July 4th is in the summer. This is a seasonal effect, not a weather effect.
f0Almost every weather paper tries to control for seasonality. This post shows they don’t control enough.

How do they do it?
To answer this question, we gathered a sample of 10 articles that used weather as a predictor. [1]
t1In economics, business, statistics, and psychology, authors use monthly and occasionally weekly controls to account for seasonality. For instance they ask, “Does how cold it was when a coat was bought predict if it was returned, controlling for the month of the year in which it was purchased?”

That’s not enough.
The figures below show the average daily temperature in Philadelphia, along with the estimates provided by monthly (left panel) and weekly (right panel) fixed effects. These figures remind us that the weather does not jump discretely from month to month or week to week. Rather, weather, like earth, moves continuously. This means that seasonal confounds, which are continuous, will survive discrete (monthly or weekly) controls.

F12The vertical distance between the blue lines (monthly/weekly dummies) captures the residual seasonality confound. For example, during March (just left of the ‘100 day’ tick), the monthly dummy assigns 44 degrees to every March day, but temperature systematically fluctuates within March, from a long-term average of 39 degrees on March 1st to a long-term average of 50 degrees on March 31st. This is a seasonally confounded 11-degree difference that is entirely unaccounted for by monthly dummies.

The confounded effect of seasonality that survives weekly dummies is roughly 1/4 that size.

Fixing it.
The easy solution is to control for the historical average of the weather variable of interest for each calendar date.[2]

For example, when using how cold January 24, 2013 was to predict whether a coat bought that day was eventually returned, we include as a covariate the historical average temperature for January 24th  (in that city).[3]

Demonstrating the easy fix
To demonstrate how well this works, we analyze a correlation that is entirely due to a seasonal confound: the number of daylight hours  in Bangkok, Thailand (sunset – sunrise), and the temperature that same day in Philadelphia (data: .dta | .csv| ).  Colder days in Philadelphia tend to be shorter days in Bangkok, but not because coldness in one place shortens the day in the other (nor vice versa), but because seasonal patterns influence both variables. Properly controlling for seasonality should eliminate an association between these variables.

Using day duration in Bangkok as the dependent variable and temperature in Philly as the predictor, we threw in monthly and then weekly dummies to control for the seasonal confound. Neither technique fully succeeded, as same-day temperature survived as a significant predictor. (STATA .do)

t2BangkokThus, using monthly and weekly dummy variables made it seem like, over and above the effects of seasonality, colder days are more likely to be shorter. However, controlling for the historical average daily temperature showed, correctly, that seasonality is the sole driver of this relationship.

Wide logo

Original author feedback:
We shared a draft of this post with authors from all 10 papers from Table 1 and we heard back from 5 of them. Their feedback led to correcting errors in Table 1, changing the title of the post, and fixing the day-duration example (Table 2). Devin Pope, moreover, conducted our suggested analysis on his convertible purchases (QJE) paper and shared the results with us. The finding is robust to our suggested additional control. Devin thought it was valuable to highlight that while historic temperature average is a better control for weather-based seasonality, reducing bias, weekly/monthly dummies help with noise from other seasonal factors such as holidays. We agreed. Best practice, in our view, would be to include time dummies to the granularity permitted by the data to reduce noise, and to include the daily historic average to reduce the seasonal confound of weather variation.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Uri created the list by starting with the most well-cited observational weather paper he knew – Hirshlifer & Shumway – and then selected papers citing it in the Web-of-science and published in journals he recognized. []
  2. Another is to use daily dummies. This option can easily be worse. It can lower statistical power by throwing away data. First, one can only apply daily fixed effects to data with at least two observations per calendar date. Second, this approach ignores historical weather data that precedes the dependent variable. For example, if using sales data from 2013-2015 in the analyses, the daily fixed effects force us to ignore weather data from any prior year. Lastly, it ‘costs’ 365 degrees-of-freedom (don’t forget leap year), instead of 1. []
  3. Uri has two weather papers. They both use this approach to account for seasonality. []

[45] Ambitious P-Hacking and P-Curve 4.0

In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.

Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.f1

For a recently published paper, “Better P-Curves” (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.


Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.[1]

P-Curve and Ambitious p-hacking
curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.

In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.

However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.

But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.

Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).

The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.

Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.

The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. [2]

In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.


The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.[3]

We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. []
  2. As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. []
  3. When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/  []

[44] AsPredicted: Pre-registration Made Easy

Pre-registering a study consists of leaving a written record of how it will be conducted and analyzed. Very few researchers currently pre-register their studies. Maybe it’s because pre-registering is annoying. Maybe it’s because researchers don’t want to tie their own hands. Or maybe it’s because researchers see no benefit to pre-registering.  This post addresses these three possible causes. First, we introduce AsPredicted.org, a new website that makes pre-registration as simple as possible. We then show that pre-registrations don’t actually tie researchers’ hands, they tie reviewers’ hands, providing selfish benefits to authors who pre-register. [1]

The best introduction is arguably the home-page itself:homepage 11302015

No matter how easy pre-registering becomes, not pre-registering is always easier.  What benefits outweigh the small cost?

Benefit 1. No more self-censoring
In part by choice, and in part because some journals (and reviewers) now require it, more and more researchers are writing papers that properly disclose how their studies were run; they are disclosing all experimental conditions, all measures collected, any data exclusions, etc.

Disclosure is good. It appropriately increases one’s skepticism of post-hoc analytic decisions. But it also increases one’s skepticism of totally reasonable ex-ante decisions, for the two are sometimes confused. Imagine you collect and properly disclose that you measured one primary dependent variable and two exploratory variables,  only to get hammered by Reviewer 2, who writes:

This study is obviously p-hacked. The authors collected three measures and only used one as a dependent variable. Reject.

When authors worry that they will be accused of reporting only the best of three measures, they may decide to only collect a single measure. Preregistration frees authors to collect all three, while assuaging any concerns of being accused of p-hacking.

You don’t tie your hands with pre-registration. You tie Reviewer 2’s.

In case you skipped the third blue box above:

Benefit 2. Go ahead, data peek
Data peeking, where one decides whether to get more data after analyzing the data, is usually a big no-no. It invalidates p-values and (several aspects of) Bayesian inference. [2]  But if researchers pre-register how they will data peek, it becomes kosher again.

For example, you can pre-register, “In line with Frick (1986 .pdf) we will check data after every 20 observations per-cell, stopping whenever p<.01 or p>.36,”  or “In line with Pocock (1977 .pdf), we will collect up to 60 observations per-cell, in batches of 20, and stop early if p<.022.”

Lakens (2014 .pdf) gives an accessible introduction to legalized data-peeking for psychologists.

Benefit 3. Bolster credibility of odd analyses
Sometimes, the best way to analyze the data is difficult to sell to readers. Maybe you want to do a negative binomial regression, or do an arcsine transformation, or drop half the sample because the observations are not independent. You think about it for hours, ask your stat-savvy friends, and then decide that the weird way to analyze your data is actually the right way to analyze your data. Reporting the weird (but correct!) analysis opens you up to accusations of p-hacking. But not if you pre-register it. “We will analyze the data with an arc-sine transformation.” Done. Reviewer 2 can’t call you a p-hacker.
Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. More flexible options for pre-registration are offered by the Open Science Framework and the Social Science Registry, where authors can write up documents in any format, covering any aspect of their design or analysis, and without any character limits. See pre-registration instructions for the OSF here , and for the Social Science Registry here. []
  2. In particular, if authors peek at their data seeking a given Bayes Factor, they increase the odds they will find support for the alternative hypothesis even if the null is true – see Colada [13] – and they obtain biased estimates of effect size. []