[119] A Hidden Confound in a Psych Methods Pre‑registrations Critique

A forthcoming paper in Psych Methods (.pdf) had a set of coders evaluate 300 pre-registrations in terms of how informative they were about several study attributes (e.g., hypotheses, analysis, DVs). The authors analyzed the subjective codings and concluded that many pre-registrations in psychology, especially those relying on the AsPredicted template, provide insufficient information.¹^,²

Central to this entire enterprise is an understanding of what constitutes a good vs a bad preregistration, and I seem to disagree with the authors on that. Pre-registrations should only contain information that helps demarcate confirmatory vs exploratory statistical analyses (i.e., that would help a reader identify harking and p-hacking), and should generally avoid other information.³ Succinctness is key because a pre-registration is only useful if it's read alongside the published paper it accompanies, and unnecessarily verbose documents are seldom read (think: terms of service disclaimers).

The Psych Methods authors naturally pre-registered their study. Nothing demonstrates their different view on what makes for a good pre-registration better than the 6247 words they put in theirs (.pdf). As a benchmark, that’s about 1000 more words than the most cited paper I have coauthored, "False-Positive Psychology" (.htm).

With Joe & Leif we already shared our thoughts on how to properly pre-register a study (see Colada[64]).
That post still represents our view on the topic. So I won't write about that here.

Instead, here I evaluate the Psych Methods article as an empirical piece. The authors collected data, analyzed it, and arrived at a few conclusions. Reading their paper, I found shortcomings that challenge some of those conclusions, especially the AsPredicted vs OSF comparison.

Problem 1. Hidden confound in the OSF vs AsPredicted comparison
A confound arises when an association between two variables is causally driven by a third variable. A hidden confound occurs when a paper does not include enough information for readers to figure out there is a confound, and readers need to dig deep into the data or materials to -maybe- catch it (see e.g., Colada[85] "The Problem of Hidden Confounds").

The hidden confound in this particular paper is the following: the samples of AsPredicted pre-registrations and of OSF pre-registrations were constructed differently, in a way that biases the results in favor of the OSF. Specifically, the OSF pre-registrations came primarily (85% of them) from submissions to the "pre-registration challenge .htm". That was a contest organized, back in 2017, by the Center for Open Science (the folks behind the OSF) where authors could earn $1000 if they published a study with a pre-registration that passed a screening by the COS organizers who checked the pre-registrations contained enough information (and giving an opportunity for authors to revise it when it didn't) (.htm). Most of the OSF sample, in other words, was selected based on the dependent variable. The AsPredicted sample, in contrast, comes from regularly published papers, no selection for informativeness.

This confound is not mentioned in the paper (or the 6247 word pre-registration).
All the paper says about the pre-registration challenge origin of the data is the following:

All we learn as readers is that there are two sources of data, but we don't learn they are systematically different in terms of selecting based on the dependent variable, nor that the OSF sample comes mostly from one, and the AsPredicted sample exclusively from the other.

Comparing submissions relying on two different templates without any controls for what kind of authors and studies are being pre-registered has little hope of obtaining an interpretable causal effect. But with a hidden confound that selects on the dependent variable for only one sample, that hope drops to zero.

Problem 2: Pre-registrations do not fall out of coconut trees
"You think you fell out of a coconut tree? You exist in the context of all in which you live and what came before you." Kamala Harris, May 10^th, 2023 (YouTube)

One way in which pre-registrations become unnecessarily verbose is by including background information that is included elsewhere (e.g., in the accompanying paper). Pre-registrations are not meant to be stand-alone documents, they are literal appendices. A-la coconut meme, pre-registrations exist in the context of the papers that reference them, and the expertise and literature of the community of researchers that read them.

The Psych Methods article, however, evaluated pre-registrations in isolation. For example, they write in their instructions to the pre-registration evaluators that, even for replication studies, "When the authors specify a part of the preregistration by referring to a different paper, please do not [count this]. . . The information should be contained within the preregistration itself (and possibly the supplementary materials)." (page 4 | .pdf).

In terms of ignoring context when evaluating information. In behavioral science almost every published statistical test involves computing a two-sided test with significance level of 5%. But, the Psych Methods authors dinged pre-registrations as uninformative if they did not explicitly state whether they were doing p-values or Bayes factors, did not stipulate 1 or 2 sided tests, or did not stipulate the alpha level.

Example: The worst-rated AsPredicted pre-registration is pretty informative.
Let's look at one pre-registration in detail to see how strict the Psych Methods paper was in their evaluation. Specifically, let's look at the worst AsPredicted-template pre-registration in the sample (.htm). It was given a score of 8 out of 100 in informativeness.⁴

One may expect an 8 out of 100 to be an absolute disaster. Something like "we will run an experiment with random assignment, measure happiness, and analyze the data with frequentist statistics". But that worst pre-registration stipulates quite a bit more than that, which means that it helps readers quite a bit to identify potential p-hacking or harking (again, the only goal of pre-registration). Specifically it stipulates the following 5 things:

The hypothesis is that circling "we" vs "I" pronouns improves spatial recall, especially among Whites
The DV is "number of spatially identified items"
There are 2 conditions: circling "I" vs "we"
Analysis: 2 x 2 GLM followed by simple effects for the interaction
Sample sizes aims to be n=130

That is a lot of things to stipulate, and that is just my 49 word summary of the more detailed pre-registration. For example, those 49 words eliminate all four forms of p-hacking we simulated in our False-Positive Psychology paper (data peeking⁵, selective use of a covariate, dropping a condition, and cherry picking among dependent variables). The 49 words also eliminate all but 1 of the degrees-of-freedom an author acknowledged exploiting for the infamous power posing study in an unusual p-hacking 'confession' (.pdf).

To be clear. It's not a perfect pre-registration. For example, it does not stipulate whether there will be exclusions, there is some wiggle room for sample size, and the analysis is not described sufficiently clearly. But, if this is the worst AsPredicted pre-registration in the sample, that's reassuring. The average AsPredicted has a score 5 times higher than this one.

The present isn't perfect
I have focused here on problems I see with the Psych Methods paper, but I do buy into its overarching goal, improving pre-registration informativeness. The OSF and AsPredicted were designed years ago, before anybody involved in their design (yours truly included), had any real experience pre-registering their studies, or reading the pre-registrations of others. In the years since, and the over 100,000 pre-registrations since, opportunities for improving this first generation of pre-registration platforms have surely arisen. In a future Colada post we will propose changes to both the OSF and AsPredicted to improve pre-registration informativeness. We believe ease-of-use, standardized information, and succintess are absolute musts. But we also believe that minor tech twists can have big returns with minimal additional costs for authors or readers.

If you have suggestions you would like us to consider for improving the OSF or AsPredicted pre-registration platforms you can submit them here [Google Poll].

Author feedback
Our policy (.htm) is to share drafts of blog posts with authors whose work we discuss, in order to solicit suggestions for things we should change prior to posting, and to invite them to write a response that we link to at the end of the post. I shared my concerns with the first author of the Psych Methods paper prior to writing the post, and then shared a draft of the post with him. He suggested two wording changes which I followed. He also provided, on behalf of himself and the co-authors, a response (.pdf), which identifies four contributions their paper makes separate from the issues I discuss here.

Subscribe to Blog via Email

Footnotes

1
The authors also evaluated the extent to which pre-registrations have become more informative over time, and the extent to which published studies deviate from pre-registrations. They do not find that significant effects for either of these.
2
The Psych Methods paper does not differentiate OSF pre-registrations that relied on the AsPredicted-template, from actual AsPredicted.org pre-registration. In this post I also treat them as one set, although I believe they should be treated separately because AsPredicted (the platform) incorporates many features the AsPredicted template on the OSF does not, including that at AsPredicted.org: (i) questions include examples for how to answer them, (ii) timestamps are unambiguous and pre-registration can never be modified (iii) it is clear which co-authors approved the pre-registration, (iv) algorithms check for similarity across pre-registration and flag those with excessive similarity, (v) pre-registrations for similar studies pre-registered within days of each other are 'bundled', so that all pre-registrations in the bundle include links to the other pre-registrations in it
3
For example, including a list of all dependent variables collected or all stimuli used helps demarcate confirmatory vs exploratory analysis, but providing justifications for the inclusion of each of them does not.
4
The authors used a 2-point scale, this pre-registration got a 0.16 out of 2 which is arithmetically equivalent to 8 out of 100. The authors don't use the term informativeness, in their pre-registration they called it "strictness", in their paper they changed the term to "producibility".
5
there is some flexibility in sample size here, but small deviations from n=130 have a very small impact on the false-positive rate; we simulated n=20 vs n=30, a 50% increase in sample size, the smaller the % change in n, the smaller the impact in the false-positive rate.

Subscribe to Blog via Email

Footnotes

Related