A forthcoming paper in the Quarterly Journal of Economics (QJE), "A Cognitive View of Policing" (htm), reports results from a field experiment showing that teaching police officers to "consider different ways of interpreting situations they encounter" led to "reductions in use of force, [and] discretionary arrests" (abstract).
In this post I explain why, having spent a few of hours comparing the pre-registration with the published results, I think the findings are not really statistically significant. As a preview, two findings described as focal in the paper (but not in the pre-registration), are p = .048 and p = .051.
First, true praise
I believe this is exactly the kind of study we should strive to have more of in social science. The intervention is directly informed by well-established psychological findings and was applied in a context of obvious societal relevance. I admire the authors' choice of research question, study design, and logistically complex and expensive implementation. I would love it if half of social science studies were like this one, and our best journals published the work even if the results were inconclusive (as I suspect these results actually are).
But
We should demand that findings from one-off policy field experiments be maximally credible, for two main reasons:
(1) it is nearly impossible for the research community to check them, and
(2) they could directly influence policy
In terms of (1), the research community is unlikely to check the validity of this particular paper because (i) it is prohibitively costly for other researchers to team up with a police department just to run a replication and (ii)_the authors indicated the data were proprietary and did not share any of it.
So, neither replication nor reanalysis of existing data, science's usual defenses against wrong conclusions, are available here.
The pre-registration
The paper links, in the authors' note on page 1, to a document (htm) in the " 'Social Science' 'Registry' ", a platform where economists act 'as if' they were pre-registering their studies. Like many documents in that platform, this paper's 'registration' was written after the analyses were. But fortunately the authors did not rely on this document as a pre-registration. Instead, they also link to "pre-analysis plans", documents posted to the OSF, when introducing their data (Page 13) and link to them in footnote 9. I will refer to such documents as the 'pre-registration' (.htm). 1 The document, dated August 10th 2021, indicates that data collection began on March 2nd 2020. But in email communications preparing this post the authors clarified that "all analyses were conducted after writing the pre-analysis plans".
A flexible pre-registration
The pre-registration (see annotated .pdf) identifies 10 "Primary Outcomes" (DVs), and at least 33 "Secondary Outcomes", for a total of at least 43 dependent variables. 2I say "at least" because the wording is ambiguous at times in terms of whether the authors committed to analyzing a bunch of listed measures as one index summing over all of them, or whether they allowed for the analysis of individual metrics too. In my calculations I assume they committed to only analyzing the total sum.
Among those 10 primary DVs, let's start with the first three which involve the level of force used during arrests; it's categorized as level 1, 2 or 3 in the original police data. The authors combined these levels building three alternative dependent variables:
DV1. Arrests with any use of force (level 1, 2, or 3)
DV2. Arrests with lower levels (1 or 2)
DV3. Arrests with higher levels (2 or 3)
This leaves some wiggle room for p-hacking of course.
In addition, the pre-registration does not specify how long a period will be used to measure the outcomes; after a police officer goes through randomly assigned training, when will potential effects be measured? The pre-registration reads "We will analyze these outcomes for up to 12 months after . . . [and] relevant sub-periods to see if the effects change over time. " (emphasis added)
This leaves lots of wiggle room for p-hacking.
So with this wiggle room in mind, let's think about that key p=.051 result mentioned earlier. It was obtained relying on DV2, using the full 12 months. OK, so that's somewhat arbitrary, but it makes sense. It makes sense because the operationalization uses lots of data, all 12 months, increasing power, and excludes Level_3 of use of force which may be too extreme to be impacted by the training.
So far so good, but, wait, I just lied. For rhetorical purposes. That specification is not the one behind the showcased p=.051 in the paper. In reality, that specification, DV2 with 12 months, gets p=.152 and is reported in a distant Table B29 in the supplement. I lied to demonstrate that we could easily justify an analysis different from that which is defined as 'primary' in the paper.
The actual key specification behind that p=.051 is another one, it's the one that includes all levels of force (DV1) measured only during the first six months. OK, so that's also somewhat arbitrary, but also does make sense. It makes sense because one may expect the training to have a bigger effect in the beginning, and to have more power in this 6 month time window it's better to include all forms of use of force. So, fine.
But, wait, I again lied for rhetorical purposes.
That second specification is also not the one behind the showcased p=.051. The results for that one, which is also consistent with the pre-registration, are not reported anywhere.
OK, henceforth no more lies, I promise. What is the actual 'focal' specification? The p=.051 is obtained using only the first four months of data, and only Level 1 and Level 2 of force, so, excluding data after month 4, and excluding Level 3 data. This, obviously, is also defensible. 100s of choices are defensible. That's how p-hacking works. We cherry pick among defensible specifications. That's why good researchers with good intentions p-hack, it seems reasonable when you are doing it. I think these are excellent researchers with excellent intentions, and I think they p-hacked. I am not saying they are cheaters, I am saying they are human.
This table produces (a non-exhaustive) set of combinations of (only the 10 primary) DVs and time-horizons that are compatible with the pre-registration, showing the p-value if I found it in the paper or supplement.
Table III in the paper, the "Key" outcomes
The main results table in the paper, the one with the findings mentioned in the abstract, is Table III. I (obviously) annotated the copy below.
The top row in the table is the p=.051 discussed in detail above.
The second row shows that the single DV on discretionary arrest gets p=.048 after four months. We don't know how correlated these two DVs are, so we don't know how informative it is to (kind of) get an effect not only in row 1, but also (kind of) get an effect in row 2. However, -counterintuitively- this second result can only harm credibility, it cannot help it; see footnote. 3Ironically, because both p-values are close to .05 we have a catch 22 situation. If the DVs are highly statistically independent, uncorrelated, it is really problematic that both are close to p=.05, that really should not happen with true effects, so the two p=.05 provides additional evidence of p-hacking, and absence of a detectable effect. If the DVs are highly dependent, on the other hand, if they are correlated, then we don't get a lot of additional information from the second DV. So this 2nd DV getting p=.05 can only hurt credibility, or leave it unchanged, but cannot help it!
The third row is more surprising. It has an outcome the authors do not mention anywhere in the pre-registration, "number of days officers were off-duty due to injuries". Despite being exploratory, they discuss it at length and prominently in the paper (e.g., in the abstract) and use it to make cost/benefit calculations for their training in the discussion section, writing "even the benefits [$1057] from reduced officer injuries alone exceed the cost of training". (p.39)
I should note this variable is not significant for months 5-8 (p=.658) and has an effect 60% as big, but with the opposite sign, on months [9-12], p=.162 (source: their Figure II).
Aside on q-values in footnote4I call the q-value as insufficient because the "families" are being defined after the fact and are way too small. Notably, they are defined for a given time period (e.g., 4 months), and include a handful of variables instead of all the variables that could have been used. See recent Psych Methods paper on this (htm). One way to think of it is that a q-value is valid only if you define the families in your pre-registration.
To be clear the authors do caveat that officer injuries was not pre-registered. But, as I argued in a previous post on another field experiment with police officers (Colada[101]), transparency make research evaluable, not credible. When a pre-registration includes 43 outcome variables, and the paper prominently discusses one that is not in that set, it does not seem outlandish to worry about p-hacking.
I close with ideas for improving this paper (behind green button) and improving economics journals.
How to improve the QJE and other econ journals.
It's easy to understand authors fooling themselves into believing the analysis that worked ex-post is the one that made sense ex-ante; we all do this as authors. Motivated reasoning is a hell of a drug. But it's harder to see why reviewers fall for it. Harder to see why the QJE allows a paper where most pre-registered outcomes go unreported, allows the pre-registration to be linked to in footnote 9, and places no demand that flexibility in analysis be acknowledged and seriously taken into account.
I would suggest economics journals, including the QJE, start doing the following:
1) Requiring that experiments be actually pre-registered (before data collection begins)
2) Offering registered reports for expensive to replicate field-experiments like this one
(check out what non-econ journals do, see e.g., Nature Human Behavior htm)
3) Requiring all pre-registered analyses be reported in an easy to read document that references the pre-registration so that readers can easily evaluate where the pre-registration was and was not followed.
Note: of course it's OK to deviate from pre-registrations, we just need to empower readers to evaluate the decisions to do so.
Author feedback
Our policy (.htm) is to share drafts of blog posts with authors whose work we discuss, in order to solicit suggestions for things we should change prior to posting, I emailed the authors of the QJE paper and they suggested changes to a paragraph in the post which I conducted to their satisfaction. Later they provided a response (pdf). In it they explain they chose 4 months for the key evaluation period because that's when they collected additional data from the police officers (in an in person assessment), they point out that they excluded Level 3 incidents of violence because they are very rare, and note that they state in the paper that the officer injury measure was not pre-registered.
They do not report or discuss the analyses I suggested behind the greeen button.
Again, a link to their response (pdf).