[59] PET-PEESE Is Not Like Homeopathy

PET-PEESE is a meta-analytical tool that seeks to correct for publication bias. In a footnote in my previous post (.htm), I referred to is as the homeopathy of meta-analysis. That was unfair and inaccurate.

Unfair because, in the style of our President, I just called PET-PEESE a name instead of describing what I believed was wrong with it. I deviated from one of my rules for ‘menschplaining’ (.htm): “Don’t label, describe.”

Inaccurate because skeptics of homeopathy merely propose that it is ineffective, not harmful. But my argument is not that PET-PEESE is merely ineffective, I believe it is also harmful. It doesn’t just fail to correct for publication bias, it adds substantial bias where none exists.

note: A few hours after this blog went live, James Pustejovsky (.htm) identified a typo in the R Code which affects some results. I have already updated the code and figures below. (I archived the original post: .htm).

PET-PEESE in a NUT-SHELL
Tom Stanley (.htm), later joined by Hristos Doucouliagos, developed PET-PEESE in various papers that have each accumulated 100-400 Google cites (.pdf | .pdf). The procedure consists of running a meta-regression: a regression in which studies are the unit of analysis, with effect size as the dependent variable and its variance as the key predictor [1]. The clever insight by Stanley & Doucouliagos is that the intercept of this regression is the effect we would expect in the absence of noise, thus, our estimate of the -publication bias corrected- true effect [2].

PET-PEESE in Psychology
PET-PEESE was developed with the meta-analysis of economics papers in mind (regressions with non-standardized effects). It is possible that some of the problems identified here, considering meta-analyses of standardized effect sizes, Cohen’s d, do not extend to such settings [3].

Psychologists have started using PET-PEESE recently. For instance, in meta-analyses about religious primes (.pdf), working memory training (.htm), and personality of computer wizzes (.htm). Probably the most famous example is Carter et al.’s meta-analysis of ego depletion, published in JEP:G (.pdf).

In this post I share simulation results that suggest we should not treat PET-PEESE estimates, at least of psychological research, very seriously. It arrives at wholly invalid estimates under too many plausible circumstances. Statistical tools need to be generally valid, or at least valid under predictable circumstances. PET-PEESE, to my understanding, is neither [4].

Results
Let’s start with a baseline case for which PET-PEESE does OK: there is no publication bias, every study examines the exact same effect size, and sample sizes are distributed uniformly between n=12 and n=120 per cell. Below we see that when the true effect is d=0, PET-PEESE correctly estimates it as d̂=0, and as d gets larger, d̂ gets larger (R Code).

About 2 years ago, Will Gervais evaluated PET-PEESE in a thorough blog post (.htm) (which I have cited in papers a few times). He found that in the presence of publication bias PET-PEESE did not perform well, but that in the absence of publication bias it at least did not make things worse. The simulations depicted above are not that different from his.

Recently, however, and by happenstance, I realized that Gervais got lucky with the simulations (or I guess PET-PEESE got lucky) [5]. If we deviate slightly from some  of the specifics of the ideal scenario in any of several directions, PET-PEESE no longer performs well even in the absence of publication bias.

For example, imagine that sample sizes don’t go all the way to up n=120 per cell; instead, they go up to only n=50 per cell (as is commonly the case with lab studies) [6]:

A more surprisingly consequential assumption involves the symmetry of sample sizes across studies. Whether there are more small than large n studies, or vice versa, PET PEESE’s performance suffers quite a bit. For example, if sample sizes look like this:

then PET-PEESE looks like this:


Micro-appendix

1) It looks worse if there are more big n than small n studies (.png).
2) Even if studies have n=12 to n=120, there is noticeable bias if n is skewed across studies (.png)

It’s likely, I believe, for real meta-analyses to have skewed n distributions. e.g., this is what it looked like in that ego depletion paper (note: it plots total N, not per-cell):

So far we have assumed all studies have the exact same effect size, say all studies in the d=.4 bin are exactly d=.4. In real life different studies have different effects. For example, a meta-analysis of ego-depletion may include studies with stronger and weaker manipulations that lead to, say, d=.5 and d=.3 respectively. On average the effect may be d=.4, but it moves around. Let’s see what happens if across studies the effect size has a standard deviation of SD=.2.

Micro-appendix
3) If big n studies are more common than small ns: .png
4) If n=12 to n=120 instead of just n=50, .png

Most troubling scenario
Finally, here is what happens when there is publication bias (only observe p<.05)


Micro-appendix
With publication bias,
5) If n goes up to n=120: .png
6) If n is uniform n=12 to n=50 .png
7) If d is homogeneous, sd(d)=0 .png

It does not seem prudent to rely on PET-PEESE, in any way, for analyzing psychological research. It’s an invalid tool under too many scenarios.

Wide logo


Author feedback.
Our policy is to share early drafts of our post with authors whose work we discuss. I shared this post with the creators of PET-PEESE, and also with others familiar with it: Will Gervais, Daniel Lakens, Joe Hilgard, Evan Carter, Mike McCullough and Bob Reed. Their feedback helped me identify an important error in my R Code, avoid some statements that seemed unfair, and become aware of the recent SPPS paper by Tom Stanley (see footnote 4). During this process I also learned, to my dismay, that people seem to believe -incorrectly- that p-curve is invalidated under heterogeneity of effect size. A future post will discuss this issue, impatient readers can check out our p-curve papers, especially Figure 1 in our first paper (here) and Figure S2 in our second (here), which already address it; but evidently insufficiently compellingly.

Last but not least, everyone I contacted was offered an opportunity to reply within this post. Both Tom Stanley (.pdf), and Joe Hilgard (.pdf) did.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Actually, that’s just PEESE; PET uses the standard error as the predictor []
  2. With PET-PEESE one runs both regressions. If PET is significant, one uses PEESE; if PET is not significant, one uses PET (!). []
  3. Though a working paper by Alinaghi and Reed suggests PET-PEESE performs poorly there as well .pdf []
  4. I shared an early draft of this paper with various peers, including Daniel Lakens and Stanley himself. They both pointed me to a recent paper in SPPS by Stanley (.pdf). It identifies conditions under which PET-PEESE gives bad results. The problems I identify here are different, and much more general than those identified there. Moreover, results presented here seem to directly contradict the conclusions from the SPPS paper. For instance, Stanley proposes that if the observed heterogeneity in studies is I2<80% we should trust PET-PEESE, and yet, in none of the simulations I present here, with utterly invalid results, is I2>80%; thus I would suggest to readers to not follow that advice. Stanley (.pdf) also points out that when there are 20 or fewer studies PET-PEESE should not be used; all my simulations assume 100 studies, and the results do not improve with a smaller sample of studies. []
  5. In particular, when preparing Colada[58] I simulated meta-analyses where, instead of choosing sample size at random, as the funnel-plot assumes, researchers choose larger samples to study smaller effects. I found truly spectacularly poor performance by PET-PEESE, much worse that trim-and-fill. Thinking about it, I realized that if researchers do any sort of power calculations, even intuitive or based on experience, then a symmetric distributions of effect size leads to an asymmetric distributions of sample size. See this illustrative figure (R Code):

    So it seemed worth checking if asymmetry alone, even if researchers were to set sample size at random, led to worse performance for PET-PEESE. And it did. []
  6. e.g., using d.f. in t-test from scraped studies as data, back in 2010, the median n in Psych Science was about 18, and around 85% of studies were n<50 []

[58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0

The funnel plot is a beloved meta-analysis tool. It is typically used to answer the question of whether a set of studies exhibits publication bias. That’s a bad question because we always know the answer: it is “obviously yes.” Some researchers publish some null findings, but nobody publishes them all. It is also a bad question because the answer is inconsequential (see Colada[55]). But the focus of this post is that the funnel plot gives an invalid answer to that question. The funnel plot is a valid tool only if all researchers set sample size randomly [1].

What is the funnel plot?
The funnel plot is a scatter-plot with individual studies as dots. A study’s effect size is represented on the x-axis, and its precision is represented on the y-axis. For example, the plot below, from  a 2014 Psych Science paper (.pdf), shows a subset of studies on the cognitive advantage of bilingualism.

The key question people ask when staring at funnel plots is: Is this thing symmetric?

If we observed all studies (i.e., if there was no publication bias), then we would expect the plot to be symmetric because studies with noisier estimates (those lower on the y-axis) should spread symmetrically on either side of the more precise estimates above them. Publication bias kills the symmetry because researchers who preferentially publish significant results will be more likely to drop the imprecisely estimated effects that are close to zero (because they are p > .05), but not those far from zero (because they are p < .05). Thus, the dots in the bottom left (but not in the bottom right) will be missing.

The authors of this 2014 Psych Science paper concluded that publication bias is present in this literature in part based of how asymmetric the above funnel plot is (and in part on their analysis of publication outcomes of conference abstracts).

The assumption
The problem is that the predicted symmetry hinges on an assumption about how sample size is set: that there is no relationship between the effect size being studied, d, and the sample size used to study it, n. Thus, it hinges on the assumption that r(n, d) = 0.

The assumption is false if researchers use larger samples to investigate effects that are harder to detect, for example, if they increase sample size when they switch from measuring an easier-to-influence attitude to a more difficult-to-influence behavior. It is also false if researchers simply adjust sample size of future studies based on how compelling the results were in past studies. If this happens, then r(n,d)<0 [2].

Returning to the bilingualism example, that funnel plot we saw above includes quite different studies; some studied how well young adults play Simon, others at what age people got Alzheimer’s. The funnel plot above is diagnostic of publication bias only if the sample sizes researchers use to study these disparate outcomes are in no way correlated with effect size. If more difficult-to-detect effects lead to bigger samples, the funnel plot is no longer diagnostic [3].

A calibration
To get a quantitative sense of how serious the problem can be, I run some simulations (R Code).

I generated 100 studies, each with a true effect size drawn from d~N(.6,.15). Researchers don’t know the true effect size, but they guess it; I assume their guesses correlate .6 with the truth, so r(d,dguess)=.6.  Using dguess they set n for 80% power. No publication bias, all studies are reported [4].

The result: a massively asymmetric funnel plot.

That’s just one simulated meta-analysis; here is an image with 100 of them: (.png).

That funnel plot asymmetry above does not tell us “There is publication bias.”
That funnel plot asymmetry above tells us “These researchers are putting some thought into their sample sizes.”

Wait, what about trim and fill?
If you know your meta-analysis tools, you know the most famous tool to correct for publication bias is trim-and-fill, a technique that is entirely dependent on the funnel plot.  In particular, it deletes real studies (trims) and adds fabricated ones (fills), to force the funnel plot to be symmetric. Predictably, it gets it wrong. For the simulations above, where mean(d)=.6, trim-and-fill incorrectly “corrects” the point estimate downward by over 20%, to d̂=.46, because it forces symmetry onto a literature that should not have it (see R Code) [5].

Bottom line.
Stop using funnel plots to diagnose publication bias.
And stop using trim-and-fill and other procedures that rely on funnel plots to correct for publication bias.
Wide logo


Authors feedback.
Our policy is to share early drafts of our post with authors whose work we discuss. This post is not about the bilingual meta-analysis paper, but it did rely on it, so I contacted the first author, Angela De Bruin. She suggested some valuable clarifications regarding her work that I attempted to incorporate (she also indicated to be interested in running p-curve analysis on follow-up work she is pursuing).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. By “randomly” I mean orthogonally to true effect size, so that the expected correlation between sample and effect size being zero: r(n,d)=0. []
  2. The problem that asymmetric funnel plots may arise from r(d,n)<0 is mentioned in some methods papers (see e.g., Lau et al. .pdf), but is usually ignored by funnel plot users. Perhaps in part because the problem is described as a theoretical possibility, a caveat; but it is is a virtual certainty, a deal-breaker. It also doesn’t help that so many sources that explain funnel plots don’t disclose this problem, e.g., the Cochrane handbook for meta-analysis .htm. []
  3. Causality can also go the other way: Given the restriction of a smaller sample, researchers may measure more obviously impacted variables. []
  4. To give you a sense of what assuming r(d,dguess)=.6 implies for researchers ability to figure out the sample size they need; for the simulations described here, researchers would set sample size that’s on average off by 38%, for example, the researcher needs n=100, but she runs n=138, or runs n=62, so not super accurate R Code. []
  5. This post was modified on April 7th, you can see an archived copy of the original version here []

[57] Interactions in Logit Regressions: Why Positive May Mean Negative

Of all economics papers published this century, the 10th most cited appeared in Economics Letters , a journal with an impact factor of 0.5.  It makes an inconvenient and counterintuitive point: the sign of the estimate (b̂) of an interaction in a logit/probit regression, need not correspond to the sign of its effect on the dependent variable (Ai & Norton 2003, .pdf; 1467 cites).

That is to say, if you run a logit regression like y=logit(b1x1+b2x2+b3x1x2), and get 3= .5, a positive interaction estimate, it is possible (and quite likely) that for many xs, the impact of the interaction on the dependent variable is negative; that is, that as x1 gets larger, the impact of x2 on y gets smaller.

This post provides an intuition for that reversal, and discusses when it actually matters.

side note: Many economists run “linear probability models” (OLS) instead of logits, to avoid this problem. But that does not fix this problem, it just hides it. I may write about that in a future post.

Buying a house (no math)
Let’s say your decision to buy a house depends on two independent factors: (i) how much you like it (ii) how good an investment it is.

Unbounded scale. If the house decision were on an unbounded scale, say how much to pay for it, liking and investment value would remain independent. If you like the house enough to pay $200k, and in addition it would give you $50k in profits, you’d pay $250k; if the profits were $80k instead of $50k, then pay $280k. Two main effects, no interaction [1].

Bounded scale. Now consider, instead of $ paid, measuring how probable it is that you buy the house; a bounded dependent variable (0-1).  Imagine you love the house (Point C in figure below). Given that enthusiasm, a small increase or drop in how good an investment it is, doesn’t affect the probability much. If you felt lukewarm, in contrast (Point B), a moderate increase in the investment quality could make a difference. And in Point A, moderate changes again don’t matter much.

Key intuition: when the dependent variable is bounded [2], the impact of every independent variable moves it closer/further from that bound, and hence, impacts how flat the curve is, how sensitive the dependent variable it is to changes in any other variable. Every variable, then, has an interactive effect on all variables, even if they are not meaningfully related to one another and even if interaction effects are not included in the regression equation.

Mechanical vs conceptual interactions
I call interactions that arise from the non-linearity of the model, mechanical interactions, and those that arise from variables actually influencing each other, conceptual interactions.

In life, most conceptual interactions are zero: how much you like the color of the kitchen in a house does not affect how much you care about roomy closets, the natural light in the living room, or the age of the AC system. But, in logit regressions, EVERY mechanical interaction is ≠0; if you love the kitchen enough that you really want to buy the house, you are far to the right in the figure above and so all other attributes now matter less: closets, AC system and natural light all now have less detectable effects on your decision.

In a logit regression, the b̂s one estimates, only capture conceptual interactions. When one computes “marginal effects”, when one goes beyond the b̂ to ask how much the dependent variable changes as we change a predictor, one adds the mechanical interaction effect.

Ai and Norton’s point, then, is that the coefficient may be positive, b̂>0, conceptual interaction positive, but the marginal effect negative, conceptual+mechanical negative.

Let’s take this to logit land
Let
y: probability of buying the house
x1: how much you like it
x2: how good an investment it is

and,
y= logit(b1x1+b2x2)  [3]
(note: there is no interaction in the true model, no x1x2 term)

Below I plot that true model, y on x2, keeping x1 constant at x1=0 (R Code for all plots in post).


We are interested in the interaction of x1 with x2. On how x2 affects the impact of x1 on y. Let’s add a new line to the figure, keeping x1 fixed at x1=1 instead of x1=0.


For any given investment value, say x2=0, you are more likely to buy the house if you like it more (dashed red vs solid black line). The vertical distance between lines is the impact of x1=1 vs x1=0; one can already see that around the extremes the gap is smaller, so the effect of x1 gets smaller when x2 is very big or very small.

Below I add arrows that quantify the vertical gaps at specific x2 values. For example, when x2=-2, going from x1=0 to x1=1 increases the probability of purchase by 15%, and by 23% when x2=-1 [4]

The difference across arrows captures how the impact of x1 changes as we change x2; the interaction. The bottom chart, under the brackets shows the results.  Recall there is no conceptual interaction here, model is y=x1+x2, so those interactions, +.08 and -.08 respectively, are purely mechanical.

Now: the sign reversal
So far we assumed x1 and x2 were not conceptually related. The figure below shows what happens when they are: y=logit(x1+x2+0.25x1x2). Despite the conceptual interaction being b=.25 > 0, the total effect of the interaction is negative for high values of x2 (e.g., from x2=1 to x2=2, it is -.08); the mechanical interaction dominates.


What to do about this?

Ai & Norton propose not focusing on point estimates at all, not focusing on b̂3=.25. To instead compute how much the dependent variable changes with a change of the underlying variables, the marginal effect of the interaction, the one that combines conceptual and mechanical. To do that for every data-point, and reporting the average.

In another Econ Letters paper, Greene (2010; .pdf) [5] argues averaging the interaction is kind of meaningless. He has a point, ask yourself how informative it is to tell a reader that the average interaction effect depicted above, +.11 and -.08, is +.015. He suggests plotting the marginal effect for every value instead.

But, such graphs will combine conceptual and mechanical interactions. Do we actually want to do that? It depends on whether we have a basic-research or applied-research question.

What is the research question?
Imagine a researcher examining the benefits of text-messaging parents of students who miss a homework and that the researcher is interested on whether messages are less beneficial for high GPA student (so on the interaction: message*GPA).

An applied research question may be:

How likely is a student to get an A in this class if we text message his parents when missing a homework?”

For that question, yes, we need to include the mechanical interaction to be accurate. If high GPA students were going to get an A anyway, then the text-message will not increase the probability for them. The ceiling effect is real and should be taken into account. So we need the marginal effect.

A (slightly more) basic-research question may be:

How likely is a student to get more academically involved in this class if we text message his parents when missing a homework?

Here grades are just a proxy, a proxy for involvement; if high GPA students were getting an A anyway, but thanks to the text-message will become more involved, we want to know that. We do not want the marginal effect on grades, we want the conceptual interaction, we want b̂.

In sum: When asking conceptual or basic-research questions, if b̂ and the marginal effects disagree, go with b̂.

Wide logo


Authors feedback.
Our policy is to contact authors whose work we discuss, asking to suggest changes and reply within our blog if they wish. I shared a draft with Chunrong Ai & Edward Norton. Edward replied indicating he appreciated the post and suggested I tell readers about another article of his, further delving into this issue (.pdf)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. What really matters is linear vs non-linear scale rather that bounded vs not, but bounded provides the intuition more clearly. []
  2. As mentioned before, the key is non-linear rather than bounded []
  3. the logit model is y=eb1x1+b2x2/(1+e b1x1+b2x2) . []
  4. percentage points, I know, but it’s a pain to write that every time. []
  5. The author of that “Greene” Econ PhD econometrics textbook .htm []

[56] TWARKing: Test-Weighting After Results are Known

On the last class of the semester I hold a “town-hall” meeting; an open discussion about how to improve the course (content, delivery, grading, etc). I follow-up with a required online poll to “vote” on proposed changes [1].

Grading in my class is old-school. Two tests, each 40%, homeworks 20% (graded mostly on a completion 1/0 scale). The downside of this model is that those who do poorly early on, get demotivated. Also, a bit of bad lack in a test hurts a lot. During the latest town-hall the idea of having multiple quizzes and dropping the worst was popular. One problem with this model is that students can blow off a quiz entirely. After the town-hall I thought of why students loved the drop-1 idea and whether I could capture the same psychological benefit with a smaller pedagogical loss.

I came up with TWARKing: assigning test weights after results are known [2]. With TWARKing, instead of each test counting 40% for every student, whichever test an individual student did better on, gets more weight; so Julie does better in Test 1 than Test 2, then Julie’s test 1 gets 45% and test 2 35%, but Jason did better in Test 2, so Jason’s test 2 gets 45%. [3]. Dropping a quiz becomes a special case of TWARKing: worst gets 0% weight.

It polls well
I expected TWARKing to do well in the online poll but was worried students would fall prey to competition-neglect, so I wrote a long question stacking the deck against TWARKing:
question

f1

70% of student were in favor, only 15% against (N=92, only 3 students did not complete the poll).

The poll is not anonymous, so I looked at how TWARKing attitudes are correlated with actual performance.

f2

Panel A shows that students doing better like TWARking less, but the effect is not as strong as I would have expected. Students liking it 5/5 perform in the bottom 40%, those liking 2/5 are in the top 40%.

Panel B shows that students with more uneven performance do like the TWARKing more, but the effect is small and unimpressive (Spearman’s r=.21:, p=.044).

For Panel C I recomputed the final grades had TWARKing been implemented for this semester and saw how the change in ranking correlated with support of TWARKing. It did not. Maybe it was asking too much for this to work as students did not yet know their Test 2 scores.

My read is that students cannot anticipate if it will help vs. hurt them, and they generally like it all the same.

TWARKing could be pedagogically superior.
Tests serve two main roles: motivating students and measuring performance. I think TWARKing could be better on both fronts.

Better measurement. My tests tend to include insight-type questions: students either nail them or fail them. It is hard to get lucky in my tests, I think, hard to get a high score despite not knowing the material. But, easy, unfortunately, to get unlucky; to get no points on a topic you had a decent understanding of [4].  Giving more weight to the highest test is hence giving more weight to the more accurate of the two tests.  So it could improve the overall validity of the grade.  A student who gets a 90 and a 70 is, I presume, better than one getting 80 in both tests.

This reminded me of what Shugan & Mitra (2009 .pdf) label the “Anna Karenina effect” in their under-appreciated paper (11 Google cites). Their Anna Karenina effect (there are a few; each different from the other), occurs when less favorable outcomes carry less information than more favorable ones; for those situations, measures other than the average, e.g., the max, performs better for out-of-sample prediction. [5]

To get an intuition for this Anna Karenina effect: think about what contains more information, a marathon runner’s best vs worst running time? A researcher’s most vs least cited paper?

Note that one can TWARK within test, weighting the highest scored answer by each student more. I will.

Motivation. After doing very poorly in a test it must be very motivating to feel that if you study hard you can make this bad performance count less. I speculate that with TWARKing underperforming students in Test 1 are less likely to be demotivated for Test 2 (I will test this next semester, but without random assignment…).  TWARKing has the magical psychological property that the gains are very concrete, every single student gets a higher average with TWARKing than without, and they see that; the losses, in contrast, are abstract and unverifiable (you don’t see the students who benefited more than you did, leading to a net-loss in ranking).

Bottom line
Students seem to really like TWARKing.
It may make things better for measurement.
It may improve motivation.

A free happiness boost.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Like Brexit, the poll in OID290 is not binding []
  2. Obviously the name is inspired by ‘HARKing’: hypothesizing after results are known.  The similarity to Twerking, in contrast, is unintentional, and, given the sincerity of the topic, probably unfortunate. []
  3. I presume someone already does this , not claiming novelty []
  4. Students can still get lucky if I happen to ask on a topic they prepared better for. []
  5. They provide calibrations with real data in sports, academia and movie ratings. Check the paper out. []

[55] The file-drawer problem is unfixable, and that’s OK

The “file-drawer problem” consists of researchers not publishing their p>.05 studies (Rosenthal 1979 .pdf).
P-hacking consist of researchers not reporting their p>.05 analyses for a given study.

P-hacking is easy to stop. File-drawering nearly impossible.
Fortunately, while p-hacking is a real problem, file-drawering is not.

Consequences of p-hacking vs file-drawering
With p-hacking it’s easy to get a p<.05 [1].  Run 1 study, p-hack a bit and it will eventually “work”; whether the effect is real or not.  In “False-Positive Psychology” we showed that a bit of p-hacking gets you p<.05 with more than 60% chance (SSRN).

With file-drawering, in contrast, when there is no real effect, only 1 in 20 studies work. It’s hard to be a successful researcher with such low a success rate [2]. It’s also hard to fool oneself the effect of interest is real when 19 in 20 studies fail. There are only so many hidden moderators we can talk ourselves into. Moreover, papers typically have multiple studies. A four-study paper would require file-drawering 76 failed studies. Nuts.

File-drawering entire studies is not really a problem, which is good news, because the solution for the file-drawer is not really a solution [3].

Study registries: The non-solution to the file-drawer problem
Like genitals & generals, study registries & pre-registrations sound similar but mean different things.

A study registry is a public repository where authors report all studies they run. A pre-registration is a document authors create before running one study, to indicate how that given study will be run. Pre-registration intends to solve p-hacking. Study registries intend to solve the file-drawer problem.

Study registries sound great, until you consider what needs to happen for them to make a difference.

How the study registry is supposed to work
You are reading a paper and get to Study 1. It shows X. You put the paper down, visit the registry, search for the set of all other studies examining X or things similar to X (so maybe search by author, then by keyword, then by dependent variable, then by topic, then by manipulation), then decide which subset of the studies you found are actually relevant for the Study 1 in front of you (e.g., actually studying X, with a similarly clean design, competent enough execution, comparable manipulation and dependent variable, etc.). Then you tabulate the results of those studies found in the registry, and use the meta-analytical statistical tool of your choice  to combine those results with the one from the study still sitting in front of you.  Now you may proceed to reading Study 2.

Sorry, I probably made it sound much easier than it actually is. In real life, researchers don’t comply with registries the way they are supposed to. The studies found in the registry almost surely will lack the info you need to ‘correct’ the paper you are reading.  A year after being completed, about 90% of studies registered in ClinicalTrials.gov do not have the results uploaded to the database (NEJM, 2015 .pdf). Even for the subset of trials where posting results is ‘mandatory’  it does not happen (BMJ, 2012 .pdf), and when results are uploaded, they are often incomplete and inconsistent with the results in the published paper (Ann Int Medicine 2014 .pdf). This sounds bad, but in social science it will be way worse; in medicine the registry is legally required, for us it’s voluntary. Our registries would only include the subset of studies some social scientists choose to register (the rest remain in the file-drawer…).

Study registries in social science fall short of fixing an inconsequential problem, the file-drawer, they are burdensome to comply with, and to use.

Pre-registration: the solution to p-hacking
Fixing p-hacking is easy: authors disclose how sample size was set & all measures, conditions, and exclusions (“False Positive Psychology” SSRN). No ambiguity, no p-hacking.

For experiments, the best way to disclose is with pre-registrations.  A pre-registration consists of writing down what one wants to do before one does it. In addition to the disclosure items above, one specifies the hypothesis of interest and focal statistical analysis. The pre-registration is then appended to studies that get written-up (and file-drawered with those that don’t). Its role is to demarcate planned from unplanned analysis. One can still explore, but now readers know one was exploring.

Pre-registrations is an almost perfect fix to p-hacking, and can be extremely easy to comply with and use.

In AsPredicted it takes 5 minutes to create a pre-registration, half a minute to read it (see sample .pdf). If you pre-register and never publish the study, you can keep your AsPredicted private forever (it’s about p-hacking, not the file-drawer). Over 1000 people created AsPredicteds in 2016.

Summary
– The file-drawer is not really a problem, and study registries don’t come close to fixing it.
P-hacking is a real problem. Easy to create and evaluate pre-registrations all but eliminate it.
Wide logo


Uri’s note: post was made public by mistake when uploading the 1st draft.  I did not receive feedback from people I was planning to contact and made several edits after posting. Sorry.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. With p-hacking it also easy to get Bayes Factor >3; see “Posterior Hacking” http://DataColada.org/13. []
  2. it’s actually 1 in 40 since usually we make directional predictions and rely on two-sided tests []
  3. p-curve is a statistical remedy to the file-drawer problem and it does work .pdf []

[54] The 90x75x50 heuristic: Noisy & Wasteful Sample Sizes In The “Social Science Replication Project”

An impressive team of researchers is engaging in an impressive task: Replicate 21 social science experiments published in Nature and Science in 2010-2015 (.htm).

The task requires making many difficult decisions, including what sample sizes to use. The authors’ current plan is a simple rule: Set n for the replication so that it would have 90% power to detect an effect that’s 75% as large as the original effect size estimate.  If “it fails” (p>.05), try again powering for an effect 50% as big as original.

In this post I examine the statistical properties of this “90-75-50” heuristic, concluding it is probably not the best solution available. It is noisy and wasteful [1].

Noisy n.
It takes a huge sample to precisely estimate effect size (ballpark: n=3000 per cell, see DataColada[20]). Typical experiments, with much smaller ns, provide extremely noisy estimates of effect size; sample size calculations for replications, based on such estimates, are extremely noisy as well.

As a calibration let’s contrast 90-75-50 with the “Small-Telescopes” approach (.pdf), which requires replications to have 2.5 times the original sample size to ensure 80% power to accept the null. Zero noise.

The figure below illustrates. It considers an original study that was powered at 50% with a sample size of 50 per cell. What sample size will that original study recommend for the first replication (powered 90% for 75% of observed effect)? The answer is a wide distribution of sample sizes reflecting the wide distribution of effect size estimates the original could result in [2]. Again, this is the recommendation for replicating the exact same study, with the same true effect and same underlying power; the variance you see for the replication recommendation purely reflects sampling error in the original study (R Code). f1

We can think of this figure as the roulette wheel being used to set the replication’s sample size.

The average sample size recommendations of both procedures are similar: n=125 for the Small Telescopes approach vs. n=133 for 90-75-50. But the heuristic has lots of noise: the standard deviation of its recommendations is 50 observations, more than 1/3 of its average recommendation of 133 [3].

Waste
The 90-75-50 heuristic throws good money after bad, escalating commitment to studies that have already accepted the null.  Consider an original study that is false-positive with n=20. Given the distribution of (p<.05) possible original effect-size estimates, 90-75-50 will on average recommends n=67 per-cell for the first replication, and when that one fails (which it will with 97.5% chance because the original is false-positive), it will run a second replication now with n=150 participants per-cell  (R Code).

From the “Small Telescopes” paper (.pdf) we know that if 2.5 times the original (n=20) were run in the first replication, n=50,  we already would have an 80% chance to accept the null. So in the vast majority of cases, when replicating it with n=67, we will already have accepted the null; why throw another n=150 at it? That dramatic explosion of sample size for false-positive original findings is about the same for any original n, such that:

False-positive original findings lead to replications with about 12 times as many subjects per-cell when relying on 90-75-50

If the false-positive original was p-hacked, it’s worse. The original p-value will be close to p=.05, meaning a smaller estimated original effect size and hence even larger replication sample size. For instance, if the false-positive original got p=.049, 90-75-50 will trigger replications with 14 times the original sample size (R Code).

Rejecting the null
So far we have focused on power and wasted observations for accepting the null. What if the null is false? The figure below shows power for rejecting the null. We see that if the original study had even mediocre power, say 40%, the gains of going beyond 2.5 times the original are modest. The Small Telescopes approach provides reasonable power to accept and also to reject the null (R Code).

f3c

Better solution.
Given the purpose (and budget) of this replication effort, the Small-Telescopes recommendation could be increased to 3.5n instead of 2.5n, giving nearly 90% power to accept the null [4].

The Small Telescopes approach requires fewer participants overall than 90-75-50 does, is unaffected by statistical noise, and it paves the way to a much needed “Do we accept the null?” mindset to interpreting ‘failed’ replications.

Wide logo


Author feedback.
Our policy is to contact authors whose work we discuss, asking to suggest changes and reply within our blog if they wish. I shared a draft with several of the authors behind the Social Science Replication Project and discussed it with a few them. They helped me clarify the depiction of their sample-size selection heuristic, prompted me to drop a discussion I had involving biased power estimates for the replications, and prompted me -indirectly- to add the entire calculations and discussions involving waste that’s included in the post you just read. Their response was prompt and valuable.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The data-peeking involved in the 2nd replication inflates false-positives a bit, from 5% to about 7%, but since replications involve directional predictions, if they use two-sided tests, it’s fine. []
  2. The calculations behind the figure work as follows. One begins with the true effect size, the one giving the original sample 50% power. Then one computes how likely each possible significant effect size estimate is, that is, the distribution of possible effect size estimates for the original (this comes straight from the non-central distribution). Then one computes for each effect size estimate, the sample size recommendation for the replication that the 90-75-50 heuristic would result in, that is, one based on an effect 75% as big as the estimate, and since we know how likely each estimate is, we know how likely each recommendation is, and that’s what’s plotted. []
  3. How noisy the 90-75-50 heuristic recommendation is depends primarily on the power of the original study and not the specific sample and effect sizes behind such power. If the original study has 50% power, the SD of the recommendation over the average recommendation is ~37% (e.g., 50/133) whether the original had n=50, n=200 or n=500. If underlying power is 80%, the ratio is ~46% for those same three sample sizes. See Section (5) in the R Code []
  4. Could also do the test half-way, after 1.75n, ending study if already conclusive; using a slightly stricter p-value cutoff to maintain desired false-positive rates; hi there @lakens []

[52] Menschplaining: Three Ideas for Civil Criticism

As bloggers, commentators, reviewers, and editors, we often criticize the work of fellow academics. In this post I share three ideas to be more civil and persuasive when doing so.

But first: should we comment publicly in the first place?
One of the best known social psychologist, Susan Fiske (.htm), last week circulated a draft of an invited opinion piece (.pdf), where she called academics who critically discuss published research in social media and blogs a long list of names including ‘self-appointed data police’ [1].

I think data-journalist is a more accurate metaphor than is data-police. Like journalists and unlike police officers, (academic) bloggers don’t have physical nor legal powers, they merely exercise free-speech sharing analyses and non-binding opinions that are valued by the people who choose to read them (in contrast, we are not free to ignore the police). Like journalists’, bloggers’ power hinges on being interesting, right, and persuasive.

Importantly, unlike journalists, most academic bloggers have similar training in the subject matter as the original authors whose work they discuss, and they inhabit their social and professional circles as well.  So bloggers are elite journalists: more qualified and better incentivized to be right [2].

Notorious vs influential
Bloggers and other commentators, as Susan Fiske reminds us, can fall in the temptation of getting more attention by acting out, say using colorful language to make unsubstantiated and vague accusations. But the consequences are internalized.

Acting out ends up hurting commentators more than those commented on. Being loud makes you notorious, not influential (think Ann Coulter). Moreover, when you have a good argument, acting out is counterproductive, it distracts from it.  You become less persuasive. Only those who already agree with you will respond to your writing. Academics who make a living within academia have no incentive to be notorious.

Despite the incentives to be civil, there is certainly room in public discussions for greater civility. If the president of APS had asked me to write a non-peer-reviewed article on this topic, I would have skipped the name-calling and gotten to the following three ideas.

Idea 1. Don’t label, describe
It is tempting to label the arguments we critique, saying about them things like ‘faulty logic,’ ‘invalid analyses,’ ‘unwarranted conclusions.’ These terms sound specific, but are ultimately empty phrases that cannot be evaluated by readers. When we label, all we are really saying is “Listen, I am so upset about this, I am ready to throw some colorful terms around.”

An example from my own (peer-reviewed & published) writing makes me cringe every time:
Their rebuttal is similarly lacking in diligence. The specific empirical concerns it raised are contradicted by evidence, logic, or both.
What a douche.

Rather than vague but powerfully sounding labels, “lacking diligence”, “contradicted by evidence,” it is better to describe the rationale for those labels. What additional analyses should have they run, and how do they contradict their conclusions? I should’ve written:
The rebuttal identifies individual examples that intuitively suggest my analyses were too conservative, but, on the one hand, closer examination shows the examples are not actually conservative, and on the other, the removal of those examples leaves the results unchanged.

Now readers know what to look for to decide if they agree with me. Labels becomes redundant, we can drop them.

Idea 2. Don’t speculate about motives
We often assume our counterparts have bad intentions. A hidden agenda, an ulterior nefarious motive. They do this because they are powerful, if not then because they are powerless. For instance, I once wrote
They then, perhaps disingenuously, argued that” Jerk.

Two problems with speculating about motives. First, it is delusional to think we know why someone did something by just seeing what they did, especially if it is in our interest to believe their intentions are not benign.  We don’t know why people do what they do, and it is too easy to assume they do so for reasons that would make us happier or holier.  Second, intentions are irrelevant. If someone publishes a critique of p-curve because they hate Joe’s, Leif’s and/or my guts, all that matters is if they are right or wrong, so when discussing the critique, all we should focus on is whether it is right or wrong.

Idea 3. Reach out
Probably the single thing that has helped me improve the most in the civility department consists of our policy in this blog: contacting authors whose work we discuss before making things public. It is amazing how much it helps, both after receiving the feedback, addressing things that tick people off that we would have never guessed, and before, anticipating remarks that may be irritating, and dropping them.

I obviously cannot do this as a reviewer or editor, in those cases I still apply Ideas 1 & 2. Also, as a heuristic check on tone, I imagine I am going to dinner with the authors and their parents that night.

Summary.
1) I disagree with the substance of Susan Fiske’s piece. Academics discussing research in blogs and social media are elite data-journalists playing an indispensable role in modern science: disseminating knowledge, detecting and correcting errors, and facilitating open discussions.
2) I object its tone, calling colleagues terrorists, among many other things, increases the notoriety of our writing, but reduces its influence. (It’s also disrespectful).
3) I shared three ideas to improve civility in public discourse: Don’t label, don’t infer motives, and reach out.

Wide logo


Author feedback.
I shared a draft of this post with Susan Fiske who suggested I make clear the document that circulated was a draft which she will revise before publication. I edited the writing to reflect this.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Her piece included the following terms to describe bloggers or their actions: (1) Mob, (2) Online vigilantes, (3) Self-appointed data police, (4) Personal ferocity, (5) crashing people, (6) Unmoderated attacks, (7) Unaccountable bullies, (8) Adversarial viciousness, (9) Methodological terrorists, (10) Dangerous minority, (11) Destructo-critics, (12) They attack the person (oops), (13) Self-appointed critics. []
  2. and less interesting, and worse at writing, and with less exciting social lives, but with more stable jobs. []

[51] Greg vs. Jamal: Why Didn’t Bertrand and Mullainathan (2004) Replicate?

Bertrand & Mullainathan (2004, .pdf) is one of the best known and most cited American Economic Review (AER) papers [1]. It reports a field experiment in which resumes given typically Black names (e.g., Jamal and Lakisha) received fewer callbacks than those given typically White names (e.g., Greg and Emily). This finding is interpreted as evidence of racial discrimination in labor markets .

Deming, Yuchtman, Abulafi, Goldin, and Katz (2016, .pdf) published a paper in the most recent issue of AER. It also reports an audit study in which resumes were sent to recruiters. While its goal was to estimate the impact of for-profit degrees, they did also manipulate race of job applicants via names. They did not replicate the (secondary to them) famous finding that at least three other studies had replicated (.ssrn | .pdf | .pdf). Why? What was different in this replication? [2]

Small Telescope Test 
The new study had more than twice the sample size (N = 10,484) of the old study (N = 4,870). But it also included Hispanic names, the sample for White and Black names was nevertheless about 50% larger than for the old study.

Callback rates in the two studies were (coincidentally) almost identical: 8.2% in both. In the famous study, Whites were 3% percentage points (pp) more likely to receive a call-back than non-Whites. In the new study they were 0.4% pp less likely than non-Whites, not significantly different from 0%, χ2(1)=.61, p=.44 [3].

The small-telescopes test (SSRN) asks if a replication result rules out effects big enough to be detectable by the original. The confidence interval around the new estimate spans -1.3% pp against Whites to +0.5% pp pro-White. For the original sample size, an effect as small as that upper end, +.5% pp, would lead to a meager 9% of statistical power. The new study’s results are inconsistent with effects big enough to be detectable by the older study, so we accept the null. (For more on accepting the null: DataColada[42].)

Why no effect of race?
Deming et al. list a few differences between their study and Bertrand & Mullainathan’s, but do not empirically explore their role [4]:
Quotes from Deming
In addition to these differences, the new study used a different set of names. They are not listed in paper, but the first author quickly provided them upon request.

That names thing is key for this post.

In DataColada[36], I wrote about a possible Socioeconomic Status (SES) confound in Bertrand and Mullainathan (2004) [5]. Namely, their Black names (e.g., Rasheed, Leroy, Jamal) seem low SES, while the White names (e.g., Greg, Matthew, Todd) do not. The Black names from the new study (e.g., Malik, Darius and Andre) do not seem to share this confound. Below is the picture I used to summarize Colada[36], I modified it (with green font) to capture why I am interested in the new study’s Black names [6].

denzel II
But I am a Chilean Jew with a PhD, my own views on whether a particular name hints at SES may not be terribly representative of that of an everyday American. So I collected some data.

Study 1
In this study I randomly paired a Black name used in the old study with a Black name used in the new study and N=202 MTurkers answered this question (Qualtrics survey file and data: .htm):
Study 1Each respondent saw one random pair of names, in counterbalanced order. The results were more striking than I expected [7].
Study 1 Results with axis labeled
Note: Tyrone was used in both studies so I did not include it.

Study 2 
One possible explanation for Study 1’s result (and the failure to replicate in the recent AER article) is that the names used in the new study were not perceived to be Black names. In Study 2, I asked N=201 MTurkers to determine the race and gender of some of the names from the old and new studies [8].
Study 2 Results
Indeed the Black names in the new study were perceived as somewhat less Black than those used in the old one. Nevertheless, they were vastly more likely to be perceived as Black than were the control names. Based on this figure alone, we would not expect discrimination against Black names to disappear in the new study. But it did.

In Sum.
The lower callback rates for Jamal and Lakisha in the classic 2004 AER paper, and the successful replications mentioned earlier, are as consistent with racial as with SES discrimination. The SES account parsimoniously also explains this one failure to replicate the effect. But this conclusion is tentative as best, we are comparing studies that differ on many dimensions (and the new study had some noteworthy glitches, read footnote 4). To test racial discrimination in particular, and name effects in general, we need the same study to orthogonally manipulate these, or at least use names pretested to differ only on the dimension of interest. I don’t think any audit study has done that.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. I contacted the 7 authors behind the two articles and received feedback from 5 of them. I hope to have successfully incorporated their suggestions; they focused on clearer use of “replication” terminology, and broader representation of the relevant literature.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. According to Web-of-science (WOS), its 682 citations put it as the 9th most cited article published since 2000. Aside: The WOS should report an age-adjusted citation index. []
  2. For a review of field experiments on discrimination see Bertrand & Duflo (.pdf) []
  3. I carried out these calculations using the posted data (.dta | .csv) . The new study reports regression results with multiple covariates and on subsamples. In particular, in Tables 3 and 4 they report results separately for jobs that do and do not require college degrees. For jobs without required degrees, White males did significantly worse: they were 4% percentage points (pp) less likely to be called back. For jobs that do require a college degree the point estimate is a -1.5% pp effect (still anti-White bias), but with a confidence interval that includes +2.5% pp (pro-White), in the ball-park of the 3% pp of the famous study, and large enough to be detectable with its sample size. So for jobs without requirements a conclusive failure to replicate, for those that require college, an inconclusive result. The famous study investigated jobs that did not require a college degree. []
  4. Deming et al surprisingly do not break down results for Hispanics vs Blacks in their article; I emailed the first author and he explained that, due to a glitch in their computer program, they could not differentiate Hispanic from Black resumes. This is not mentioned in their paper. This glitch generates another possible explanation: Blacks may have been discriminated against just like in the original, but Hispanics were discriminated in favor by a larger amount, so that when collapsed into a single group, the negative effect is not observable. []
  5. Fryer & Levitt (2004) have a QJE paper on the origin and consequences of distinctively Black names (.pdf) []
  6. The opening section mentions three articles that replicate Bertrand and Mullainathan’s finding of Black names receiving fewer callbacks. The Black names in those papers also seem low SES to me but I did not include them in my data-collection. For instance, the names include  Latoya, Tanisha, DeAndre, DeShawn, and Reginald. They do not include any of the high SES Black names from Deming et al. []
  7. It would be most useful to compare SES between White and Black names, but I did not include White names in this study worried respondents would think “I am not going to tell you that I think White names are higher SES than Black names.” Possibly my favorite Mike Norton study (.pdf) documents people are OK making judgments between two White or two Black faces, but reluctant between a White and a Black one []
  8. From Bertrand & Mullainathan I chose 3 Black names, the one their participants rated the Blackest, the median, and the least Black; see their Table A1. []

[50] Teenagers in Bikinis: Interpreting Police-Shooting Data

The New York Times, on Monday, showcased (.htm) an NBER working paper (.pdf) that proposed that “blacks are 23.8 percent less likely to be shot at by police relative to whites.” (p.22)

The paper involved a monumental data collection effort  to address an important societal question. The analyses are rigorous, clever and transparently reported. Nevertheless, I do not believe the above conclusions are justified from the evidence. Relying on additional results reported in the paper, I show here the data are consistent with police shootings being biased against Blacks, but too noisy to confidently conclude either way [1],[2].

Teenagers in bikinis
As others have noted [3], an interesting empirical challenge for interpreting the shares of Whites vs Blacks shot by police while being arrested is that biased officers, those overestimating the threat posted by a Black civilian, will arrest less dangerous Blacks on average. They will arrest those posing a real threat, but also some not posing a real threat, resulting in lower average threat among those arrested by biased officers [4].

For example, a biased officer may be more likely to perceive a Black teenager in a bikini as a physical threat (YouTube) than a non-biased officer would, lowering the average threat of his arrestees. If teenagers in bikinis, in turn, are less likely to be shot by police than armed criminals are, racial bias will cause a smaller share of Black arrestees to be shot. A spurious association showing no bias precisely because there is bias.

A closer look at the table behind the result that Blacks are 23.8% less likely to be shot, leads me to suspect the finding is indeed spurious.

Table 5Let’s focus on the red rectangle (the other columns don’t control for threat posed by arrestee). It reports odds ratios for Black relative to White arrestees being shot, controlling for more and more variables. The numbers tell us how many Blacks are shot for every White that is. The first number, .762 is where the result that Blacks are 23.8% less likely to be shot comes from (1-.762=.238). It controls for nothing, criminals and teenagers in bikinis are placed in the same pool .

The highlighted Row 4 shows what happens when we control for, among other things, how much of a threat the arrestee posed (namely, whether s/he drew a weapon). The odds ratio jumps from  .76 to 1.1. The evidence suggesting discrimination in favor of Blacks disappears, exactly what you expect if the result is driven by selection bias (by metaphorical teenagers in bikinis lowering the average threat of arrestees).

Given how noisy the results are, big standard errors (see next point), I don’t read much from the fact that the estimate goes over 1.0 (showing discrimination against Blacks), I do make much of the fact that the estimate is so unstable, and it moves dramatically in the predicted direction by the “it is driven by selection-bias” explanation.

Noisy estimates
The above discussion took the estimates at face value, but they have very large standard errors, to the point they provide virtually no signal. For example:

Row 4 is compatible with Blacks being 50% less likely to be shot than Whites, but
Row 4 is compatible with Blacks being 80% more likely to be shot than Whites.

These results do not justify updating our beliefs on the matter one way or the other.

How threatening was the person shot at?
Because the interest in the topic is sparked by videos showing Black civilians killed by police officers despite posing no obvious threat to them, I would define the research question as follows:

When a police officer interacts with a civilian, is a Black civilian more likely to be shot than a White civilian is, for a given level of actual threat to the police officer and the public?

The better we can measure and take into account threat, the better we can answer that research question.

The NBER paper includes analyses that answer this question better than the analyses covered by The New York Times do. For instance, Table 8 (.png) focuses on civilians shot by police and asks: Did they have a weapon?  If there is bias against Blacks, we expect fewer of them to have had a weapon when shot, and that’s what the table reports [5].

14.9% of White civilians shot by White officers did not have a weapon.
19.0% of Black civilians shot by White officers did not have a weapon.

The observed difference is 4.1 percentage points, or about 1/3 the baseline (a larger effect size than the 23.8% behind the NY Times story). As before the estimates are noisy and not statistically significant.

When big effect size estimates are not statistically significant we don’t learn the effect is zero, we learn the sample is too small,the results inconclusive; not newsworthy.

Ideas for more precise estimates
One solution is larger samples. Obviously, but sometimes hard to achieve.

Collecting additional proxies for threat could help too. For example, the arrestee’s criminal record, the reason for the arrest, and the origin of the officer-civilian interaction (e.g., routine traffic stop vs responding to 911 call). What kind of weapon did the civilian have and was it within easy reach. Etc.

The data used for the NBER article includes long narrative accounts written by the police about the interactions. These could be stripped of race identifying information and raters be asked to subjectively evaluate the threat level right before the shooting takes place.

Finally, I’d argue we expect not just a main effect of threat, one to be controlled with a covariate, but an interaction. In high-threat situations the use of force may be unambiguously appropriate. Racial bias may be play a larger role in lower-threat situations.


Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. Roland Fryer, the Harvard economist author of the NBER article, generously and very promptly responded providing valuable feedback that I hope to have been able to adequately incorporate, including the last paragraphs with constructive suggestions. (I am especially grateful given how many people must be contacting him right after the New York Times articles came out.) 

PS: Josh Miller (.htm) from Bocconi had a similar set of reactions that are discussed in today’s blogpost by Andrew Gelman (.htm).


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The paper and supplement add to nearly 100 pages, by necessity I focus on the subset of analyses most directly relevant to the 23.8% result. []
  2. The result most inconsistent with my interpretation of the data are reported in Table 6 (.png), page 24 in the paper, comparing the share of police officers who self-reported, after the fact, whether they shot before or after being attacked. The results show an unbelievably large difference favoring Blacks; officers self-report being 44% less likely to shoot before being attacked by Black vs White arrestees. []
  3. See e.g. these tweets by political scientist Matt Blackwell .pdf []
  4. The intuition behind this selection bias is commonly relied on to test for discrimination. It dates back at least to Becker (1957) “The Economics of Discrimination”; it’s been used in empirical papers examining discrimination in real estate, bank loans, traffic stops, teaching evaluations, etc. []
  5. The table reports vast differences in the behavior of White and Black officers; I suspect this means the analyses need to include more controls. []

[48] P-hacked Hypotheses Are Deceivingly Robust

Sometimes we selectively report the analyses we run to test a hypothesis.
Other times we selectively report which hypotheses we tested.

One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or extroverts — do.  Another popular way is to get an interesting dataset first, and figure out what to test with it second [1].

bee

For example, a researcher gets data from a spelling bee competition and asks: Is there evidence of gender discrimination? How about race? Peer-effects? Saliency? Hyperbolic discounting? Weather? Yes! Then s/he writes a paper titled “Weather & (Spelling) Bees” as if that were the only hypothesis tested [2]. The odds of a p<.05 when testing all these hypotheses is 26% rather than the nominal 5% [3].

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks [4].

Example: Odd numbers and the horoscope
To demonstrate the problem I conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,”  may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code) [5]

T1dPeople are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS.  Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

How to deal with p-hacked hypotheses?
Replications are the obvious way to tease apart true from false positives. Direct replications, testing the same prediction in new studies, are often not feasible with observational data.  In experimental psychology it is common to instead run conceptual replications, examining new hypotheses based on the same underlying theory.  We should do more of this in non-experimental work. One big advantage is that with rich data sets we can often run conceptual replications on the same data.

To do a conceptual replication, we start from the theory behind the hypothesis, say “odd numbers prompt use of less traditional sources of information” and test new hypotheses. For example, this theory may predict that odd numbered respondents are more likely to read blogs instead of academic articles, read nutritional labels from foreign countries, or watch niche TV shows [6].

Conceptual replications should be statistically independent from original (under the null).[7]
That is to say, if an effect we observe is false-positive, the probability that the conceptual replication obtains p<.05 should be 5%. An example that would violate this would be testing if respondents with odd numbers are more likely to consult tarot readers. If by chance many superstitious individuals received an odd number by the GSS, they will both read the horoscope and consult tarot readers more often. Not independent under the null, hence not a good conceptual replication with the same data.

Moderation
A closely related alternative is also commonly used in experimental psychology: moderation. Does the effect get smaller/larger when the theory predicts it should?

For example, I once examined how the price of infant carseats sold on eBay responded to a new safety rating by Consumer Reports (CR), and to its retraction (surprisingly, the retraction was completely effective, .pdf). A referee noted that if the effects  were indeed caused by CR information, they should be stronger for new carseats, as CR advises against buying used ones. If I had a false-positive in my hands we would not expect moderation to work (it did).

Summary
1. With field data it’s easy to p-hack hypotheses.
2. The resulting false-positive findings will be robust to alternative specifications
3. Tools common in experimental psychology, conceptual replications and testing moderation, are viable solutions.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. As with most forms of p-hacking, selectively reporting hypotheses typically does not involve willful deception. []
  2. I chose weather and spelling bee as an arbitrary example. Any resemblance to actual papers is seriously unintentional. []
  3. (1-.95^6)=.2649 []
  4. Robustness tests may help with the selective reporting of hypothesis if a spurious finding is obtained due to specification rather than sampling error. []
  5. This finding is necessarily false-positive because ID numbers are assigned after the opportunity to read the horoscope has passed, and respondents are unaware of the number they have been assigned to; but see Bem (2011 .htm) []
  6. This opens the door to more selective reporting as a researcher may attempt many conceptual replications and report only the one(s) that worked. By virtue of using the same dataset to test a fixed theory, however, this is relatively easy to catch/correct if reviewers and readers have access to the set of variables available to the researcher and hence can at least partially identify the menu of conceptual replications available. []
  7. Red font clarification added after tweet from Sanjay Srivastava .htm []