[50] Teenagers in Bikinis: Interpreting Police-Shooting Data

The New York Times, on Monday, showcased (.htm) an NBER working paper (.pdf) that proposed that “blacks are 23.8 percent less likely to be shot at by police relative to whites.” (p.22)

The paper involved a monumental data collection effort  to address an important societal question. The analyses are rigorous, clever and transparently reported. Nevertheless, I do not believe the above conclusions are justified from the evidence. Relying on additional results reported in the paper, I show here the data are consistent with police shootings being biased against Blacks, but too noisy to confidently conclude either way [1],[2].

Teenagers in bikinis
As others have noted [3], an interesting empirical challenge for interpreting the shares of Whites vs Blacks shot by police while being arrested is that biased officers, those overestimating the threat posted by a Black civilian, will arrest less dangerous Blacks on average. They will arrest those posing a real threat, but also some not posing a real threat, resulting in lower average threat among those arrested by biased officers [4].

For example, a biased officer may be more likely to perceive a Black teenager in a bikini as a physical threat (YouTube) than a non-biased officer would, lowering the average threat of his arrestees. If teenagers in bikinis, in turn, are less likely to be shot by police than armed criminals are, racial bias will cause a smaller share of Black arrestees to be shot. A spurious association showing no bias precisely because there is bias.

A closer look at the table behind the result that Blacks are 23.8% less likely to be shot, leads me to suspect the finding is indeed spurious.

Table 5Let’s focus on the red rectangle (the other columns don’t control for threat posed by arrestee). It reports odds ratios for Black relative to White arrestees being shot, controlling for more and more variables. The numbers tell us how many Blacks are shot for every White that is. The first number, .762 is where the result that Blacks are 23.8% less likely to be shot comes from (1-.762=.238). It controls for nothing, criminals and teenagers in bikinis are placed in the same pool .

The highlighted Row 4 shows what happens when we control for, among other things, how much of a threat the arrestee posed (namely, whether s/he drew a weapon). The odds ratio jumps from  .76 to 1.1. The evidence suggesting discrimination in favor of Blacks disappears, exactly what you expect if the result is driven by selection bias (by metaphorical teenagers in bikinis lowering the average threat of arrestees).

Given how noisy the results are, big standard errors (see next point), I don’t read much from the fact that the estimate goes over 1.0 (showing discrimination against Blacks), I do make much of the fact that the estimate is so unstable, and it moves dramatically in the predicted direction by the “it is driven by selection-bias” explanation.

Noisy estimates
The above discussion took the estimates at face value, but they have very large standard errors, to the point they provide virtually no signal. For example:

Row 4 is compatible with Blacks being 50% less likely to be shot than Whites, but
Row 4 is compatible with Blacks being 80% more likely to be shot than Whites.

These results do not justify updating our beliefs on the matter one way or the other.

How threatening was the person shot at?
Because the interest in the topic is sparked by videos showing Black civilians killed by police officers despite posing no obvious threat to them, I would define the research question as follows:

When a police officer interacts with a civilian, is a Black civilian more likely to be shot than a White civilian is, for a given level of actual threat to the police officer and the public?

The better we can measure and take into account threat, the better we can answer that research question.

The NBER paper includes analyses that answer this question better than the analyses covered by The New York Times do. For instance, Table 8 (.png) focuses on civilians shot by police and asks: Did they have a weapon?  If there is bias against Blacks, we expect fewer of them to have had a weapon when shot, and that’s what the table reports [5].

14.9% of White civilians shot by White officers did not have a weapon.
19.0% of Black civilians shot by White officers did not have a weapon.

The observed difference is 4.1 percentage points, or about 1/3 the baseline (a larger effect size than the 23.8% behind the NY Times story). As before the estimates are noisy and not statistically significant.

When big effect size estimates are not statistically significant we don’t learn the effect is zero, we learn the sample is too small,the results inconclusive; not newsworthy.

Ideas for more precise estimates
One solution is larger samples. Obviously, but sometimes hard to achieve.

Collecting additional proxies for threat could help too. For example, the arrestee’s criminal record, the reason for the arrest, and the origin of the officer-civilian interaction (e.g., routine traffic stop vs responding to 911 call). What kind of weapon did the civilian have and was it within easy reach. Etc.

The data used for the NBER article includes long narrative accounts written by the police about the interactions. These could be stripped of race identifying information and raters be asked to subjectively evaluate the threat level right before the shooting takes place.

Finally, I’d argue we expect not just a main effect of threat, one to be controlled with a covariate, but an interaction. In high-threat situations the use of force may be unambiguously appropriate. Racial bias may be play a larger role in lower-threat situations.


Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. Roland Fryer, the Harvard economist author of the NBER article, generously and very promptly responded providing valuable feedback that I hope to have been able to adequately incorporate, including the last paragraphs with constructive suggestions. (I am especially grateful given how many people must be contacting him right after the New York Times articles came out.) 

PS: Josh Miller (.htm) from Bocconi had a similar set of reactions that are discussed in today’s blogpost by Andrew Gelman (.htm).


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The paper and supplement add to nearly 100 pages, by necessity I focus on the subset of analyses most directly relevant to the 23.8% result. []
  2. The result most inconsistent with my interpretation of the data are reported in Table 6 (.png), page 24 in the paper, comparing the share of police officers who self-reported, after the fact, whether they shot before or after being attacked. The results show an unbelievably large difference favoring Blacks; officers self-report being 44% less likely to shoot before being attacked by Black vs White arrestees. []
  3. See e.g. these tweets by political scientist Matt Blackwell .pdf []
  4. The intuition behind this selection bias is commonly relied on to test for discrimination. It dates back at least to Becker (1957) “The Economics of Discrimination”; it’s been used in empirical papers examining discrimination in real estate, bank loans, traffic stops, teaching evaluations, etc. []
  5. The table reports vast differences in the behavior of White and Black officers; I suspect this means the analyses need to include more controls. []

[48] P-hacked Hypotheses Are Deceivingly Robust

Sometimes we selectively report the analyses we run to test a hypothesis.
Other times we selectively report which hypotheses we tested.

One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or extroverts — do.  Another popular way is to get an interesting dataset first, and figure out what to test with it second [1].

bee

For example, a researcher gets data from a spelling bee competition and asks: Is there evidence of gender discrimination? How about race? Peer-effects? Saliency? Hyperbolic discounting? Weather? Yes! Then s/he writes a paper titled “Weather & (Spelling) Bees” as if that were the only hypothesis tested [2]. The odds of a p<.05 when testing all these hypotheses is 26% rather than the nominal 5% [3].

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks [4].

Example: Odd numbers and the horoscope
To demonstrate the problem I conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,”  may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code) [5]

T1dPeople are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS.  Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

How to deal with p-hacked hypotheses?
Replications are the obvious way to tease apart true from false positives. Direct replications, testing the same prediction in new studies, are often not feasible with observational data.  In experimental psychology it is common to instead run conceptual replications, examining new hypotheses based on the same underlying theory.  We should do more of this in non-experimental work. One big advantage is that with rich data sets we can often run conceptual replications on the same data.

To do a conceptual replication, we start from the theory behind the hypothesis, say “odd numbers prompt use of less traditional sources of information” and test new hypotheses. For example, this theory may predict that odd numbered respondents are more likely to read blogs instead of academic articles, read nutritional labels from foreign countries, or watch niche TV shows [6].

Conceptual replications should be statistically independent from original (under the null).[7]
That is to say, if an effect we observe is false-positive, the probability that the conceptual replication obtains p<.05 should be 5%. An example that would violate this would be testing if respondents with odd numbers are more likely to consult tarot readers. If by chance many superstitious individuals received an odd number by the GSS, they will both read the horoscope and consult tarot readers more often. Not independent under the null, hence not a good conceptual replication with the same data.

Moderation
A closely related alternative is also commonly used in experimental psychology: moderation. Does the effect get smaller/larger when the theory predicts it should?

For example, I once examined how the price of infant carseats sold on eBay responded to a new safety rating by Consumer Reports (CR), and to its retraction (surprisingly, the retraction was completely effective, .pdf). A referee noted that if the effects  were indeed caused by CR information, they should be stronger for new carseats, as CR advises against buying used ones. If I had a false-positive in my hands we would not expect moderation to work (it did).

Summary
1. With field data it’s easy to p-hack hypotheses.
2. The resulting false-positive findings will be robust to alternative specifications
3. Tools common in experimental psychology, conceptual replications and testing moderation, are viable solutions.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. As with most forms of p-hacking, selectively reporting hypotheses typically does not involve willful deception. []
  2. I chose weather and spelling bee as an arbitrary example. Any resemblance to actual papers is seriously unintentional. []
  3. (1-.95^6)=.2649 []
  4. Robustness tests may help with the selective reporting of hypothesis if a spurious finding is obtained due to specification rather than sampling error. []
  5. This finding is necessarily false-positive because ID numbers are assigned after the opportunity to read the horoscope has passed, and respondents are unaware of the number they have been assigned to; but see Bem (2011 .htm) []
  6. This opens the door to more selective reporting as a researcher may attempt many conceptual replications and report only the one(s) that worked. By virtue of using the same dataset to test a fixed theory, however, this is relatively easy to catch/correct if reviewers and readers have access to the set of variables available to the researcher and hence can at least partially identify the menu of conceptual replications available. []
  7. Red font clarification added after tweet from Sanjay Srivastava .htm []

[47] Evaluating Replications: 40% Full ≠ 60% Empty

Last October, Science published the paper “Estimating the Reproducibility of Psychological Science” (.pdf), which reported the results of 100 replication attempts. Today it published a commentary by Gilbert et al. (.pdf) as well as a response by the replicators (.pdf).

The commentary makes two main points. First, because of sampling error, we should not expect all of the effects to replicate even if all of them were true. Second, differences in design between original studies and replication attempts may explain differences in results. Let’s start with the latter.[1]

Design differences
The commentators provide some striking examples of design differences. For example, they write, “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon” (p. 1037).

People can debate if such differences can explain the results (and in their reply, the replicators explain why they don’t think so). However, for readers to consider whether design differences matter, they first need to know those differences exist. I, for one, was unaware of them before reading Gilbert et al. (They are not mentioned in the 6 page Science article .pdf, nor 26 page supplement .pdf). [2]

This is not about pointing fingers, as I have also made this mistake: I did not sufficiently describe differences between original and replication studies  in my Small Telescopes paper (see Colada [43]).

This is also not about taking a position on whether any particular difference is responsible for any particular discrepancy in results. I have no idea. Nor am I arguing design differences are a problem per-se, in most cases they were even approved by the original authors.

This is entirely about improving the reporting of replications going forward. After reading the commentary I better appreciate the importance of prominently disclosing design differences. This better enables readers to consider the consequences of such differences, while encouraging replicators to anticipate and address, before publication, any concerns they may raise. [3]

Noisy results
I am also sympathetic to the commentators’ other concern, which is that sampling error may explain the low reproducibility rate. Their statistical analyses are not quite right, but neither are those by the replicators in the reproducibility project.

A study result can be imprecise enough to be consistent both with an effect existing and with it not existing. (See Colada[7] for a remarkable example from Economics). Clouds are consistent with rain, but also consistent with no rain. Clouds, like noisy results, are inconclusive.

The replicators interpreted inconclusive replications as failures, the commentators as successes. For instance, one of the analyses by the replicators considered replications as successful only if they obtained p<.05, effectively treating all inconclusive replications as failures. [4]

Both sets of authors examined whether the results from one study were within the confidence interval of the other, selectively ignoring sampling error of one or the other study.[5]

In particular, the replicators deemed a replication successful if the original finding was within the confidence interval of the replication. Among other problems this approach leads most true effects to fail to replicate with sufficiently big replication samples.[6]

F1
The commentators, in contrast, deemed replications successful if their estimate was within the confidence interval of the original. Among other problems, this approach leads too many false-positive findings to survive most replication efforts.[7]

F2
For more on these problems with effect size comparisons, see p. 561 in “Small Telescopes” (.pdf).

Accepting the null
Inconclusive replications are not failed replications.

For a replication to fail, the data must support the null. They must affirm the non-existence of a detectable effect. There are four main approaches to accepting the null (see Colada [42]). Two lend themselves particularly well to evaluating replications:

(i) Small Telescopes (.pdf): Test whether the replication rejects effects big enough to be detectable by the original study, and (ii) Bayesian evaluation of replications (.pdf).

These are philosophically and mathematically very different, but in practice they often agree. In Colada [42] I reported that for this very reproducibility project, the Small Telescopes and the Bayesian approach are correlated r = .91 overall, and r = .72 among replications with p>.05. Moreover, both find that about 30% of replications were inconclusive. (R Code).  [8],[9]

40% full is not 60% empty
The opening paragraph of the response by the replicators reads:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled”

They are saying the glass is 40% full.  They are not explicitly saying it is 60% empty. But readers may be forgiven for jumping to that conclusion, and they almost invariably have.  This opening paragraph would have been equally justified:
[…] the Open Science Collaboration observed that the original result failed to replicate in ~30 of 100 studies sampled”

It would be much better to fully report:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 replications were inconclusive.”

Summary
1. Replications must be analyzed in ways that allow for results to be inconclusive, not just success/fail
2. Design differences between original and replication should be prominently disclosed.

Wide logo


Author feedback.
I shared a draft of this post with Brian Nosek, Dan Gilbert and Tim Wilson, and invited them and their co-authors to provide feedback. I exchanged over 20 emails total with 7 of them. Their feedback greatly improved, and considerably lengthened, this post. Colada Co-host Joe Simmons provided lots of feedback as well.  I kept editing after getting feedback from all of them, so the version you just read is probably worse and surely different from the versions any of them commented on.


Concluding remarks
My views on the state of social science and what to do about it are almost surely much closer to those of the reproducibility team than to those of the authors of the commentary. But. A few months ago I came across a “Rationally Speaking” podcast (.htm) by Julia Galef (relevant part of transcript starts on page 7, .pdf) where she talks about debating with a “steel-man” version, as opposed to straw-man, of an argument. It changed how Iman of steel approach disagreements. For example, the Gilbert et al  commentary opens with what appears to be an incorrectly calculated probability. One could straw-man argue against the commentary by focusing on that calculation. But the argument such probability is meant to support does not hinge on precisely estimating it. There are other weak-links in the commentary, but its steel-man version, the one focusing on its strengths rather than weaknesses, did make me think better about the issues at hand and ended up with what I think is an improved perspective on replications.

We are greatly indebted to the collaborative work of 100s of colleagues behind the reproducibility project, and to Brian Nosek for leading that gargantuan effort (as well as many other important efforts to improve the transparency and replicability of social science). This does not mean we should not try to improve on it or to learn from its shortcomings.


 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The commentators  actually focus on three issues: (1) (Sampling) error, (2) Statistical power, and (3) Design differences. I treat (1) and (2) as the same problem []
  2. However, the 100 detailed study protocols are available online (.htm), and so people can identify them by reading those protocols. For instance, here (.htm) is the (8 page) protocol for the military vs honeymoon study. []
  3. Brandt et al (JESP 2014) understood the importance of this long before I did, see their ‘Replication Recipe’ paper .pdf []
  4. Any true effect can fail to replicate with a small enough sample, a point made in most articles making suggestions for conducting and evaluating replications, including Small Telescopes (.pdf). []
  5. The original paper reported 5 tests of reproducibility: (i) Is the replication p<.05?, (ii) Is the original within the confidence interval of the replication?, (iii) Does the replication team subjectively rate it as successful vs failure? (iv) Is the replication directionally smaller than the original? and (v) Is the average of original and replication significantly different from zero? In the post I focus only on (i) and (ii) because: (iii)  is not a statistic with evaluative properties (but in any case, also does not include an ‘inconclusive bin’), and neither (iv) nor (v) measure reproducibility.  (iv) Measures publication bias (with lots of noise), and I couldn’t say what (v) measures. []
  6. Most true findings are inflated due to publication bias, so the unbiased estimate from the replication will eventually reject it []
  7. For example, the prototypically  p-hacked p=.049 finding, has a confidence interval that nearly touches zero. To obtain a replication outside that confidence interval, therefore, we need to observe a negative estimate. If the true effect is zero, that will happen only 50% of the time, so about half of false-positive p=.049 would survive replication attempts []
  8. Alex Etz in his blog post did the Bayesian analyses long before I did and I used his summary dataset, as is, to run my analyses. See his PLOS ONE paper, .htm. []
  9. The Small Telescope approach finds that only 25% of replications conclusively failed to replicate, whereas the Bayesian approach says this number is about 37%. However, several of the disagreements come from results that barely accept or don’t accept the null, so the two agree more than these two figures suggest. In the last section of Colada[42] I explain what causes disagreements between the two. []

[43] Rain & Happiness: Why Didn’t Schwarz & Clore (1983) ‘Replicate’ ?

In my “Small Telescopes” paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days.

Figure and text from Small Telescopes paper (SSRN)
Small Telescopes quotes
I recently had an email exchange with a senior researcher (not involved in the original paper) who persuaded me I should have been more explicit regarding the design differences between the original and replication studies.  If my paper weren’t published I would add a discussion of such differences and would explain why I don’t believe these can explain the failures to replicate.  

Because my paper is already published, I write this post instead.

The 1983 study
This study is so famous that a paper telling the story behind it (.pdf) has over 450 Google cites.  It is among the top-20 most cited articles published in JPSP and the most cited by either (superstar) author.

In the original study a research assistant called University of Illinois students either during the “first two sunny spring days after a long period of gray, overcast days”, or during two rainy days within a “period of low-hanging clouds and rain” (p. 298, .pdf).

She asked about life satisfaction and then current mood. At the beginning of the phone conversation, she either did not mention the weather, mentioned it in passing, or described it as being of interest to the study.

The reported finding is that “respondents were more satisfied with their lives on sunny than rainy days—but only when their attention was not drawn to the weather” (p.298, .pdf)
results‘Replication’
Feddersen et al. (.pdf) matched weather data to the Australian Household Income Survey, which includes a question about life satisfaction. With 90,000 observations, the effect was basically zero.

There are at least three notable design differences between the original and replication studies:[1]

1. Smaller causes have smaller effect. The 1983 study focused on days on which weather was expected to have large mood effects, the Australian sample used the whole year. The first sunny day in spring is not like the 53rd sunny day of summer.

2. Already attributed. Respondents answered many questions in Australia before reporting their life-satisfaction, possibly misattributing mood to something else.

3. Noise. The representative sample is more diverse than a sample of college undergrads is; thus the data are noisier, less likely to detectably exhibit any effect.

Often this is where discussions of failed replications end—with the enumeration of potential moderators, and the call for more and better data. I’ll try to use the data we already have to assess whether any of the differences are likely to matter.[2]

Design difference 1. Smaller causes.
If weather contrasts were critical for altering mood and hence possibly happiness, then the effect in the 1983 study should be driven by the first sunny day in spring, not the Nth rainy day.  But a look at the bar chart above shows the opposite: People were NOT happier the first sunny day of spring; they were unhappier on the rainy days. Their description of these days again: and the rainy days we used were several days into a new period of low-hanging clouds and rain.’ (p. 298, .pdf)

The days driving the effect, then, were similar to previous days. Because of how seasons work, most days in the replication studies presumably were also similar to the days that preceded them (sunny after sunny and rainy after rainy), and so on this point the replication does not seem different or problematic.

Second, Lucas and Lawless (JPSP 2014, .pdf) analyzed a large (N=1 million) US sample and also found no effect of weather on life satisfaction. Moreover, they explicitly assessed if unseasonably cloudy/sunny days, or days with sunshine that differed from recent days, were associated with bigger effects. They were not. (See their Table 3).

Third, the effect size Schwarz and Clore report is enormous: 1.7 points in a 1-10 scale. To put that in perspective, from other studies, we know that the life satisfaction gap between people who got married vs. people who became widows over the past year is about 1.5 on the same scale (see Figure 1, Lucas 2005 .pdf). Life vs. death are estimated as less impactful than precipitation. Even if the effect were smaller on days not as carefully selected as those by Schwarz and Clore, the ‘replications’ averaging across all days should still have detectable effects.

The large effect is particularly surprising considering it is the downstream effect of weather on mood, and that effect is really tiny (see Tal Yarkoni’s blog review of a few studies .htm)

Design difference  2. Already attributed.
This concern, recall, is that people answering many questions in a survey may misattribute their mood to earlier questions. This makes sense, but the concern applies to the original as well.

The phone-call from Schwarz & Clore’s RA does not come immediately after the “mood induction” either, rather, participants get the RA’s phone call hours into a rainy vs sunny day.  Before the call they presumably made evaluations too, answering questions like “How are you and Lisa doing?” “How did History 101 go?” “Man, don’t you hate Champaign’s weather?” etc. Mood could have been misattributed to any of these earlier judgments in the original as well. Our participants’ experiences do not begin when we start collecting their data. [3]

Design difference 3. Noise.
This concern is that the more diverse sample in the replication makes it harder to detect any effect. If the replication were noisier, we may expect the dependent variable to have a higher standard deviation (SD).  For life-satisfaction Schwarz and Clore got about SD=1.69, Feddersen et al, SD=1.52.  So less noise in the replication. [4] Moreover, the replication has panel data and controls for individual differences via fixed effects. These account for 50% of the variance, so they have spectacularly less noise. [5]

Concluding bullet points.
– The existing data are overwhelmingly inconsistent with current weather affecting reported life satisfaction.
– This does not imply the theory behind Schwarz and Clore (1983), mood-as-information, is wrong.

Wide logo

Author feedback
I sent a draft of this post to Richard Lucas (.htm) who provided valuable feedback and additional sources. I also sent a draft to Norbert Schwarz (.htm) and Gerald Clore (.htm). They provided feedback that led me to clarify when I first identified the design differences between the original and replication studies (back in 2013, see footnotes 1&2).  They turned down several invitations to comment within this post.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The first two were mentioned in the first draft of my paper but I unfortunately cut them out during a major revision, around May 2013. The third was proposed in Feburary of 2013 in a small mailing list discussing the first talk I gave of my Small Telescopes paper []
  2. There is also the issue, as Norbert Schwarz pointed out to me in an email in May of 2013, that the 1983 study is not about weather nor life satisfaction, but about misattribution of mood. The ‘replications’ do not even measure mood. I believe we can meaningfully discuss whether the affects of rain on happiness replicates without measuring mood, in fact, the difficulty to manipulate mood via weather is one thing that make the original finding surprising. []
  3. What one needs to explain the differences via the presence of other questions is that mood effects from weather replenish through the day, but not immediately. So on sunny days at 7AM I think my cat makes me happier than usual, and then at 10AM that my calculus teacher jokes are funnier than usual, but if the joke had been told at 7.15AM I would not have found it funny because I had already attributed my mood to the cat. This is possible. []
  4. Schwarz and Clore did not report SDs, but one can compute them off the reported test statistics. See Supplement 2 for Small Telescopes .pdf. []
  5. See Rin Feddersen et al’s Table A1, column 4 vs 3, .pdf  []

[42] Accepting the Null: Where to Draw the Line?

We typically ask if an effect exists.  But sometimes we want to ask if it does not.

For example, how many of the “failed” replications in the recent reproducibility project published in Science (.pdf) suggest the absence of an effect?

Data have noise, so we can never say ‘the effect is exactly zero.’  We can only say ‘the effect is basically zero.’ What we do is draw a line close to zero and if we are confident the effect is below the line, we accept the null.
Drawing on whiteboard with confidence intervals that do and do not include the lineWe can draw the line via Bayes or via p-values, it does not matter very much. The line is what really matters. How far from zero is it? What moves it up and down?

In this post I describe 4 ways to draw the line, and then pit the top-2 against each other.

Way 1. Absolutely small
The oldest approach draws the line based on absolute size. Say, diets leading to losing less than 2 pounds have an effect of basically zero. Economists do this often. For instance, a recent World Bank paper (.html) reads

“The impact of financial literacy on the average remittance frequency has a 95 percent confidence interval [−4.3%, +2.5%] …. We consider this a relatively precise zero effect, ruling out large positive or negative effects of training” (emphasis added)
(Dictionary note. Remittance: immigrants sending money home).

In much of behavioral science effects of any size can be of theoretical interest, and sample sizes are too small to obtain tight confidence intervals, making this approach unviable in principle and in practice [1].

Way 2. Undetectably Small
In our first p-curve paper with Joe and Leif (SSRN), and in my “Small Telescopes” paper on evaluating replications (.pdf), we draw the line based on detectability.

We don’t draw the line where we stop caring about effects.
We draw the line where we stop being able to detect them.

Say an original study with n=50 finds people can feel the future. A replication with n=125 ‘fails,’ getting and effect estimate of d=0.01, p=.94. Data are noisy, so the confidence interval goes all the way up to d=.2. That’s a respectably big feeling-the-future effect we are not ruling out. So we cannot say the effect is absolutely small.
example
The original study, with just n=50, however, is unable to detect that small an effect (it would have <18% power). So we accept the null, the null that the effect is either zero, or undetectably small by existing studies.

Way 3. Smaller than expected in general
Bayesian hypothesis testing runs a horse race between two hypotheses:

Hypothesis 1 (null):              The effect is exactly zero.
Hypothesis 2 (alternative): The effect is one of those moderately sized ones [2].

When data clearly favor 1 more than 2, we accept the null. The bigger the effects Hypothesis 2 includes, the further from zero we draw the line, the more likely we accept the null [3].

The default Bayesian test, commonly used by Bayesian advocates in psychology, draws the line too far from zero (for my taste). Reasonably powered studies of moderately big effects wrongly accept the null of zero effect too often (see Colada[35]) [4].

Way 4. Smaller than expected this time
A new Bayesian approach to evaluate replications, by Verhagen and Wagenmakers (2014 .pdf), pits a different Hypothesis 2 against the null. Its Hypothesis 2 is what a Bayesian observer would predict for the replication after seeing the Original (with some assumed prior).

Similar to Way 3 the bigger the effect seen in the original is, the bigger the effect we expect in the replication, and hence the further from zero we draw the line. Importantly, here the line moves based on what we observed in the original, not (only) on what we arbitrarily choose to consider reasonable to expect. The approach is the handsome cousin of testing if effect size differs between original and replication.

Small Telescope vs Expected This Time (Way 2 vs Way 4)
I compared the conclusions both approaches arrive at when applied to the 100 replications from that Science paper. The results are similar but far from equal, r = .9 across all replications, and r = .72 among n.s. ones (R Code). Focusing on situations where the two lead to opposite conclusions is useful to understand each better [5],[6].

In Study 7 in the Science paper,
The Original estimated a monstrous d=2.14 with N=99 participants total.
The Replication estimated a small    d=0.26, with a miniscule N=14.

The Small Telescopes approach is irked by the small sample of the replication. Its wide confidence interval includes effects as big as d =1.14, giving the original >99% power. We cannot rule out detectable effects, the replication is inconclusive.

The Bayesian observer, in contrast, draw a line quite far from zero after seeing the massive Original effect size. The line, indeed is at a remarkable d=.8. Replications with smaller effect size estimates, anything smaller than large, ‘supports the null.’ Because the replication is d=.26, it strongly supports the null.

A hypothetical scenario where they disagree in the opposite direction (R Code),
Original.       N=40,       d=.7
Replication.  N=5000, d=.1

The Small Telescopes approach asks if the replication rejects an effect big enough to be detectable by the original. Yes. d=.1 cannot be studied with N=40. Null Accepted [7].

Interestingly, that small N=40 pushes the Bayesian in the opposite direction. An original with N=40 changes very little her beliefs about the effect, so d=.1 in the replication is not that surprising  vs. the Original, but it is incompatible with d=0 given the large sample size, null rejected.

I find myself agreeing with the Small Telescopes’ line more than any other. But that’s a matter of taste, not fact.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. e.g., we need n=1500 per cell to have a confidence interval entirely within d<.1 and d>-.1 []
  2. The tests don’t formally assume the effects are moderately large, rather they assume distributions of effect size, say N(0,1). These distributions include tiny effects, even zero, but they also include very large effects, e.g., d>1 as probable possibilities.  It is hard to have intuitions for what assuming a distribution entails. So for brevity and clarity I just say they assume the effect is moderately large. []
  3. Bayesians don’t accept and reject hypotheses, instead, the evidence supports one or another hypothesis. I will use the term accept anyway. []
  4. This is fixable in principle, just define another alternative. If someone proposes a new Bayesian test, ask them “what line around zero is it drawing?”  Even without understanding Bayesian statistics you can evaluate if you like the line the test generates or not. []
  5. Alex Etz in a blogpost (.html) reported the Bayesian analysis of the 100 replications, I used some of his results here. []
  6. These are the spearman correlation between the p-value testing the null that the original had at least 33% power, and Bayes Factor described above. []
  7. Technically it is the upper end of the confidence interval we consider when evaluating the power of the original sample, it goes up to d=.14, I used d=.1 to keep things simpler []

[41] Falsely Reassuring: Analyses of ALL p-values

It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers:

Fig 0

The dotted circles highlight the excess of .05s, but most p-values are way smaller, suggesting  p-hacking happens but is not a first order concern. That’s reassuring, but falsely reassuring [1],[2].

Bad Sampling.
The are several problems with looking at all p-values, here I focus on sampling [3].

If we want to know if researchers p-hack their results, we need to examine the p-values associated with their results, those they may want to p-hack in the first place. Samples, to be unbiased, must only include observations from the population of interest.

Most p-values reported in most papers are irrelevant for the strategic behavior of interest. Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data.  Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?” [4].

A Demonstration.
In our first p-curve paper (SSRN) we analyzed p-values from experiments with results reported only with a covariate.

We believed researchers would report the analysis without the covariate if it were significant, thus we believed those studies were p-hacked. The resulting p-curve was left-skewed, so we were right.

Figure 2. p-curve for relevant p-values in experiments reported only with a covariate.
Fig 1

I went back to the papers we had analyzed and redid the analyses, only this time I did them incorrectly.

Instead of collecting only the (23) p-values one should select -we provide detailed directions for selecting p-values in our paper SSRN– I proceeded the way the indiscriminate analysts of p-values proceed. I got ALL (712) p-values reported in those papers.

Figure 3. p-curve for all p-values reported in papers behind Figure 2
Fig 2

Figure 3 tells that that the things those papers were not studying were super true.
Figure 2 tells the ones they were studying were not.

Looking at all p-values is falsely reassuring.

Wide logo


Author feedback
I sent a draft of this post to the first author of the three papers with charts reprinted in Figure 1 and the paper from footnote 1. They provided valuable feedback that improved the writing and led to footnotes 2 & 4.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The Econ and Psych papers were not meant to be reassuring, but they can be interpreted that way. For instance, a recent J of Econ Perspectives (.pdf) paper reads “Brodeur et al. do find excess bunching, [but] their results imply that it may not be quantitatively as severe as one might have thought” The PLOS Biology paper was meant to be reassuring. []
  2. The PLOS Biology paper had two parts. The first used the indiscriminate selection of p-values from articles in a broad range of journals and attempted to assess the prevalence and impact of p-hacking in the field as a whole. This part is fully invalidated by the problems described in this post. The second used p-values from a few published-metaanalyses on sexual selection in evolutionary biology; this second part is by construction not representative of biology as a whole. In the absence of a p-curve disclosure table, where we know which p-value was selected from each study, it is not possible to evaluate the validity of this exercise. []
  3. For other problems see Dorothoy Bishop’s recent paper [.html] []
  4. Brodeur et al did painstaking work to exclude some irrelevant p-values, e.g., those explicitly described as control variables, but nevertheless left many in . To give a sense, they obtained an average of about 90 p-values from each paper. To give a concrete example, one of the papers in their sample is by Ferreira and Gyourko (.pdf). Via regression discontinuity it shows that a mayor’s political party does not predict policy. To demonstrate the importance of their design, Ferreira & Gyourko also report naive OLS regressions with highly significant but spurious and incorrect results that at face value contradict the paper’s thesis (see their Table II). These very small but irrelevant p-values were included in the sample by Brodeur et al. []

[40] Reducing Fraud in Science

Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things.  In this post I counter-argue, proposing three alternative solutions.

Problems with the Change the Incentives solution.
First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We should pit costs vs benefits. Asking questions with the most upside is beneficial.

Second, if we started rewarding unsexy stuff, a likely consequence is fabricateurs continuing to fake, now just unsexy stuff.  Fabricateurs want the lifestyle of successful scientists. [1] Changing incentives involves making our lifestyle less appealing. (Finally, a benefit to committee meetings). 

Third, the evidence for “liking sexy→fraud” is just not there. Like real research, most fake research is not sexy. Life-long fabricateur Diederik Stapel mostly published dry experiments with “findings” in line with the rest of the literature. That we attend to and remember the sexy fake studies is diagnostic of what we pay attention to, not what causes fraud.  

The evidence that incentives causes fraud comes primarily from self-reports, with fabricateurs saying “the incentives made me do it” (see e.g., Tijdink et al .pdf; or Stapel interviews).  To me, the guilty saying “it’s not my fault” seems like weak evidence. What else could they say?
“I realized I was not cut-out for this; it was either faking some science or getting a job with less status”
I am kind of a psychopath, I had fun tricking everyone”
“A voice in my head told me to do it”

Similarly weak, to me, is the observation that fraud is more prevalent in top journals; we find fraud where we look for it. Fabricateurs faking articles that don’t get read don’t get caught….

It’s good for universities to ignore quantity of papers when hiring and promoting, good for journals to publish interesting questions with inconclusive answers. But that won’t help with fraud.

Solution 1. Retract without asking “are the data fake?”
We have a high bar for retracting articles, and a higher bar for accusing people of fraud. 
The latter makes sense. The former does not.

Retracting is not such a big deal, it just says “we no longer have confidence in the evidence.” 

So many things can go wrong when collecting, analyzing and reporting data that this should be a relatively routine occurrence even in the absence of fraud. An accidental killing may not land the killer in prison, but the victim goes 6 ft under regardless. I’d propose a  retraction doctrine like:

If something is discovered that would lead reasonable experts to believe the results did not originate in a study performed as described in a published paper, or to conclude the study was conducted with excessive sloppiness, the journal should retract the paper.   

Example 1. Analyses indicate published results are implausible for a study conducted as described (e.g., excessive linearity, implausibly similar means, or a covariate is impossibly imbalanced across conditions). Retract.

Example 2. Authors of a paper published in a journal that requires data sharing upon request, when asked for it, indicate to have “lost the data”.  Retract. [2]

Example 3. Comparing original materials with posted data reveals important inconsistencies (e.g., scales ranges are 1-11 in the data but 1-7 in the original). Retract.

When journals reject original submissions it is not their job to figure out why the authors run an uninteresting study or executed it poorly. They just reject it.

When journals lose confidence in the data behind a published article it is not their job to figure out why the authors published data whose confidence was eventually lost. They should just retract it.

Employers, funders, and co-authors can worry about why an author published untrustworthy data. 

Solution 2. Show receipts
Penn, my employer, reimburses me for expenses incurred at conferences.receipt

However, I don’t get to just say “hey, I bought some tacos in that Kansas City conference, please deposit $6.16 onto my checking account.” I need receipts.  They trust me, but there is a paper trail in case of need.

When I submit the work I presented in Kansas City to a journal, in contrast, I do just say “hey, I collected the data this or that way.” No receipts.

The recent Science retraction, with canvassers & gay marriage, is a great example for the value of receipts. The statistical evidence  suggested something was off, but the receipts-like paper trail helped a lot 

Author: “so and so run the survey with such and such company”
Sleuths: “hello such and such company, can we talk with so and so about this survey you guys run?”
Such and such company: “we don’t know any so and so, and we don’t have the capability to run the survey.”

Authors should provide as much documentation about how they run their science as they do about what they eat at conferences: where exactly was the study run, at what time and day, which research assistant run it (with contact information), how exactly were participants paid, etc.

We will trust everything researchers say. Until the need to verify arises.

Solution 3. Post data, materials and code
Had the raw data not been available, the recent Science retraction would probably not have happened. Stapel would probably not have gotten caught. The cases against Sanna and Smeesters would not have move forward.  To borrow from a recent paper with Joe and Leif:

Journals that do not increase data and materials posting requirements for publications are causally, if not morally, responsible for the continued contamination of the scientific record with fraud and sloppiness.  

Wide logo


Feedback from Ivan Oransky, co-founder of Retraction Watch
Ivan co-wrote an editorial in the New  York Times on changing the incentives to reduce fraud (.pdf). I reached out to him to get feedback. He directed me to some papers on the evidence of incentives and fraud. I was unaware of, but also unpersuaded by, that evidence. This prompted to add the last paragraph in the incentives section (where I am skeptical of that evidence).  
Despite our different takes on the role of rewarding sexy-findings on fraud, Ivan is on board with the three non-incentive solutions proposed here.  I thank Ivan for the prompt response and useful feedback. (and for Retraction Watch!)


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. I use the word fabricateur to refer to scientists who fabricate data. Fraudster is insufficiently specific (e.g., selling 10 bagels calling them a dozen is fraud too), and fabricator has positive meanings (e.g., people who make things). Fabricateur has a nice ring to it. []
  2. Every author publishing in an American Psychological Association journal agrees to share data upon request []

[39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?

A recent Science-paper (.pdf) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. 

N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent variable, the Implicit Association Test (IAT), was contrasted within-participant before and after napping. [1]

Reasonable question: How much more power does subtracting baseline IAT give a study?
Surprising answer: it lowers power.

Design & analysis of napping study
Participants took the gender and race IATs, then trained for the gender IAT (while listening to one sound) and the race IAT (different sound). Then everyone naps.  While napping one of the two sounds is played (to cue memory of the corresponding training, facilitating learning while sleeping). Then both IATs are taken again. Nappers were reported to be less biased in the cued IAT after the nap.

This is perhaps a good place to indicate that there are many studies with similar designs and sample sizes. The blogpost is about strengthening intuitions for within-subject designs, not criticizing the authors of the study.

Intuition for the power drop
Let’s simplify the experiment. No napping. No gender IAT. Everyone takes only the race IAT.

Half train before taking it, half don’t. To test if training works we could do
         Between-subject test: is the mean IAT different across conditions?

If before training everyone took a baseline race IAT, we could instead do
         Mixed design test: is the mean change in IAT different across conditions?

Subtracting baseline, going from between-subject to a mixed-design, has two effects: one good, one bad.

Good: Reduce between-subject differences. Some people have stronger racial associations than others. Subtracting baselines reduces those differences, increasing power.

Bad: Increase noise. The baseline is, after all, just an estimate. Subtracting baseline adds noise, reducing power.

Imagine the baseline was measured incorrectly. The computer recorded, instead of the IAT, the participant’s body temperature. IAT scores minus body temperature is a noisier dependent variable than just IAT scores, so we’d have less power.

If baseline is not quite as bad as body temperature, the consequence is not quite as bad, but same idea. Subtracting baseline adds the baseline’s noise.

We can be quite precise about this. Subtracting baseline only helps power if baseline is correlated r>.5 with the dependent variable, but it hurts if r<.5. [2]

See the simple math (.html). Or, just see the simple chart. F1
e.g., running n=20 per cell and subtracting baseline, when r=.3, lowers power enough that it is as if the sample had been n=15 instead of n=20. (R Code)

Before-After correlation for  IAT
Subtracting baseline IAT will only help, then, if when people take it twice, their scores are correlated r>.5. Prior studies have found test-retest reliability of r = .4 for the racial IAT. [3]  Analyzing the posted data (.html) from this study, where manipulations take place between measures, I got r = .35. (For gender IAT I got r=.2) [4]

Aside: one can avoid the power-drop entirely if one controls for baseline in a regression/ANCOVA instead of subtracting it.  Moreover, controlling for baseline never lowers power. See bonus chart (.pdf). 

Within-subject manipulations
In addition to subtracting baseline, one may carry out the manipulation within-subject, 
every participant gets treatment and control. Indeed, in the napping study everyone had a cued and a non-cued IAT.

How much this helps depends again on the correlation of the within-subject measures: Does race IAT correlate with gender IAT?  The higher the correlation, the bigger the power boost. 

f2note: Aurélien Allard, a PhD student in Moral Psychology at Paris 8 University, caught an error in the R Code used to generate this figure.  He contacted me on 2016/11/02 and I updated the figure 2 days later. You can see the archived version of the post, with the incorrect figure, here.

When both measures are uncorrelated it is as if the study had twice as many subjects. This makes sense. r=0 is as if the data came from different people, asking two questions from n=20 is like asking one from n=40. As r increases we have more power because we expect the two measure to be more and more similar, so any given difference is more and more statistically significant (R Code for chart) [5].

Race & gender IATs capture distinct mental associations, measured with a test of low reliability, so we may not expect a high correlation. At baseline, r(race,gender)=-.07, p=.66.  The within-subject manipulation, then, “only” doubled the sample size.

So, how big was the sample?
The Science-paper reports N=40 people total. The supplement explains that actually combines two separate studies run months apart, each N=20. The analyses subtracted baseline IAT, lowering power, as if N=15. The manipulation was within subject, doubling it, to N=30. To detect “men are heavier than women” one needs N=92. [6]

Wide logoAuthor feedback
I shared an early draft of this post with the authors of the Science-paper. We had an extensive email exchange that led to clarifying some ambiguities in the writing. They also suggested I mention their results are robust to controlling instead of subtracting baseline.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. The IAT is the Implicit Association Test and assesses how strongly respondents associate, for instance, good things with Whites and bad things with Blacks; take a test (.html) []
  2. Two days after this post went live I learned, via Jason Kerwin, of this very relevant paper by David McKenzie (.pdf) arguing for economists to collect data from more rounds. David makes the same point about r>.5 for a gain in power from, in econ jargon, a diff-in-diff vs. the simple diff. []
  3. Bar-Anan & Nosek (2014, p. 676 .pdf); Lane et al. (2007, p.71 .pdf)  []
  4. That’s for post vs pre nap. In the napping study the race IAT is taken 4 times by every participant, resulting in 6 before-after  correlations. Raning  -.047 to r = .53; simple average r = .3. []
  5. This ignores the impact that going from between to within subject design has on the actual effect itself. Effects can get smaller or larger depending on the specifics. []
  6. The idea of using men-vs-women weight as a benchmark is to give a heuristic reaction; effects big enough to be detectable by the naked eye require bigger samples than the ones we are used to seeing when studying surprising effects. For those skeptical of this heuristic, let’s use published evidence on the IAT as a benchmark. Lai et al (2014 .pdf) run 17 interventions seeking to reduce IAT scores. The biggest effect among these 17 was d=.49. That effect size requires n=66 per cell, N=132 total, for 80% power (more than for men vs women weight). Moderating this effect through sleep, and moderating the moderation through cueing while sleeping, requires vastly larger samples to attain the same power.   []

[36] How to Study Discrimination (or Anything) With Names; If You Must

Consider these paraphrased famous findings:
“Because his name resembles ‘dentist,’ Dennis became one” (JPSP, .pdf)
“Because the applicant was black (named Jamal instead of Greg) he was not interviewed” (AER, .pdf)
“Because the applicant was female (named Jennifer instead of John), she got a lower offer” (PNAS, .pdf)

Everything that matters (income, age, location, religion) correlates with people’s names, hence comparing people with different names involves comparing people with potentially different everything that matters.

This post highlights the problem and proposes three practical solutions. [1]

Gender
Jennifer was the #1 baby girl name between 1970 & 1984, while John has been a top-30 boy name for the last 120 years. Comparing reactions to profiles with these names pits mental associations about women in their late 30s/early 40s with those of  men of unclear age.

More generally, close your eyes and think of Jennifers. Now do that for Johns.
Is gender the only difference between the two sets of people you considered?

Here is what Google did when I asked it to close its eyes: [2]

 Jennifer  jenn
 John  John

Johns vary more in age, appearance, affluence, and presidential ambitions. For somewhat harder data, I consulted a website where people rate names on various attributes:John vs jennifer

Race
Distinctively black names (e.g., Jamal and Lakisha) signal low socioeconomic status while typical White names do not (QJE .pdf).  Do people not want to hire Jamal because he is Black or because he is of low status?

Even if all distinctively Black names (and even Black people) were perceived as low status, and hence Jamal were an externally valid signal of Blackness, the contrast with Greg might nevertheless be low in internal validity, because the difference attributed to race could instead be the result of status (or some other confounding variable). This is addressable because some (most?) low status people are not Black. We could compare Black names vs. low-status White names: say Jamal with Bubba or Billy Bob, and Lakisha with Bambi or Billy Jean. This would allow assessing  racial discrimination above and beyond status discrimination. [3greg jamalImagine reading a movie script where a Black drug dealer is being defended by a brilliant Black lawyer. One of these characters is named Greg, the other Jamal. The intuition that Greg is the lawyer’s name, is the intuition behind the internal validity problem.

Solution 1. Stop using names
Probably the best solution is to stop using names to manipulate race and gender.  A recent paper (PNAS .pdf) examined gender discrimination using only pronouns (and found that academics in STEM fields favored females over males 2:1).

Solution 2. Choose many names
A great paper titled “Stimulus Sampling” (PSPB .pdf) argues convincingly for choosing many stimuli for any given manipulation to avoid stumbling on unforeseen confounds. Stimulus sampling would involve going beyond Jennifer vs. John, to using, say, 20 female vs. 20 male names. This helps with idiosyncratic confounds (e.g., age) but not with the systematic confound that most distinctively Black names signal low socioeconomic status. [4]

Solution 3. Choose control names actively
If one chooses to study names, then one needs to select control names that if it weren’t for the scientific hypothesis of interest, would produce no difference with the target names (e.g., if it weren’t for racial discrimination, then people should like Jamal and this other name just as much)

I close with an example from a paper of mine where I attempted to generate proper control names to examine if people disproportionately marry others with similar names, e.g. Eric-Erica, because of implicit egotism: a preference for things that resemble the self. (JPSP .pdf)

We need control names that we would expect to marry Ericas just as frequently as Erics do in the absence of implicit egotism (e.g., of similar age, religion, income, class and location).  To find such names I looked at the relative frequency of wife names for every male name and asked “What male names have the most similar distribution of wife names to Erics?” [5].

The answer was: Joseph, Frank and Carl. We would expect these three names to marry Erica just as frequently as Eric does, if not for implicit egotism. And we would be right.

For the Jamal vs. Greg study, we could compare Jamal to non-Black names that have the most similar distribution of occupations, or of Zip Codes, or of criminal records.

Wide logo

 


Feedback from original authors:
I shared an early draft of this post with the authors of the Jamal vs. Greg, and Jennifer vs. John study.

Sendhil Mullainathan, co-author of the former, indicated across a few emails he did not believe it was clear one should control for socioeconomic status differences in studies about race, because status and race are correlated in real life.

Corinne Moss-Racusin sent me a note she wrote with her co-authors of their PNAS study:

Thanks so much for contacting us about this interesting topic. We agree that these are thoughtful and important points, and have often grappled with them in our own research. The names we used (John and Jennifer) had been pretested and rated as equivalent on a number of dimensions including warmth, competence, likeability, intelligence, and typicality (Brescoll & Uhlmann, 2005 .pdf), but they were not rated for perceived age, as you highlight here. However, for our study in particular, age of the target should not have extensively impacted our results, because the age of both our targets could easily be inferred from the targets’ resume information that our participants were exposed to. Both the male and female targets (John and Jennifer respectively) were presented as recent college grads (with the same graduation year), and it is thus reasonable to assume that participants believed they were the same age, as recent college grads are almost always the same age (give or take a few years). Thus, although it is possible that age (and other potential variables) may indeed be confounded with gender across our manipulation, we nonetheless do not believe that choosing different male and female names that were equivalent for age would greatly impact our findings, given our design. That said, future research should still seek to replicate our key findings using different manipulations of target gender. Specifically, your suggestions (using only pronouns, and using multiple names) are particularly promising. We have also considered utilizing target pictures in the past, but have encountered issues relating to attractiveness and other confounds.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. Galen Bodenhausen read this post and told me about a paper on confounds in names used for gender research, from 1993(!) PsychBull .pdf []
  2. Based on the Jennifers and Johns I see, I suspect Google peeked at my cookies before closing its eyes,e.g., there are two bay area business school professors. Your results may differ. []
  3. Bertrand and Mullainathan write extensively about the socioeconomic confound and report a few null results that they interpret as suggesting it is not playing a large role (see their Section “V.B Potential Confounds”, .pdf). However, (1) the n.s. results of socioeconomic status are obtained with extremely noisy proxies and small samples, reducing the ability to conclude evidence of absence from the absence of evidence, and on the other, (2) these analyses seek to remedy the consequences of the name-confound rather than avoiding the confound from the get-go through experimental design. This post is about experimental design. []
  4. The Jamal paper used 9 different names per race/gender cell []
  5. To avoid biasing the test against implicit egotism, I excluded from the calculations male and female names starting with E_ []

[35] The Default Bayesian Test is Prejudiced Against Small Effects

When considering any statistical tool I think it is useful to answer the following two practical questions:

1. “Does it give reasonable answers in realistic circumstances?”
2. “Does it answer a question I am interested in?”

In this post I explain why, for me, when it comes to the default Bayesian test that’s starting to pop up in some psychology publications, the answer to both questions is no.”

The Bayesian test
The Bayesian approach to testing hypotheses is neat and compelling. In principle. [1]

The p-value assesses only how incompatible the data are with the null hypothesis. The Bayesian approach, in contrast, assesses the relative compatibility of the data with a null vs an alternative hypothesis.

The devil is in choosing that alternative.  If the effect is not zero, what is it?

Bayesian advocates in psychology have proposed using a “default” alternative (Rouder et al 1999, .pdf). This default is used in the online (.html) and R based (.html) Bayes factor calculators. The original papers do warn attentive readers that the default can be replaced with alternatives informed by expertise or beliefs (see especially Dienes 2011 .pdf), but most researchers leave the default unchanged. [2]

This post is written with that majority of default following researchers in mind. I explain why, for me, when running the default Bayesian test, the answer to Questions 1 & 2 is “no” .

Question 1. “Does it give reasonable answers in realistic circumstances?”
No. It is prejudiced against small effects

The null hypothesis is that the effect size (henceforth d) is zero. Ho: d = 0. What’s the alternative hypothesis? It can be whatever we want it to be, say, Ha: = .5. We would then ask: are the data more compatible with = 0 or are they more compatible with = .5?

The default alternative hypothesis used in the Bayesian test is a bit more complicated. It is a distribution, so more like Ha: d~N(0,1). So we ask if the data are more compatible with zero or with d~N(0,1). [3]

That the alternative is a distribution makes it difficult to think about the test intuitively.  Let’s not worry about that. The key thing for us is that that default is prejudiced against small effects.

Intuitively (but not literally), that default means the Bayesian test ends up asking: “is the effect zero, or is it biggish?” When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero. [4]

Demo 1. Power at 50%

Let’s see how the test behaves as the effect size get smaller (R Code)Fig1The Bayesian test erroneously supports the null about 5% of the time when the effect is biggish, d=.64, but it does so five times more frequently when it is smallish, d=.28.  The smaller the effect (for studies with a given level of power), the more likely we are to dismiss its existence.  We are prejudiced against small effects. [5]

Note how as sample gets larger the test becomes more confident (smaller white area) and more wrong (larger red area).

Demo 2. Facebook
For a more tangible example consider the Facebook experiment (.html) that found that seeing images of friends who voted (see panel a below) increased voting by 0.39% (panel b).Facebook3While the null of a zero effect is rejected (p=.02) and hence the entire confidence interval for the effect is above zero, [6] the Bayesian test concludes VERY strongly in favor of the null, 35:1. (R Code)

Prejudiced against (in this case very) small effects.

Question 2. “Does it answer a question I am interested in?”
No. I am not interested in how well data support one elegant distribution.

 When people run a Bayesian test they like writing things like
“The data support the null.”

But that’s not quite right. What they actually ought to write is
“The data support the null more than they support one mathematically elegant alternative hypothesis I compared it to”

Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.

We are constantly reminded that:
P(D|H0)≠P(H0)
The probability of the data given the null is not the probability of the null

But let’s not forget that:
P(H0|D) / P(H1|D)  ≠ P(H0)
The relative probability of the null over one mathematically elegant alternative is not the probability of the null either.

Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it. The default Bayesian test does not answer a question I would ask.

Wide logo

 


Feedback from Bayesian advocates:
I shared an early draft of this post with three Bayesian advocates. I asked for feedback and invited them to comment.

1. Andrew Gelman  Expressed “100% agreement” with my argument but thought I should make it clearer this is not the only Bayesian approach, e.g., he writes “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors.” I made several edits in response to his suggestions, including changing the title.

2. Jeff Rouder  Provided additional feedback and also wrote a formal reply (.html). He begins highlighting the importance of comparing p-values and Bayesian Factors when -as is the case in reality- we don’t know if the effect does or does not exist, and the paramount importance for science of subjecting specific predictions to data analysis (again, full reply: .html)

3. EJ Wagenmakers Provided feedback on terminology, the poetic response that follows, and a more in-depth critique of confidence intervals (.pdf)

In a desert of incoherent frequentist testing there blooms a Bayesian flower. You may not think it is a perfect flower. Its color may not appeal to you, and it may even have a thorn. But it is a flower, in the middle of a desert. Instead of critiquing the color of the flower, or the prickliness of its thorn, you might consider planting your own flower — with a different color, and perhaps without the thorn. Then everybody can benefit.”

Sunbaked Mud in Desert


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. If you want to learn more about it I recommend Rouder et al. 1999 (.pdf), Wagenmakers 2007 (.pdf) and Dienes 2011 (.pdf) []
  2. e.g., Rouder et al (.pdf) write “We recommend that researchers incorporate information when they believe it to be appropriate […] Researchers may also incorporate expectations and goals for specific experimental contexts by tuning the scale of the prior on effect size” p.232 []
  3. The current default distribution is d~N(0,.707), the simulations in this post use that default []
  4. Again, Bayesian advocates are upfront about this, but one has to read their technical papers attentively. Here is an example in Rouder et al (.pdf) page 30: “it is helpful to recall that the marginal likelihood of a composite hypothesis is the weighted average of the likelihood over all constituent point hypotheses, where the prior serves as the weight. As [variance of the alternative hypothesis] is increased, there is greater relative weight on larger values of [the effect size] […] When these unreasonably large values […] have increasing weight, the average favors the null to a greater extent”.   []
  5. The convention is to say that the evidence clearly supports the null if the data are at least three times more likely when the null hypothesis is true than when the alternative hypothesis is, and vice versa. In the chart above I refer to data that do not clearly support the null nor the alternative as inconclusive. []
  6. note that the figure plots standard errors, not a confidence interval []