[36] How to Study Discrimination (or Anything) With Names; If You Must

Consider these paraphrased famous findings:
“Because his name resembles ‘dentist,’ Dennis became one” (JPSP, .pdf)
“Because the applicant was black (named Jamal instead of Greg) he was not interviewed” (AER, .pdf)
“Because the applicant was female (named Jennifer instead of John), she got a lower offer” (PNAS, .pdf)

Everything that matters (income, age, location, religion) correlates with people’s names, hence comparing people with different names involves comparing people with potentially different everything that matters.

This post highlights the problem and proposes three practical solutions. [1]

Gender
Jennifer was the #1 baby girl name between 1970 & 1984, while John has been a top-30 boy name for the last 120 years. Comparing reactions to profiles with these names pits mental associations about women in their late 30s/early 40s with those of  men of unclear age.

More generally, close your eyes and think of Jennifers. Now do that for Johns.
Is gender the only difference between the two sets of people you considered?

Here is what Google did when I asked it to close its eyes: [2]

 Jennifer  jenn
 John  John

Johns vary more in age, appearance, affluence, and presidential ambitions. For somewhat harder data, I consulted a website where people rate names on various attributes:John vs jennifer

Race
Distinctively black names (e.g., Jamal and Lakisha) signal low socioeconomic status while typical White names do not (QJE .pdf).  Do people not want to hire Jamal because he is Black or because he is of low status?

Even if all distinctively Black names (and even Black people) were perceived as low status, and hence Jamal were an externally valid signal of Blackness, the contrast with Greg might nevertheless be low in internal validity, because the difference attributed to race could instead be the result of status (or some other confounding variable). This is addressable because some (most?) low status people are not Black. We could compare Black names vs. low-status White names: say Jamal with Bubba or Billy Bob, and Lakisha with Bambi or Billy Jean. This would allow assessing  racial discrimination above and beyond status discrimination. [3greg jamalImagine reading a movie script where a Black drug dealer is being defended by a brilliant Black lawyer. One of these characters is named Greg, the other Jamal. The intuition that Greg is the lawyer’s name, is the intuition behind the internal validity problem.

Solution 1. Stop using names
Probably the best solution is to stop using names to manipulate race and gender.  A recent paper (PNAS .pdf) examined gender discrimination using only pronouns (and found that academics in STEM fields favored females over males 2:1).

Solution 2. Choose many names
A great paper titled “Stimulus Sampling” (PSPB .pdf) argues convincingly for choosing many stimuli for any given manipulation to avoid stumbling on unforeseen confounds. Stimulus sampling would involve going beyond Jennifer vs. John, to using, say, 20 female vs. 20 male names. This helps with idiosyncratic confounds (e.g., age) but not with the systematic confound that most distinctively Black names signal low socioeconomic status. [4]

Solution 3. Choose control names actively
If one chooses to study names, then one needs to select control names that if it weren’t for the scientific hypothesis of interest, would produce no difference with the target names (e.g., if it weren’t for racial discrimination, then people should like Jamal and this other name just as much)

I close with an example from a paper of mine where I attempted to generate proper control names to examine if people disproportionately marry others with similar names, e.g. Eric-Erica, because of implicit egotism: a preference for things that resemble the self. (JPSP .pdf)

We need control names that we would expect to marry Ericas just as frequently as Erics do in the absence of implicit egotism (e.g., of similar age, religion, income, class and location).  To find such names I looked at the relative frequency of wife names for every male name and asked “What male names have the most similar distribution of wife names to Erics?” [5].

The answer was: Joseph, Frank and Carl. We would expect these three names to marry Erica just as frequently as Eric does, if not for implicit egotism. And we would be right.

For the Jamal vs. Greg study, we could compare Jamal to non-Black names that have the most similar distribution of occupations, or of Zip Codes, or of criminal records.

Wide logo

 


Feedback from original authors:
I shared an early draft of this post with the authors of the Jamal vs. Greg, and Jennifer vs. John study.

Sendhil Mullainathan, co-author of the former, indicated across a few emails he did not believe it was clear one should control for socioeconomic status differences in studies about race, because status and race are correlated in real life.

Corinne Moss-Racusin sent me a note she wrote with her co-authors of their PNAS study:

Thanks so much for contacting us about this interesting topic. We agree that these are thoughtful and important points, and have often grappled with them in our own research. The names we used (John and Jennifer) had been pretested and rated as equivalent on a number of dimensions including warmth, competence, likeability, intelligence, and typicality (Brescoll & Uhlmann, 2005 .pdf), but they were not rated for perceived age, as you highlight here. However, for our study in particular, age of the target should not have extensively impacted our results, because the age of both our targets could easily be inferred from the targets’ resume information that our participants were exposed to. Both the male and female targets (John and Jennifer respectively) were presented as recent college grads (with the same graduation year), and it is thus reasonable to assume that participants believed they were the same age, as recent college grads are almost always the same age (give or take a few years). Thus, although it is possible that age (and other potential variables) may indeed be confounded with gender across our manipulation, we nonetheless do not believe that choosing different male and female names that were equivalent for age would greatly impact our findings, given our design. That said, future research should still seek to replicate our key findings using different manipulations of target gender. Specifically, your suggestions (using only pronouns, and using multiple names) are particularly promising. We have also considered utilizing target pictures in the past, but have encountered issues relating to attractiveness and other confounds.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. Galen Bodenhausen read this post and told me about a paper on confounds in names used for gender research, from 1993(!) PsychBull .pdf []
  2. Based on the Jennifers and Johns I see, I suspect Google peeked at my cookies before closing its eyes,e.g., there are two bay area business school professors. Your results may differ. []
  3. Bertrand and Mullainathan write extensively about the socioeconomic confound and report a few null results that they interpret as suggesting it is not playing a large role (see their Section “V.B Potential Confounds”, .pdf). However, (1) the n.s. results of socioeconomic status are obtained with extremely noisy proxies and small samples, reducing the ability to conclude evidence of absence from the absence of evidence, and on the other, (2) these analyses seek to remedy the consequences of the name-confound rather than avoiding the confound from the get-go through experimental design. This post is about experimental design. []
  4. The Jamal paper used 9 different names per race/gender cell []
  5. To avoid biasing the test against implicit egotism, I excluded from the calculations male and female names starting with E_ []

[35] The Default Bayesian Test is Prejudiced Against Small Effects

When considering any statistical tool I think it is useful to answer the following two practical questions:

1. “Does it give reasonable answers in realistic circumstances?”
2. “Does it answer a question I am interested in?”

In this post I explain why, for me, when it comes to the default Bayesian test that’s starting to pop up in some psychology publications, the answer to both questions is no.”

The Bayesian test
The Bayesian approach to testing hypotheses is neat and compelling. In principle. [1]

The p-value assesses only how incompatible the data are with the null hypothesis. The Bayesian approach, in contrast, assesses the relative compatibility of the data with a null vs an alternative hypothesis.

The devil is in choosing that alternative.  If the effect is not zero, what is it?

Bayesian advocates in psychology have proposed using a “default” alternative (Rouder et al 1999, .pdf). This default is used in the online (.html) and R based (.html) Bayes factor calculators. The original papers do warn attentive readers that the default can be replaced with alternatives informed by expertise or beliefs (see especially Dienes 2011 .pdf), but most researchers leave the default unchanged. [2]

This post is written with that majority of default following researchers in mind. I explain why, for me, when running the default Bayesian test, the answer to Questions 1 & 2 is “no” .

Question 1. “Does it give reasonable answers in realistic circumstances?”
No. It is prejudiced against small effects

The null hypothesis is that the effect size (henceforth d) is zero. Ho: d = 0. What’s the alternative hypothesis? It can be whatever we want it to be, say, Ha: = .5. We would then ask: are the data more compatible with = 0 or are they more compatible with = .5?

The default alternative hypothesis used in the Bayesian test is a bit more complicated. It is a distribution, so more like Ha: d~N(0,1). So we ask if the data are more compatible with zero or with d~N(0,1). [3]

That the alternative is a distribution makes it difficult to think about the test intuitively.  Let’s not worry about that. The key thing for us is that that default is prejudiced against small effects.

Intuitively (but not literally), that default means the Bayesian test ends up asking: “is the effect zero, or is it biggish?” When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero. [4]

Demo 1. Power at 50%

Let’s see how the test behaves as the effect size get smaller (R Code)Fig1The Bayesian test erroneously supports the null about 5% of the time when the effect is biggish, d=.64, but it does so five times more frequently when it is smallish, d=.28.  The smaller the effect (for studies with a given level of power), the more likely we are to dismiss its existence.  We are prejudiced against small effects. [5]

Note how as sample gets larger the test becomes more confident (smaller white area) and more wrong (larger red area).

Demo 2. Facebook
For a more tangible example consider the Facebook experiment (.html) that found that seeing images of friends who voted (see panel a below) increased voting by 0.39% (panel b).Facebook3While the null of a zero effect is rejected (p=.02) and hence the entire confidence interval for the effect is above zero, [6] the Bayesian test concludes VERY strongly in favor of the null, 35:1. (R Code)

Prejudiced against (in this case very) small effects.

Question 2. “Does it answer a question I am interested in?”
No. I am not interested in how well data support one elegant distribution.

 When people run a Bayesian test they like writing things like
“The data support the null.”

But that’s not quite right. What they actually ought to write is
“The data support the null more than they support one mathematically elegant alternative hypothesis I compared it to”

Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.

We are constantly reminded that:
P(D|H0)≠P(H0)
The probability of the data given the null is not the probability of the null

But let’s not forget that:
P(H0|D) / P(H1|D)  ≠ P(H0)
The relative probability of the null over one mathematically elegant alternative is not the probability of the null either.

Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it. The default Bayesian test does not answer a question I would ask.

Wide logo

 


Feedback from Bayesian advocates:
I shared an early draft of this post with three Bayesian advocates. I asked for feedback and invited them to comment.

1. Andrew Gelman  Expressed “100% agreement” with my argument but thought I should make it clearer this is not the only Bayesian approach, e.g., he writes “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors.” I made several edits in response to his suggestions, including changing the title.

2. Jeff Rouder  Provided additional feedback and also wrote a formal reply (.html). He begins highlighting the importance of comparing p-values and Bayesian Factors when -as is the case in reality- we don’t know if the effect does or does not exist, and the paramount importance for science of subjecting specific predictions to data analysis (again, full reply: .html)

3. EJ Wagenmakers Provided feedback on terminology, the poetic response that follows, and a more in-depth critique of confidence intervals (.pdf)

In a desert of incoherent frequentist testing there blooms a Bayesian flower. You may not think it is a perfect flower. Its color may not appeal to you, and it may even have a thorn. But it is a flower, in the middle of a desert. Instead of critiquing the color of the flower, or the prickliness of its thorn, you might consider planting your own flower — with a different color, and perhaps without the thorn. Then everybody can benefit.”

Sunbaked Mud in Desert


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. If you want to learn more about it I recommend Rouder et al. 1999 (.pdf), Wagenmakers 2007 (.pdf) and Dienes 2011 (.pdf) []
  2. e.g., Rouder et al (.pdf) write “We recommend that researchers incorporate information when they believe it to be appropriate […] Researchers may also incorporate expectations and goals for specific experimental contexts by tuning the scale of the prior on effect size” p.232 []
  3. The current default distribution is d~N(0,.707), the simulations in this post use that default []
  4. Again, Bayesian advocates are upfront about this, but one has to read their technical papers attentively. Here is an example in Rouder et al (.pdf) page 30: “it is helpful to recall that the marginal likelihood of a composite hypothesis is the weighted average of the likelihood over all constituent point hypotheses, where the prior serves as the weight. As [variance of the alternative hypothesis] is increased, there is greater relative weight on larger values of [the effect size] […] When these unreasonably large values […] have increasing weight, the average favors the null to a greater extent”.   []
  5. The convention is to say that the evidence clearly supports the null if the data are at least three times more likely when the null hypothesis is true than when the alternative hypothesis is, and vice versa. In the chart above I refer to data that do not clearly support the null nor the alternative as inconclusive. []
  6. note that the figure plots standard errors, not a confidence interval []

[34] My Links Will Outlive You

If you are like me, from time to time your papers include links to online references.

Because the internet changes so often, by the time readers follow those links, who knows if the cited content will still be there.

This blogpost shares a simple way to ensure your links live “forever.”  I got the idea from a recent New Yorker article [.html].

Content Rot
It is estimated that about 20%-30% of links referenced in papers are already dead and, like you and me, the remaining links aren’t getting any younger. [1]

I asked a research assistant to follow links in papers published in April of 2005 and April 2010 across four journals, to get a sense of what happens to links 5 and 10 years out. [2]

Figure1

Perusing results I noticed that:

  • Links still alive tend to involve individual newspaper articles (these will die when that newspaper shuts down) and .pdf articles hosted in university servers (these will die when faculty move on to other institutions).
  • Links to pages whose information has changed involved things like websites with financial information for 2009 (now reporting 2014 data), or working papers now replaced with updated or published versions.
  • Dead links tended to involve websites by faculty and students now at different institutions, and now-defunct online organizations.

If you intend to give future readers access to the information you are accessing today, providing links seems like a terrible way to do that.

Solution
Making links “permanent” is actually easy. It involves saving the referenced material on WebArchive.org, a repository that saves individual internet pages “forever.”

Here is an example. The Cincinnati Post was a newspaper that started in 1881 and shut down in 2007. The newspaper had a website (www.cincypost.com). If you visit it today, your browser will show this:

dead

The browser will show the same result if we follow any link to any story ever published by that newspaper.

Using the WebArchive, however, we can still read the subset of stories that were archived, for example, this October 2007 story on a fundraising event by then president George W. Bush (.html)

How to make your links “permanent”
1) Go to http://archive.org/web
2) Enter the URL of interest into the “Save Page Now” box

webarchive image

Copy paste the resulting permanent link onto your paper

Example
Imagine writing an academic article in which you want to cite, say, Colada[33] “The Effect Size Does not Exist”. The URL is http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

You could include that link in your paper, but eventually DataColada will die, and so will the content you are linking to. Someone reading your peer-reviewed Colada takedown in ninety years will have no way of knowing what you were talking about. But, if you copy-paste that URL into the WebArchive, you will save the post, and get a permanent link like this:

http://web.archive.org/web/20150221162234/http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

Done. Your readers can read Colada[33] long after DataColada.org is 6-feet-under.

PS: Note that WebArchive links include the original link. Were the original material to outlive WebArchive, readers could still see it. Archiving is a weakly dominating strategy.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. See “Related Work” section in this PlosONE article [.html] []
  2. I chose journals I read: The Journal of Consumer Research, Psychological Science, Management Science and The American Economic Review. Actually, I no longer read JCR articles, but that’s not 100% relevant. []

[33] “The” Effect Size Does Not Exist

Consider the robust phenomenon of anchoring, where people’s numerical estimates are biased towards arbitrary starting points. What does it mean to say “the” effect size of anchoring?

It surely depends on moderators like domain of the estimate, expertise, and perceived informativeness of the anchor. Alright, how about “the average” effect-size of anchoring? That’s simple enough. Right? Actually, that’s where the problem of interest to this post arises. Computing the average requires answering the following unanswerable question: How much weight should each possible effect-size get when computing “the average?” effect size?

Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?

Say anchoring effects are larger when estimating river lengths than door heights, does “the average” anchoring effect give all river studies combined 50% weight and all door studies the other 50%? If so, what do we do with canal-length studies, combine them with rivers or count them on their own?

If we weight by study rather than stimulus, “the average” effect gets larger as more rivers studies are conducted, and if we weight by sample size “the average” gets smaller if we run more subjects in the door studies.

31 13

What about the impact of anchoring on perceived strawberry-jam viscosity. Nobody has yet studied that but they could, does “the average” anchoring effect-size include this one?

What about all the zero estimates one would get if the experiment was done in a room without any lights or with confusing instructions?  What about all the large effects one would get via demand effects or confounds? Does the average include these?

Studies aren’t random
We can think of the problem using a sampling framework: the studies we run are a sample of the studies we could run. Just not a random sample.

Cheat-sheet. Random sample: every member of the population is equally likely to be selected.

First, we cannot run studies randomly, because we don’t know the relative frequency of every possible study in the population of studies. We don’t know how many “door” vs “river” studies exist in this platonic universe, so we don’t know with what probability to run a door vs a river study.

Second, we don’t want to run studies randomly, we want studies that will provide new information, that are similar to those we have seen elsewhere, that will have higher rhetorical value in a talk or paper, that we find intrinsically interesting, that are less confounded, etc. [1]

What can we estimate?
Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again.

To generalize beyond the data requires judgment rather than statistics.
Judgment can account for non-randomly run studies in a way that statistics cannot.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Running studies with a set instead of a single stimulus is nevertheless very important, but for construct rather than external validity. Running a set of stimuli reduces the risks of stumbling on the single confounded stimulus that works. Check out the excellent “Stimulus Sampling” paper by Wells and Windschitl (.pdf) []

[32] Spotify Has Trouble With A Marketing Research Exam

This is really just a post-script to Colada [2], where I described a final exam question I gave in my MBA marketing research class. Students got a year’s worth of iTunes listening data for one person –me– and were asked: “What songs would this person put on his end-of-year Top 40?” I compared that list to the actual top-40 list. Some students did great, but many made the rookie mistake of failing to account for the fact that older songs (e.g., those released in January) had more opportunity to be listened to than did newer songs (e.g., those released in November).

I was reminded of this when I recently received an email from Spotify (my chosen music provider) that read:

spotify figure 1

First, Spotify, rather famously, does not make listening-data particularly public, [1] so any acknowledgement that they are assessing my behavior is kind of exciting. Second, that song, Inauguration [Spotify link], is really good. On the other hand, despite my respect for the hard working transistors inside the Spotify preference-detection machine, that song is not my “top song” of 2014. [2]

The thing is, “Inauguration” came out in January. Could Spotify be making the same rookie mistake as some of my MBA students?

Following Spotify’s suggestion, I decided to check out the rest of their assessment of my 2014 musical preferences. Spotify offered a ranked listing of my Top 100 songs from 2014. Basically, without even being asked, Spotify said “hey, I will take that final exam of yours.” So without even being asked I said, “hey, I will grade that answer of yours.” How did Spotify do?

Poorly. Spotify thinks I really like music from January and February.

Here is their data:

spotify figure 2

Each circle is a song; the red ones are those which I included in my actual Top 40 list.

If I were grading this student, I would definitely have some positive things to say. “Dear Spotify Preference-Detection Algorithm, Nice job identifying eight of my 40 favorite songs. In particular, the song that you have ranked second overall, is indeed in my top three.” On the other hand, I would also probably say something like, “That means that your 100 guesses still missed 32 of my favorites. Your top 40 only included five of mine. If you’re wondering where those other songs are hiding, I refer you to the entirely empty right half of the above chart. Of your Top 100, a full 97 were songs added before July 1. I like the second half of the year just as much as the first.” Which is merely to say that the Spotify algorithm has room for improvement. Hey, who doesn’t?

Actually, in preparing this post, I was surprised to learn that, if anything, I have a strong bias toward songs released later in the year. This bias that could reflect my tastes, or alternatively a bias in the industry (see this post in a music blog on the topic, .html). I looked at when Grammy-winning songs are released and learned that they are slightly biased toward the second half of the year [3]. The figure below shows the distributions (with the correlation between month and count).

spotify figure 3

I have now learned how to link my Spotify listening behavior to Last.fm. A year from now perhaps I will get emails from two different music-distribution computers and I can compare them head-to-head? In the meantime, I will probably just listen to the forty best songs of 2014 [link to my Spotify playlist].

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. OK, “famously” is overstated, but even a casual search will reveal that there are many users who want more of their own listening data. Also, “not particularly public” is not the same as “not at all public.” For example, they apparently share all kinds of data with Walt Hickey at FiveThirtyEight (.html). I am envious of Mr. Hickey. []
  2. My top song of 2014 is one of these (I don’t rank my Top 40): The Black and White Years – Embraces, Modern Mod – January, or Perfume Genius – Queen []
  3. I also learned that “Little Green Apples” won in the same year that “Mrs. Robinson” and “Hey Jude” were nominated. Grammy voters apparently fail a more basic music preference test. []

[31] Women are taller than men: Misusing Occam’s Razor to lobotomize discussions of alternative explanations

Most scientific studies document a pattern for which the authors provide an explanation. The job of readers and reviewers is to examine whether that pattern is better explained by alternative explanations.

When alternative explanations are offered, it is common for authors to acknowledge that although, yes, each study has potential confounds, no single alternative explanation can account for all studies. Only the author’s favored explanation can parsimoniously do so.

This is a rhetorically powerful line. Parsimony is a good thing, so arguments that include parsimony-claims feel like good arguments. Nevertheless, such arguments are actually kind of silly.

(Don’t know the term Occam’s Razor? It states that among competing hypotheses, the one with the fewest assumptions should be selected. Wikipedia )

Women are taller than men
A paper could read something like this:

While the lay intuition is that human males are taller than their female counterparts, in this article we show this perception is erroneous, referring to it as “malevation bias.”

In Study 1, we found that (male) actor Tom Cruise is reliably shorter than his (female) partners.  FIG1

In Study 2 we found that (female) elementary school teachers were much taller than their (mostly male) students. Fig 2

In Study 3 we found that female basketball players are reliably taller than male referees.FIG3

The silly Occam’s razor argument

Across three studies we found that women were taller than men. Although each study is imperfect – for example, an astute reviewer suggested that age differences between teachers and students may explain Study 2, the only single explanation that’s consistent with the totality of the evidence is that women are in general indeed taller than men.

Parsimony favors different alternative explanations
One way to think of the misuse of parsimony to explain a set of studies is that the set is not representative of the world. The results were not randomly selected, they were chosen by the author to make a point.

Parsimony should be judged looking at all evidence, not only the selectively collected and selectively reported subset.

For instance, although the age confound with height is of limited explanatory value when we only consider Studies 1-3 (it only accounts for Study 2), it has great explanatory power in general. Age accounts for most of the variation in height we see in the world.

If three alternative explanations are needed to explain a paper, but each of those explanations accounts for a lot more evidence in the world than the novel explanation proposed by the author to explain her three studies, Occam’s razor should be used to shave off the single new narrow theory, rather than the three existing general theories.

How to deal with alternative explanations then?
Conceptual replications help examine the generalizability of a finding. As the examples above show, they do not help assess if a confound is responsible for a finding, because we can have a different confound in each conceptual replication. [1]

Three ways to deal with concerns that Confound A accounts for Study X:

1) Test additional predictions Confound A makes for Study X.

2) Run a new study designed to examine if Confound A is present in Study X.

3) Run a new study that’s just like Study X, lacking only Confound A.

Running an entirely different Study Y is not a solution for Study X. An entirely different Study Y says “Given the identified confounds with Study X we have decided to give up and start from scratch with Study Y”. And Study Y better be able to stand on its own.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Conceptual replications also don’t help diagnose false-positives, checkout the excellent: Pashler and Harris (2012) .pdf []

[30] Trim-and-Fill is Full of It (bias)

Statistically significant findings are much more likely to be published than non-significant ones (no citation necessary). Because overestimated effects are more likely to be statistically significant than are underestimated effects, this means that most published effects are overestimates. Effects are smaller – often much smaller – than the published record suggests.

For meta-analysts the gold standard procedure to correct for this bias, with >1700 Google cites, is called Trim-and-Fill (Duval & Tweedie 2000, .pdf). In this post we show Trim-and-Fill generally does not work.

What is Trim-and-Fill?
When you have effect size estimates from a set of studies, you can plot those estimates with effect size on the x-axis and a measure of precision (e.g., sample size or standard error) on the y-axis. In the absence of publication bias this chart is symmetric: noisy estimates are sometimes too big and sometimes too small. In the presence of publication bias the small estimates are missing. Trim-and-Fill deletes (i.e., trims) some of those large-effect studies and adds (i.e., fills) small-effect studies, so that the plot is symmetric. The average effect size in this synthetic set of studies is Trim-and-Fill’s “publication bias corrected” estimate.

What is Wrong With It?
A known limitation of Trim-and-Fill is that it can correct for publication bias that does not exist, underestimating effect sizes (see e.g., Terrin et al 2003, .pdf). A less known limitation is that it generally does not correct for the publication bias that does exist, overestimating effect sizes.

The chart below shows the results of simulations we conducted for our just published “P-Curve and Effect Size” paper (SSRN). We simulated large meta-analyses aggregating studies comparing two means, with sample sizes ranging from 10-70, for five different true effect sizes. The chart plots true effect sizes against estimated effect sizes in a context in which we only observe significant (publishable) findings (R Code for this [Figure 2b] and all other results in our paper).

f1blog_2

Start with the blue line at the top. That line shows what happens when you simply average only the statistically significant findings–that is, only the findings that would typically be observed in the published literature. As we might expect, those effect size estimates are super biased.

The black line shows what happens when you “correct” for this bias using Trim-and-Fill. Effect size estimates are still super biased, especially when the effect is nonexistent or small.

Aside: p-curve nails it.

We were wrong
Trim-and-Fill assumes that studies with relatively smaller effects are not published (e.g., that out of 20 studies attempted, the 3 obtaining the smallest effect size are not publishable). In most fields, however, publication bias is governed by p-values rather than effect size (e.g., out of 20 studies only those with p<.05 are publishable).

Until a few weeks ago we thought that this incorrect assumption led to Trim-and-Fill’s poor performance. For instance, in our paper (SSRN) we wrote

when the publication process suppresses nonsignificant findings, Trim-and-Fill is woefully inadequate as a corrective technique.” (p.667)

For this post we conducted additional analyses and learned that Trim-and-Fill performs poorly even when its assumptions are met–that is, even when only small-effect studies go unpublished (R Code). Trim-and-Fill seems to work well only when few studies are missing, that is, where there is little bias to be corrected. In situations when a correction is most needed, Trim-and-Fill does not correct nearly enough.

Two Recommendations
1) Stop using Trim-and-Fill in meta-analyses.
2) Stop treating published meta-analyses with a Trim-and-Fill “correction” as if they have corrected for publication bias. They have not.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Author response:
Our policy at Data Colada is to contact authors whose work we cover, offering an opportunity to provide feedback and to comment within our original post. Trim-and-Fill was originally created by Sue Duval and the late Richard Tweedie.  We contacted Dr. Duval and exchanged a few emails but she did not provide feedback nor a response.

[29] Help! Someone Thinks I p-hacked

It has become more common to publicly speculate, upon noticing a paper with unusual analyses, that a reported finding was obtained via p-hacking. This post discusses how authors can persuasively respond to such speculations.

Examples of public speculation of p-hacking
Example 1. A Slate.com post by Andrew Gelman suspected p-hacking in a paper that collected data on 10 colors of clothing, but analyzed red & pink as a single color [.html] (see authors’ response to the accusation .html)

Example 2. An anonymous referee suspected p-hacking and recommended rejecting a paper, after noticing participants with low values of the dependent variable were dropped [.html]

Example 3. A statistics blog suspected p-hacking after noticing a paper studying number of hurricane deaths relied on the somewhat unusual Negative-Binomial Regression [.html]

First, the wrong response
The most common & tempting response to concerns like these is also the wrong response: justifying what one did. Explaining, for instance, why it makes sense to collapse red with pink or to run a negative-binomial.

It is the wrong response because when we p-hack, we self-servingly choose among justifiable analyses. P-hacked findings are by definition justifiable. Unjustifiable research practices involve incompetence or fraud, not p-hacking.

Showing an analysis is justifiable does not inform the question of whether it was p-hacked.

Right Response #1.  “We decided in advance”
P-hacking involves post-hoc selection of analyses to get p<.05. One way to address p-hacking concerns is to indicate analysis decisions were made ex-ante.

A good way to do this is to just say so:  “We decided to collapse red & pink before running any analyses” A better way is with a more general and verifiable statement:  “In all papers we collapse red & pink” An even better way is:  “We preregistered that we would collapse red & pink in this study” (see related Colada[12]: “Preregistration: Not Just for the Empiro-Zealots“)

Right Response #2.  “We didn’t decide in advance, but the results are robust”
Often we don’t decide in advance. We don’t think of outliers till we see them. What to do then? Show the results don’t hinge on how the problem is dealt with. Show dropping  >2SD, >2.5SD, >3SD, logging the dependent variable, comparing medians and running a non-parametric test. If the conclusion is the same in most of these, tell the blogger to shut up.

Right Response 3. “We didn’t decide in advance, and the results are not robust. So we run a direct replication.”
Sometimes the result will only be there if you drop >2SD and it will not have occurred to you to do so till you saw the p=.24 without it. One possibility is that you are chasing noise. Another possibility is that you are right. The one way to tell these two apart is with a new study. Run everything the same, exclude again based on >2SD.

If in your “replication” you now need a gender interaction for the >2SD exclusion to give you p<.05, it is not too late to read “False-Positive Psychology” (.html)

Cheers
If a blogger raises concerns of p-hacking, and you cannot provide any of the three responses above: buy the blogger a drink. She is probably right.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[28] Confidence Intervals Don’t Change How We Think about Data

Some journals are thinking of discouraging authors from reporting p-values and encouraging or even requiring them to report confidence intervals instead. Would our inferences be better, or even just different, if we reported confidence intervals instead of p-values?

One possibility is that researchers become less obsessed with the arbitrary significant/not-significant dichotomy. We start paying more attention to effect size. We start paying attention to precision. A step in the right direction.

Another possibility is that researchers forced to report confidence intervals will use them as if they were p-values and will only ask “Does the confidence interval include 0?” In this world confidence intervals are worse than p-values, because p=.012, p=.0002, p=.049 all become p<.05. Our analyses become more dichotomous. A step in the wrong direction.

How to test this?
To empirically assess the consequences of forcing researchers to replace p-values with confidence intervals we could randomly impose the requirement on some authors and see what happens.

That’s hard to pull off for a blog post.  Instead, I exploit a quirk in how “mediation analysis” is now reported in psychology. In particular, the statistical program everyone uses to run mediation reports confidence intervals rather than p-values.  How are researchers analyzing those confidence intervals?

Sample: 10 papers
I went to Web-of-Science and found the ten most recent JPSP articles (.html) citing the Preacher and Hayes (2004) article that provided the statistical programs that everyone runs (.pdf).

All ten of them used confidence intervals as dichotomus p-values, none discussed effect size or precision. None discussed the percentage of the effect that was mediated. One even accepted the null of no mediation because the confidence interval included 0 (it also included large effects).

 

F1

This sample suggests confidence intervals do not change how we think of data.

If people don’t care about effect size here…
Unlike other effect-size estimates in the lab, effect-size in mediation is intrinsically valuable.

No one asks how much more hot sauce subjects pour for a confederate to consume after watching a film that made them angry, but we do ask how much of that effect is mediated by anger; ideally all of it. [1]

Change the question before you change the answer
If we want researchers to care about effect size and precision, then we have to persuade researchers that effect size and precision are important.

I have not been persuaded yet. Effect size matters outside the lab for sure. But in the lab not so clear. Our theories don’t make quantitative predictions, effect sizes in the lab are not particularly indicative of how important a phenomenon is outside the lab, and to study effect size with even moderate precision we need  samples too big to plausibly be run in the lab (see Colada[20]). [2]

My talk at a recent conference (SESP) focused on how research questions should shape the statistical tools we choose to run and report. Here are the slides. (.pptx). This post is an extension of Slide #21.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. In practice we do not measure things perfectly, so going for 100% mediation is too ambitious []
  2. I do not have anything against reporting confidence intervals alongside p-values. They will probably be ignored by most readers, but a few will be happy to see them, and it is generally good to make people happy (Though it is worth pointing out that one can usually easily compute confidence intervals from test results).  Descriptive statistics more generally, e.g., means and SDs, should always be reported to catch errors, facilitate meta-analyses, and just generally better understand the results. []

[27] Thirty-somethings are Shrinking and Other U-Shaped Challenges

A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.

For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.

The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”

This post shows the most commonly used test is incorrect, and suggests a simple alternative.

What test would you run?
If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y01x2x2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.

We browsed a dozen or so papers testing u-shapes in economics and in psychology and that is also what they did.

That’s also what the Too-Much-Talent paper did. For instance, these are the results they report for the basketball and soccer studies: a fitted inverted u-shaped curve with a statistically significant x2. [1]

F1 v3

Everybody is wrong
Relying on the quadratic is super problematic because it sees u-shapes everywhere, even in cases where a true u-shape is not present. For instance:

Figure 2 final

The source of the problem is that regressions work hard to get as close as possible to data (blue dots), but are indifferent to implied shapes.

A U-shaped relationship will (eventually) imply a significant quadratic, but a significant quadratic does not imply a U-shaped relationship. [2]

First, plot the raw data.
Figure 2 shows how plotting the data prevents obviously wrong answers. Plots, however, are necessary but not sufficient for good inferences. They may have too little or too much data, becoming Rorschach tests. [3]

F3_double

These charts are somewhat suggestive of a u-shape, but it is hard to tell whether the quadratic is just chasing noise. As social scientists interested in summarizing a mass of data, we want to write sentences like: “As predicted, the relationship was u-shaped, p=.002.

Those charts don’t let us do that.

A super simple solution
When testing inverted u-shapes we want to assess whether:
At first more x leads to more y, but eventually more x leads to less y.

If that’s what we want to assess, maybe that’s what we should test.Here is an easy way to do that that builds on the quadratic regression everyone is already running.

1)      Run the quadratic regression
2)      Find the point where the resulting u-shape maxes out.
3)      Now run a linear regression up to that point, and another from that point onwards.
4)      Test whether the second line is negative and significant.

More detailed step-by-step instructions (.html). [4]

One demonstration
We contacted the authors of the Too-Much-Talent paper and they proposed running the two-lines test on all three of their data sets. Aside: we think that’s totally great and admirable.
They emailed us the results of those analyses, and we all agreed to include their analyses in this post.
F_tripple

The paper had predicted and documented the lack of a u-shape for Baseball. The first figure is consistent with that result.

The paper had predicted and documented an inverted u-shape in Basketball and Soccer.The Basketball results are as predicted (first slope is positive, p<.001, second slope negative, p = .026). The Soccer results were more ambiguous (first slope is significantly positive, p<.001, but the second slope is not significant, p=.53).

The authors provided a detailed discussion of these and additional new analyses (.pdf).

We thank them for their openness, responsiveness, and valuable feedback.

Another demonstration
The most cited paper studying u-shapes we found (Aghion et al, QJE 2005, .pdf) examines the impact of competition on innovation.  Figure 3b above is the key figure in that paper. Here it is with two lines instead (STATA code .do; raw data .zip):

5

The second line is significantly negatively sloped, z=-3.75, p<.0001.

If you are like us, you think the p-value from that second line adds value to the eye-ball test of the published chart, and surely to the nondiagnostic p-value from the x2  in the quadratic regression.

If you see a problem with the two lines, or know of a better solution, please email Uri and/or Leif

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Talent was operationalized in soccer as belonging to a top-25 soccer team (e.g., Manchester United) and in basketball as being top-third of the NBA in Estimated Wins Added (EWA), and results were shown to be robust to defining top-20% and top-40%. []
  2. Lind and Mehlum (2010, .pdf), propose a way to formally test for the u-shape itself within a quadratic (and a few other specifications) and Miller et al (2013 .pdf)  provide analytical techniques for calculating thresholds where effects differ from zero for quadratics models. However, these tools should only be utilized when the researcher is confident about functional form, for they can lead to mistaken inferences when the assumptions are wrong. For example, if applied to y=log(x), one would, for sufficiently dispersed x-es, incorrectly conclude the relationship has an inverted u-shape, when it obviously does not. We shared an early draft of this post with the authors of both methods papers and they provided valuable feedback already reflected in this longest of footnotes. []
  3. One could plot fitted nonparametric functions for these, via splines or kernel regressions, but the results are quite sensitive to researcher degrees-of-freedom (e.g., bandwidth choice, # of knots) and also do not provide a formal test of a functional form []
  4. We found one paper that implemented something similar to this approach: Ungemach et al, Psych Science, 2011, Study 2 (.pdf), though they identify the split point with theory rather than a quadratic regression. More generally, there are other ways to find the point where the two lines are split, and their relative performance is worth exploring.  []