[35] The Default Bayesian Test is Prejudiced Against Small Effects

When considering any statistical tool I think it is useful to answer the following two practical questions:

1. “Does it give reasonable answers in realistic circumstances?”
2. “Does it answer a question I am interested in?”

In this post I explain why, for me, when it comes to the default Bayesian test that’s starting to pop up in some psychology publications, the answer to both questions is no.”

The Bayesian test
The Bayesian approach to testing hypotheses is neat and compelling. In principle.1

The p-value assesses only how incompatible the data are with the null hypothesis. The Bayesian approach, in contrast, assesses the relative compatibility of the data with a null vs an alternative hypothesis.

The devil is in choosing that alternative.  If the effect is not zero, what is it?

Bayesian advocates in psychology have proposed using a “default” alternative (Rouder et al 1999, .pdf). This default is used in the online (.html) and R based (.html) Bayes factor calculators. The original papers do warn attentive readers that the default can be replaced with alternatives informed by expertise or beliefs (see especially Dienes 2011 .pdf), but most researchers leave the default unchanged.2

This post is written with that majority of default following researchers in mind. I explain why, for me, when running the default Bayesian test, the answer to Questions 1 & 2 is “no” .

Question 1. “Does it give reasonable answers in realistic circumstances?”
No. It is prejudiced against small effects

The null hypothesis is that the effect size (henceforth d) is zero. Ho: d = 0. What’s the alternative hypothesis? It can be whatever we want it to be, say, Ha: = .5. We would then ask: are the data more compatible with = 0 or are they more compatible with = .5?

The default alternative hypothesis used in the Bayesian test is a bit more complicated. It is a distribution, so more like Ha: d~N(0,1). So we ask if the data are more compatible with zero or with d~N(0,1).3

That the alternative is a distribution makes it difficult to think about the test intuitively.  Let’s not worry about that. The key thing for us is that that default is prejudiced against small effects.

Intuitively (but not literally), that default means the Bayesian test ends up asking: “is the effect zero, or is it biggish?” When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero.4

Demo 1. Power at 50%

Let’s see how the test behaves as the effect size get smaller (R Code)Fig1The Bayesian test erroneously supports the null about 5% of the time when the effect is biggish, d=.64, but it does so five times more frequently when it is smallish, d=.28.  The smaller the effect (for studies with a given level of power), the more likely we are to dismiss its existence.  We are prejudiced against small effects.5

Note how as sample gets larger the test becomes more confident (smaller white area) and more wrong (larger red area).

Demo 2. Facebook
For a more tangible example consider the Facebook experiment (.html) that found that seeing images of friends who voted (see panel a below) increased voting by 0.39% (panel b).Facebook3While the null of a zero effect is rejected (p=.02) and hence the entire confidence interval for the effect is above zero,6 the Bayesian test concludes VERY strongly in favor of the null, 35:1. (R Code)

Prejudiced against (in this case very) small effects.

Question 2. “Does it answer a question I am interested in?”
No. I am not interested in how well data support one elegant distribution.

 When people run a Bayesian test they like writing things like
“The data support the null.”

But that’s not quite right. What they actually ought to write is
“The data support the null more than they support one mathematically elegant alternative hypothesis I compared it to”

Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.

We are constantly reminded that:
The probability of the data given the null is not the probability of the null

But let’s not forget that:
P(H0|D) / P(H1|D)  ≠ P(H0)
The relative probability of the null over one mathematically elegant alternative is not the probability of the null either.

Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it. The default Bayesian test does not answer a question I would ask.

Wide logo


Feedback from Bayesian advocates:
I shared an early draft of this post with three Bayesian advocates. I asked for feedback and invited them to comment.

1. Andrew Gelman  Expressed “100% agreement” with my argument but thought I should make it clearer this is not the only Bayesian approach, e.g., he writes “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors.” I made several edits in response to his suggestions, including changing the title.

2. Jeff Rouder  Provided additional feedback and also wrote a formal reply (.html). He begins highlighting the importance of comparing p-values and Bayesian Factors when -as is the case in reality- we don’t know if the effect does or does not exist, and the paramount importance for science of subjecting specific predictions to data analysis (again, full reply: .html)

3. EJ Wagenmakers Provided feedback on terminology, the poetic response that follows, and a more in-depth critique of confidence intervals (.pdf)

In a desert of incoherent frequentist testing there blooms a Bayesian flower. You may not think it is a perfect flower. Its color may not appeal to you, and it may even have a thorn. But it is a flower, in the middle of a desert. Instead of critiquing the color of the flower, or the prickliness of its thorn, you might consider planting your own flower — with a different color, and perhaps without the thorn. Then everybody can benefit.”

Sunbaked Mud in Desert

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. If you want to learn more about it I recommend Rouder et al. 1999 (.pdf), Wagenmakers 2007 (.pdf) and Dienes 2011 (.pdf) []
  2. e.g., Rouder et al (.pdf) write “We recommend that researchers incorporate information when they believe it to be appropriate […] Researchers may also incorporate expectations and goals for specific experimental contexts by tuning the scale of the prior on effect size” p.232 []
  3. The current default distribution is d~N(0,.707), the simulations in this post use that default []
  4. Again, Bayesian advocates are upfront about this, but one has to read their technical papers attentively. Here is an example in Rouder et al (.pdf) page 30: “it is helpful to recall that the marginal likelihood of a composite hypothesis is the weighted average of the likelihood over all constituent point hypotheses, where the prior serves as the weight. As [variance of the alternative hypothesis] is increased, there is greater relative weight on larger values of [the effect size] […] When these unreasonably large values […] have increasing weight, the average favors the null to a greater extent”.   []
  5. The convention is to say that the evidence clearly supports the null if the data are at least three times more likely when the null hypothesis is true than when the alternative hypothesis is, and vice versa. In the chart above I refer to data that do not clearly support the null nor the alternative as inconclusive. []
  6. note that the figure plots standard errors, not a confidence interval []

[34] My Links Will Outlive You

If you are like me, from time to time your papers include links to online references.

Because the internet changes so often, by the time readers follow those links, who knows if the cited content will still be there.

This blogpost shares a simple way to ensure your links live “forever.”  I got the idea from a recent New Yorker article [.html].

Content Rot
It is estimated that about 20%-30% of links referenced in papers are already dead and, like you and me, the remaining links aren’t getting any younger.1

I asked a research assistant to follow links in papers published in April of 2005 and April 2010 across four journals, to get a sense of what happens to links 5 and 10 years out.2


Perusing results I noticed that:

  • Links still alive tend to involve individual newspaper articles (these will die when that newspaper shuts down) and .pdf articles hosted in university servers (these will die when faculty move on to other institutions).
  • Links to pages whose information has changed involved things like websites with financial information for 2009 (now reporting 2014 data), or working papers now replaced with updated or published versions.
  • Dead links tended to involve websites by faculty and students now at different institutions, and now-defunct online organizations.

If you intend to give future readers access to the information you are accessing today, providing links seems like a terrible way to do that.

Making links “permanent” is actually easy. It involves saving the referenced material on WebArchive.org, a repository that saves individual internet pages “forever.”

Here is an example. The Cincinnati Post was a newspaper that started in 1881 and shut down in 2007. The newspaper had a website (www.cincypost.com). If you visit it today, your browser will show this:


The browser will show the same result if we follow any link to any story ever published by that newspaper.

Using the WebArchive, however, we can still read the subset of stories that were archived, for example, this October 2007 story on a fundraising event by then president George W. Bush (.html)

How to make your links “permanent”
1) Go to http://archive.org/web
2) Enter the URL of interest into the “Save Page Now” box

webarchive image

Copy paste the resulting permanent link onto your paper

Imagine writing an academic article in which you want to cite, say, Colada[33] “The Effect Size Does not Exist”. The URL is http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

You could include that link in your paper, but eventually DataColada will die, and so will the content you are linking to. Someone reading your peer-reviewed Colada takedown in ninety years will have no way of knowing what you were talking about. But, if you copy-paste that URL into the WebArchive, you will save the post, and get a permanent link like this:


Done. Your readers can read Colada[33] long after DataColada.org is 6-feet-under.

PS: Note that WebArchive links include the original link. Were the original material to outlive WebArchive, readers could still see it. Archiving is a weakly dominating strategy.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. See “Related Work” section in this PlosONE article [.html] []
  2. I chose journals I read: The Journal of Consumer Research, Psychological Science, Management Science and The American Economic Review. Actually, I no longer read JCR articles, but that’s not 100% relevant. []

[33] “The” Effect Size Does Not Exist

Consider the robust phenomenon of anchoring, where people’s numerical estimates are biased towards arbitrary starting points. What does it mean to say “the” effect size of anchoring?

It surely depends on moderators like domain of the estimate, expertise, and perceived informativeness of the anchor. Alright, how about “the average” effect-size of anchoring? That’s simple enough. Right? Actually, that’s where the problem of interest to this post arises. Computing the average requires answering the following unanswerable question: How much weight should each possible effect-size get when computing “the average?” effect size?

Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?

Say anchoring effects are larger when estimating river lengths than door heights, does “the average” anchoring effect give all river studies combined 50% weight and all door studies the other 50%? If so, what do we do with canal-length studies, combine them with rivers or count them on their own?

If we weight by study rather than stimulus, “the average” effect gets larger as more rivers studies are conducted, and if we weight by sample size “the average” gets smaller if we run more subjects in the door studies.

31 13

What about the impact of anchoring on perceived strawberry-jam viscosity. Nobody has yet studied that but they could, does “the average” anchoring effect-size include this one?

What about all the zero estimates one would get if the experiment was done in a room without any lights or with confusing instructions?  What about all the large effects one would get via demand effects or confounds? Does the average include these?

Studies aren’t random
We can think of the problem using a sampling framework: the studies we run are a sample of the studies we could run. Just not a random sample.

Cheat-sheet. Random sample: every member of the population is equally likely to be selected.

First, we cannot run studies randomly, because we don’t know the relative frequency of every possible study in the population of studies. We don’t know how many “door” vs “river” studies exist in this platonic universe, so we don’t know with what probability to run a door vs a river study.

Second, we don’t want to run studies randomly, we want studies that will provide new information, that are similar to those we have seen elsewhere, that will have higher rhetorical value in a talk or paper, that we find intrinsically interesting, that are less confounded, etc.1

What can we estimate?
Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again.

To generalize beyond the data requires judgment rather than statistics.
Judgment can account for non-randomly run studies in a way that statistics cannot.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Running studies with a set instead of a single stimulus is nevertheless very important, but for construct rather than external validity. Running a set of stimuli reduces the risks of stumbling on the single confounded stimulus that works. Check out the excellent “Stimulus Sampling” paper by Wells and Windschitl (.pdf) []

[32] Spotify Has Trouble With A Marketing Research Exam

This is really just a post-script to Colada [2], where I described a final exam question I gave in my MBA marketing research class. Students got a year’s worth of iTunes listening data for one person –me– and were asked: “What songs would this person put on his end-of-year Top 40?” I compared that list to the actual top-40 list. Some students did great, but many made the rookie mistake of failing to account for the fact that older songs (e.g., those released in January) had more opportunity to be listened to than did newer songs (e.g., those released in November).

I was reminded of this when I recently received an email from Spotify (my chosen music provider) that read:

spotify figure 1

First, Spotify, rather famously, does not make listening-data particularly public,1 so any acknowledgement that they are assessing my behavior is kind of exciting. Second, that song, Inauguration [Spotify link], is really good. On the other hand, despite my respect for the hard working transistors inside the Spotify preference-detection machine, that song is not my “top song” of 2014.2

The thing is, “Inauguration” came out in January. Could Spotify be making the same rookie mistake as some of my MBA students?

Following Spotify’s suggestion, I decided to check out the rest of their assessment of my 2014 musical preferences. Spotify offered a ranked listing of my Top 100 songs from 2014. Basically, without even being asked, Spotify said “hey, I will take that final exam of yours.” So without even being asked I said, “hey, I will grade that answer of yours.” How did Spotify do?

Poorly. Spotify thinks I really like music from January and February.

Here is their data:

spotify figure 2

Each circle is a song; the red ones are those which I included in my actual Top 40 list.

If I were grading this student, I would definitely have some positive things to say. “Dear Spotify Preference-Detection Algorithm, Nice job identifying eight of my 40 favorite songs. In particular, the song that you have ranked second overall, is indeed in my top three.” On the other hand, I would also probably say something like, “That means that your 100 guesses still missed 32 of my favorites. Your top 40 only included five of mine. If you’re wondering where those other songs are hiding, I refer you to the entirely empty right half of the above chart. Of your Top 100, a full 97 were songs added before July 1. I like the second half of the year just as much as the first.” Which is merely to say that the Spotify algorithm has room for improvement. Hey, who doesn’t?

Actually, in preparing this post, I was surprised to learn that, if anything, I have a strong bias toward songs released later in the year. This bias that could reflect my tastes, or alternatively a bias in the industry (see this post in a music blog on the topic, .html). I looked at when Grammy-winning songs are released and learned that they are slightly biased toward the second half of the year3. The figure below shows the distributions (with the correlation between month and count).

spotify figure 3

I have now learned how to link my Spotify listening behavior to Last.fm. A year from now perhaps I will get emails from two different music-distribution computers and I can compare them head-to-head? In the meantime, I will probably just listen to the forty best songs of 2014 [link to my Spotify playlist].

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. OK, “famously” is overstated, but even a casual search will reveal that there are many users who want more of their own listening data. Also, “not particularly public” is not the same as “not at all public.” For example, they apparently share all kinds of data with Walt Hickey at FiveThirtyEight (.html). I am envious of Mr. Hickey. []
  2. My top song of 2014 is one of these (I don’t rank my Top 40): The Black and White Years – Embraces, Modern Mod – January, or Perfume Genius – Queen []
  3. I also learned that “Little Green Apples” won in the same year that “Mrs. Robinson” and “Hey Jude” were nominated. Grammy voters apparently fail a more basic music preference test. []

[31] Women are taller than men: Misusing Occam’s Razor to lobotomize discussions of alternative explanations

Most scientific studies document a pattern for which the authors provide an explanation. The job of readers and reviewers is to examine whether that pattern is better explained by alternative explanations.

When alternative explanations are offered, it is common for authors to acknowledge that although, yes, each study has potential confounds, no single alternative explanation can account for all studies. Only the author’s favored explanation can parsimoniously do so.

This is a rhetorically powerful line. Parsimony is a good thing, so arguments that include parsimony-claims feel like good arguments. Nevertheless, such arguments are actually kind of silly.

(Don’t know the term Occam’s Razor? It states that among competing hypotheses, the one with the fewest assumptions should be selected. Wikipedia )

Women are taller than men
A paper could read something like this:

While the lay intuition is that human males are taller than their female counterparts, in this article we show this perception is erroneous, referring to it as “malevation bias.”

In Study 1, we found that (male) actor Tom Cruise is reliably shorter than his (female) partners.  FIG1

In Study 2 we found that (female) elementary school teachers were much taller than their (mostly male) students. Fig 2

In Study 3 we found that female basketball players are reliably taller than male referees.FIG3

The silly Occam’s razor argument

Across three studies we found that women were taller than men. Although each study is imperfect – for example, an astute reviewer suggested that age differences between teachers and students may explain Study 2, the only single explanation that’s consistent with the totality of the evidence is that women are in general indeed taller than men.

Parsimony favors different alternative explanations
One way to think of the misuse of parsimony to explain a set of studies is that the set is not representative of the world. The results were not randomly selected, they were chosen by the author to make a point.

Parsimony should be judged looking at all evidence, not only the selectively collected and selectively reported subset.

For instance, although the age confound with height is of limited explanatory value when we only consider Studies 1-3 (it only accounts for Study 2), it has great explanatory power in general. Age accounts for most of the variation in height we see in the world.

If three alternative explanations are needed to explain a paper, but each of those explanations accounts for a lot more evidence in the world than the novel explanation proposed by the author to explain her three studies, Occam’s razor should be used to shave off the single new narrow theory, rather than the three existing general theories.

How to deal with alternative explanations then?
Conceptual replications help examine the generalizability of a finding. As the examples above show, they do not help assess if a confound is responsible for a finding, because we can have a different confound in each conceptual replication.1

Three ways to deal with concerns that Confound A accounts for Study X:

1) Test additional predictions Confound A makes for Study X.

2) Run a new study designed to examine if Confound A is present in Study X.

3) Run a new study that’s just like Study X, lacking only Confound A.

Running an entirely different Study Y is not a solution for Study X. An entirely different Study Y says “Given the identified confounds with Study X we have decided to give up and start from scratch with Study Y”. And Study Y better be able to stand on its own.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Conceptual replications also don’t help diagnose false-positives, checkout the excellent: Pashler and Harris (2012) .pdf []

[30] Trim-and-Fill is Full of It (bias)

Statistically significant findings are much more likely to be published than non-significant ones (no citation necessary). Because overestimated effects are more likely to be statistically significant than are underestimated effects, this means that most published effects are overestimates. Effects are smaller – often much smaller – than the published record suggests.

For meta-analysts the gold standard procedure to correct for this bias, with >1700 Google cites, is called Trim-and-Fill (Duval & Tweedie 2000, .pdf). In this post we show Trim-and-Fill generally does not work.

What is Trim-and-Fill?
When you have effect size estimates from a set of studies, you can plot those estimates with effect size on the x-axis and a measure of precision (e.g., sample size or standard error) on the y-axis. In the absence of publication bias this chart is symmetric: noisy estimates are sometimes too big and sometimes too small. In the presence of publication bias the small estimates are missing. Trim-and-Fill deletes (i.e., trims) some of those large-effect studies and adds (i.e., fills) small-effect studies, so that the plot is symmetric. The average effect size in this synthetic set of studies is Trim-and-Fill’s “publication bias corrected” estimate.

What is Wrong With It?
A known limitation of Trim-and-Fill is that it can correct for publication bias that does not exist, underestimating effect sizes (see e.g., Terrin et al 2003, .pdf). A less known limitation is that it generally does not correct for the publication bias that does exist, overestimating effect sizes.

The chart below shows the results of simulations we conducted for our just published “P-Curve and Effect Size” paper (SSRN). We simulated large meta-analyses aggregating studies comparing two means, with sample sizes ranging from 10-70, for five different true effect sizes. The chart plots true effect sizes against estimated effect sizes in a context in which we only observe significant (publishable) findings (R Code for this [Figure 2b] and all other results in our paper).


Start with the blue line at the top. That line shows what happens when you simply average only the statistically significant findings–that is, only the findings that would typically be observed in the published literature. As we might expect, those effect size estimates are super biased.

The black line shows what happens when you “correct” for this bias using Trim-and-Fill. Effect size estimates are still super biased, especially when the effect is nonexistent or small.

Aside: p-curve nails it.

We were wrong
Trim-and-Fill assumes that studies with relatively smaller effects are not published (e.g., that out of 20 studies attempted, the 3 obtaining the smallest effect size are not publishable). In most fields, however, publication bias is governed by p-values rather than effect size (e.g., out of 20 studies only those with p<.05 are publishable).

Until a few weeks ago we thought that this incorrect assumption led to Trim-and-Fill’s poor performance. For instance, in our paper (SSRN) we wrote

when the publication process suppresses nonsignificant findings, Trim-and-Fill is woefully inadequate as a corrective technique.” (p.667)

For this post we conducted additional analyses and learned that Trim-and-Fill performs poorly even when its assumptions are met–that is, even when only small-effect studies go unpublished (R Code). Trim-and-Fill seems to work well only when few studies are missing, that is, where there is little bias to be corrected. In situations when a correction is most needed, Trim-and-Fill does not correct nearly enough.

Two Recommendations
1) Stop using Trim-and-Fill in meta-analyses.
2) Stop treating published meta-analyses with a Trim-and-Fill “correction” as if they have corrected for publication bias. They have not.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Author response:
Our policy at Data Colada is to contact authors whose work we cover, offering an opportunity to provide feedback and to comment within our original post. Trim-and-Fill was originally created by Sue Duval and the late Richard Tweedie.  We contacted Dr. Duval and exchanged a few emails but she did not provide feedback nor a response.

[29] Help! Someone Thinks I p-hacked

It has become more common to publicly speculate, upon noticing a paper with unusual analyses, that a reported finding was obtained via p-hacking. This post discusses how authors can persuasively respond to such speculations.

Examples of public speculation of p-hacking
Example 1. A Slate.com post by Andrew Gelman suspected p-hacking in a paper that collected data on 10 colors of clothing, but analyzed red & pink as a single color [.html] (see authors’ response to the accusation .html)

Example 2. An anonymous referee suspected p-hacking and recommended rejecting a paper, after noticing participants with low values of the dependent variable were dropped [.html]

Example 3. A statistics blog suspected p-hacking after noticing a paper studying number of hurricane deaths relied on the somewhat unusual Negative-Binomial Regression [.html]

First, the wrong response
The most common & tempting response to concerns like these is also the wrong response: justifying what one did. Explaining, for instance, why it makes sense to collapse red with pink or to run a negative-binomial.

It is the wrong response because when we p-hack, we self-servingly choose among justifiable analyses. P-hacked findings are by definition justifiable. Unjustifiable research practices involve incompetence or fraud, not p-hacking.

Showing an analysis is justifiable does not inform the question of whether it was p-hacked.

Right Response #1.  “We decided in advance”
P-hacking involves post-hoc selection of analyses to get p<.05. One way to address p-hacking concerns is to indicate analysis decisions were made ex-ante.

A good way to do this is to just say so:  “We decided to collapse red & pink before running any analyses” A better way is with a more general and verifiable statement:  “In all papers we collapse red & pink” An even better way is:  “We preregistered that we would collapse red & pink in this study” (see related Colada[12]: “Preregistration: Not Just for the Empiro-Zealots“)

Right Response #2.  “We didn’t decide in advance, but the results are robust”
Often we don’t decide in advance. We don’t think of outliers till we see them. What to do then? Show the results don’t hinge on how the problem is dealt with. Show dropping  >2SD, >2.5SD, >3SD, logging the dependent variable, comparing medians and running a non-parametric test. If the conclusion is the same in most of these, tell the blogger to shut up.

Right Response 3. “We didn’t decide in advance, and the results are not robust. So we run a direct replication.”
Sometimes the result will only be there if you drop >2SD and it will not have occurred to you to do so till you saw the p=.24 without it. One possibility is that you are chasing noise. Another possibility is that you are right. The one way to tell these two apart is with a new study. Run everything the same, exclude again based on >2SD.

If in your “replication” you now need a gender interaction for the >2SD exclusion to give you p<.05, it is not too late to read “False-Positive Psychology” (.html)

If a blogger raises concerns of p-hacking, and you cannot provide any of the three responses above: buy the blogger a drink. She is probably right.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[28] Confidence Intervals Don’t Change How We Think about Data

Some journals are thinking of discouraging authors from reporting p-values and encouraging or even requiring them to report confidence intervals instead. Would our inferences be better, or even just different, if we reported confidence intervals instead of p-values?

One possibility is that researchers become less obsessed with the arbitrary significant/not-significant dichotomy. We start paying more attention to effect size. We start paying attention to precision. A step in the right direction.

Another possibility is that researchers forced to report confidence intervals will use them as if they were p-values and will only ask “Does the confidence interval include 0?” In this world confidence intervals are worse than p-values, because p=.012, p=.0002, p=.049 all become p<.05. Our analyses become more dichotomous. A step in the wrong direction.

How to test this?
To empirically assess the consequences of forcing researchers to replace p-values with confidence intervals we could randomly impose the requirement on some authors and see what happens.

That’s hard to pull off for a blog post.  Instead, I exploit a quirk in how “mediation analysis” is now reported in psychology. In particular, the statistical program everyone uses to run mediation reports confidence intervals rather than p-values.  How are researchers analyzing those confidence intervals?

Sample: 10 papers
I went to Web-of-Science and found the ten most recent JPSP articles (.html) citing the Preacher and Hayes (2004) article that provided the statistical programs that everyone runs (.pdf).

All ten of them used confidence intervals as dichotomus p-values, none discussed effect size or precision. None discussed the percentage of the effect that was mediated. One even accepted the null of no mediation because the confidence interval included 0 (it also included large effects).



This sample suggests confidence intervals do not change how we think of data.

If people don’t care about effect size here…
Unlike other effect-size estimates in the lab, effect-size in mediation is intrinsically valuable.

No one asks how much more hot sauce subjects pour for a confederate to consume after watching a film that made them angry, but we do ask how much of that effect is mediated by anger; ideally all of it.1

Change the question before you change the answer
If we want researchers to care about effect size and precision, then we have to persuade researchers that effect size and precision are important.

I have not been persuaded yet. Effect size matters outside the lab for sure. But in the lab not so clear. Our theories don’t make quantitative predictions, effect sizes in the lab are not particularly indicative of how important a phenomenon is outside the lab, and to study effect size with even moderate precision we need  samples too big to plausibly be run in the lab (see Colada[20]).2

My talk at a recent conference (SESP) focused on how research questions should shape the statistical tools we choose to run and report. Here are the slides. (.pptx). This post is an extension of Slide #21.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. In practice we do not measure things perfectly, so going for 100% mediation is too ambitious []
  2. I do not have anything against reporting confidence intervals alongside p-values. They will probably be ignored by most readers, but a few will be happy to see them, and it is generally good to make people happy (Though it is worth pointing out that one can usually easily compute confidence intervals from test results).  Descriptive statistics more generally, e.g., means and SDs, should always be reported to catch errors, facilitate meta-analyses, and just generally better understand the results. []

[27] Thirty-somethings are Shrinking and Other U-Shaped Challenges

A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.

For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.

The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”

This post shows the most commonly used test is incorrect, and suggests a simple alternative.

What test would you run?
If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y01x2x2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.

We browsed a dozen or so papers testing u-shapes in economics and in psychology and that is also what they did.

That’s also what the Too-Much-Talent paper did. For instance, these are the results they report for the basketball and soccer studies: a fitted inverted u-shaped curve with a statistically significant x2.1

F1 v3

Everybody is wrong
Relying on the quadratic is super problematic because it sees u-shapes everywhere, even in cases where a true u-shape is not present. For instance:

Figure 2 final

The source of the problem is that regressions work hard to get as close as possible to data (blue dots), but are indifferent to implied shapes.

A U-shaped relationship will (eventually) imply a significant quadratic, but a significant quadratic does not imply a U-shaped relationship.2

First, plot the raw data.
Figure 2 shows how plotting the data prevents obviously wrong answers. Plots, however, are necessary but not sufficient for good inferences. They may have too little or too much data, becoming Rorschach tests.3


These charts are somewhat suggestive of a u-shape, but it is hard to tell whether the quadratic is just chasing noise. As social scientists interested in summarizing a mass of data, we want to write sentences like: “As predicted, the relationship was u-shaped, p=.002.

Those charts don’t let us do that.

A super simple solution
When testing inverted u-shapes we want to assess whether:
At first more x leads to more y, but eventually more x leads to less y.

If that’s what we want to assess, maybe that’s what we should test.Here is an easy way to do that that builds on the quadratic regression everyone is already running.

1)      Run the quadratic regression
2)      Find the point where the resulting u-shape maxes out.
3)      Now run a linear regression up to that point, and another from that point onwards.
4)      Test whether the second line is negative and significant.

More detailed step-by-step instructions (.html).4

One demonstration
We contacted the authors of the Too-Much-Talent paper and they proposed running the two-lines test on all three of their data sets. Aside: we think that’s totally great and admirable.
They emailed us the results of those analyses, and we all agreed to include their analyses in this post.

The paper had predicted and documented the lack of a u-shape for Baseball. The first figure is consistent with that result.

The paper had predicted and documented an inverted u-shape in Basketball and Soccer.The Basketball results are as predicted (first slope is positive, p<.001, second slope negative, p = .026). The Soccer results were more ambiguous (first slope is significantly positive, p<.001, but the second slope is not significant, p=.53).

The authors provided a detailed discussion of these and additional new analyses (.pdf).

We thank them for their openness, responsiveness, and valuable feedback.

Another demonstration
The most cited paper studying u-shapes we found (Aghion et al, QJE 2005, .pdf) examines the impact of competition on innovation.  Figure 3b above is the key figure in that paper. Here it is with two lines instead (STATA code .do; raw data .zip):


The second line is significantly negatively sloped, z=-3.75, p<.0001.

If you are like us, you think the p-value from that second line adds value to the eye-ball test of the published chart, and surely to the nondiagnostic p-value from the x2  in the quadratic regression.

If you see a problem with the two lines, or know of a better solution, please email Uri and/or Leif

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Talent was operationalized in soccer as belonging to a top-25 soccer team (e.g., Manchester United) and in basketball as being top-third of the NBA in Estimated Wins Added (EWA), and results were shown to be robust to defining top-20% and top-40%. []
  2. Lind and Mehlum (2010, .pdf), propose a way to formally test for the u-shape itself within a quadratic (and a few other specifications) and Miller et al (2013 .pdf)  provide analytical techniques for calculating thresholds where effects differ from zero for quadratics models. However, these tools should only be utilized when the researcher is confident about functional form, for they can lead to mistaken inferences when the assumptions are wrong. For example, if applied to y=log(x), one would, for sufficiently dispersed x-es, incorrectly conclude the relationship has an inverted u-shape, when it obviously does not. We shared an early draft of this post with the authors of both methods papers and they provided valuable feedback already reflected in this longest of footnotes. []
  3. One could plot fitted nonparametric functions for these, via splines or kernel regressions, but the results are quite sensitive to researcher degrees-of-freedom (e.g., bandwidth choice, # of knots) and also do not provide a formal test of a functional form []
  4. We found one paper that implemented something similar to this approach: Ungemach et al, Psych Science, 2011, Study 2 (.pdf), though they identify the split point with theory rather than a quadratic regression. More generally, there are other ways to find the point where the two lines are split, and their relative performance is worth exploring.  []

[26] What If Games Were Shorter?

The smaller your sample, the less likely your evidence is to reveal the truth. You might already know this, but most people don’t (.pdf), or at least they don’t appropriately apply it (.pdf). (See, for example, nearly every inference ever made by anyone). My experience trying to teach this concept suggests that it’s best understood using concrete examples.

So let’s consider this question: What if sports games were shorter?

Most NFL football games feature a matchup between one team that is expected to win – the favorite – and one that is not – the underdog. A full-length NFL game consists of four 15-minute quarters.1 After four quarters, favorites outscore their underdog opponents about 63% of the time.2 Now what would happen to the favorites’ chances of winning if the games were shortened to 1, 2, or 3 quarters?

In this post, I’ll tell you what happens and then I’ll tell you what people think happens.

What If Sports Games Were Shorter?

I analyzed 1,008 games across four NFL seasons (2009-2012; data .xls). Because smaller samples are less likely to reveal true differences between the teams, the favorites’ chances of winning (vs. losing or being tied) increase as game length increases.3

Reality is more likely to deviate from true expectations when samples are smaller. We can see this again in an analysis of point differences. For each NFL game, well-calibrated oddsmakers predict how many points the favorite will win by. Plotting these expected point differences against actual point differences reveals how the relationship between expectation and reality increases with game length:

Sample sizes affect the likelihood that reality will deviate from an average expectation.

But sample sizes do not affect what our average expectation should be. If a coin is known to turn up heads 60% of the time, then, regardless of whether the coin will be flipped 10 times or 100,000 times, our best guess is that heads will turn up 60% of time. The error around 60% will be greater for 10 flips than for 100,000 flips, but the average expectation will remain constant.

To see this in the football data, I computed point differences after each quarter, and then scaled them to a full-length game. For example, if the favorite was up by 3 points after one quarter, I scaled that to a 12-point advantage after 4 quarters. We can plot the difference between expected and actual point differences after each quarter.

The dots are consistently near the red line on the above graph, indicating that the average outcome aligns with expectations regardless of game length. However, as the progressively decreasing error bars show, the deviation from expectation is greater for shorter games than for longer ones.

Do People Know This?

I asked MTurk NFL fans to consider an NFL game in which the favorite was expected to beat the underdog by 7 points in a full-length game. I elicited their beliefs about sample size in a few different ways (materials .pdf; data .xls).

Some were asked to give the probability that the better team would be winning, losing, or tied after 1, 2, 3, and 4 quarters. If you look at the average win probabilities, their judgments look smart.

But this graph is super misleading, because the fact that the average prediction is wise masks the fact that the average person is not. Of the 204 participants sampled, only 26% assigned the favorite a higher probability to win at 4 quarters than at 3 quarters than at 2 quarters than at 1 quarter. About 42% erroneously said, at least once, that the favorite’s chances of winning would be greater for a shorter game than for a longer game.

How good people are at this depends on how you ask the question, but no matter how you ask it they are not very good.

I asked 106 people to indicate whether shortening an NFL game from four quarters to two quarters would increase, decrease, or have no effect on the favorite’s chance of winning. And I asked 103 people to imagine NFL games that vary in length from 1 quarter to 4 quarters, and to indicate which length would give the favorite the best chance to win.

The modal participant believed that game length would not matter. Only 44% correctly said that shortening the game would reduce the favorite’s chances, and only 33% said that the favorite’s chances would be best after 4 quarters than after 3, 2, or 1.

Even though most people get this wrong there are ways to make the consequences of sample size more obvious. It is easy for students to realize that they have a better chance of beating LeBron James in basketball if the game ends after 1 point than after 10 points. They also know that an investment portfolio with one stock is riskier than one with ten stocks.

What they don’t easily see is that these specific examples reflect a general principle. Whether you want to know which candidate to hire, which investment to make, or which team to bet on, the smaller your sample, the less you know.

Wide logo

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. If the game is tied, the teams play up to 15 additional minutes of overtime. []
  2. 7% of games are tied after four quarters, and, in my sample, favorites won 57% of those in overtime; thus favorites win about 67% of games overall []
  3. Note that it is not that the favorite is more likely to be losing after one quarter; it is likely more to be losing or tied. []