[27] Thirty-somethings are Shrinking and Other U-Shaped Challenges

A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.

For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.

The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”

This post shows the most commonly used test is incorrect, and suggests a simple alternative.

What test would you run?
If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y01x2x2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.

We browsed a dozen or so papers testing u-shapes in economics and in psychology and that is also what they did.

That’s also what the Too-Much-Talent paper did. For instance, these are the results they report for the basketball and soccer studies: a fitted inverted u-shaped curve with a statistically significant x2. [1]

F1 v3

Everybody is wrong
Relying on the quadratic is super problematic because it sees u-shapes everywhere, even in cases where a true u-shape is not present. For instance:

Figure 2 final

The source of the problem is that regressions work hard to get as close as possible to data (blue dots), but are indifferent to implied shapes.

A U-shaped relationship will (eventually) imply a significant quadratic, but a significant quadratic does not imply a U-shaped relationship. [2]

First, plot the raw data.
Figure 2 shows how plotting the data prevents obviously wrong answers. Plots, however, are necessary but not sufficient for good inferences. They may have too little or too much data, becoming Rorschach tests. [3]

F3_double

These charts are somewhat suggestive of a u-shape, but it is hard to tell whether the quadratic is just chasing noise. As social scientists interested in summarizing a mass of data, we want to write sentences like: “As predicted, the relationship was u-shaped, p=.002.

Those charts don’t let us do that.

A super simple solution
When testing inverted u-shapes we want to assess whether:
At first more x leads to more y, but eventually more x leads to less y.

If that’s what we want to assess, maybe that’s what we should test.Here is an easy way to do that that builds on the quadratic regression everyone is already running.

1)      Run the quadratic regression
2)      Find the point where the resulting u-shape maxes out.
3)      Now run a linear regression up to that point, and another from that point onwards.
4)      Test whether the second line is negative and significant.

More detailed step-by-step instructions (.html). [4]

One demonstration
We contacted the authors of the Too-Much-Talent paper and they proposed running the two-lines test on all three of their data sets. Aside: we think that’s totally great and admirable.
They emailed us the results of those analyses, and we all agreed to include their analyses in this post.
F_tripple

The paper had predicted and documented the lack of a u-shape for Baseball. The first figure is consistent with that result.

The paper had predicted and documented an inverted u-shape in Basketball and Soccer.The Basketball results are as predicted (first slope is positive, p<.001, second slope negative, p = .026). The Soccer results were more ambiguous (first slope is significantly positive, p<.001, but the second slope is not significant, p=.53).

The authors provided a detailed discussion of these and additional new analyses (.pdf).

We thank them for their openness, responsiveness, and valuable feedback.

Another demonstration
The most cited paper studying u-shapes we found (Aghion et al, QJE 2005, .pdf) examines the impact of competition on innovation.  Figure 3b above is the key figure in that paper. Here it is with two lines instead (STATA code .do; raw data .zip):

5

The second line is significantly negatively sloped, z=-3.75, p<.0001.

If you are like us, you think the p-value from that second line adds value to the eye-ball test of the published chart, and surely to the nondiagnostic p-value from the x2  in the quadratic regression.

If you see a problem with the two lines, or know of a better solution, please email Uri and/or Leif

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Talent was operationalized in soccer as belonging to a top-25 soccer team (e.g., Manchester United) and in basketball as being top-third of the NBA in Estimated Wins Added (EWA), and results were shown to be robust to defining top-20% and top-40%. []
  2. Lind and Mehlum (2010, .pdf), propose a way to formally test for the u-shape itself within a quadratic (and a few other specifications) and Miller et al (2013 .pdf)  provide analytical techniques for calculating thresholds where effects differ from zero for quadratics models. However, these tools should only be utilized when the researcher is confident about functional form, for they can lead to mistaken inferences when the assumptions are wrong. For example, if applied to y=log(x), one would, for sufficiently dispersed x-es, incorrectly conclude the relationship has an inverted u-shape, when it obviously does not. We shared an early draft of this post with the authors of both methods papers and they provided valuable feedback already reflected in this longest of footnotes. []
  3. One could plot fitted nonparametric functions for these, via splines or kernel regressions, but the results are quite sensitive to researcher degrees-of-freedom (e.g., bandwidth choice, # of knots) and also do not provide a formal test of a functional form []
  4. We found one paper that implemented something similar to this approach: Ungemach et al, Psych Science, 2011, Study 2 (.pdf), though they identify the split point with theory rather than a quadratic regression. More generally, there are other ways to find the point where the two lines are split, and their relative performance is worth exploring.  []

[21] Fake-Data Colada

Recently, a psychology paper (.pdf) was flagged as possibly fraudulent based on statistical analyses (.pdf). The author defended his paper (.html), but the university committee investigating misconduct concluded it had occurred (.pdf).

In this post we present new and more intuitive versions of the analyses that flagged the paper as possibly fraudulent. We then rule out p-hacking among other benign explanations.

Excessive linearity
The whistleblowing report pointed out the suspicious paper had excessively linear results.
That sounds more technical than it is.

Imagine comparing the heights of kids in first, second, and third grade, with the hypothesis that higher grades have taller children. You get samples of n=20 kids in each grade, finding average heights of: 120 cms, 126 cms, and 130 cms. That’s almost a perfectly linear pattern,  2nd graders [126], are almost exactly between the other two groups [mean(120,130)=125].

The scrutinized paper has 12 studies with three conditions each. The Control was too close to the midpoint of the other two in all of them. It is not suspicious for the true effect to be linear. Nothing wrong with 2nd graders being 125 cm tall. But, real data are noisy, so even if the effect is truly and perfectly linear, small samples of 2nd graders won’t average 125 every time.

Our new analysis of excessive linearity
The original report estimated a less than 1 in 179 million chance that a single paper with 12 studies would lead to such perfectly linear results. Their approach was elegant (subjecting results from two F-tests to a third F-test) but a bit technical for the uninitiated.

We did two things differently:
(1)    Created a more intuitive measure of linearity, and
(2)    Ran simulations instead of relying on F-distributions.

Intuitive measure of linearity
For each study, we calculated how far the Control condition was from the midpoint of the other two. So if in one study the means were: Low=0, Control=61, High=100, our measure compares the midpoint, 50, to the 61 from the Control, and notes they differ by 11% of the High-Low distance. [1]

Across the 12 studies, the Control conditions were on average just 2.3% away from the midpoint. We ran simulations to see how extreme that 2.3% was.

Simulations
We drew samples from populations with means and standard deviations equal to those reported in the suspicious paper. Our simulated variables were discrete and bounded, as in the paper, and we assumed that the true mean of the Control was exactly midway between the other two. [2] We gave the reported data every benefit of the doubt.
(see R Code)

Results
Recall that in the suspicious paper the Control was off by just 2.3% from the midpoint of the other two conditions. How often did we observe such a perfectly linear result in our 100,000 simulations?

Never.

Colada21Fig1

In real life, studies need to be p<.05 to be published. Could that explain it?

We redid the above chart including only the 45% of simulated papers in which all 12 studies were p<.05. The results changed so little that to save space we put the (almost identical) chart here


A second witness. Excessive similarity across studies
The original report also noted very similar effect sizes across studies.
The results reported in the suspicious paper convey this: Colada21_fig2

The F-values are not just surprisingly large, they are also surprisingly stable across studies.
Just how unlikely is that?

We computed the simplest measure of similarity we could think of: the standard deviation of F() across the 12 studies. In the suspicious paper, see figure above, SD(F)=SD(8.93, 9.15, 10.02…)=.866. We then computed SD(F) for each of the simulated papers.

How often did we observe such extreme similarity in our 100,000 simulations?

Never.

Colada21Fig3

Two red flags
For each simulated paper we have two measures of excessive similarity “Control is too close to High-Low midpoint,” and “SD of F-values”. These proved uncorrelated in our simulations (r = .004), so they provide independent evidence of aberrant results, we have a conceptual replication of  “these data are not real.” [3]

Alternative explanations
1.  Repeat subjects?
Some have speculated that perhaps some participants took part in more than one of the  studies. Because of random assignment to condition that wouldn’t help explain consistency in differences across conditions in different studies. Possibly it would make things worse; repeat participants would increase variability, as studies would differ in the mixture of experienced and inexperienced participants.

2. Recycled controls?
Others have speculated that perhaps the same control condition was used in multiple studies. But controls were different across studies. e.g., Study 2 involved listening to poems, Study 1 seeing letters.

3. Innocent copy-paste error?
Recent scandals in economics (.html) and medicine (.html) have involved copy-pasting errors before running analyses. Here so many separate experiments are involved, with the same odd patterns, that unintentional error seems implausible.

4. P-hacking?
To p-hack you need to drop participants, measures, or conditions.  The studies have the same dependent variables, parallel manipulations, same sample sizes and analysis. There is no room for selective reporting.

In addition, p-hacking leads to p-values just south of .05 (see our p-curve paper, SSRN). All p-values in the paper are smaller than p=.0008.  P-hacked findings do not reliably get this pedigree of p-values.

Actually, with n=20, not even real effects do.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The measure=|((High+Low)/2  – Control)/(High-Low)| []
  2. Thus, we don’t use the reported Control mean; our analysis is much more conservative than that []
  3. Note that the SD(F) simulation is not under the null that the F-values are the same, but rather, under the null that the Control is the midpoint. We also carried out 100,000 simulations under this other null and also never got SD(F) that small []