[29] Help! Someone Thinks I p-hacked

It has become more common to publicly speculate, upon noticing a paper with unusual analyses, that a reported finding was obtained via p-hacking. This post discusses how authors can persuasively respond to such speculations.

Examples of public speculation of p-hacking
Example 1. A Slate.com post by Andrew Gelman suspected p-hacking in a paper that collected data on 10 colors of clothing, but analyzed red & pink as a single color [.html] (see authors’ response to the accusation .html)

Example 2. An anonymous referee suspected p-hacking and recommended rejecting a paper, after noticing participants with low values of the dependent variable were dropped [.html]

Example 3. A statistics blog suspected p-hacking after noticing a paper studying number of hurricane deaths relied on the somewhat unusual Negative-Binomial Regression [.html]

First, the wrong response
The most common & tempting response to concerns like these is also the wrong response: justifying what one did. Explaining, for instance, why it makes sense to collapse red with pink or to run a negative-binomial.

It is the wrong response because when we p-hack, we self-servingly choose among justifiable analyses. P-hacked findings are by definition justifiable. Unjustifiable research practices involve incompetence or fraud, not p-hacking.

Showing an analysis is justifiable does not inform the question of whether it was p-hacked.

Right Response #1.  “We decided in advance”
P-hacking involves post-hoc selection of analyses to get p<.05. One way to address p-hacking concerns is to indicate analysis decisions were made ex-ante.

A good way to do this is to just say so:  “We decided to collapse red & pink before running any analyses” A better way is with a more general and verifiable statement:  “In all papers we collapse red & pink” An even better way is:  “We preregistered that we would collapse red & pink in this study” (see related Colada[12]: “Preregistration: Not Just for the Empiro-Zealots“)

Right Response #2.  “We didn’t decide in advance, but the results are robust”
Often we don’t decide in advance. We don’t think of outliers till we see them. What to do then? Show the results don’t hinge on how the problem is dealt with. Show dropping  >2SD, >2.5SD, >3SD, logging the dependent variable, comparing medians and running a non-parametric test. If the conclusion is the same in most of these, tell the blogger to shut up.

Right Response 3. “We didn’t decide in advance, and the results are not robust. So we run a direct replication.”
Sometimes the result will only be there if you drop >2SD and it will not have occurred to you to do so till you saw the p=.24 without it. One possibility is that you are chasing noise. Another possibility is that you are right. The one way to tell these two apart is with a new study. Run everything the same, exclude again based on >2SD.

If in your “replication” you now need a gender interaction for the >2SD exclusion to give you p<.05, it is not too late to read “False-Positive Psychology” (.html)

Cheers
If a blogger raises concerns of p-hacking, and you cannot provide any of the three responses above: buy the blogger a drink. She is probably right.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[28] Confidence Intervals Don’t Change How We Think about Data

Some journals are thinking of discouraging authors from reporting p-values and encouraging or even requiring them to report confidence intervals instead. Would our inferences be better, or even just different, if we reported confidence intervals instead of p-values?

One possibility is that researchers become less obsessed with the arbitrary significant/not-significant dichotomy. We start paying more attention to effect size. We start paying attention to precision. A step in the right direction.

Another possibility is that researchers forced to report confidence intervals will use them as if they were p-values and will only ask “Does the confidence interval include 0?” In this world confidence intervals are worse than p-values, because p=.012, p=.0002, p=.049 all become p<.05. Our analyses become more dichotomous. A step in the wrong direction.

How to test this?
To empirically assess the consequences of forcing researchers to replace p-values with confidence intervals we could randomly impose the requirement on some authors and see what happens.

That’s hard to pull off for a blog post.  Instead, I exploit a quirk in how “mediation analysis” is now reported in psychology. In particular, the statistical program everyone uses to run mediation reports confidence intervals rather than p-values.  How are researchers analyzing those confidence intervals?

Sample: 10 papers
I went to Web-of-Science and found the ten most recent JPSP articles (.html) citing the Preacher and Hayes (2004) article that provided the statistical programs that everyone runs (.pdf).

All ten of them used confidence intervals as dichotomus p-values, none discussed effect size or precision. None discussed the percentage of the effect that was mediated. One even accepted the null of no mediation because the confidence interval included 0 (it also included large effects).

 

F1

This sample suggests confidence intervals do not change how we think of data.

If people don’t care about effect size here…
Unlike other effect-size estimates in the lab, effect-size in mediation is intrinsically valuable.

No one asks how much more hot sauce subjects pour for a confederate to consume after watching a film that made them angry, but we do ask how much of that effect is mediated by anger; ideally all of it.1

Change the question before you change the answer
If we want researchers to care about effect size and precision, then we have to persuade researchers that effect size and precision are important.

I have not been persuaded yet. Effect size matters outside the lab for sure. But in the lab not so clear. Our theories don’t make quantitative predictions, effect sizes in the lab are not particularly indicative of how important a phenomenon is outside the lab, and to study effect size with even moderate precision we need  samples too big to plausibly be run in the lab (see Colada[20]).2

My talk at a recent conference (SESP) focused on how research questions should shape the statistical tools we choose to run and report. Here are the slides. (.pptx). This post is an extension of Slide #21.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. In practice we do not measure things perfectly, so going for 100% mediation is too ambitious []
  2. I do not have anything against reporting confidence intervals alongside p-values. They will probably be ignored by most readers, but a few will be happy to see them, and it is generally good to make people happy (Though it is worth pointing out that one can usually easily compute confidence intervals from test results).  Descriptive statistics more generally, e.g., means and SDs, should always be reported to catch errors, facilitate meta-analyses, and just generally better understand the results. []

[27] Thirty-somethings are Shrinking and Other U-Shaped Challenges

A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.

For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.

The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”

This post shows the most commonly used test is incorrect, and suggests a simple alternative.

What test would you run?
If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y01x2x2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.

We browsed a dozen or so papers testing u-shapes in economics and in psychology and that is also what they did.

That’s also what the Too-Much-Talent paper did. For instance, these are the results they report for the basketball and soccer studies: a fitted inverted u-shaped curve with a statistically significant x2.1

F1 v3

Everybody is wrong
Relying on the quadratic is super problematic because it sees u-shapes everywhere, even in cases where a true u-shape is not present. For instance:

Figure 2 final

The source of the problem is that regressions work hard to get as close as possible to data (blue dots), but are indifferent to implied shapes.

A U-shaped relationship will (eventually) imply a significant quadratic, but a significant quadratic does not imply a U-shaped relationship.2

First, plot the raw data.
Figure 2 shows how plotting the data prevents obviously wrong answers. Plots, however, are necessary but not sufficient for good inferences. They may have too little or too much data, becoming Rorschach tests.3

F3_double

These charts are somewhat suggestive of a u-shape, but it is hard to tell whether the quadratic is just chasing noise. As social scientists interested in summarizing a mass of data, we want to write sentences like: “As predicted, the relationship was u-shaped, p=.002.

Those charts don’t let us do that.

A super simple solution
When testing inverted u-shapes we want to assess whether:
At first more x leads to more y, but eventually more x leads to less y.

If that’s what we want to assess, maybe that’s what we should test.Here is an easy way to do that that builds on the quadratic regression everyone is already running.

1)      Run the quadratic regression
2)      Find the point where the resulting u-shape maxes out.
3)      Now run a linear regression up to that point, and another from that point onwards.
4)      Test whether the second line is negative and significant.

More detailed step-by-step instructions (.html).4

One demonstration
We contacted the authors of the Too-Much-Talent paper and they proposed running the two-lines test on all three of their data sets. Aside: we think that’s totally great and admirable.
They emailed us the results of those analyses, and we all agreed to include their analyses in this post.
F_tripple

The paper had predicted and documented the lack of a u-shape for Baseball. The first figure is consistent with that result.

The paper had predicted and documented an inverted u-shape in Basketball and Soccer.The Basketball results are as predicted (first slope is positive, p<.001, second slope negative, p = .026). The Soccer results were more ambiguous (first slope is significantly positive, p<.001, but the second slope is not significant, p=.53).

The authors provided a detailed discussion of these and additional new analyses (.pdf).

We thank them for their openness, responsiveness, and valuable feedback.

Another demonstration
The most cited paper studying u-shapes we found (Aghion et al, QJE 2005, .pdf) examines the impact of competition on innovation.  Figure 3b above is the key figure in that paper. Here it is with two lines instead (STATA code .do; raw data .zip):

5

The second line is significantly negatively sloped, z=-3.75, p<.0001.

If you are like us, you think the p-value from that second line adds value to the eye-ball test of the published chart, and surely to the nondiagnostic p-value from the x2  in the quadratic regression.

If you see a problem with the two lines, or know of a better solution, please email Uri and/or Leif

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Talent was operationalized in soccer as belonging to a top-25 soccer team (e.g., Manchester United) and in basketball as being top-third of the NBA in Estimated Wins Added (EWA), and results were shown to be robust to defining top-20% and top-40%. []
  2. Lind and Mehlum (2010, .pdf), propose a way to formally test for the u-shape itself within a quadratic (and a few other specifications) and Miller et al (2013 .pdf)  provide analytical techniques for calculating thresholds where effects differ from zero for quadratics models. However, these tools should only be utilized when the researcher is confident about functional form, for they can lead to mistaken inferences when the assumptions are wrong. For example, if applied to y=log(x), one would, for sufficiently dispersed x-es, incorrectly conclude the relationship has an inverted u-shape, when it obviously does not. We shared an early draft of this post with the authors of both methods papers and they provided valuable feedback already reflected in this longest of footnotes. []
  3. One could plot fitted nonparametric functions for these, via splines or kernel regressions, but the results are quite sensitive to researcher degrees-of-freedom (e.g., bandwidth choice, # of knots) and also do not provide a formal test of a functional form []
  4. We found one paper that implemented something similar to this approach: Ungemach et al, Psych Science, 2011, Study 2 (.pdf), though they identify the split point with theory rather than a quadratic regression. More generally, there are other ways to find the point where the two lines are split, and their relative performance is worth exploring.  []

[26] What If Games Were Shorter?

The smaller your sample, the less likely your evidence is to reveal the truth. You might already know this, but most people don’t (.pdf), or at least they don’t appropriately apply it (.pdf). (See, for example, nearly every inference ever made by anyone). My experience trying to teach this concept suggests that it’s best understood using concrete examples.

So let’s consider this question: What if sports games were shorter?

Most NFL football games feature a matchup between one team that is expected to win – the favorite – and one that is not – the underdog. A full-length NFL game consists of four 15-minute quarters.1 After four quarters, favorites outscore their underdog opponents about 63% of the time.2 Now what would happen to the favorites’ chances of winning if the games were shortened to 1, 2, or 3 quarters?

In this post, I’ll tell you what happens and then I’ll tell you what people think happens.

What If Sports Games Were Shorter?

I analyzed 1,008 games across four NFL seasons (2009-2012; data .xls). Because smaller samples are less likely to reveal true differences between the teams, the favorites’ chances of winning (vs. losing or being tied) increase as game length increases.3

Reality is more likely to deviate from true expectations when samples are smaller. We can see this again in an analysis of point differences. For each NFL game, well-calibrated oddsmakers predict how many points the favorite will win by. Plotting these expected point differences against actual point differences reveals how the relationship between expectation and reality increases with game length:

Sample sizes affect the likelihood that reality will deviate from an average expectation.

But sample sizes do not affect what our average expectation should be. If a coin is known to turn up heads 60% of the time, then, regardless of whether the coin will be flipped 10 times or 100,000 times, our best guess is that heads will turn up 60% of time. The error around 60% will be greater for 10 flips than for 100,000 flips, but the average expectation will remain constant.

To see this in the football data, I computed point differences after each quarter, and then scaled them to a full-length game. For example, if the favorite was up by 3 points after one quarter, I scaled that to a 12-point advantage after 4 quarters. We can plot the difference between expected and actual point differences after each quarter.

The dots are consistently near the red line on the above graph, indicating that the average outcome aligns with expectations regardless of game length. However, as the progressively decreasing error bars show, the deviation from expectation is greater for shorter games than for longer ones.

Do People Know This?

I asked MTurk NFL fans to consider an NFL game in which the favorite was expected to beat the underdog by 7 points in a full-length game. I elicited their beliefs about sample size in a few different ways (materials .pdf; data .xls).

Some were asked to give the probability that the better team would be winning, losing, or tied after 1, 2, 3, and 4 quarters. If you look at the average win probabilities, their judgments look smart.

But this graph is super misleading, because the fact that the average prediction is wise masks the fact that the average person is not. Of the 204 participants sampled, only 26% assigned the favorite a higher probability to win at 4 quarters than at 3 quarters than at 2 quarters than at 1 quarter. About 42% erroneously said, at least once, that the favorite’s chances of winning would be greater for a shorter game than for a longer game.

How good people are at this depends on how you ask the question, but no matter how you ask it they are not very good.

I asked 106 people to indicate whether shortening an NFL game from four quarters to two quarters would increase, decrease, or have no effect on the favorite’s chance of winning. And I asked 103 people to imagine NFL games that vary in length from 1 quarter to 4 quarters, and to indicate which length would give the favorite the best chance to win.

The modal participant believed that game length would not matter. Only 44% correctly said that shortening the game would reduce the favorite’s chances, and only 33% said that the favorite’s chances would be best after 4 quarters than after 3, 2, or 1.

Even though most people get this wrong there are ways to make the consequences of sample size more obvious. It is easy for students to realize that they have a better chance of beating LeBron James in basketball if the game ends after 1 point than after 10 points. They also know that an investment portfolio with one stock is riskier than one with ten stocks.

What they don’t easily see is that these specific examples reflect a general principle. Whether you want to know which candidate to hire, which investment to make, or which team to bet on, the smaller your sample, the less you know.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. If the game is tied, the teams play up to 15 additional minutes of overtime. []
  2. 7% of games are tied after four quarters, and, in my sample, favorites won 57% of those in overtime; thus favorites win about 67% of games overall []
  3. Note that it is not that the favorite is more likely to be losing after one quarter; it is likely more to be losing or tied. []

[25] Maybe people actually enjoy being alone with their thoughts

Recently Science published a paper concluding that people do not like sitting quietly by themselves (.html). The article received press coverage, that press coverage received blog coverage, which received twitter coverage, which received meaningful head-nodding coverage around my department. The bulk of that coverage (e.g., 1, 2, and 3) focused on the tenth study in the eleven-study article. In that study, lots of people preferred giving themselves electric shocks to being alone in a room (one guy shocked himself 190 times). I was more intrigued by the first nine studies, all of which were very similar to each other.1

Opposite inference
The reason I write this post is that upon analyzing the data for those studies, I arrived at an inference opposite the authors’. They write things like:

Participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think. (abstract)

It is surprisingly difficult to think in enjoyable ways even in the absence of competing external demands. (p.75, 2nd column)

The untutored mind does not like to be alone with itself (last phrase)

But the raw data point in the opposite direction: people reported to enjoy thinking.

Three measures
In the studies, people sit in a room for a while and then answer a few questions when they leave, including how enjoyable, how boring, and how entertaining the thinking period was, in 1-9 scales (anchored at 1 = “not at all”, 5 = “somewhat”, 9 = “extremely”). Across the nine studies, 663 people rated the experience of thinking, the overall mean for these three variables was M=4.94, SD=1.83, not significantly different from 5, the midpoint of the scale, t(662)=.9, p=.36. The 95% confidence interval for the mean is tight, 4.8 to 5.1. Which is to say, people endorse the midpoint of the scale composite: “somewhat boring, somewhat entertaining, and somewhat enjoyable.”

Five studies had means below the midpoint, four had means above it.

I see no empirical support for the core claim that “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves.”2

Focusing on enjoyment
Because the paper’s inferences are about enjoyment I now focus on the question that directly measured enjoyment. It read “how much did you enjoy sitting in the room and thinking?” 1 = “not at all enjoyable” to 5 = “somewhat enjoyable” to 9 = “extremely enjoyable”. That’s it. OK, so what sort of pattern would you expect after reading “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think.”?

Rather than entirely rely on your (or my) interpretations, I asked a group of people (N=50) to specifically estimate the distribution of responses that would lead to that claim.3 Here is what they guessed:

Figure 1

And now, with that in mind, let’s take a look at the distribution that the authors observed on that measure:4

Figure 2

Out of 663 participants, MOST (69.6%) said that the experience was somewhat enjoyable or better.5

If I were trying out a new manipulation and wanted to ensure that participants typically DID enjoy it, I would be satisfied with the distribution above. I would infer people typically enjoy being alone in a room with nothing to do but think.

It is still interesting
The thing is, though that inference is rather directly in opposition to the authors’, it is not any less interesting. In fact, it highlights value in manipulations they mostly gloss over. In those initial studies, the authors try a number of manipulations which compare the basic control condition to one in which people were directed to fantasize during the thinking period. Despite strong and forceful manipulations (e.g., Participants chose and wrote about the details of activities that would be fun to think about, and then were told to spend the thinking period considering either those activities, or if they wanted, something that was more pleasant or entertaining), there were never any significant differences. People in the control condition enjoyed the experience just as much as the fantasy conditions.6 People already know how to enjoy their thoughts. Instructing them how to fantasize does not help. Finally, if readers think that the electric shock finding is interesting conditional on the (I think, erroneous) belief that it is not enjoyable to be alone in thought, then the finding is surely even more interesting if we instead take the data at face value: Some people choose to self-administer an electric shock despite enjoying sitting alone with their thoughts.

Authors’ response
Our policy at DataColada is to give drafts of our post to authors whose work we cover before posting, asking for feedback and providing an opportunity to comment. Tim Wilson was very responsive in providing feedback and suggesting changes to previous drafts. Furthermore, he offered the response below.

We thank Professor Nelson for his interest in our work and for offering to post a response.  Needless to say we disagree with Prof. Nelson’s characterization of our results, but because it took us a bit more than the allotted 150 words to explain why, we have posted our reply here.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Excepting Study 8, for which I will consider only the control condition. Study 11 was a forecasting study. []
  2. The condition from Study 8 where people were asked to engage in external activities rather than think is –obviously- not included in this overall average. []
  3. I asked 50 mTurk workers to imagine that 100 people had tried a new experience and that their assessments were characterized as “participants typically did not enjoy the experience”. They then estimated, given that description, how many people responded with a 1, a 2, etc. Data. []
  4. The authors made all of their data publicly available. That is entirely fantastic and has made this continuing discussion possible. []
  5. The pattern is similar focusing in the subset of conditions with no other interventions. Out of 240 participants in the control conditions, 65% chose midpoint or above. []
  6. OK, a caveat here to point out that the absence of statistical significance should not be interpreted as accepting the null. Nevertheless, with more than 600 participants, they really don’t find a hint of an effect, the confidence interval for the mean enjoyment is (4.8 to 5.1). Their fantasy manipulations might not be a true null, but they certainly are not producing a truly large effect. []

[24] P-curve vs. Excessive Significance Test

In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.

The many-labs project is a collaboration of 36 labs around the world, each running a replication of 13 published effects in psychology (paper: pdf; data: xlsx).1

One of the most replicable effects was the Asian Disease problem, a demonstration of people being risk seeking for losses but risk averse for gains; it was p<.05 in 31 of 36 labs (we also replicated it  in Colada[11]).

Here I apply the Excessive Significance Test and p-curve to those 31 studies (summary table .xlsx).

How The Excessive Significance Test Works
It takes a set of studies (e.g., all studies in a paper) and asks whether too many are statistically significant. For example, say a paper has five studies, all p<.05. Imagine each obtained an effect size that would have given it 50% power. The probability that five out of five studies powered to 50% would all get p<.05 is .5*.5*.5*.5*.5=.03125. So we reject the null of full reporting, meaning that at least one null finding was not reported.

The excessive significance test was developed by Ioannidis and Trikalinos (.pdf). In psychology it has been popularized by Greg Francis (.html) and Ulrich Schimmack (html). I have twice been invited to publish commentaries on Francis’ use of the test: “It Does not Follow” (.pdf) and “It Really Just Does not Follow” (.pdf)

How p-curve Works
P-curve is a tool that assesses if, after accounting for p-hacking and file-drawering, a set of statistically significant findings have evidential value.  It looks at the distribution of p-values and asks whether that distribution is what we would expect of a set of true findings. In a nutshell, you see more low (e.g., p<.025) than high (e.g., p>.025) significant p-values when an effect is true (for details see www.p-curve.com)

Running both tests
The Excessive Significance Test takes the 31 studies that worked and spits out p=.03: rejecting the null that all studies were reported. It nails it. We know 5 studies were not “reported” and the test infers accordingly. (R Code)2

This inference is pointless for two reasons.

First, we always know the answer to the question of whether all studies were published. The answer is always “No.” Some people publish some null findings, but nobody publishes all null findings.

Second, it tells us about researcher behavior, not about the world, and we do science to learn about the world, not to learn about researcher behavior.

The question of interest is not “is there a null finding you are not telling me about?” The question of interest is “do these significant findings you are telling me about have truth value?”

P-curve takes the 31 studies and tells us that taken as a whole the studies do support the notion that gain vs loss framing has an effect on risk preferences.

f1

The figure (generated with the online app) shows that consistent with a true effect, there are more low than high p-values among the 31 studies that worked.

The excessive significance test tells you only that the glass is not 100% full.
P-curve tells you whether it has enough water to quench your thirst.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. More data: https://osf.io/wx7ck/ []
  2. Ulrich Schimmack (.pdf) proposes a variation in how the test is conducted, computing power based on each individual effect size rather than pooling. When done this way, the Excessive Signifiance Test is also significant, p=.01; see R Code link above []

[23] Ceiling Effects and Replications

A recent failure to replicate led to an attention-grabbing debate in psychology.

As you may expect from university professors, some of it involved data.  As you may not expect from university professors, much of it involved saying mean things that would get a child sent to the principal’s office (.pdf).

The hostility in the debate has obscured an interesting empirical question. This post aims to answer that interesting empirical question.1

Ceiling effect
The replication (.pdf) was pre-registered; it was evaluated and approved by peers, including the original authors, before being run. The predicted effect was not obtained, in two separate replication studies.

The sole issue of contention regarding the data (.xlsx), is that nearly twice as many respondents gave the highest possible answer in the replication as in the original study (about 41% vs about 23%).  In a forthcoming commentary (.pdf), the original author proposes a “ceiling effect” explanation: it is hard to increase something that is already very high.

I re-analyzed the original and replication data to assess this sensible concern.
My read is that the evidence is greatly inconsistent with the ceiling effect explanation.

The experiments
In the original paper (.pdf), participants rated six “dilemmas” involving moral judgments (e.g., How wrong  is it to keep money found in a lost wallet?). These judgments were predicted to become less harsh for people primed with cleanliness (Study 1) or who just washed their hands (Study 2).

The new analysis
In a paper with Joe and Leif (SSRN), we showed that a prominent failure to replicate in economics was invalidated by a ceiling effect. I use the same key analysis here.2

It consists of going beyond comparing means, examining instead all observations.The stylized figures below give the intuition. They plot the cumulative percentage of observations for each value of the dependent variable.

The first shows an effect across the board: there is a gap between the curves throughout.
The third shows the absence of an effect: the curves perfectly overlap.

Example FigureThe middle figure captures what a ceiling effect looks like. All values above 2 were brought down to 2 so the lines overlap there, but below the ceiling the gap is still easy to notice.

Let’s now look at real data. Study 1 first:3
Ori1  Rep1
It is easy to spot the effect in the original data.
It is just as easy to spot the absence of an effect in the replication.

Study 2 is more compelling,
Ori2 Rep2

In the Original the effect is largest in the 4-6 range. In the Replication about 60% of the data is in that range, far from the ceiling of 7. But still there is no gap between the lines.

Ceiling analysis by original author
In her forthcoming commentary (.pdf), effect size is computed as a percentage and shown to be smaller in scenarios with higher baseline levels (see her Figure 1). This is interpreted as evidence of a ceiling effect.
I don’t think that’s right.

Dividing something by increasingly larger numbers leads to increasingly smaller ratios, with or without a ceiling. Imagine the effect were constant, completely unaffected by ceiling effects. Say a 1 point increase in the morality scale in every scenario. This constant effect would be a smaller % in scenarios with a larger baseline; going from 2 to 3 is a 50% increase, whereas going from 9 to 10 only 11%.4

If a store-owner gives you $5 off any item, buying a $25 calculator gets you a 20% discount, buying a $100 jacket gets you only a 5% discount. But there is no ceiling, you are getting $5 in both cases.

To eliminate the arithmetic confound, I redid this analysis with effect size defined as the difference of means, rather than %, and there was no association between effect size and share of answers at boundary across scenarios (see calculations, .xlsx).

Ceiling analysis by replicators
In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.
I don’t think that’s right either.

Dropping observations at the boundary lowers power whether there is a ceiling effect or not, by a lot.  In simulations, I saw drops of 30% and more, say from 50% to 20% power (R Code). So not getting an effect this way does not support the absence of a ceiling effect problem.

Tobit
To formally take ceiling effects into account one can use the Tobit model (common in economics for censored data, see Wikipedia). A feature of this approach is that it allows analyzing the data at the scenario level, where the ceiling effect would actually be happening. I run Tobits on all datasets. The replications still had tiny effect sizes (<1/20th size of original), with p-values>.8 (STATA code).5

Wide logo

Authors’ response
Our policy at DataColada is to give drafts of our post to authors whose work we cover before posting, asking for feedback and providing an opportunity to comment. This causes delays (see footnote 1) but avoids misunderstandings.

The replication authors, Brent Donnellan, Felix Cheung and David Johnson suggested minor modifications to analyses and writing. They are reflected in the version you just read.

The original author, Simone Schnall, suggested a few edits also, and asked me to include this comment from her:

Your analysis still does not acknowledge the key fact: There are significantly more extreme scores in the replication data (38.5% in Study 1, and 44.0% in Study 2) than in the original data. The Tobin analysis is a model-based calculation and makes certain assumptions; it is not based on the empirical data. In the presence of so many extreme scores a null result remains inconclusive.

 

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. This blogpost was drafted on Thursday May 29th and was sent to original and replication authors for feedback, offering also an opportunity to comment. The dialogue with Simone Schnall lasted until June 3rd, which is why it appears only today. In the interim Tal Yarkoni and Yoel Inbar, among others, posted their own independent analyses. []
  2. Actually, in that paper it was a floor effect []
  3. The x-axis on these graphs had a typo that we were alerted to by Alex Perrone in August, 2014. The current version is correct []
  4. She actually divides by the share of observations at ceiling, but the same intuition and arithmetic apply. []
  5. I treat the experiment as nested, with 6 repeated-measures for each participant, one per scenario []

[22] You know what’s on our shopping list

As part of an ongoing project with Minah Jung, a nearly perfect doctoral student, we asked people to estimate the percentage of people who bought some common items in their last trip to the supermarket. For each of 18 items, we simply asked people (N = 397) to report whether they had bought it on their last trip to the store and also to estimate the percentage of other people who bought it1.

Take a sample item: Laundry Detergent. Did you buy laundry detergent the last time you went to the store? What percentage of other people2 do you think purchased laundry detergent? The correct answer is that 42% of people bought laundry detergent. If you’re like me, you see that number and say, “that’s crazy, no one buys laundry detergent.” If you’re like Minah, you say, “that’s crazy, everyone buys laundry detergent.” Minah had just bought laundry detergent, whereas I had not. Our biases are shared by others. People who bought detergent thought that 69% of others bought detergent whereas non-buyers thought that number was only 29%. Those are really different. We heavily emphasize our own behavior when estimating the behavior of others3.
Grocery Shopping Figure 1
That effect, generally referred to as the false consensus effect (see classic paper .pdf), extends beyond estimates of detergent purchase likelihoods. All of the items (e.g., milk, crackers, etc.) showed a similar effect. The scatterplot below shows estimates for each of the products. The x-axis is the actual percentage of purchasers and the y-axis reports estimated percentages (so the identity line would be a perfectly accurate estimate).
Grocery Shopping Figure 2
For every single product, buyers gave a higher estimate than non-buyers; the false consensus effect is quite robust. People are biased. But a second observation gets its own chart. What happens if you just average the estimates from everyone?
Grocery Shopping Figure 3
That is a correlation of r = .95.

As a judgment and decision making researcher, one of my tasks is to identify idiosyncratic shortcomings in human thinking (e.g., the false consensus effect). Nevertheless, under the right circumstances, I can be entranced by accuracy. In this case, I marvel at the wisdom of crowds. Every person has a ton of error (e.g., “I have no idea whether you bought detergent”) and a solid amount of bias (e.g., “but since I didn’t buy detergent, you probably didn’t either.”). When we put all of that together, the error and the bias cancel out. What’s left over is astonishing amounts of signal.

Minah and I could cheerfully use the same data to write one of two papers. The first could use a pervasive judgmental bias (18 out of 18 products show the effect!) to highlight the limitations of human thinking. A second paper could use the correlation (.95!) to highlight the efficiency of human thinking. Fortunately, this is a blog post, so I get to comfortably write about both.

Sometimes, even with judgmental shortcomings in the individual, there is still judgmental genius in the many.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Truth be told, it was ever so slightly more complicated. We asked half the people to talk about purchases from their next shopping trip. To first approximation there are no differences between these conditions, so for the simplicity of verb tense I refer to the past. []
  2. “Other people” was articulated as “other people who are also answering this question on mTurk.” []
  3. In fact, you might recall from Colada[16] that Joe is rather publicly prone to this error. []

[21] Fake-Data Colada

Recently, a psychology paper (.pdf) was flagged as possibly fraudulent based on statistical analyses (.pdf). The author defended his paper (.html), but the university committee investigating misconduct concluded it had occurred (.pdf).

In this post we present new and more intuitive versions of the analyses that flagged the paper as possibly fraudulent. We then rule out p-hacking among other benign explanations.

Excessive linearity
The whistleblowing report pointed out the suspicious paper had excessively linear results.
That sounds more technical than it is.

Imagine comparing the heights of kids in first, second, and third grade, with the hypothesis that higher grades have taller children. You get samples of n=20 kids in each grade, finding average heights of: 120 cms, 126 cms, and 130 cms. That’s almost a perfectly linear pattern,  2nd graders [126], are almost exactly between the other two groups [mean(120,130)=125].

The scrutinized paper has 12 studies with three conditions each. The Control was too close to the midpoint of the other two in all of them. It is not suspicious for the true effect to be linear. Nothing wrong with 2nd graders being 125 cm tall. But, real data are noisy, so even if the effect is truly and perfectly linear, small samples of 2nd graders won’t average 125 every time.

Our new analysis of excessive linearity
The original report estimated a less than 1 in 179 million chance that a single paper with 12 studies would lead to such perfectly linear results. Their approach was elegant (subjecting results from two F-tests to a third F-test) but a bit technical for the uninitiated.

We did two things differently:
(1)    Created a more intuitive measure of linearity, and
(2)    Ran simulations instead of relying on F-distributions.

Intuitive measure of linearity
For each study, we calculated how far the Control condition was from the midpoint of the other two. So if in one study the means were: Low=0, Control=61, High=100, our measure compares the midpoint, 50, to the 61 from the Control, and notes they differ by 11% of the High-Low distance.1

Across the 12 studies, the Control conditions were on average just 2.3% away from the midpoint. We ran simulations to see how extreme that 2.3% was.

Simulations
We drew samples from populations with means and standard deviations equal to those reported in the suspicious paper. Our simulated variables were discrete and bounded, as in the paper, and we assumed that the true mean of the Control was exactly midway between the other two.2 We gave the reported data every benefit of the doubt.
(see R Code)

Results
Recall that in the suspicious paper the Control was off by just 2.3% from the midpoint of the other two conditions. How often did we observe such a perfectly linear result in our 100,000 simulations?

Never.

Colada21Fig1

In real life, studies need to be p<.05 to be published. Could that explain it?

We redid the above chart including only the 45% of simulated papers in which all 12 studies were p<.05. The results changed so little that to save space we put the (almost identical) chart here


A second witness. Excessive similarity across studies
The original report also noted very similar effect sizes across studies.
The results reported in the suspicious paper convey this: Colada21_fig2

The F-values are not just surprisingly large, they are also surprisingly stable across studies.
Just how unlikely is that?

We computed the simplest measure of similarity we could think of: the standard deviation of F() across the 12 studies. In the suspicious paper, see figure above, SD(F)=SD(8.93, 9.15, 10.02…)=.866. We then computed SD(F) for each of the simulated papers.

How often did we observe such extreme similarity in our 100,000 simulations?

Never.

Colada21Fig3

Two red flags
For each simulated paper we have two measures of excessive similarity “Control is too close to High-Low midpoint,” and “SD of F-values”. These proved uncorrelated in our simulations (r = .004), so they provide independent evidence of aberrant results, we have a conceptual replication of  “these data are not real.”3

Alternative explanations
1.  Repeat subjects?
Some have speculated that perhaps some participants took part in more than one of the  studies. Because of random assignment to condition that wouldn’t help explain consistency in differences across conditions in different studies. Possibly it would make things worse; repeat participants would increase variability, as studies would differ in the mixture of experienced and inexperienced participants.

2. Recycled controls?
Others have speculated that perhaps the same control condition was used in multiple studies. But controls were different across studies. e.g., Study 2 involved listening to poems, Study 1 seeing letters.

3. Innocent copy-paste error?
Recent scandals in economics (.html) and medicine (.html) have involved copy-pasting errors before running analyses. Here so many separate experiments are involved, with the same odd patterns, that unintentional error seems implausible.

4. P-hacking?
To p-hack you need to drop participants, measures, or conditions.  The studies have the same dependent variables, parallel manipulations, same sample sizes and analysis. There is no room for selective reporting.

In addition, p-hacking leads to p-values just south of .05 (see our p-curve paper, SSRN). All p-values in the paper are smaller than p=.0008.  P-hacked findings do not reliably get this pedigree of p-values.

Actually, with n=20, not even real effects do.

Wide logo


Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The measure=|((High+Low)/2  – Control)/(High-Low)| []
  2. Thus, we don’t use the reported Control mean; our analysis is much more conservative than that []
  3. Note that the SD(F) simulation is not under the null that the F-values are the same, but rather, under the null that the Control is the midpoint. We also carried out 100,000 simulations under this other null and also never got SD(F) that small []

[20] We cannot afford to study effect size in the lab

Methods people often say  – in textbooks, task forces, papers, editorials, over coffee, in their sleep – that we should focus more on estimating effect sizes rather than testing for significance.

I am kind of a methods person, and I am kind of going to say the opposite.

Only kind of the opposite because it is not that we shouldn’t try to estimate effect sizes; it is that, in the lab, we can’t afford to.

The sample sizes needed to estimate effect sizes are too big for most researchers most of the time.

With n=20, forget it
The median sample size in published studies in Psychology is about n=20 per cell.1 There have been many calls over the last few decades to report and discuss effect size in experiments. Does it make sense to push for effect size reporting when we run small samples? I don’t see how.

Arguably the lowest bar for claiming to care about effect size is so to distinguish among Small, Medium, and Large effects. And with n=20 we can’t do even that.

Cheatsheet: I use Cohen’s d to index effect size. d is by how many standard deviations the means differ. Small is d=.2, Medium d=.5 and Large d=.8.

The figure below shows 95% confidence intervals surrounding Small, Medium and Large estimates when n=20 (see simple R Code).

f1d

Whatever effect we get, we will not be able to rule out effects of a different qualitative size.

Four-digit n’s
It is easy to bash n=20 (please do it often). But just how big an n do we need to study effect size?

I am about to show that the answer has four-digits.

It will be rhetorically useful to consider a specific effect size. Let’s go with d=.5. You need n=64 per cell to detect this effect 80% of the time.

If you run the study with n=64, then you will get a confidence interval that will not include zero 80% of the time, but if your estimate is right on the money at d=.5, that confidence interval still will include effects smaller than Small (d<.2) and larger than Large (d>.8). So n=64 is fine for testing whether the effect exists, but not for estimating its size.

Properly powered studies teach you almost nothing about effect size.2

f2b
What if we go the extra mile, or three, and power it to 99.9%, running n=205 per cell. This study will almost always produce a significant effect, yet the expected confidence interval is massive, spanning a basically small effect (d=.3) to a basically large effect (d=.7).

To get the kind of confidence interval that actually gives confidence regarding effect size, one that spans say ±0.1, we need n=3000 per cell. THREE-THOUSAND. (see simple R Code3

In the lab, four-digit per-cell sample sizes are not affordable.

Advocating a focus on effect size estimation, then, implies advocating for either:
1)       Leaving the lab (e.g., mTurk, archival data).4
2)       Running within-subject designs.

Some may argue effect size is so important we ought to do these things.
But that’s a case to be made, not an implication to be ignored.

UPDATE 2014 05 08: A commentary on this post is available here

Wide logo


 

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Based on the degrees of freedom reported in thousands of test statistics I scraped from Psych Science and JPSP []
  2. Unless you properly power for a trivially small effect by running a gigantic sample []
  3. If you run n=1000 the expected confidence interval spans d=.41 and d=.59 []
  4. One way to get big samples is to combine many small samples. Whether one should focus on effect size in meta-analysis is not something that seems controversial enough to be interesting to discuss []