[61] Why p-curve excludes ps>.05

In a recent working paper, Carter et al (.pdf) proposed that one can better correct for publication bias by including not just p<.05 results, the way p-curve does, but also p>.05 results [1]. Their paper, currently under review, aimed to provide a comprehensive simulation study that compared a variety of bias-correction methods for meta-analysis.

Although the paper is well written and timely, the advice is problematic. Incorporating non-significant results into a tool designed to correct for publication bias requires one to make assumptions about how difficult it is to publish each possible non-significant result. For example, one has to make assumptions about how much more likely an author is to publish a p=.051 than a p=.076, or a p=.09 in the wrong direction than a p=.19 in the right direction, etc. If the assumptions are even slightly wrong, the tool’s performance becomes disastrous [2]

Assumptions and p>.05s
The desire to include p>.05 results in p-curve type analyses is understandable. Doing so would increase our sample sizes (of studies), rendering our estimates more precise. Moreover, we may be intrinsically interested in learning about studies that did not get to p<.05.

So why didn’t we do that when we developed p-curve? Because we wanted a tool that would work well in the real world.  We developed a good tool, because the perfect tool is unattainable.

While we know that the published literature generally does not discriminate among p<.05 results (e.g., p=.01 is not perceptibly easier to publish than is p=.02), we don’t know how much easier it is to publish some non-significant results rather than others.

The downside of p-curve focusing only on p<.05 is that p-curve can “only” tell us about the (large) subset of published results that are statistically significant. The upside is that p-curve actually works.

All p>.05 are not created equal
The simulations reported by Carter et al. assume that all p>.05 findings are equally likely to be published: a p=.051 in the right direction is as likely to be published as a p=.051 in the wrong direction. A p=.07 in the right direction is as likely to be published as a p=.97 in the right direction. If this does not sound implausible to you, we recommend re-reading this paragraph.

Intuitively it is easy to see how getting this assumption wrong will introduce bias. “Imagine” that a p=.06 is easier to publish than is a p=.76. A tool that assumes both results are equally likely to be published will be naively impressed when it sees many more p=.06s than p=.76s, and it will fallaciously conclude there is evidential value when there isn’t any.

A calibration
We ran simulations matching one of the setups considered by Carter et al., and assessed what happens if the publishability of p>.05 results deviated from their assumptions (R Code). The black bar in the figure below shows that if their fantastical assumption were true, the tool would do well, producing a false-positive rate of 5%. The other bars show that under some (slightly) more realistic circumstances, false-positives abound.

One must exclude p>.05
It is obviously not true that all p>.05s are equally publishable. But no alternative assumption is plausible. The mechanisms that influence the publication of p>.05 results are too unknowable, complex, and unstable from paper to paper, to allow one to make sensible assumptions or generate reasonable estimates. The probability of publication depends on the research question, on the authors’ and editors’ idiosyncratic beliefs and standards, on how strong other results in the paper are, on how important the finding is for the paper’s thesis, etc.  Moreover, comparing the 2nd and 3rd bar in the graph above, we see that even minor quantitative differences in a face-valid assumption make a huge difference.

P-curve is not perfect. But it makes minor and sensible assumptions, and is robust to realistic deviations from those assumptions. Specifically, it assumes that all p<.05 are equally publishable regardless of what exact p-value they have. This captures how most researchers perceive publication bias to occur (at least in psychology). Its inferences about evidential value are robust to relatively large deviations from this assumption (e.g., if researchers start aiming for p<.045 instead of p<.05, or even p<.035, or even p<.025, p-curve analysis, as implemented in the online app (.htm), will falsely conclude there is evidential value when the null is true, no more than 5% of the time.  See our “Better P-Curvespaper (SSRN)).

Conclusion
With p-curve we can determine whether a set of p<.05 results have evidential value, and what effect we may expect in a direct replication of those studies.  Those are not the only questions you may want to ask. For example, traditional meta-analysis tools ask what is the average effect of all of the studies that one could possibly run (whatever that means; see Colada[33]), not just those you observe. P-curve does not answer that question. Then again, no existing tool does. At least not even remotely accurately.

P-curve tells you “only” this: If I were to run these statistically significant studies again, what should I expect?

Wide logo


Author feedback.
We shared a draft of this post with Evan Carter, Felix Schönbrodt, Joe Hilgard and Will Gervais. We had an incredibly constructive and valuable discussion, sharing R Code back and forth and jointly editing segments of the post.

We made minor edits after posting responding to readers’ feedback. The original version is archived here .htm.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. When applying p-curve to estimate effect size, it is extremely similar to following the “one-parameter-selection-model” by Hedges 1984 (.pdf). []
  2. Their paper is nuanced in many sections, but their recommendations are not. For example, they write in the abstract, “we generally recommend that meta-analysis of data in psychology use the three-parameter selection model.” []

[49] P-Curve Won’t Do Your Laundry, But Will Identify Replicable Findings

In a recent critique, Bruns and Ioannidis (PlosONE 2016 .pdf) proposed that p-curve makes mistakes when analyzing studies that have collected field/observational data. They write that in such cases:

p-curves based on true effects and p‑curves based on null-effects with p-hacking cannot be reliably distinguished” (abstract).

In this post we show, with examples involving sex, guns, and the supreme court, that the statement is incorrect. P-curve does reliably distinguish between null effects and non-null effects. The observational nature of the data isn’t relevant.

The erroneous conclusion seems to arise from their imprecise use of terminology. Bruns & Ioannidis treat a false-positive finding and a confounded finding as the same thing.  But they are different things. The distinction is as straightforward as it is important.

Confound vs False-positive.
We present examples to clarify the distinction, but first let’s speak conceptually.

A Confounded effect of X on Y is real, but the association arises because another (omitted) variable causes both X and Y. A new study of X on Y is expected to find that association again.

A False-positive effect of X on Y, in contrast, is not real. The apparent association between X and Y is entirely the result of sampling error. A new study of X on Y is not expected to find an association again.

Confounded effects are real and replicable, while false-positive effects are neither. Those are big differences, but Bruns & Ioannidis conflate them. For instance, they write:

the estimated effect size may be different from zero due to an omitted-variable bias rather than due to a true effect. (p. 3; emphasis added).

Omitted-variable bias does not make a relationship untrue; it makes it un-causal.

This is not just semantics, nor merely a discussion of “what do you mean by a true effect?”
We can learn something from examining  replicable effects further (e.g., learn if there is a confound and what it is; confounds are sometimes interesting).  We cannot learn something from examining non-replicable  effects further.

This critical distinction between replicable and non-replicable effects  can be informed by p-curve. Replicable results, whether causal or not, lead to right-skewed p-curves. False-positive, non-replicable effects lead to flat or left-skewed p-curves.

Causality
P-curve’s inability to distinguish causal vs. confounded relationships is no more of a shortcoming than is its inability to fold laundry or file income tax returns. Identifying causal relationships is not something we can reasonably expect any statistical test to do [1].

laundryWhen researchers try to assess causality through techniques such as instrumental variables, regression discontinuity, or randomized field experiments, they do so via superior designs, not via superior statistical tests. The Z, t, and F tests reported in papers that credibly establish causality are the same tests as those reported in papers that do not.

Correlation is not causation. Confusing the two is human error, not tool error.

To make things concrete we provide two examples. Both use General Social Survey (GSS) data, which is, of course, observational data.

Example 1. Shotguns and female partners (Confound)
With the full GSS, we identified the following confounded association: Shotgun owners report having had 1.9 more female sexual partners, on average, than do non-owners, t(14824)=10.58, p<.0001.  The omitted variable is gender.

33% of Male respondents report owning a shotgun, whereas, um, ‘only’ 19% of Women do.

Males, relative to females, also report having had a greater number of sexual encounters with females (means of 9.82 vs 0.21)

Moreover, controlling for gender, the effect goes away (t(14823)=.68, p=.496) [2].

So the relationship is confounded. It is real but not causal. Let’s see what p-curve thinks of it. We use data from 1994 as the focal study, and create a p-curve using data from previous years (1989-1993) following a procedure similar to Bruns and Ioannidis (2016)  [3]. Panel A in Figure 1 shows the resulting right-skewed p-curve. It suggests the finding should replicate in subsequent years. Panel B shows that it does.

F1 for blogR Code to reproduce this figure: https://osf.io/v4spq/ 

Example 2. Random numbers and the Supreme Court (false-positive)
With observational data it’s hard to identify exactly zero effects because there is always the risk of omitted variables, selection bias, long and difficult-to-understand causal chains, etc.

To create a definitely false-positive finding we started with a predictor that could not possibly be expected to truly correlate with any variable: whether the random respondent ID was odd vs. even.

We then p-hacked an effect by running t-tests on every other variable in the 1994 GSS dataset for odd vs. even participants, arriving at 36 false-positive ps<.05. For its amusement value, we focused on the question asking participants how much confidence they have in the U.S. Supreme Court (1: a great deal, 2: only some, 3: hardly any).

Panel C in Figure 1 shows that, following the same procedure as for the previous example, the p-curve for this finding is flat, suggesting that the finding would not replicate in subsequent years. Panel D shows that it does not. Figure 1 demonstrates how p-curve successfully distinguishes between statistically significant studies that are vs. are not expected to replicate.

Punchline: p-curve can distinguish replicable from non-replicable findings. To distinguish correlational from causal findings, call an expert.

Note: this is a blog-adapted version of a formal reply we wrote and submitted to PlosONE, but since 2 months have passed and they have not sent it out to reviewers yet, we decided to Colada it and hope someday PlosONE generously decides to send our paper out for review.

Wide logo

Author feedback.
Our policy is to contact authors whose work we discuss to request feedback and give an opportunity to respond within our original post. We contacted Stephan Bruns and John Ioannidis. They didn’t object to our distinction between confounded and false-positive findings, but propose that “the ability of ‘experts’ to identify confounding is close to non-existent.” See their full 3-page response (.pdf).

 


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. For what is worth, we have acknowledged this in prior work. For example, in Simonsohn, Nelson, and Simmons (2014, p. 535) we wrote, “Just as an individual finding may be statistically significant even if the theory it tests is incorrect— because the study is flawed (e.g., due to confounds, demand effects, etc.)—a set of studies investigating incorrect theories may nevertheless contain evidential value precisely because that set of studies is flawed” (emphasis added). []
  2. We are not claiming, of course, that the residual effect is exactly zero. That’s untestable. []
  3. In particular, we generated random subsamples (of the size of the 1994 sample), re-ran the regression predicting number of female sexual partners with the shotgun ownership dummy, and constructed a p-curve for the subset of statistically significant results that were obtained. This procedure is not really necessary. Once we know the effect size and sample size we know the non-centrality parameter of the distribution for the test-statistic and can compute expected p-curves without simulations (see Supplement 1 in Simonsohn et al., 2014), but we did our best to follow the procedures by Bruns and Ioannidis. []

[45] Ambitious P-Hacking and P-Curve 4.0

In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.

Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.f1

For a recently published paper, “Better P-Curves” (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.

F2

Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.[1]

P-Curve and Ambitious p-hacking
P-
curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.

In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.

However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.

But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.

Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).

The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.

Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.

The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. [2]

In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.

f3

The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.[3]

We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. []
  2. As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. []
  3. When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/  []

[44] AsPredicted: Pre-registration Made Easy

Pre-registering a study consists of leaving a written record of how it will be conducted and analyzed. Very few researchers currently pre-register their studies. Maybe it’s because pre-registering is annoying. Maybe it’s because researchers don’t want to tie their own hands. Or maybe it’s because researchers see no benefit to pre-registering.  This post addresses these three possible causes. First, we introduce AsPredicted.org, a new website that makes pre-registration as simple as possible. We then show that pre-registrations don’t actually tie researchers’ hands, they tie reviewers’ hands, providing selfish benefits to authors who pre-register. [1]

AsPredicted.org
The best introduction is arguably the home-page itself:homepage 11302015

No matter how easy pre-registering becomes, not pre-registering is always easier.  What benefits outweigh the small cost?

Benefit 1. No more self-censoring
In part by choice, and in part because some journals (and reviewers) now require it, more and more researchers are writing papers that properly disclose how their studies were run; they are disclosing all experimental conditions, all measures collected, any data exclusions, etc.

Disclosure is good. It appropriately increases one’s skepticism of post-hoc analytic decisions. But it also increases one’s skepticism of totally reasonable ex-ante decisions, for the two are sometimes confused. Imagine you collect and properly disclose that you measured one primary dependent variable and two exploratory variables,  only to get hammered by Reviewer 2, who writes:

This study is obviously p-hacked. The authors collected three measures and only used one as a dependent variable. Reject.

When authors worry that they will be accused of reporting only the best of three measures, they may decide to only collect a single measure. Preregistration frees authors to collect all three, while assuaging any concerns of being accused of p-hacking.

You don’t tie your hands with pre-registration. You tie Reviewer 2’s.

In case you skipped the third blue box above:
whatif

Benefit 2. Go ahead, data peek
Data peeking, where one decides whether to get more data after analyzing the data, is usually a big no-no. It invalidates p-values and (several aspects of) Bayesian inference. [2]  But if researchers pre-register how they will data peek, it becomes kosher again.

For example, you can pre-register, “In line with Frick (1986 .pdf) we will check data after every 20 observations per-cell, stopping whenever p<.01 or p>.36,”  or “In line with Pocock (1977 .pdf), we will collect up to 60 observations per-cell, in batches of 20, and stop early if p<.022.”

Lakens (2014 .pdf) gives an accessible introduction to legalized data-peeking for psychologists.

Benefit 3. Bolster credibility of odd analyses
Sometimes, the best way to analyze the data is difficult to sell to readers. Maybe you want to do a negative binomial regression, or do an arcsine transformation, or drop half the sample because the observations are not independent. You think about it for hours, ask your stat-savvy friends, and then decide that the weird way to analyze your data is actually the right way to analyze your data. Reporting the weird (but correct!) analysis opens you up to accusations of p-hacking. But not if you pre-register it. “We will analyze the data with an arc-sine transformation.” Done. Reviewer 2 can’t call you a p-hacker.
Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. More flexible options for pre-registration are offered by the Open Science Framework and the Social Science Registry, where authors can write up documents in any format, covering any aspect of their design or analysis, and without any character limits. See pre-registration instructions for the OSF here , and for the Social Science Registry here. []
  2. In particular, if authors peek at their data seeking a given Bayes Factor, they increase the odds they will find support for the alternative hypothesis even if the null is true – see Colada [13] – and they obtain biased estimates of effect size. []

[30] Trim-and-Fill is Full of It (bias)

Statistically significant findings are much more likely to be published than non-significant ones (no citation necessary). Because overestimated effects are more likely to be statistically significant than are underestimated effects, this means that most published effects are overestimates. Effects are smaller – often much smaller – than the published record suggests.

For meta-analysts the gold standard procedure to correct for this bias, with >1700 Google cites, is called Trim-and-Fill (Duval & Tweedie 2000, .pdf). In this post we show Trim-and-Fill generally does not work.

What is Trim-and-Fill?
When you have effect size estimates from a set of studies, you can plot those estimates with effect size on the x-axis and a measure of precision (e.g., sample size or standard error) on the y-axis. In the absence of publication bias this chart is symmetric: noisy estimates are sometimes too big and sometimes too small. In the presence of publication bias the small estimates are missing. Trim-and-Fill deletes (i.e., trims) some of those large-effect studies and adds (i.e., fills) small-effect studies, so that the plot is symmetric. The average effect size in this synthetic set of studies is Trim-and-Fill’s “publication bias corrected” estimate.

What is Wrong With It?
A known limitation of Trim-and-Fill is that it can correct for publication bias that does not exist, underestimating effect sizes (see e.g., Terrin et al 2003, .pdf). A less known limitation is that it generally does not correct for the publication bias that does exist, overestimating effect sizes.

The chart below shows the results of simulations we conducted for our just published “P-Curve and Effect Size” paper (SSRN). We simulated large meta-analyses aggregating studies comparing two means, with sample sizes ranging from 10-70, for five different true effect sizes. The chart plots true effect sizes against estimated effect sizes in a context in which we only observe significant (publishable) findings (R Code for this [Figure 2b] and all other results in our paper).

f1blog_2

Start with the blue line at the top. That line shows what happens when you simply average only the statistically significant findings–that is, only the findings that would typically be observed in the published literature. As we might expect, those effect size estimates are super biased.

The black line shows what happens when you “correct” for this bias using Trim-and-Fill. Effect size estimates are still super biased, especially when the effect is nonexistent or small.

Aside: p-curve nails it.

We were wrong
Trim-and-Fill assumes that studies with relatively smaller effects are not published (e.g., that out of 20 studies attempted, the 3 obtaining the smallest effect size are not publishable). In most fields, however, publication bias is governed by p-values rather than effect size (e.g., out of 20 studies only those with p<.05 are publishable).

Until a few weeks ago we thought that this incorrect assumption led to Trim-and-Fill’s poor performance. For instance, in our paper (SSRN) we wrote

when the publication process suppresses nonsignificant findings, Trim-and-Fill is woefully inadequate as a corrective technique.” (p.667)

For this post we conducted additional analyses and learned that Trim-and-Fill performs poorly even when its assumptions are met–that is, even when only small-effect studies go unpublished (R Code). Trim-and-Fill seems to work well only when few studies are missing, that is, where there is little bias to be corrected. In situations when a correction is most needed, Trim-and-Fill does not correct nearly enough.

Two Recommendations
1) Stop using Trim-and-Fill in meta-analyses.
2) Stop treating published meta-analyses with a Trim-and-Fill “correction” as if they have corrected for publication bias. They have not.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Author response:
Our policy at Data Colada is to contact authors whose work we cover, offering an opportunity to provide feedback and to comment within our original post. Trim-and-Fill was originally created by Sue Duval and the late Richard Tweedie.  We contacted Dr. Duval and exchanged a few emails but she did not provide feedback nor a response.

[10] Reviewers are asking for it

Recent past and present
The leading empirical psychology journal, Psychological Science, will begin requiring authors to disclose flexibility in data collection and analysis starting on January of 2014 (see editorial). The leading business school journal, Management Science, implemented a similar policy a few months ago.

Both policies closely mirror the recommendations we made in our 21 Word Solution piece, where we contrasted the level of disclosure in science vs. food (see reprint of Figure 3).

2013-12-07_102828
Our proposed 21 word disclosure statement was:

We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.

Etienne Lebel tested an elegant and simple implementation in his PsychDisclosure project. Its success contributed to Psych Science‘s decision to implement disclosure requirements.

Starting Now
When reviewing for journals other than Psych Science and Management Science, what could reviewers do?

On the one hand, as reviewers we simply cannot do our jobs if we do not know fully what happened in the study we are tasked with evaluating.

On the other hand, requiring disclosure from an individual article one is reviewing risks authors taking such requests personally (reviewers are doubting them) and risks revealing our identity as reviewers.

A solution is a uniform disclosure request that large numbers of reviewers request for every paper they review.

Together with Etienne LebelDon Moore, and Brian Nosek we created a standardized request that we and many others have already begun using in all of our reviews. We hope you will start using it too. With many reviewers including it in their referee reports, the community norms will change:

I request that the authors add a statement to the paper confirming whether, for all experiments, they have reported all measures, conditions, data exclusions, and how they determined their sample sizes. The authors should, of course, add any additional text to ensure the statement is accurate. This is the standard reviewer disclosure request endorsed by the Center for Open Science [see http://osf.io/project/hadz3]. I include it in every review.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate

In the paper “One Swallow Doesn’t Make A Summer: New Evidence on Anchoring Effects”, forthcoming in the AER, Maniadis, Tufano and List attempted to replicate a classic study in economics. The results were entirely consistent with the original and yet they interpreted them as a “failure to replicate.” What went wrong?

This post answers that question succinctly; our new paper has additional analyses.

Original results
In an article with >600 citations, Ariely, Loewenstein, and Prelec (2003) showed that people presented with high anchors (“Would you pay $70 for a box of chocolates?”) end up paying more than people presented with low anchors (“Would you pay $20 for a box of chocolates?”). They found this effect in five studies, but the AER replication reran only Study 2. In that study, participants gave their asking prices for aversive sounds that were 10, 30, or 60 seconds long, after a high (50¢), low (10¢), or no anchor.

Replication results

comparing only the 10-cent and 50-cent anchor conditions, we find an effect size equal to 28.57 percent [the percentage difference between valuations], about half of what ALP found. The p-value […] was equal to 0.253” (p. 8).

So their evidence is unable to rule out the possibility that anchoring is a zero effect. But that is only part of the story. Does their evidence also rule out a sizable anchoring effect? It does not. Their evidence is consistent with an effect much larger than the original.

Fig1 Anchoring post

Those calculations use Maniadis et al.’s definition of effect size: % difference in valuations (as quoted above). An alternative is to divide the differences of means by the standard deviation (Cohen’s d). Using this metric the Replication’s effect size is more markedly different from the Original’s, d=.94 vs. d=.26 . However, the 95% confidence interval for the Replication includes effects as big as d=.64, midway between medium and large effects. Whether we examine Maniadis et al.’s operationalization of effect size, then, or Cohen’s d, we arrive at the same conclusion: the Replication is too noisy to distinguish between a nonexistent and a sizable anchoring effect.

Why is the Replication so imprecise?
In addition to having 12% fewer participants, nearly half of all valuations are ≤10¢. Even if anchoring had a large percentage effect, one that doubles WTA from 3¢ to 6¢, the tendency of participants to round both to 5¢ makes it undetectable. And there is the floor effect: valuations so close to $0 cannot drop. One way around this problem is to do something economists do all the time: Express the effect size of one variable (How big is the impact of X on Z?) relative to the effect size of another (it is half the effect of Y on Z). Figure 2 shows that, in cents, both the effect of anchoring and duration is smaller in the replication, and that the relative effect of anchoring is comparable across studies. Fig2 Anchoring post

p-curve
The original paper had five studies, four were p<.01, the fifth p<.02. When we submit these p-values to p-curve we can empirically examine the fear expressed by the replicators that the original finding is false-positive. The results strongly reject this possibility; selective reporting is an unlikely explanation for the original paper, p<.0001.

Some successful replications
Every year Uri runs a replication of Ariely et al.’s Study 1 in his class. In an online survey at the beginning of the semester, students write down the last two digits of their social-security-number, indicate if they would pay that amount for something (this semester it was for a ticket to watch Jerry Seinfeld live on campus), and then indicate the most they would pay. Figure 3 has this year’s data:

Fig3 Anchoring post

We recently learned that SangSuk Yoon, Nathan Fong and Angelika Dimoka successfully replicated Ariely et al.’s Study 1 with real decisions (in contrast to this paper).

Concluding remark
We are not vouching for the universal replicability of Ariely et al here. It is not difficult to imagine moderators (beyond floor effects) that attenuate anchoring. We are arguing that the forthcoming “failure-to-replicate” anchoring in the AER is no such thing.

note: When we discuss others’ work at DataColada we ask them for feedback and offer them space to comment within the original post. Maniadis, Tufano, and List provided feedback only for our paper and did not send us comments to post here.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.