[68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)

Uli Schimmack recently identified an interesting pattern in the data from Daryl Bem’s infamous “Feeling the Future” JPSP paper, in which he reported evidence for the existence of extrasensory perception (ESP; .pdf)[1]. In each study, the effect size is larger among participants who completed the study earlier (blogpost: .htm). Uli referred to this as the “decline effect.” Here is his key chart:

The y-axis represents the cumulative effect size, and the x-axis the order in which subjects participated.

The nine dashed blue lines represent each of Bem’s nine studies. The solid blue line represents the average effect across the nine studies. For the purposes of this post you can ignore the gray areas of the chart [2].

Uli’s analysis is ingenious, stimulating, and insightful, and the pattern he discovered is puzzling and interesting. We’ve enjoyed thinking about it. And in doing so, we have come to believe that Uli’s explanation for this pattern is ultimately incorrect, for reasons that are quite counter-intuitive (at least to us). [3].

Pilot dropping
Uli speculated that Bem did something that we will refer to as pilot dropping. In Uli’s words: “we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continues to collect more data to see whether the effect is real (…) the strong effect during the initial trials (…) is sufficient to maintain statistical significance  (…) as more participants are added” (.htm).

In our “False-Positively Psychology” paper (.pdf) we barely mentioned pilot-dropping as a form of p-hacking (p. 1361) and so we were intrigued about the possibility that it explains Bem’s impossible results.

Pilot dropping can make false-positives harder to get
It is easiest to quantify the impact of pilot dropping on false-positives by computing how many participants you need to run before a successful (false-positive) result is expected.

Let’s say you want to publish a study with two between-subjects condition and n=100 per condition (N=200 total). If you don’t p-hack at all, then on average you need to run 20 studies to obtain one false-positive finding [4]. With N=200 in each study, then that means you need an average of 4,000 participants to obtain one finding.

The effects of pilot-dropping are less straightforward to compute, and so we simulated it [5].

We considered a researcher who collects a “pilot” of, say, n = 25. (We show later the size of the pilot doesn’t matter much). If she gets a high p-value the pilot is dropped. If she gets a low p-value she keeps the pilot and adds the remaining subjects to get to 100 (so she runs another n=75 in this case).

How many subjects she ends up running depends on what threshold she selects for dropping the pilot. Two things are counter-intuitive.

First, the lower the threshold to continue with the study (e.g., p<.05 instead of p<.10), the more subjects she ends up running in total.

Second, she can easily end up running way more subjects than if she didn’t pilot-drop or p-hack at all.

This chart has the results (R Code):

Note that if pilots are dropped when they obtain p>.05, it takes about 50% more participants on average to get a single study to work (because you drop too many pilots, and still many full studies don’t work).

Moreover, Uli conjectured that Bem added observations only when obtaining a “strong effect”. If we operationalize strong effect as p<.01, we now need about N=18,000 for one study to work, instead of “only” 4,000.

With higher thresholds, pilot-dropping does help, but only a little (the blue line is never too far below 4,000). For example, dropping pilots using a threshold of p>.30 is near the ‘optimum,’ and the expected number of subjects is about 3400.

As mentioned, these results do not hinge on the size of the pilot, i.e., on the assumed n=25 (see charts .pdf).

What’s the intuition?
Pilot dropping has two effects.
(1) It saves subjects by cutting losses after a bad early draw.
(2) It costs subjects by interrupting a study that would have worked had it gone all the way.

For lower cutoffs, (2) is larger than (1)

What does explain the decline effect in this dataset?
We were primarily interested in the consequences of pilot dropping, but the discovery that pilot dropping is not very consequential does not bring us closer to understanding the patterns that Uli found in Bem’s data. One possibility is pilot-hacking, superficially similar to, but critically different from, pilot-dropping.

It would work like this: you run a pilot and you intensely p-hack it, possibly well past p=.05. Then you keep collecting more data and analyze them the same (or a very similar) way. That probably feels honest (regardless, it’s wrong). Unlike pilot dropping, pilot hacking would dramatically decrease the # of subjects needed for a false-positive finding, because way fewer pilots would be dropped thanks to p-hacking, and because you would start with a much stronger effect so more studies would end up surviving the added observations (e.g., instead of needing 20 attempts to get a pilot to get p<.05, with p-hacking one often needs only 1). Of course, just because pilot-hacking would produce a pattern like that identified by Uli, one should not conclude that’s what happened.

Alternative explanations for decline effects within study
1) Researchers may make a mistake when sorting the data (e.g., sorting by the dependent variable and not including the timestamp in their sort, thus creating a spurious association between time and effect) [6].

2) People who participate earlier in a study could plausibly show a larger effect than those that participate later; for example, if responsible students participate earlier and pay more attention to instructions (this is not a particularly plausible explanation for Bem, as precognition is almost certainly zero for everyone)  [7]

3) Researchers may put together a series of small experiments that were originally run separately and present them as “one study,” and (perhaps inadvertently) put within the compiled dataset studies that obtained larger effects first.

Summary
Pilot dropping is not a plausible explanation for Bem’s results in general nor for the pattern of decreasing effect size in particular. Moreover, because it backfires, it is not a particularly worrisome form of p-hacking.

Wide logo


Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded. We shared a draft with Daryl Bem and Uli Schimmack. Uli replied and suggested that we extend the analyses to smaller sample sizes for the full study. We did. The qualitative conclusion was the same. The posted R Code includes the more flexible simulations that accommodated his suggestion. We are grateful for Uli’s feedback.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. In this paper, Bem claimed that participants were affected by treatments that they received in the future. Since causation doesn’t work that way, and since some have failed to replicate Bem’s results, many scholars do not believe Bem’s conclusion []
  2. The gray lines are simulated data when the true effect is d=.2 []
  3. To give a sense of how much we lacked the intuition, at least one of us was pretty convinced by Uli’s explanation. We conducted the simulations below not to make a predetermined point, but because we really did not know what to expect. []
  4. The median number of studies needed is about 14; there is a long tail []
  5. The key number one needs is the probability that the full study will work, conditional on having decided to run it after seeing the pilot. That’s almost certainly possible to compute with formulas, but why bother? []
  6. This does not require a true effect, as the overall effect behind the spurious association could have been p-hacked []
  7. Ebersole et al., in “Many Labs 3” (.pdf), find no evidence of a decline over the semester; but that’s a slightly different hypothesis. []

[67] P-curve Handles Heterogeneity Just Fine

A few years ago, we developed p-curve (see p-curve.com), a statistical tool that identifies whether or not a set of statistically significant findings contains evidential value, or whether those results are solely attributable to the selective reporting of studies or analyses. It also estimates the true average power of a set of significant findings [1].

A few methods researchers have published papers stating that p-curve is biased when it is used to analyze studies with different effect sizes (i.e., studies with “heterogeneous effects”). Since effect sizes in the real world are not identical across studies, this would mean that p-curve is not very useful.

In this post, we demonstrate that p-curve performs quite well in the presence of effect size heterogeneity, and we explain why the methods researchers have stated otherwise.

Basic setup
Most of this post consists of figures like this one, which report the results of 1,000 simulated p-curve analyses (R Code).

Each analysis contains 20 studies, and each study has its own effect size, its own sample size, and because these are drawn independently, its own statistical power. In other words, the 20 studies contain heterogeneity [2].

For example, to create this first figure, each analysis contained 20 studies. Each study had a sample size drawn at random from the orange histogram, a true effect size drawn at random from the blue histogram, and thus a level of statistical power drawn at random from the third histogram.

The studies’ statistical power ranged from 10% to 70%, and their average power was 41%. P-curve guessed that their average power was 40%. Not bad.

But what if…?

1) But what if there is more heterogeneity in effect size?
Let’s increase heterogeneity so that the analyzed set of studies contains effect sizes ranging from d = 0 (null) to d = 1 (very large), probably pretty close to the entire range of plausible effect sizes in psychology [3].

The true average power is 42%. P-curve estimates 43%. Again, not bad.

2) But what if samples are larger?
Perhaps p-curve’s success is limited to analyses of studies that are relatively underpowered. So let’s increase sample size (and therefore power) and see what happens. In this simulation, we’ve increased the average sample size from 25 per cell to 50 per cell.

The true power is 69%, and p-curve estimates 68%. This is starting to feel familiar.

3) But what if the null is true for some studies?
In real life, many p-curves will include a few truly null effects that are nevertheless significant (i.e., false-positives).  Let’s now analyze 25 studies, including 5 truly null effects (d=0) that were false-positively significant.

The true power is 56%, and p-curve estimates 57%. This is continuing to feel familiar.

4) But what if sample size and effect size are not symmetrically distributed?

Maybe p-curve only works when sample and effect size are (unrealistically) symmetrically distributed. Let’s try changing that. First we skew the sample size, then we skew the effect size:

The true powers are 58% and 60%, and p-curve estimates 59% and 61%. This is persisting in feeling familiar.

5) But what if all studies are highly powered?
Let’s go back to the first simulation and increase the average sample size to 100 per cell:The true power is 93%, and p-curve estimates 94%. It is clear that heterogeneity does not break or bias p-curve. On the contrary, p-curve does very well in the presence of heterogeneous effect sizes.

So why have others proposed that p-curve is biased in the presence of heterogeneous effects?

Reason 1:  Different definitions of p-curve’s goal.
van Aert, Wicherts, & van Assen (2016, .pdf) write that p-curve “overestimat[es] effect size under moderate-to-large heterogeneity” (abstract). McShane, Bockenholt, & Hansen (2016 .pdfwrite that p-curve “falsely assume[s] homogeneity […] produc[ing] upward[ly] biased estimates of the population average effect size.” (p.736).

We believe that the readers of those papers would be very surprised by the results we depict in the figures above. How can we reconcile our results with what these authors are claiming?

The answer is that the authors of those papers assessed how well p-curve estimated something different from what it estimates (and what we have repeatedly stated that it estimates).

They assessed how well p-curve estimated the average effect sizes of all studies that could be conducted on the topic under investigationBut p-curve informs us “only” about the studies included in p-curve [4].

Imagine that an effect is much stronger for American than for Ukrainian participants. For simplicity, let’s say that all the Ukrainian studies are non-significant and thus excluded from p-curve, and that all the American studies are p<.05 and thus included in p-curve.

P-curve would recover the true average effect of the American studies. Those arguing that p-curve is biased are saying that it should recover the average effect of both the Ukrainian and American studies, even though no Ukrainian study was included in the analysis [5].

To be clear, these authors are not particularly idiosyncratic in their desire to estimate “the” overall effect.  Many meta-analysts write their papers as if that’s what they wanted to estimate. However…

•  We don’t think that the overall effect exists in psychology (DataColada[33]).
•  We don’t think that the overall effect is of interest to psychologists (DataColada[33]).
•  And we know of no tool that can credibly estimate it.

In any case, as a reader, here is your decision:
If you want to use p-curve analysis to assess the evidential value or the average power of a set of statistically significant studies, then you can do so without having to worry about heterogeneity [6].

If you instead want to assess something about a set of studies that are not analyzed by p-curve, including studies never observed or even conducted, do not run p-curve analysis. And good luck with that.

Reason 2: Outliers vs heterogeneity
Uli Schimmack, in a working paper (.pdf), reports that p-curve overestimates statistical power in the presence of heterogeneity. Just like us, and unlike the previously referenced authors, he is looking only at the studies included in p-curve. Why do we get different results?

It will be useful to look at a concrete simulation he has proposed, one in which p-curve does indeed do poorly (R Code):

Although p-curve overestimates power in this scenario, the culprit is not heterogeneity, but rather the presence of outliers, namely several extremely highly powered studies. To see this let’s look at similarly heterogeneous studies, but ones in which the maximum power is 80% instead of 100%.

In a nutshell the overestimation with outliers occurs because power is a bounded variable but p-curve estimates it based on an unbounded latent variable (the noncentrality parameter). It’s worth keeping in mind that a single outlier does not greatly bias p-curve. For example, if 20 studies are powered on average to 50%, adding one study powered to 95% increases true average power to 52%, p-curve’s estimate to just 54%.

This problem that Uli has identified is worth taking into account and perhaps p-curve can be modified to prevent such bias [7]. But it is worth keeping in mind that this situation should be rare, as few literatures contain both (1) a substantial number of studies powered over 90% and (2) a substantial number of under-powered studies. Moreover, this is somewhat inconsequential mistake. All it means is that p-curve will exaggerate how strong a truly (and obviously) strong literature actually is.

In Summary
•  P-curve is not biased by heterogeneity.
•  It is biased upwards in the presence of both (1) low powered studies, and (2) a large share of extremely highly powered studies.
•  P-curve tells us about the study designs it includes, not the study designs it excludes.

Wide logo


Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted all 7 authors of the three methods papers discussed above. Uli Schimmack declined to comment. Karsten Hansen and Blake McShane provided suggestions that led us to more precisely describe their analyses and to describe more detailed analyses in Footnote 5. Though our exchange with Karsten and Blake started and ended with, in their words, “fundamental disagreements about the nature of evaluating the statistical properties of an estimator,” the dialogue was friendly and constructive. We are very grateful to them, both for the feedback and for the tone of the discussion. (Interestingly, we disagree with them about the nature of our disagreement: we don’t think we disagree about how to examine the statistical properties of an estimator, but rather, about how to effectively communicate methods issues to a general audience).  Marcel van Assen, Robbie van Aert, and Jelte Wicherts disagreed with our belief that readers of their paper would be surprised by how well p-curve recovers average power in the presence of heterogeneity (as they think their paper explains this as well). Specifically, like us, they think p-curve performs well when making inferences about studies included in p-curve, but, unlike us, they think that readers of their paper would realize this. They are not persuaded by our arguments that the population effect size does not exist and is not of interest to psychologists, and they are troubled by the fact that p-curve does not recover this effect. They also proposed that an important share of studies may indeed have power near 100% (citing this paper: .htm). We are very grateful to them for their feedback and collegiality as well.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. P-curve can also be used to estimate average effect size rather than power (and, as Blake McShane and Karsten Hansen pointed out to us, when used in this fashion p-curve is virtually equivalent to the maximum likelihood procedure proposed by Hedges in 1984 (.pdf) ).  Here we focus on power rather than effect size because we don’t think “average effect size” is meaningful or of interest when aggregating across psychology experiments with different designs (see Colada[33]). Moreover, whereas power calculations only require that one knows the results of the test statistic of interest (e.g., F(1,230)=5.23), effect size calculations require one to also know how the study was defined, a fact that renders effect size estimations much more prone to human error (see page 676 of our article on p-curve and effect size estimation (.pdf) ). In any case, the point that we make in this post applies at least as strongly to an analysis of effect size as it does to an analysis of power: p-curve correctly recovers the true average effect size of the studies that it analyses, even when those studies contain different (i.e., heterogeneous) effect sizes. See Figure 2c in our article on p-curve and effect size estimation (.pdf) and Supplement 2 of that same paper (.pdf) []
  2. In real life, researchers are probably more likely to collect larger samples when studying smaller effects (see Colada[58]). This would necessarily reduce heterogeneity in power across studies []
  3. To do this, we changed the blue histogram from d~N(.5,.05) to d~N(.5,.15). []
  4. We have always been transparent about this. For instance, when we described how to use p-curve for effect size estimation (.pdf) we wrote, “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve.” (p.667). []
  5. For a more quantitative example check out Supplement 2 (.pdf) of our p-curve and effect size paper. In the middle panel of Figure S2, we consider a scenario in which a research attempts to run an equal number of studies (with n = 20 per cell) testing either an effect size of d = .2 or an effect size of d = .6. Because it necessarily easier to get significance when the effect size is larger than when the effect size is smaller, the share of significant d = .6 studies will necessarily be greater than the share of significant d = .2 studies, and thus p-curve will include more d = .6 studies that d = .2 studies. Because the d = .6 studies will be over-represented among all significant studies, the true average efefct of the significant studies will be d = .53 rather than d=.4. P-curve correctly recovers this value (.53), but it is biased upwards if we expect it to guess d=.4. For an even more quantitative example, imagine the true average effect is d = .5 with a standard deviation of .2. If we study this with many n=20 studies, the average observed significant effect will be d=.91, but the true average effect of those studies is d = .61, which is the number that p-curve would recover. It would not recover the true mean of the population (d = .5) but rather the true mean of the studies that were statistically significant (d = .61). In simulations, the true mean is known and this might look like a bias. In real life, the true mean is, well, meaningless, as it depends on arbitrary definitions of what constitutes the true population of all possible studies; R Code) []
  6. Again, for both practical and conceptual reasons, we would not advise you to estimate the average effect size, regardless of whether you use p-curve or any other tool. But this has nothing to do with the supposed inability of p-curve to handle heterogeneity. See footnote 1. []
  7. Uli has proposed using z-curve, a tool he developed, instead of p-curve.  While z-curve does not seem to be biased in scenarios with many studies with extreme high-power, it performs worse than p-curve in almost all other scenarios. For example, in the examples depicted graphically in this post, z-curve’s expected estimates are about 4 times further from the truth than are p-curve’s. []

[66] Outliers: Evaluating A New P-Curve Of Power Poses

In a forthcoming Psych Science paper, Cuddy, Schultz, & Fosse, hereafter referred to as CSF, p-curved 55 power-posing studies (.pdfSSRN), concluding that they contain evidential value [1]. Thirty-four of those studies were previously selected and described as “all published tests” (p. 657) by Carney, Cuddy, & Yap (2015; .pdf). Joe and Uri p-curved those 34 studies and concluded that they lacked evidential value (.pdf | Colada[37]). The two p-curve analyses – Joe & Uri’s old p-curve and CSF’s new p-curve – arrive at different conclusions not because the different sets of authors used different sets of tools, but rather because they used the same tool to analyze different sets of data.

In this post we discuss CSF’s decision to include four studies with unusually small p-values (e.g., p < 1 in a quadrillion) in their analysis. The inclusion of these studies was sufficiently problematic that we stopped further evaluating their p-curve. [2].

Aside:
Several papers have replicated the effect of power posing on feelings of power and, as Joe and Uri reported in their Psych Science paper (.pdf, pg.4), a p-curve of those feelings-of-power effects suggests they contain evidential value. CSF interpret this as a confirmation of the central power-posing hypothesis, whereas we are reluctant to interpret it as such for reasons that are both psychological and statistical. Fleshing out the arguments on both sides may be interesting, but it is not the topic of this post.

Evaluating p-curves
Evaluating any paper is time consuming and difficult. Evaluating a p-curve paper – which is in essence, a bundle of other papers – is necessarily more time consuming and more difficult.

We have, over time, found ways to do it more efficiently. We begin by preliminarily assessing three criteria. If the p-curve fails any of these criteria, we conclude that it is invalid and stop evaluating it. If the p-curve passes all three criteria, we evaluate the p-curve work more thoroughly.

Criterion 1: Study Selection Rule
Our first step is to verify that the authors followed a clear and reproducible study selection rule. CSF did not. That’s a problem, but it is not the focus of this post. Interested readers can check out this footnote: [3].

Criterion 2: Test Selection
Figure 4 (.pdf) in our first p-curve paper (SSRN) explains and summarizes which tests to select from the most common study designs. The second thing we do when evaluating a p-curve paper is to verify that the guidelines were followed by focusing on the subset of designs that are most commonly incorrectly treated by p-curvers. For example, we look at interaction hypotheses to make sure that the right test is included, and we look to see whether omnibus tests are selected (they should almost never be; see Colada[60]). CSF selected some incorrect test results (e.g., their smallest p-value comes from an omnibus test). See “Outlier 1” below.

Criterion 3. Outliers
Next we sort studies by p-value to identify possible outliers, and we carefully read the papers containing an outlier result. We do this both because outliers exert a disproportionate effect on the results of p-curve, and because outliers are much more likely to represent the erroneous inclusion of a study or the erroneous selection of a test result. This post focuses on outliers.

This figure presents the distribution of p-values in CSF’s p-curve analysis (see their disclosure table .xlsx). As you can see, there are four outliers:

Outlier 1
CSF’s smallest p-value is from F(7, 140) = 19.47, approximately p = .00000000000000002, or 1 in 141 quadrillion. It comes from a 1993 experiment published in the journal The Arts in Psychotherapy (.pdf).

In this within-subject study (N = 24), each participant held three “open” and three “closed” body poses. At the beginning of the study, and then again after every pose, they rated themselves on eight emotions. The descriptions of the analyses are insufficiently clear to us (and to colleagues we sent the paper to), but as far as we can tell, the following things are true:

(1) Some effects are implausibly large. For example, Figure 1 in their paper (.pdf) suggests that the average change in happiness for those adopting the “closed” postures was ~24 points on a 0-24 scale. This could occur only if every participant was maximally happy at baseline and then maximally miserable after adopting every one of the 3 closed postures.

(2) The statistical analyses incorrectly treat multiple answers by the same participants as independent, across emotions and across poses.

(3) The critical test of an interaction between emotion valence and pose is not reported. Instead the authors report only an omnibus interaction: F(7, 140) = 19.47. Given the degrees-of-freedom of the test, we couldn’t figure out what hypothesis this analysis was testing, but regardless, no omnibus test examines the directional hypothesis of interest. Thus, it should not be included in a p-curve analysis.

Outlier 2
CSF’s second smallest p-value is from F(1,58)=85.9,  p = .00000000005, or 1 in 2 trillion. It comes from a 2016 study published in Biofeedback Magazine (.pdf). In that study, 33 physical therapists took turns in dyads, with one of them (the “tester”) pressing down on the other’s arm, and the other (the “subject”) attempting to resist that pressure.

The p-value selected by CSF compares subjective arm strength when the subject is standing straight (with back support) vs. slouching (without support). As the authors of the original article explain, however, that has nothing to do with any psychological consequences of power posing, but rather, with its mechanical consequences. In their words: “Obviously, the loss of strength relates to the change in the shoulder/body biomechanics and affects muscle activation recorded from the trapezius and medial and anterior deltoid when the person resists the downward applied pressure” (p. 68-69; emphasis added) [4].

Outlier 3
CSF’s third smallest p-value is from F(1,68)=26.25, p = .00000267, or 1 in ~370,000. It comes from a 2014 study published in Psychology of Women Quarterly (.pdf).

This paper explores two main hypotheses, one that is quite nonintuitive, and one that is fairly straightforward. The nonintuitive hypothesis predicts, among other things, that women who power pose while sitting on a throne will attempt more math problems when they are wearing a sweatshirt but fewer math problems when they are wearing a tank-top; the prediction is different for women sitting in a child’s chair instead of a throne [5].

CSF chose the p-value for the straightforward hypothesis, the prediction that people experience fewer positive emotions while slouching (“allowing your rib cage to drop and your shoulders to rotate forward”) than while sitting upright (“lifting your rib cage forward and pull[ing] your shoulders slightly backwards”).

Unlike the previous two outliers, one might be able to validly include this p-value in p-curve. But we have reservations, both about the inclusion of this study, and about the inclusion of this p-value.

First, we believe most people find power posing interesting because it affects what happens after posing, not what happens while posing. For example, in our opinion, the fact that slouching is more uncomfortable than sitting upright should not be taken as evidence for the power poses hypothesis.

Second, while the hypothesis is about mood, this study’s dependent variable is a principal component that combines mood with various other theoretically irrelevant variables that could be driving the effect, such as how “relaxed” or “amused” the participants were. We discuss two additional reservations in this footnote: [6].

Outlier 4
CSF’s fourth smallest p-value is from F(2,44)=13.689, p=.0000238, or 1 in 42,000. It comes from a 2015 study published in the Mediterranean Journal of Social Sciences (.pdf). Fifteen male Iranian students were all asked to hold the same pose for almost the entirety of each of nine 90-minute English instruction sessions, varying across sessions whether it was an open, ordinary, or closed pose. Although the entire class was holding the same position at the same time, and evaluating their emotions at the same time, and in front of all other students, the data were analyzed as if all observations were independent, artificially reducing the p-value.

Conclusion
Given how difficult and time consuming it is to thoroughly review a p-curve analysis or any meta-analysis (e.g., we spent hours evaluating each of the four studies discussed here), we preliminarily rely on three criteria to decide whether a more exhaustive evaluation is even warranted. CSF’s p-curve analysis did not satisfy any of the criteria. In its current form, their analysis should not be used as evidence for the effects of power posing, but perhaps a future revision might be informative.

Wide logo


Author feedback.
Our policy (.htm) is to share, prior to publication, drafts of posts with original authors whose work we discuss, asking them to identify anything that is unfair, inaccurate, misleading, snarky, or poorly worded.

We contacted CSF and the authors of the four studies we reviewed.

Amy Cuddy responded to our email, but did not discuss any of the specific points we made in our post, or ask us to make any specific changes. Erik Peper, lead author of the second outlier study, helpfully noticed that we had the wrong publication date and briefly mentioned several additional articles of his own on how slouched positions affect emotions, memory, and energy levels (.pdf; .pdf; .pdf; html; html). We also received an email from the second author of the first outlier study; he had “no recommended changes.” He suggested that we try to contact the lead author but we were unable to find her current email address.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes

  1. When p-curve concludes that there is evidential value, it is simply saying that at least one of the analyzed findings was unlikely to have arisen from the mere combination of random noise and selective reporting. In other words, at least one of the studies would be expected to repeatedly replicate []
  2. After reporting the overall p-curve, CSF also split the 55 studies based on the type of dependent variable: (i) feelings of power, (ii) EASE (“Emotion, Affect, and Self-Evaluation”), and (iii) behavior or hormonal response (non-EASE). They find evidential value for the first two, but not the last. The p-curve for EASE includes all four of the studies described in this post. []
  3. To ensure that studies are not selected in a biased manner, and more generally to help readers and reviewers detect possible errors, the set of studies included in p-curve must be determined by a predetermined rule. The rule, in turn, should be concrete and precise enough that an independent set of researchers following the rule would generate the same, or virtually the same, set of studies. The rule, as described in CSF’s paper, lacks the requisite concreteness and precision. In particular, the paper lists 24 search terms (e.g., “power”, “dominance”) that were combined (but the combinations are not listed). The resulting hits were then “filter[ed] out based on title, then abstract, and then the study description in the full text” in an unspecified manner. (Supplement: https://osf.io/5xjav/ | our archived copy .txt). In sum, though the authors provide some information about how they generated their set of studies, neither the search queries nor the filters are specified precisely enough for someone else to reproduce them. Joe and Uri’s p-curve, on the other hand, followed a reproducible study selection rule: all studies that were cited by Carney et al. (2015) as evidence for power posing. []
  4. The paper also reports that upright vs. collapsed postures may affect emotions and the valence of memories, but these claims are supported by quotations rather than by statistics. The one potential exception is that the authors report a “negative correlation between perceived strength and severity of depression (r=-.4).” Given the sample size of the study, this is indicative of a p-value in the .03-.05 range. The critical effect of pose on feelings, however, is not reported. []
  5. The study (N = 80) employed a fully between-subjects 2(self-objectification: wearing a tanktop vs. wearing a sweatshirt) x 2(power/status: sitting in a “grandiose, carved wooden decorative antique throne” vs. a “small wooden child’s chair from the campus day-care facility”) x 2(pose: upright vs slumped) design. []
  6. First, for robustness, one would need to include in the p-curve the impact of posing on negative mood, which is also reported in the paper and which has a considerably larger p-value (F = 13.76 instead of 26.25). Second, the structure of the experiment is very complex, involving a three-way interaction which in turn hinges on a two-way reversing interaction and a two-way attenuated interaction. It is hard to know if the p-value distribution of the main effect is expected to be uniform under the null (a requirement of p-curve analysis) when the researcher is interested in these trickle-down effects. For example, it hard to know whether p-hacking the attenuated interaction effect would cause the p-value associated with the main effect to be biased downwards. []

[64] How To Properly Preregister A Study

P-hacking, the selective reporting of statistically significant analyses, continues to threaten the integrity of our discipline. P-hacking is inevitable whenever (1) a researcher hopes to find evidence for a particular result, (2) there is ambiguity about how exactly to analyze the data, and (3) the researcher does not perfectly plan out his/her analysis in advance. Although some mistakenly believe that accusations of p-hacking are tantamount to accusations of cheating, the truth is that accusations of p-hacking are nothing more than accusations of imperfect planning.

The best way to address the problem of imperfect planning is to plan more perfectly: to preregister your studies. Preregistrations are time-stamped documents in which researchers specify exactly how they plan to collect their data and to conduct their key confirmatory analyses. The goal of a preregistration is to make it easy to distinguish between planned, confirmatory analyses – those for which statistical significance is meaningful – and unplanned exploratory analyses – those for which statistical significance is not meaningful [1]. Because a good preregistration prevents researchers from p-hacking, it also protects them from suspicions of p-hacking [2].

In the past five years or so, preregistration has gone from being something that no psychologists did to something that many psychologists are doing. In our view, this wonderful development is the biggest reason to be optimistic about the future of our discipline.

But if preregistration is going to be the solution, then we need to ensure that it is done right. After casually reviewing several recent preregistration attempts in published papers, we noticed that there is room for improvement. We saw two kinds of problems.

Problem 1. Not enough information
For example, we saw one “preregistration” that was simply a time-stamped abstract of the project; it contained almost no details about how data were going to be collected and analyzed. Others failed to specify one or more critical aspects of the analysis: sample size, rules for exclusions, or how the dependent variable would be scored (in a case for which there were many ways to score it). These preregistrations are time-stamped, but they lack the other critical ingredient: precise planning.

To decide which information to include in your preregistration, it may be helpful to imagine a skeptical reader of your paper. Let’s call him Leif. Imagine that Leif is worried that p-hacking might creep into the analyses of even the best-intentioned researchers. The job of your preregistration is to set Leif’s mind at ease [3]. This means identifying all of the ways you could have p-hacked – choosing a different sample size, or a different exclusion rule, or a different dependent variable, or a different set of controls/covariates, or a different set of conditions to compare, or a different data transformation – and including all of the information that lets Leif know that these decisions were set in stone in advance. In other words, your job is to prevent Leif from worrying that you tried to run your critical analysis in more than one way.

This means that your preregistration needs to be sufficiently exhaustive and sufficiently specific. If you say, “We will exclude participants who are distracted,” Leif could think, “Right, but distracted how? Did you define “distracted” in advance?” It is better to say, “We will exclude participants who incorrectly answered at least 2 out of our 3 comprehension checks.” If you say, “We will measure happiness,” Leif could think, “Right, but aren’t there a number of ways to measure it? I wonder if this was the only one they tried or if it was just the one they most wanted to report after the data came in?” So it’s better to say, “Our dependent variable is happiness, which we will measure by asking people ‘How happy do you feel right now?’ on a scale ranging from 1 (not at all happy) to 7 (extremely happy).”

If including something in a preregistration would make Leif less likely to wonder whether you p-hacked, then include it.

Problem 2. Too much information
A preregistration cannot allow readers and reviewers to distinguish between confirmatory and exploratory analyses if it is not easy to read or understand. Thus, a preregistration needs to be easy to read and understand. This means that it should contain only the information that is essential for the task at hand. We have seen many preregistrations that are just too long, containing large sections on theoretical background and on exploratory analyses, or lots of procedural details that on the one hand will definitely be part of the paper, and on the other, are not p-hackable. Don’t forget that you will publish the paper also, not just the preregistration; you don’t need to say in the preregistration everything that you will say in the paper. A hard-to-read preregistration makes preregistration less effective [4].

To decide which information to exclude in your preregistration, you can again imagine that a skeptical Leif is reading your paper, but this time you can ask, “If I leave this out, will Leif be more concerned that my results are attributable to p-hacking?”

For example, if you leave out the literature review from your preregistration, will Leif now be more concerned? Of course not, as your literature review does not affect how much flexibility you have in your key analysis. If you leave out how long people spent in the lab, how many different RAs you are using, why you think your hypothesis is interesting, or the description of your exploratory analyses, will Leif be more concerned? No, because none of those things affect the fact that your analyses are confirmatory.

If excluding something from a preregistration would not make Leif more likely to wonder whether you p-hacked, then you should exclude it.

The Takeaway
Thus, a good preregistration needs to have two features:

  1. It needs to specify exactly how the key confirmatory analyses will be conducted.
  2. It needs to be short and easy to read.

We designed AsPredicted.org with these goals in mind. The website poses a standardized set of questions asking you only to include what needs to be included, thus also making it obvious what does not need to be. The OSF offers lots of flexibility, but they also offer an AsPredicted template here: https://osf.io/fnsb/ [5].

Still, even on AsPredicted, it is possible to get it wrong, by, for example, not being specific enough in your answers to the questions it poses. This table provides an example of how to wrongly and properly answer these questions.


Regardless of where your next preregistration is hosted, make it a good preregistration by including what should be included and excluding what should be excluded.

 

Wide logo


Feedback
We would like to thank Stephen Lindsay and Simine Vazire for taking time out of their incredibly busy schedules to give us invaluable feedback on a previous version of this post.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. This is because conducting unplanned analyses necessarily inflates the probability that you will find a statistically significant relationship even if no relationship exists. []
  2. For good explanations of the virtues of preregistration, see Lindsay et al. (2016) <.html>, Moore (2016) <.pdf>, and van’t Veer & Giner-Sorolla (2016) <.pdf>. []
  3. Contrary to popular belief, the job of your pre-registration is NOT to show that your predictions were confirmed. Indeed, the critical aspect of pre-registration is not the prediction that you register – many good preregistrations pose questions (e.g., “We are testing whether eating Funyuns cures cancer”) rather than hypotheses (e.g., “We hypothesize that eating Funyuns cures cancer”) – but the analysis that you specify. In hindsight, perhaps our preregistration website should have been called AsPlanned rather than AsPredicted, although AsPredicted sounds better. []
  4. Even complex studies should have a simple and clear preregistration, one that allows a reader to casually differentiate between confirmation and exploration. Additional complexities could potentially be captured in other secondary planning documents, but because these are far less likely to be read, they shouldn’t obscure the core basics of the simple preregistration. []
  5. We recently updated the AsPredicted questions, and so this OSF template contains slightly different questions than the ones currently on AsPredicted. We advise readers who wish to use the OSF to answer the questions that are currently on https://AsPredicted.org. []