[46] Controlling the Weather

Behavioral scientists have put forth evidence that the weather affects all sorts of things, including the stock market, restaurant tips, car purchases, product returns, art prices, and college admissions.

It is not easy to properly study the effects of weather on human behavior. This is because weather is (obviously) seasonal, as is much of what people do. This means that any investigation of the relation between weather and behavior must properly control for seasonality.

For example, in the U.S., Google searches for “fireworks” correlate positively with temperature throughout the year, but only because July 4th is in the summer. This is a seasonal effect, not a weather effect.
f0Almost every weather paper tries to control for seasonality. This post shows they don’t control enough.

How do they do it?
To answer this question, we gathered a sample of 10 articles that used weather as a predictor. 1
t1In economics, business, statistics, and psychology, authors use monthly and occasionally weekly controls to account for seasonality. For instance they ask, “Does how cold it was when a coat was bought predict if it was returned, controlling for the month of the year in which it was purchased?”

That’s not enough.
The figures below show the average daily temperature in Philadelphia, along with the estimates provided by monthly (left panel) and weekly (right panel) fixed effects. These figures remind us that the weather does not jump discretely from month to month or week to week. Rather, weather, like earth, moves continuously. This means that seasonal confounds, which are continuous, will survive discrete (monthly or weekly) controls.

F12The vertical distance between the blue lines (monthly/weekly dummies) captures the residual seasonality confound. For example, during March (just left of the ‘100 day’ tick), the monthly dummy assigns 44 degrees to every March day, but temperature systematically fluctuates within March, from a long-term average of 39 degrees on March 1st to a long-term average of 50 degrees on March 31st. This is a seasonally confounded 11-degree difference that is entirely unaccounted for by monthly dummies.

The confounded effect of seasonality that survives weekly dummies is roughly 1/4 that size.

Fixing it.
The easy solution is to control for the historical average of the weather variable of interest for each calendar date.2

For example, when using how cold January 24, 2013 was to predict whether a coat bought that day was eventually returned, we include as a covariate the historical average temperature for January 24th  (in that city).3

Demonstrating the easy fix
To demonstrate how well this works, we analyze a correlation that is entirely due to a seasonal confound: the number of daylight hours  in Bangkok, Thailand (sunset – sunrise), and the temperature that same day in Philadelphia.  Colder days in Philadelphia tend to be shorter days in Bangkok, but not because coldness in one place shortens the day in the other (nor vice versa), but because seasonal patterns influence both variables. Properly controlling for seasonality should eliminate an association between these variables.

Using day duration in Bangkok as the dependent variable and temperature in Philly as the predictor, we threw in monthly and then weekly dummies to control for the seasonal confound. Neither technique fully succeeded, as same-day temperature survived as a significant predictor.

t2BangkokThus, using monthly and weekly dummy variables made it seem like, over and above the effects of seasonality, colder days are more likely to be shorter. However, controlling for the historical average daily temperature showed, correctly, that seasonality is the sole driver of this relationship.

Wide logo

Original author feedback:
We shared a draft of this post with authors from all 10 papers from Table 1 and we heard back from 5 of them. Their feedback led to correcting errors in Table 1, changing the title of the post, and fixing the day-duration example (Table 2). Devin Pope, moreover, conducted our suggested analysis on his convertible purchases (QJE) paper and shared the results with us. The finding is robust to our suggested additional control. Devin thought it was valuable to highlight that while historic temperature average is a better control for weather-based seasonality, reducing bias, weekly/monthly dummies help with noise from other seasonal factors such as holidays. We agreed. Best practice, in our view, would be to include time dummies to the granularity permitted by the data to reduce noise, and to include the daily historic average to reduce the seasonal confound of weather variation.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. Uri created the list by starting with the most well-cited observational weather paper he knew – Hirshlifer & Shumway – and then selected papers citing it in the Web-of-science and published in journals he recognized. []
  2. Another is to use daily dummies. This option can easily be worse. It can lower statistical power by throwing away data. First, one can only apply daily fixed effects to data with at least two observations per calendar date. Second, this approach ignores historical weather data that precedes the dependent variable. For example, if using sales data from 2013-2015 in the analyses, the daily fixed effects force us to ignore weather data from any prior year. Lastly, it ‘costs’ 365 degrees-of-freedom (don’t forget leap year), instead of 1. []
  3. Uri has two weather papers. They both use this approach to account for seasonality. []

[45] Ambitious p-hacking and p-curve 4.0

In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.

Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.f1

For a recently published paper, “Better P-Curves” (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.

F2

Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.1

P-Curve and Ambitious p-hacking
P-
curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.

In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.

However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.

But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.

Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).

The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.

Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.

The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. 2

In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.

f3

The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.3

We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. []
  2. As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. []
  3. When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/  []

[44] AsPredicted: Pre-registration made easy

Pre-registering a study consists of leaving a written record of how it will be conducted and analyzed. Very few researchers currently pre-register their studies. Maybe it’s because pre-registering is annoying. Maybe it’s because researchers don’t want to tie their own hands. Or maybe it’s because researchers see no benefit to pre-registering.  This post addresses these three possible causes. First, we introduce AsPredicted.org, a new website that makes pre-registration as simple as possible. We then show that pre-registrations don’t actually tie researchers’ hands, they tie reviewers’ hands, providing selfish benefits to authors who pre-register. 1

AsPredicted.org
The best introduction is arguably the home-page itself:homepage 11302015

No matter how easy pre-registering becomes, not pre-registering is always easier.  What benefits outweigh the small cost?

Benefit 1. No more self-censoring
In part by choice, and in part because some journals (and reviewers) now require it, more and more researchers are writing papers that properly disclose how their studies were run; they are disclosing all experimental conditions, all measures collected, any data exclusions, etc.

Disclosure is good. It appropriately increases one’s skepticism of post-hoc analytic decisions. But it also increases one’s skepticism of totally reasonable ex-ante decisions, for the two are sometimes confused. Imagine you collect and properly disclose that you measured one primary dependent variable and two exploratory variables,  only to get hammered by Reviewer 2, who writes:

This study is obviously p-hacked. The authors collected three measures and only used one as a dependent variable. Reject.

When authors worry that they will be accused of reporting only the best of three measures, they may decide to only collect a single measure. Preregistration frees authors to collect all three, while assuaging any concerns of being accused of p-hacking.

You don’t tie your hands with pre-registration. You tie Reviewer 2’s.

In case you skipped the third blue box above:
whatif

Benefit 2. Go ahead, data peek
Data peeking, where one decides whether to get more data after analyzing the data, is usually a big no-no. It invalidates p-values and (several aspects of) Bayesian inference. 2  But if researchers pre-register how they will data peek, it becomes kosher again.

For example, you can pre-register, “In line with Frick (1986 .pdf) we will check data after every 20 observations per-cell, stopping whenever p<.01 or p>.36,”  or “In line with Pocock (1977 .pdf), we will collect up to 60 observations per-cell, in batches of 20, and stop early if p<.022.”

Lakens (2014 .pdf) gives an accessible introduction to legalized data-peeking for psychologists.

Benefit 3. Bolster credibility of odd analyses
Sometimes, the best way to analyze the data is difficult to sell to readers. Maybe you want to do a negative binomial regression, or do an arcsine transformation, or drop half the sample because the observations are not independent. You think about it for hours, ask your stat-savvy friends, and then decide that the weird way to analyze your data is actually the right way to analyze your data. Reporting the weird (but correct!) analysis opens you up to accusations of p-hacking. But not if you pre-register it. “We will analyze the data with an arc-sine transformation.” Done. Reviewer 2 can’t call you a p-hacker.
Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. More flexible options for pre-registration are offered by the Open Science Framework and the Social Science Registry, where authors can write up documents in any format, covering any aspect of their design or analysis, and without any character limits. See pre-registration instructions for the OSF here , and for the Social Science Registry here. []
  2. In particular, if authors peek at their data seeking a given Bayes Factor, they increase the odds they will find support for the alternative hypothesis even if the null is true – see Colada [13] – and they obtain biased estimates of effect size. []

[43] Rain & Happiness: Why Didn’t Schwarz & Clore (1983) ‘Replicate’ ?

In my “Small Telescopes” paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days.

Figure and text from Small Telescopes paper (SSRN)
Small Telescopes quotes
I recently had an email exchange with a senior researcher (not involved in the original paper) who persuaded me I should have been more explicit regarding the design differences between the original and replication studies.  If my paper weren’t published I would add a discussion of such differences and would explain why I don’t believe these can explain the failures to replicate.  

Because my paper is already published, I write this post instead.

The 1983 study
This study is so famous that a paper telling the story behind it (.pdf) has over 450 Google cites.  It is among the top-20 most cited articles published in JPSP and the most cited by either (superstar) author.

In the original study a research assistant called University of Illinois students either during the “first two sunny spring days after a long period of gray, overcast days”, or during two rainy days within a “period of low-hanging clouds and rain” (p. 298, .pdf).

She asked about life satisfaction and then current mood. At the beginning of the phone conversation, she either did not mention the weather, mentioned it in passing, or described it as being of interest to the study.

The reported finding is that “respondents were more satisfied with their lives on sunny than rainy days—but only when their attention was not drawn to the weather” (p.298, .pdf)
results‘Replication’
Feddersen et al. (.pdf) matched weather data to the Australian Household Income Survey, which includes a question about life satisfaction. With 90,000 observations, the effect was basically zero.

There are at least three notable design differences between the original and replication studies:1

1. Smaller causes have smaller effect. The 1983 study focused on days on which weather was expected to have large mood effects, the Australian sample used the whole year. The first sunny day in spring is not like the 53rd sunny day of summer.

2. Already attributed. Respondents answered many questions in Australia before reporting their life-satisfaction, possibly misattributing mood to something else.

3. Noise. The representative sample is more diverse than a sample of college undergrads is; thus the data are noisier, less likely to detectably exhibit any effect.

Often this is where discussions of failed replications end—with the enumeration of potential moderators, and the call for more and better data. I’ll try to use the data we already have to assess whether any of the differences are likely to matter.2

Design difference 1. Smaller causes.
If weather contrasts were critical for altering mood and hence possibly happiness, then the effect in the 1983 study should be driven by the first sunny day in spring, not the Nth rainy day.  But a look at the bar chart above shows the opposite: People were NOT happier the first sunny day of spring; they were unhappier on the rainy days. Their description of these days again: and the rainy days we used were several days into a new period of low-hanging clouds and rain.’ (p. 298, .pdf)

The days driving the effect, then, were similar to previous days. Because of how seasons work, most days in the replication studies presumably were also similar to the days that preceded them (sunny after sunny and rainy after rainy), and so on this point the replication does not seem different or problematic.

Second, Lucas and Lawless (JPSP 2014, .pdf) analyzed a large (N=1 million) US sample and also found no effect of weather on life satisfaction. Moreover, they explicitly assessed if unseasonably cloudy/sunny days, or days with sunshine that differed from recent days, were associated with bigger effects. They were not. (See their Table 3).

Third, the effect size Schwarz and Clore report is enormous: 1.7 points in a 1-10 scale. To put that in perspective, from other studies, we know that the life satisfaction gap between people who got married vs. people who became widows over the past year is about 1.5 on the same scale (see Figure 1, Lucas 2005 .pdf). Life vs. death are estimated as less impactful than precipitation. Even if the effect were smaller on days not as carefully selected as those by Schwarz and Clore, the ‘replications’ averaging across all days should still have detectable effects.

The large effect is particularly surprising considering it is the downstream effect of weather on mood, and that effect is really tiny (see Tal Yarkoni’s blog review of a few studies .htm)

Design difference  2. Already attributed.
This concern, recall, is that people answering many questions in a survey may misattribute their mood to earlier questions. This makes sense, but the concern applies to the original as well.

The phone-call from Schwarz & Clore’s RA does not come immediately after the “mood induction” either, rather, participants get the RA’s phone call hours into a rainy vs sunny day.  Before the call they presumably made evaluations too, answering questions like “How are you and Lisa doing?” “How did History 101 go?” “Man, don’t you hate Champaign’s weather?” etc. Mood could have been misattributed to any of these earlier judgments in the original as well. Our participants’ experiences do not begin when we start collecting their data. 3

Design difference 3. Noise.
This concern is that the more diverse sample in the replication makes it harder to detect any effect. If the replication were noisier, we may expect the dependent variable to have a higher standard deviation (SD).  For life-satisfaction Schwarz and Clore got about SD=1.69, Feddersen et al, SD=1.52.  So less noise in the replication. 4 Moreover, the replication has panel data and controls for individual differences via fixed effects. These account for 50% of the variance, so they have spectacularly less noise. 5

Concluding bullet points.
– The existing data are overwhelmingly inconsistent with current weather affecting reported life satisfaction.
– This does not imply the theory behind Schwarz and Clore (1983), mood-as-information, is wrong.

Wide logo

Author feedback
I sent a draft of this post to Richard Lucas (.htm) who provided valuable feedback and additional sources. I also sent a draft to Norbert Schwarz (.htm) and Gerald Clore (.htm). They provided feedback that led me to clarify when I first identified the design differences between the original and replication studies (back in 2013, see footnotes 1&2).  They turned down several invitations to comment within this post.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. The first two were mentioned in the first draft of my paper but I unfortunately cut them out during a major revision, around May 2013. The third was proposed in Feburary of 2013 in a small mailing list discussing the first talk I gave of my Small Telescopes paper []
  2. There is also the issue, as Norbert Schwarz pointed out to me in an email in May of 2013, that the 1983 study is not about weather nor life satisfaction, but about misattribution of mood. The ‘replications’ do not even measure mood. I believe we can meaningfully discuss whether the affects of rain on happiness replicates without measuring mood, in fact, the difficulty to manipulate mood via weather is one thing that make the original finding surprising. []
  3. What one needs to explain the differences via the presence of other questions is that mood effects from weather replenish through the day, but not immediately. So on sunny days at 7AM I think my cat makes me happier than usual, and then at 10AM that my calculus teacher jokes are funnier than usual, but if the joke had been told at 7.15AM I would not have found it funny because I had already attributed my mood to the cat. This is possible. []
  4. Schwarz and Clore did not report SDs, but one can compute them off the reported test statistics. See Supplement 2 for Small Telescopes .pdf. []
  5. See Rin Feddersen et al’s Table A1, column 4 vs 3, .pdf  []

[42] Accepting the null: Where to draw the line?

We typically ask if an effect exists.  But sometimes we want to ask if it does not.

For example, how many of the “failed” replications in the recent reproducibility project published in Science (.pdf) suggest the absence of an effect?

Data have noise, so we can never say ‘the effect is exactly zero.’  We can only say ‘the effect is basically zero.’ What we do is draw a line close to zero and if we are confident the effect is below the line, we accept the null.
Drawing on whiteboard with confidence intervals that do and do not include the lineWe can draw the line via Bayes or via p-values, it does not matter very much. The line is what really matters. How far from zero is it? What moves it up and down?

In this post I describe 4 ways to draw it the line, and then pit the top-2 against each other.

Way 1. Absolutely small
The oldest approach draws the line based on absolute size. Say, diets leading to losing less than 2 pounds have an effect of basically zero. Economists do this often. For instance, a recent World Bank paper (.html) reads

“The impact of financial literacy on the average remittance frequency has a 95 percent confidence interval [−4.3%, +2.5%] …. We consider this a relatively precise zero effect, ruling out large positive or negative effects of training” (emphasis added)
(Dictionary note. Remittance: immigrants sending money home).

In much of behavioral science effects of any size can be of theoretical interest, and sample sizes are too small to obtain tight confidence intervals, making this approach unviable in principle and in practice. 1

Way 2. Undetectably Small
In our first p-curve paper with Joe and Leif (SSRN), and in my “Small Telescopes” paper on evaluating replications (.pdf), we draw the line based on detectability.

We don’t draw the line where we stop caring about effects.
We draw the line where we stop being able to detect them.

Say an original study with n=50 finds people can feel the future. A replication with n=125 ‘fails,’ getting and effect estimate of d=0.01, p=.94. Data are noisy, so the confidence interval goes all the way up to d=.2. That’s a respectably big feeling-the-future effect we are not ruling out. So we cannot say the effect is absolutely small.
example
The original study, with just n=50, however, is unable to detect that small an effect (it would have <18% power). So we accept the null, the null that the effect is either zero, or undetectably small by existing studies.

Way 3. Smaller than expected in general
Bayesian hypothesis testing runs a horse race between two hypotheses:

Hypothesis 1 (null):              The effect is exactly zero.
Hypothesis 2 (alternative): The effect is one of those moderately sized ones. 2

When data clearly favor 1 more than 2, we accept the null. The bigger the effects Hypothesis 2 includes, the further from zero we draw the line, the more likely we accept the null. 3

The default Bayesian test, commonly used by Bayesian advocates in psychology, draws the line too far from zero (for my taste). Reasonably powered studies of moderately big effects wrongly accept the null of zero effect too often (see Colada[35]). 4

Way 4. Smaller than expected this time
A new Bayesian approach to evaluate replications, by Verhagen and Wagenmakers (2014 .pdf), pits a different Hypothesis 2 against the null. Its Hypothesis 2 is what a Bayesian observer would predict for the replication after seeing the Original (with some assumed prior).

Similar to Way 3 the bigger the effect seen in the original is, the bigger the effect we expect in the replication, and hence the further from zero we draw the line. Importantly, here the line moves based on what we observed in the original, not (only) on what we arbitrarily choose to consider reasonable to expect. The approach is the handsome cousin of testing if effect size differs between original and replication.

Small Telescope vs Expected This Time (Way 2 vs Way 4)
I compared the conclusions both approaches arrive at when applied to the 100 replications from that Science paper. The results are similar but far from equal, r = .9 across all replications, and r = .72 among n.s. ones (R Code). Focusing on situations where the two lead to opposite conclusions is useful to understand each better. 5, 6

In Study 7 in the Science paper,
The Original estimated a monstrous d=2.14 with N=99 participants total.
The Replication estimated a small    d=0.26, with a miniscule N=14.

The Small Telescopes approach is irked by the small sample of the replication. Its wide confidence interval includes effects as big as d =1.14, giving the original >99% power. We cannot rule out detectable effects, the replication is inconclusive.

The Bayesian observer, in contrast, draw a line quite far from zero after seeing the massive Original effect size. The line, indeed is at a remarkable d=.8. Replications with smaller effect size estimates, anything smaller than large, ‘supports the null.’ Because the replication is d=.26, it strongly supports the null.

A hypothetical scenario where they disagree in the opposite direction (R Code),
Original.       N=40,       d=.7
Replication.  N=5000, d=.1

The Small Telescopes approach asks if the replication rejects an effect big enough to be detectable by the original. Yes. d=.1 cannot be studied with N=40. Null Accepted.  7

Interestingly, that small N=40 pushes the Bayesian in the opposite direction. An original with N=40 changes very little her beliefs about the effect, so d=.1 in the replication is not that surprising  vs. the Original, but it is incompatible with d=0 given the large sample size, null rejected.

I find myself agreeing with the Small Telescopes’ line more than any other. But that’s a matter of taste, not fact.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. e.g., we need n=1500 per cell to have a confidence interval entirely within d<.1 and d>-.1 []
  2. The tests don’t formally assume the effects are moderately large, rather they assume distributions of effect size, say N(0,1). These distributions include tiny effects, even zero, but they also include very large effects, e.g., d>1 as probable possibilities.  It is hard to have intuitions for what assuming a distribution entails. So for brevity and clarity I just say they assume the effect is moderately large. []
  3. Bayesians don’t accept and reject hypotheses, instead, the evidence supports one or another hypothesis. I will use the term accept anyway. []
  4. This is fixable in principle, just define another alternative. If someone proposes a new Bayesian test, ask them “what line around zero is it drawing?”  Even without understanding Bayesian statistics you can evaluate if you like the line the test generates or not. []
  5. Alex Etz in a blogpost (.html) reported the Bayesian analysis of the 100 replications, I used some of his results here []
  6. These are the spearman correlation between the p-value testing the null that the original had at least 33% power, and Bayes Factor described above []
  7. Technically it is the upper end of the confidence interval we consider when evaluating the power of the original sample, it goes up to d=.14, I used d=.1 to keep things simpler []

[41] Falsely Reassuring: Analyses of ALL p-values

It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers:

Fig 0

The dotted circles highlight the excess of .05s, but most p-values are way smaller, suggesting  p-hacking happens but is not a first order concern. That’s reassuring, but falsely reassuring. 1 , 2

Bad Sampling.
The are several problems with looking at all p-values, here I focus on sampling. 3

If we want to know if researchers p-hack their results, we need to examine the p-values associated with their results, those they may want to p-hack in the first place. Samples, to be unbiased, must only include observations from the population of interest.

Most p-values reported in most papers are irrelevant for the strategic behavior of interest. Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data.  Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?” 4

A Demonstration.
In our first p-curve paper (SSRN) we analyzed p-values from experiments with results reported only with a covariate.

We believed researchers would report the analysis without the covariate if it were significant, thus we believed those studies were p-hacked. The resulting p-curve was left-skewed, so we were right.

Figure 2. p-curve for relevant p-values in experiments reported only with a covariate.
Fig 1

I went back to the papers we had analyzed and redid the analyses, only this time I did them incorrectly.

Instead of collecting only the (23) p-values one should select -we provide detailed directions for selecting p-values in our paper SSRN– I proceeded the way the indiscriminate analysts of p-values proceed. I got ALL (712) p-values reported in those papers.

Figure 3. p-curve for all p-values reported in papers behind Figure 2
Fig 2

Figure 3 tells that that the things those papers were not studying were super true.
Figure 2 tells the ones they were studying were not.

Looking at all p-values is falsely reassuring.

Wide logo



Author feedback
I sent a draft of this post to the first author of the three papers with charts reprinted in Figure 1 and the paper from footnote 1. They provided valuable feedback that improved the writing and led to footnotes 2 & 4.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. The Econ and Psych papers were not meant to be reassuring, but they can be interpreted that way. For instance, a recent J of Econ Perspectives (.pdf) paper reads “Brodeur et al. do find excess bunching, [but] their results imply that it may not be quantitatively as severe as one might have thought”. The PLOS Biology paper was meant to be reassuring. []
  2. The PLOS Biology paper had two parts. The first used the indiscriminate selection of p-values from articles in a broad range of journals and attempted to assess the prevalence and impact of p-hacking in the field as a whole. This part is fully invalidated by the problems described in this post. The second used p-values from a few published-metaanalyses on sexual selection in evolutionary biology; this second part is by construction not representative of biology as a whole. In the absence of a p-curve disclosure table, where we know which p-value was selected from each study, it is not possible to evaluate the validity of this excercise. []
  3. For other problems see Dorothoy Bishop’s recent paper [.html] []
  4. Brodeur et al did painstaking work to exclude some irrelevant p-values, e.g., those explicitly described as control variables, but nevertheless left many in . To give a sense, they obtained an average of about 90 p-values from each paper. To give a concrete example, one of the papers in their sample is by Ferreira and Gyourko (.pdf). Via regression discontinuity it shows that a mayor’s political party does not predict policy. To demonstrate the importance of their design, Ferreira & Gyourko also report naive OLS regressions with highly significant but spurious and incorrect results that at face value contradict the paper’s thesis (see their Table II). These very small but irrelevant p-values were included in the sample by Brodeur et al. []

[40] Reducing Fraud in Science

Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things.  In this post I counter-argue, proposing three alternative solutions.

Problems with the Change the Incentives solution.
First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We should pit costs vs benefits. Asking questions with the most upside is beneficial.

Second, if we started rewarding unsexy stuff, a likely consequence is fabricateurs continuing to fake, now just unsexy stuff.  Fabricateurs want the lifestyle of successful scientists. 1 Changing incentives involves making our lifestyle less appealing. (Finally, a benefit to committee meetings). 

Third, the evidence for “liking sexy→fraud” is just not there. Like real research, most fake research is not sexy. Life-long fabricateur Diederik Stapel mostly published dry experiments with “findings” in line with the rest of the literature. That we attend to and remember the sexy fake studies is diagnostic of what we pay attention to, not what causes fraud.  

The evidence that incentives causes fraud comes primarily from self-reports, with fabricateurs saying “the incentives made me do it” (see e.g., Tijdink et al .pdf; or Stapel interviews).  To me, the guilty saying “it’s not my fault” seems like weak evidence. What else could they say?
“I realized I was not cut-out for this; it was either faking some science or getting a job with less status”
I am kind of a psychopath, I had fun tricking everyone”
“A voice in my head told me to do it”

Similarly weak, to me, is the observation that fraud is more prevalent in top journals; we find fraud where we look for it. Fabricateurs faking articles that don’t get read don’t get caught….

It’s good for universities to ignore quantity of papers when hiring and promoting, good for journals to publish interesting questions with inconclusive answers. But that won’t help with fraud.

Solution 1. Retract without asking “are the data fake?”
We have a high bar for retracting articles, and a higher bar for accusing people of fraud. 
The latter makes sense. The former does not.

Retracting is not such a big deal, it just says “we no longer have confidence in the evidence.” 

So many things can go wrong when collecting, analyzing and reporting data that this should be a relatively routine occurrence even in the absence of fraud. An accidental killing may not land the killer in prison, but the victim goes 6 ft under regardless. I’d propose a  retraction doctrine like:

If something is discovered that would lead reasonable experts to believe the results did not originate in a study performed as described in a published paper, or to conclude the study was conducted with excessive sloppiness, the journal should retract the paper.   

Example 1. Analyses indicate published results are implausible for a study conducted as described (e.g., excessive linearity, implausibly similar means, or a covariate is impossibly imbalanced across conditions). Retract.

Example 2. Authors of a paper published in a journal that requires data sharing upon request, when asked for it, indicate to have “lost the data”.  Retract. 2

Example 3. Comparing original materials with posted data reveals important inconsistencies (e.g., scales ranges are 1-11 in the data but 1-7 in the original). Retract.

When journals reject original submissions it is not their job to figure out why the authors run an uninteresting study or executed it poorly. They just reject it.

When journals lose confidence in the data behind a published article it is not their job to figure out why the authors published data whose confidence was eventually lost. They should just retract it.

Employers, funders, and co-authors can worry about why an author published untrustworthy data. 

Solution 2. Show receipts
Penn, my employer, reimburses me for expenses incurred at conferences.receipt

However, I don’t get to just say “hey, I bought some tacos in that Kansas City conference, please deposit $6.16 onto my checking account.” I need receipts.  They trust me, but there is a paper trail in case of need.

When I submit the work I presented in Kansas City to a journal, in contrast, I do just say “hey, I collected the data this or that way.” No receipts.

The recent Science retraction, with canvassers & gay marriage, is a great example for the value of receipts. The statistical evidence  suggested something was off, but the receipts-like paper trail helped a lot 

Author: “so and so run the survey with such and such company”
Sleuths: “hello such and such company, can we talk with so and so about this survey you guys run?”
Such and such company: “we don’t know any so and so, and we don’t have the capability to run the survey.”

Authors should provide as much documentation about how they run their science as they do about what they eat at conferences: where exactly was the study run, at what time and day, which research assistant run it (with contact information), how exactly were participants paid, etc.

We will trust everything researchers say. Until the need to verify arises.

Solution 3. Post data, materials and code
Had the raw data not been available, the recent Science retraction would probably not have happened. Stapel would probably not have gotten caught. The cases against Sanna and Smeesters would not have move forward.  To borrow from a recent paper with Joe and Leif:

Journals that do not increase data and materials posting requirements for publications are causally, if not morally, responsible for the continued contamination of the scientific record with fraud and sloppiness.  

Wide logo


Feedback from Ivan Oransky, co-founder of Retraction Watch
Ivan co-wrote an editorial in the New  York Times on changing the incentives to reduce fraud (.pdf). I reached out to him to get feedback. He directed me to some papers on the evidence of incentives and fraud. I was unaware of, but also unpersuaded by, that evidence. This prompted to add the last paragraph in the incentives section (where I am skeptical of that evidence).  
Despite our different takes on the role of rewarding sexy-findings on fraud, Ivan is on board with the three non-incentive solutions proposed here.  I thank Ivan for the prompt response and useful feedback. (and for Retraction Watch!)


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. I use the word fabricateur to refer to scientists who fabricate data. Fraudster is insufficiently specific (e.g., selling 10 bagels calling them a dozen is fraud too), and fabricator has positive meanings (e.g., people who make things). Fabricateur has a nice ring to it. []
  2. Every author publishing in an American Psychological Association journal agrees to share data upon request []

[39] Power Naps: When do within-subject comparisons help vs hurt (yes, hurt) power?

A recent Science-paper (.pdf) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. 

N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent variable, the Implicit Association Test (IAT), was contrasted within-participant before and after napping. 1

Reasonable question: How much more power does subtracting baseline IAT give a study?
Surprising answer: it lowers power.

Design & analysis of napping study
Participants took the gender and race IATs, then trained for the gender IAT (while listening to one sound) and the race IAT (different sound). Then everyone naps.  While napping one of the two sounds is played (to cue memory of the corresponding training, facilitating learning while sleeping). Then both IATs are taken again. Nappers were reported to be less biased in the cued IAT after the nap.

This is perhaps a good place to indicate that there are many studies with similar designs and sample sizes. The blogpost is about strengthening intuitions for within-subject designs, not criticizing the authors of the study.

Intuition for the power drop
Let’s simplify the experiment. No napping. No gender IAT. Everyone takes only the race IAT.

Half train before taking it, half don’t. To test if training works we could do
         Between-subject test: is the mean IAT different across conditions?

If before training everyone took a baseline race IAT, we could instead do
         Mixed design test: is the mean change in IAT different across conditions?

Subtracting baseline, going from between-subject to a mixed-design, has two effects: one good, one bad.

Good: Reduce between-subject differences. Some people have stronger racial associations than others. Subtracting baselines reduces those differences, increasing power.

Bad: Increase noise. The baseline is, after all, just an estimate. Subtracting baseline adds noise, reducing power.

Imagine the baseline was measured incorrectly. The computer recorded, instead of the IAT, the participant’s body temperature. IAT scores minus body temperature is a noisier dependent variable than just IAT scores, so we’d have less power.

If baseline is not quite as bad as body temperature, the consequence is not quite as bad, but same idea. Subtracting baseline adds the baseline’s noise.

We can be quite precise about this. Subtracting baseline only helps power if baseline is correlated r>.5 with the dependent variable, but it hurts if r<.5. 2

See the simple math (.html). Or, just see the simple chart. F1
e.g., running n=20 per cell and subtracting baseline, when r=.3, lowers power enough that it is as if the sample had been n=15 instead of n=20. (R Code)

Before-After correlation for  IAT
Subtracting baseline IAT will only help, then, if when people take it twice, their scores are correlated r>.5. Prior studies have found test-retest reliability of r = .4 for the racial IAT. 3  Analyzing the posted data (.html) from this study, where manipulations take place between measures, I got r = .35. (For gender IAT I got r=.2) 4

Aside: one can avoid the power-drop entirely if one controls for baseline in a regression/ANCOVA instead of subtracting it.  Moreover, controlling for baseline never lowers power. See bonus chart (.pdf). 

Within-subject manipulations
In addition to subtracting baseline, one may carry out the manipulation within-subject, 
every participant gets treatment and control. Indeed, in the napping study everyone had a cued and a non-cued IAT.

How much this helps depends again on the correlation of the within-subject measures: Does race IAT correlate with gender IAT?  The higher the correlation, the bigger the power boost. 

Fig 2When both measures are uncorrelated it is as if the study had twice as many subjects. This makes sense. r=0 is as if the data came from different people, asking two questions from n=20 is like asking one from n=40. As r increases we have more power because we expect the two measure to be more and more similar, so any given difference is more and more statistically significant. 5 (R Code for chart)

Race & gender IATs capture distinct mental associations, measured with a test of low reliability, so we may not expect a high correlation. At baseline, r(race,gender)=-.07, p=.66.  The within-subject manipulation, then, “only” doubled the sample size.

So, how big was the sample?
The Science-paper reports N=40 people total. The supplement explains that actually combines two separate studies run months apart, each N=20. The analyses subtracted baseline IAT, lowering power, as if N=15. The manipulation was within subject, doubling it, to N=30. To detect “men are heavier than women” one needs N=92. 6

Wide logoAuthor feedback
I shared an early draft of this post with the authors of the Science-paper. We had an extensive email exchange that led to clarifying some ambiguities in the writing. They also suggested I mention their results are robust to controlling instead of subtracting baseline.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. The IAT is the Implicit Association Test and assesses how strongly respondents associate, for instance, good things with Whites and bad things with Blacks; take a test (.html) []
  2. Two days after this post went live I learned, via Jason Kerwin, of this very relevant paper by David McKenzie (.pdf) arguing for economists to collect data from more rounds. David makes the same point about r>.5 for a gain in power from, in econ jargon, a diff-in-diff vs. the simple diff. []
  3. Bar-Anan & Nosek (2014, p. 676 .pdf); Lane et al. (2007, p.71 .pdf)  []
  4. That’s for post vs pre nap. In the napping study the race IAT is taken 4 times by every participant, resulting in 6 before-after  correlations. Raning  -.047 to r = .53; simple average r = .3. []
  5. This ignores the impact that going from between to within subject design has on the actual effect itself. Effects can get smaller or larger depending on the specifics. []
  6. The idea of using men-vs-women weight as a benchmark is to give a heuristic reaction; effects big enough to be detectable by the naked eye require bigger samples than the ones we are used to seeing when studying surprising effects. For those skeptical of this heuristic, let’s use published evidence on the IAT as a benchmark. Lai et al (2014 .pdf) run 17 interventions seeking to reduce IAT scores. The biggest effect among these 17 was d=.49. That effect size requires n=66 per cell, N=132 total, for 80% power (more than for men vs women weight). Moderating this effect through sleep, and moderating the moderation through cueing while sleeping, requires vastly larger samples to attain the same power.   []

[38] A Better Explanation Of The Endowment Effect

It’s a famous study. Give a mug to a random subset of a group of people. Then ask those who got the mug (the sellers) to tell you the lowest price they’d sell the mug for, and ask those who didn’t get the mug (the buyers) to tell you the highest price they’d pay for the mug. You’ll find that sellers’ minimum selling prices exceed buyers’ maximum buying prices by a factor of 2 or 3 (.pdf).

This famous finding, known as the endowment effect, is presumed to have a famous cause: loss aversion. Just as loss aversion maintains that people dislike losses more than they like gains, the endowment effect seems to show that people put a higher price on losing a good than on gaining it. The endowment effect seems to perfectly follow from loss aversion.

But a 2012 paper by Ray Weaver and Shane Frederick convincingly shows that loss aversion is not the cause of the endowment effect (.pdf). Instead, “the endowment effect is often better understood as the reluctance to trade on unfavorable terms,” in other words “as an aversion to bad deals.” 1

This paper changed how I think about the endowment effect, and so I wanted to write about it.

A Reference Price Theory Of The Endowment Effect

Weaver and Frederick’s theory is simple: Selling and buying prices reflect two concerns. First, people don’t want to sell the mug for less, or buy the mug for more, than their own value of it. Second, they don’t want to sell the mug for less, or buy the mug for more, than the market price. This is because people dislike feeling like a sucker. 2

To see how this produces the endowment effect, imagine you are willing to pay $1 for the mug and you believe it usually sells for $3. As a buyer, you won’t pay more than $1, because you don’t want to pay more than it’s worth to you. But as a seller, you don’t want to sell for as little as $1, because you’ll feel like a chump selling it for much less than it is worth. 3. Thus, because there’s a gap between people’s perception of the market price and their valuation of the mug, there’ll be a large gap between selling ($3) and buying ($1) prices:

Weaver and Frederick predict that the endowment effect will arise whenever market prices differ from valuations.

However, when market prices are not different from valuations, you shouldn’t see the endowment effect. For example, if people value a mug at $2 and also think that its market price is $2, then both buyers and sellers will price it at $2:

And this is what Weaver and Frederick find. Repeatedly. There is no endowment effect when valuations are equal to perceived market prices. Wow.

Just to be sure, I ran a within-subjects hypothetical study that is much inferior to Weaver and Frederick’s between-subjects incentivized studies, and, although my unusual design produced some unusual results, I found strong support for their hypothesis (full description .pdf; data .xls). Most importantly, I found that people who gave higher selling prices than buying prices for the same good were much more likely to say they did this because they wanted to avoid a bad deal than because of loss aversion:

In fact, whereas 82.5% of participants endorsed at least one bad-deal reason, only 18.8% of participants endorsed at least one loss-aversion reason. 4

I think Weaver and Frederick’s evidence makes it difficult to consider loss aversion the best explanation of the endowment effect. Loss aversion can’t explain why the endowment effect is so sensitive to the difference between market prices and valuations, and it certainly can’t explain why the effect vanishes when market prices and valuations converge. 5

Weaver and Frederick’s theory is simple, plausible, supported by the data, and doesn’t assume that people treat losses differently than gains. It just assumes that, when setting prices, people consider both their valuations and market prices, and dislike feeling like a sucker.

Wide logo


Author feedback.
I shared an early draft of this post with Shane Frederick. Although he opted not to comment publicly, during our exchange I did learn of an unrelated short (and excellent) piece that he wrote that contains a pretty awesome footnote (.html).

  1. Even if you don’t read Weaver and Frederick’s paper, I strongly advise you to read Footnote 10. []
  2. Thaler (1985) called this “transaction utility” (.pdf). Technically Weaver and Frederick’s theory is about “reference prices” rather than “market prices”, but since market prices are the most common/natural reference price I’m going to use the term market prices. []
  3. Maybe because you got the mug for free, you’d be willing to sell it for a little bit less than the market price – perhaps $2 rather than $3. Even so, if the gap between market prices and valuations is large enough, there’ll still be an endowment effect []
  4. For a similar result, see Brown 2005 (.pdf). []
  5. Loss aversion is not the only popular account. According to an “ownership” account of the endowment effect (.pdf), owning a good makes you like it more, and thus price it higher, than not owning it. Although this mechanism may account for some of the effect (the endowment effect may be multiply determined), it cannot explain all the effects Weaver and Frederick report. Nor can it easily account for why the endowment effect is observed in hypothetical studies, when people simply imagine being buyers or sellers. []

[37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk

A recent paper in Psych Science (.pdf) reports a failure to replicate the study that inspired a TED Talk that has been seen 25 million times. 1  The talk invited viewers to do better in life by assuming high-power poses, just like Wonder Woman’s below, but the replication found that power-posing was inconsequential.

1 wonderIf an original finding is a false positive then its replication is likely to fail, but a failed replication need not imply that the original was a false positive. In this post we try to figure out why the replication failed.

Original study
Participants in the original study took an expansive “high-power” pose, or a contractive “low-power” pose (Psych Science 2010, .pdf).2 posingThe power-posing participants were reported to have felt more powerful, sought more risk, had higher testosterone levels, and lower cortisol levels.  In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. 2 The key point of the TED Talk, that power poses “can significantly change the outcomes of your life” (minute 20:10; video; transcript .html), was not supported.

Was The Replication Sufficiently Precise?
Whenever a replication fails, it is important to assess how precise the estimates are. A noisy estimate may be consistent with no effect (i.e., not significant) but also consistent with effects large enough so as to not challenge the original study.

Only precisely estimated non-significant results contradict an original finding (see Colada[7] for an example of a strikingly imprecise replication).

Here are the original and replication confidence intervals for the risk-taking measure: 3

3 risk

This figure shows that the replication is precise enough to be informative.

The higher end of the confidence interval is d=.06, meaning that the data are consistent with zero and inconsistent with power-poses affecting risk-taking by more than 6% of a standard deviation. We can put in perspective how small that upper bound is by realizing that, with just n=21 per cell, the original study would have a meager 5.6% of chance of detecting it (i.e., <6% statistical power).

Thus, even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it. (For more on this approach to thinking about replications check out Uri’s Small Telescopes paper (.pdf)).

The same is true for the effects of power poses on hormones:

4 hormonesThis implies that there is either a moderator or the original is a false-positive.

Moderators
In their response (.pdf), the original authors reviewed every published study on power posing they could locate (they found 33), seeking potential moderators, factors that may affect whether power posing works. For example, they speculated that power posing may not have worked in the replication because participants were told that power poses could influence behavior. (One apparent implication of this hypothesis is that watching the power poses TED talk would make power poses ineffective).

The authors also list all differences in design they noted between the original and replication studies (see their nice summary .pdf).

One approach would be to run a series of studies to systematically manipulate these hypothesized moderators to see whether they matter.

But before spending valuable resources on that, it is necessary to first establish whether there is reason to believe, based on the published literature, that power posing is ever effective. Might it be instead the case that the original findings are false-positive? 4

P-curve
It may seem that 30-some successful studies is enough to conclude the effect is real. However, we need to account for selective reporting. If studies only get published when they show an effect, the fact that all the published evidence shows an effect is not diagnostic.

P-curve is just the tool for this. It tells us whether we can rule out selective reporting as the sole explanation for a set of p<.05 findings (see p-curve paper .pdf).

We conducted a p-curve analysis on the 33 studies the original authors cited as evidence that power posing works in their reply. They come from many labs around the world.

If power posing were a true effect, we should see a curve that is right-skewed, tall on the left and low on the right. If power posing had not effect, we expect a flat curve.

5 Good - bad p-curveThe actual p-curve for power posing, is flattish, definitely not right-skewed.

6 pcurve

Note: you should ignore p-curve results that do not include a disclosure table; here is ours (.xlsx) .

Consistent with the replication motivating this post, p-curve indicates that either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it. Note that there are perfectly benign explanations for this: e.g., labs that run studies that worked wrote them up, labs that run studies that didn’t, didn’t. 5

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

Wide logo


Author feedback.
We shared an early draft of this post with the authors of the original and failed replication studies. We also shared it with Joe Cesario, author of a few power-posing studies with whom we had discussed this literature a few months ago.

Note: It is our policy not to comment, they get the last word

Amy Cuddy (.html), co-author of the original study and TED talk speaker, provided useful suggestions for clarifications and modification that lead to several revisions, including a new title. She also sent this note:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. Given that our quite thorough response to the Ranehill et al. study has already been published in Psychological Science, I will direct you to that,(.html). I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

Roberto Weber (.html), co-author of the failed replication study sent Uri a note, co-signed with all co-authors, which he allowed us to post (see it here .pdf). They make four points, we quote attempting to summarize:

(1) none of the 33 published studies, other than the original and our replication, study the effect […] on hormones
(2) even within [the original] the hormone result seems to be presented inconsistently
(3) regarding differences in designs [from Table 2 in their response], […] the evidence does not support many of these specific examples as probable moderators in this instance
(4) we employed an experimenter-blind design […] this might be the factor that most plausibly serves as a moderator of the effect”

Again, full response: .pdf.

 Joe Cesario (html)

Excellent post, very clear and concise. Our lab has also been concerned with some of the power pose claims–less about direct replicability and more about the effectiveness in more realistic situations (motivating our Cesario & McDonald paper and another paper in prep with David Johnson). These concerns also led us to contact a number of the most cited power pose researchers to invite them for a multi-site, collaborative replication and extension project. Several said they could not commit the resources to such an endeavor, but several did agree in principle. Although progress has stalled, we are hopeful about such a project moving forward. This would be a useful and efficient way of providing information about replicability as well as potential moderation of the effects.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. This TED talk is ranked 2nd in overall downloads and 1st in per-year downloads. The talk also shared a very touching personal story of how the speaker overcame an immense personal challenge; this post is concerned exclusively with the science part of the talk []
  2. The original authors consider “feelings of power” to be a manipulation check rather than an outcome measure. In their most recent paper they write, “as a manipulation check, participants reported how dominant, in control, in charge, powerful, and like a leader they felt on a 5-point scale” (.pdf) []
  3. The definition of Small Effect in the figure corresponds to an effect that would give the original sample size 33% power, see Uri’s Small Telescopes paper (.pdf) []
  4. ***See Simine Vazire’s excellent post on why it’s wrong to always ask authors to explain failures to replicate by proposing moderators rather than concluding that the original is a false-positive .html (the *** is a gift for Simine)  []
  5. Because the replication obtained a significant effect of power posing on the manipulation check, self-reported power, we constructed a separate p-curve including only the 7 manipulation check results. The resulting p-curve was directionally  right-skewed (p=.075). We interpret this as further consistency between p-curve and the replication results. If from studies reporting both manipulation checks and the effect of interest we select only the manipulation check, and from other studies we select the effect of interest (something we believe is not reasonable but report anyway in case a reader disagrees and for the sake of robustness), the overall conclusions do not change. The overall p-curve still looks flat, does not conclude there is evidential value (Z=.84, p=.199), and does conclude the curve is flatter than 33% (Z=2.05, p=.0203) []