[46] Controlling the Weather

Behavioral scientists have put forth evidence that the weather affects all sorts of things, including the stock market, restaurant tips, car purchases, product returns, art prices, and college admissions.

It is not easy to properly study the effects of weather on human behavior. This is because weather is (obviously) seasonal, as is much of what people do. This means that any investigation of the relation between weather and behavior must properly control for seasonality.

For example, in the U.S., Google searches for “fireworks” correlate positively with temperature throughout the year, but only because July 4th is in the summer. This is a seasonal effect, not a weather effect.
f0Almost every weather paper tries to control for seasonality. This post shows they don’t control enough.

How do they do it?
To answer this question, we gathered a sample of 10 articles that used weather as a predictor. [1]
t1In economics, business, statistics, and psychology, authors use monthly and occasionally weekly controls to account for seasonality. For instance they ask, “Does how cold it was when a coat was bought predict if it was returned, controlling for the month of the year in which it was purchased?”

That’s not enough.
The figures below show the average daily temperature in Philadelphia, along with the estimates provided by monthly (left panel) and weekly (right panel) fixed effects. These figures remind us that the weather does not jump discretely from month to month or week to week. Rather, weather, like earth, moves continuously. This means that seasonal confounds, which are continuous, will survive discrete (monthly or weekly) controls.

F12The vertical distance between the blue lines (monthly/weekly dummies) captures the residual seasonality confound. For example, during March (just left of the ‘100 day’ tick), the monthly dummy assigns 44 degrees to every March day, but temperature systematically fluctuates within March, from a long-term average of 39 degrees on March 1st to a long-term average of 50 degrees on March 31st. This is a seasonally confounded 11-degree difference that is entirely unaccounted for by monthly dummies.

The confounded effect of seasonality that survives weekly dummies is roughly 1/4 that size.

Fixing it.
The easy solution is to control for the historical average of the weather variable of interest for each calendar date.[2]

For example, when using how cold January 24, 2013 was to predict whether a coat bought that day was eventually returned, we include as a covariate the historical average temperature for January 24th  (in that city).[3]

Demonstrating the easy fix
To demonstrate how well this works, we analyze a correlation that is entirely due to a seasonal confound: the number of daylight hours  in Bangkok, Thailand (sunset – sunrise), and the temperature that same day in Philadelphia (data: .dta | .csv| ).  Colder days in Philadelphia tend to be shorter days in Bangkok, but not because coldness in one place shortens the day in the other (nor vice versa), but because seasonal patterns influence both variables. Properly controlling for seasonality should eliminate an association between these variables.

Using day duration in Bangkok as the dependent variable and temperature in Philly as the predictor, we threw in monthly and then weekly dummies to control for the seasonal confound. Neither technique fully succeeded, as same-day temperature survived as a significant predictor. (STATA .do)

t2BangkokThus, using monthly and weekly dummy variables made it seem like, over and above the effects of seasonality, colder days are more likely to be shorter. However, controlling for the historical average daily temperature showed, correctly, that seasonality is the sole driver of this relationship.

Wide logo

Original author feedback:
We shared a draft of this post with authors from all 10 papers from Table 1 and we heard back from 5 of them. Their feedback led to correcting errors in Table 1, changing the title of the post, and fixing the day-duration example (Table 2). Devin Pope, moreover, conducted our suggested analysis on his convertible purchases (QJE) paper and shared the results with us. The finding is robust to our suggested additional control. Devin thought it was valuable to highlight that while historic temperature average is a better control for weather-based seasonality, reducing bias, weekly/monthly dummies help with noise from other seasonal factors such as holidays. We agreed. Best practice, in our view, would be to include time dummies to the granularity permitted by the data to reduce noise, and to include the daily historic average to reduce the seasonal confound of weather variation.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Uri created the list by starting with the most well-cited observational weather paper he knew – Hirshlifer & Shumway – and then selected papers citing it in the Web-of-science and published in journals he recognized. []
  2. Another is to use daily dummies. This option can easily be worse. It can lower statistical power by throwing away data. First, one can only apply daily fixed effects to data with at least two observations per calendar date. Second, this approach ignores historical weather data that precedes the dependent variable. For example, if using sales data from 2013-2015 in the analyses, the daily fixed effects force us to ignore weather data from any prior year. Lastly, it ‘costs’ 365 degrees-of-freedom (don’t forget leap year), instead of 1. []
  3. Uri has two weather papers. They both use this approach to account for seasonality. []

[37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk

A recent paper in Psych Science (.pdf) reports a failure to replicate the study that inspired a TED Talk that has been seen 25 million times. [1]  The talk invited viewers to do better in life by assuming high-power poses, just like Wonder Woman’s below, but the replication found that power-posing was inconsequential.

1 wonderIf an original finding is a false positive then its replication is likely to fail, but a failed replication need not imply that the original was a false positive. In this post we try to figure out why the replication failed.

Original study
Participants in the original study took an expansive “high-power” pose, or a contractive “low-power” pose (Psych Science 2010, .pdf).2 posingThe power-posing participants were reported to have felt more powerful, sought more risk, had higher testosterone levels, and lower cortisol levels.  In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. [2] The key point of the TED Talk, that power poses “can significantly change the outcomes of your life” (minute 20:10; video; transcript .html), was not supported.

Was The Replication Sufficiently Precise?
Whenever a replication fails, it is important to assess how precise the estimates are. A noisy estimate may be consistent with no effect (i.e., not significant) but also consistent with effects large enough so as to not challenge the original study.

Only precisely estimated non-significant results contradict an original finding (see Colada[7] for an example of a strikingly imprecise replication).

Here are the original and replication confidence intervals for the risk-taking measure: [3]

3 risk

This figure shows that the replication is precise enough to be informative.

The higher end of the confidence interval is d=.06, meaning that the data are consistent with zero and inconsistent with power-poses affecting risk-taking by more than 6% of a standard deviation. We can put in perspective how small that upper bound is by realizing that, with just n=21 per cell, the original study would have a meager 5.6% of chance of detecting it (i.e., <6% statistical power).

Thus, even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it. (For more on this approach to thinking about replications check out Uri’s Small Telescopes paper (.pdf)).

The same is true for the effects of power poses on hormones:

4 hormonesThis implies that there is either a moderator or the original is a false-positive.

Moderators
In their response (.pdf), the original authors reviewed every published study on power posing they could locate (they found 33), seeking potential moderators, factors that may affect whether power posing works. For example, they speculated that power posing may not have worked in the replication because participants were told that power poses could influence behavior. (One apparent implication of this hypothesis is that watching the power poses TED talk would make power poses ineffective).

The authors also list all differences in design they noted between the original and replication studies (see their nice summary .pdf).

One approach would be to run a series of studies to systematically manipulate these hypothesized moderators to see whether they matter.

But before spending valuable resources on that, it is necessary to first establish whether there is reason to believe, based on the published literature, that power posing is ever effective. Might it be instead the case that the original findings are false-positive? [4]

P-curve
It may seem that 30-some successful studies is enough to conclude the effect is real. However, we need to account for selective reporting. If studies only get published when they show an effect, the fact that all the published evidence shows an effect is not diagnostic.

P-curve is just the tool for this. It tells us whether we can rule out selective reporting as the sole explanation for a set of p<.05 findings (see p-curve paper .pdf).

We conducted a p-curve analysis on the 33 studies the original authors cited as evidence that power posing works in their reply. They come from many labs around the world.

If power posing were a true effect, we should see a curve that is right-skewed, tall on the left and low on the right. If power posing had not effect, we expect a flat curve.

5 Good - bad p-curveThe actual p-curve for power posing, is flattish, definitely not right-skewed.

6 pcurve

Note: you should ignore p-curve results that do not include a disclosure table; here is ours (.xlsx) .

Consistent with the replication motivating this post, p-curve indicates that either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it. Note that there are perfectly benign explanations for this: e.g., labs that run studies that worked wrote them up, labs that run studies that didn’t, didn’t. [5]

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

Wide logo


Author feedback.
We shared an early draft of this post with the authors of the original and failed replication studies. We also shared it with Joe Cesario, author of a few power-posing studies with whom we had discussed this literature a few months ago.

Note: It is our policy not to comment, they get the last word

Amy Cuddy (.html), co-author of the original study and TED talk speaker, provided useful suggestions for clarifications and modification that lead to several revisions, including a new title. She also sent this note:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. Given that our quite thorough response to the Ranehill et al. study has already been published in Psychological Science, I will direct you to that,(.html). I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

Roberto Weber (.html), co-author of the failed replication study sent Uri a note, co-signed with all co-authors, which he allowed us to post (see it here .pdf). They make four points, we quote attempting to summarize:

(1) none of the 33 published studies, other than the original and our replication, study the effect […] on hormones
(2) even within [the original] the hormone result seems to be presented inconsistently
(3) regarding differences in designs [from Table 2 in their response], […] the evidence does not support many of these specific examples as probable moderators in this instance
(4) we employed an experimenter-blind design […] this might be the factor that most plausibly serves as a moderator of the effect”

Again, full response: .pdf.

 Joe Cesario (html)

Excellent post, very clear and concise. Our lab has also been concerned with some of the power pose claims–less about direct replicability and more about the effectiveness in more realistic situations (motivating our Cesario & McDonald paper and another paper in prep with David Johnson). These concerns also led us to contact a number of the most cited power pose researchers to invite them for a multi-site, collaborative replication and extension project. Several said they could not commit the resources to such an endeavor, but several did agree in principle. Although progress has stalled, we are hopeful about such a project moving forward. This would be a useful and efficient way of providing information about replicability as well as potential moderation of the effects.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. This TED talk is ranked 2nd in overall downloads and 1st in per-year downloads. The talk also shared a very touching personal story of how the speaker overcame an immense personal challenge; this post is concerned exclusively with the science part of the talk []
  2. The original authors consider “feelings of power” to be a manipulation check rather than an outcome measure. In their most recent paper they write, “as a manipulation check, participants reported how dominant, in control, in charge, powerful, and like a leader they felt on a 5-point scale” (.pdf) []
  3. The definition of Small Effect in the figure corresponds to an effect that would give the original sample size 33% power, see Uri’s Small Telescopes paper (.pdf) []
  4. ***See Simine Vazire’s excellent post on why it’s wrong to always ask authors to explain failures to replicate by proposing moderators rather than concluding that the original is a false-positive .html (the *** is a gift for Simine)  []
  5. Because the replication obtained a significant effect of power posing on the manipulation check, self-reported power, we constructed a separate p-curve including only the 7 manipulation check results. The resulting p-curve was directionally  right-skewed (p=.075). We interpret this as further consistency between p-curve and the replication results. If from studies reporting both manipulation checks and the effect of interest we select only the manipulation check, and from other studies we select the effect of interest (something we believe is not reasonable but report anyway in case a reader disagrees and for the sake of robustness), the overall conclusions do not change. The overall p-curve still looks flat, does not conclude there is evidential value (Z=.84, p=.199), and does conclude the curve is flatter than 33% (Z=2.05, p=.0203) []