In my "Small Telescopes" paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days.
Figure and text from Small Telescopes paper (SSRN)
I recently had an email exchange with a senior researcher (not involved in the original paper) who persuaded me I should have been more explicit regarding the design differences between the original and replication studies. If my paper weren't published I would add a discussion of such differences and would explain why I don't believe these can explain the failures to replicate.
Because my paper is already published, I write this post instead.
The 1983 study
This study is so famous that a paper telling the story behind it (.pdf) has over 450 Google cites. It is among the top-20 most cited articles published in JPSP and the most cited by either (superstar) author.
In the original study a research assistant called University of Illinois students either during the "first two sunny spring days after a long period of gray, overcast days", or during two rainy days within a "period of low-hanging clouds and rain" (p. 298, .pdf).
She asked about life satisfaction and then current mood. At the beginning of the phone conversation, she either did not mention the weather, mentioned it in passing, or described it as being of interest to the study.
The reported finding is that "respondents were more satisfied with their lives on sunny than rainy days—but only when their attention was not drawn to the weather" (p. 298; .pdf).
Feddersen et al. (.pdf) matched weather data to the Australian Household Income Survey, which includes a question about life satisfaction. With 90,000 observations, the effect was basically zero.
There are at least three notable design differences between the original and replication studies: 
1. Smaller causes have smaller effect. The 1983 study focused on days on which weather was expected to have large mood effects, the Australian sample used the whole year. The first sunny day in spring is not like the 53rd sunny day of summer.
2. Already attributed. Respondents answered many questions in Australia before reporting their life-satisfaction, possibly misattributing mood to something else.
3. Noise. The representative sample is more diverse than a sample of college undergrads is; thus the data are noisier, less likely to detectably exhibit any effect.
Often this is where discussions of failed replications end—with the enumeration of potential moderators, and the call for more and better data. I'll try to use the data we already have to assess whether any of the differences are likely to matter.
Design difference 1. Smaller causes.
If weather contrasts were critical for altering mood and hence possibly happiness, then the effect in the 1983 study should be driven by the first sunny day in spring, not the Nth rainy day. But a look at the bar chart above shows the opposite: People were NOT happier the first sunny day of spring; they were unhappier on the rainy days. Their description of these days again: 'and the rainy days we used were several days into a new period of low-hanging clouds and rain.' (p. 298, .pdf)
The days driving the effect, then, were similar to previous days. Because of how seasons work, most days in the replication studies presumably were also similar to the days that preceded them (sunny after sunny and rainy after rainy), and so on this point the replication does not seem different or problematic.
Second, Lucas and Lawless (JPSP 2014, .pdf) analyzed a large (N=1 million) US sample and also found no effect of weather on life satisfaction. Moreover, they explicitly assessed if unseasonably cloudy/sunny days, or days with sunshine that differed from recent days, were associated with bigger effects. They were not. (See their Table 3).
Third, the effect size Schwarz and Clore report is enormous: 1.7 points in a 1-10 scale. To put that in perspective, from other studies, we know that the life satisfaction gap between people who got married vs. people who became widows over the past year is about 1.5 on the same scale (see Figure 1, Lucas 2005 .html). Life vs. death are estimated as less impactful than precipitation. Even if the effect were smaller on days not as carefully selected as those by Schwarz and Clore, the 'replications' averaging across all days should still have detectable effects.
The large effect is particularly surprising considering it is the downstream effect of weather on mood, and that effect is really tiny (see Tal Yarkoni's blog review of a few studies .html)
Design difference 2. Already attributed.
This concern, recall, is that people answering many questions in a survey may misattribute their mood to earlier questions. This makes sense, but the concern applies to the original as well.
The phone-call from Schwarz & Clore's RA does not come immediately after the "mood induction" either, rather, participants get the RA's phone call hours into a rainy vs sunny day. Before the call they presumably made evaluations too, answering questions like "How are you and Lisa doing?" "How did History 101 go?" "Man, don't you hate Champaign's weather?" etc. Mood could have been misattributed to any of these earlier judgments in the original as well. Our participants' experiences do not begin when we start collecting their data. 
Design difference 3. Noise.
This concern is that the more diverse sample in the replication makes it harder to detect any effect. If the replication were noisier, we may expect the dependent variable to have a higher standard deviation (SD). For life-satisfaction Schwarz and Clore got about SD=1.69, Feddersen et al, SD=1.52. So less noise in the replication.  Moreover, the replication has panel data and controls for individual differences via fixed effects. These account for 50% of the variance, so they have spectacularly less noise. 
Concluding bullet points.
– The existing data are overwhelmingly inconsistent with current weather affecting reported life satisfaction.
– This does not imply the theory behind Schwarz and Clore (1983), mood-as-information, is wrong.
I sent a draft of this post to Richard Lucas (.htm) who provided valuable feedback and additional sources. I also sent a draft to Norbert Schwarz (.htm) and Gerald Clore (.htm). They provided feedback that led me to clarify when I first identified the design differences between the original and replication studies (back in 2013, see footnotes 1&2). They turned down several invitations to comment within this post.
Subscribe to Blog via Email
- The first two were mentioned in the first draft of my paper but I unfortunately cut them out during a major revision, around May 2013. The third was proposed in Feburary of 2013 in a small mailing list discussing the first talk I gave of my Small Telescopes paper [↩]
- There is also the issue, as Norbert Schwarz pointed out to me in an email in May of 2013, that the 1983 study is not about weather nor life satisfaction, but about misattribution of mood. The 'replications' do not even measure mood. I believe we can meaningfully discuss whether the affects of rain on happiness replicates without measuring mood, in fact, the difficulty to manipulate mood via weather is one thing that make the original finding surprising. [↩]
- What one needs to explain the differences via the presence of other questions is that mood effects from weather replenish through the day, but not immediately. So on sunny days at 7AM I think my cat makes me happier than usual, and then at 10AM that my calculus teacher jokes are funnier than usual, but if the joke had been told at 7.15AM I would not have found it funny because I had already attributed my mood to the cat. This is possible. [↩]
- Schwarz and Clore did not report SDs, but one can compute them off the reported test statistics. See Supplement 2 for Small Telescopes .pdf. [↩]
- See R2 in Feddersen et al's Table A1, column 4 vs 3, .pdf [↩]