[48] P-hacked Hypotheses Are Deceivingly Robust

Sometimes we selectively report the analyses we run to test a hypothesis.
Other times we selectively report which hypotheses we tested.

One popular way to p-hack hypotheses involves subgroups. Upon realizing analyses of the entire sample do not produce a significant effect, we check whether analyses of various subsamples — women, or the young, or republicans, or extroverts — do.  Another popular way is to get an interesting dataset first, and figure out what to test with it second [1].

bee

For example, a researcher gets data from a spelling bee competition and asks: Is there evidence of gender discrimination? How about race? Peer-effects? Saliency? Hyperbolic discounting? Weather? Yes! Then s/he writes a paper titled “Weather & (Spelling) Bees” as if that were the only hypothesis tested [2]. The odds of a p<.05 when testing all these hypotheses is 26% rather than the nominal 5% [3].

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks [4].

Example: Odd numbers and the horoscope
To demonstrate the problem I conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,”  may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code) [5]

T1dPeople are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS.  Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

How to deal with p-hacked hypotheses?
Replications are the obvious way to tease apart true from false positives. Direct replications, testing the same prediction in new studies, are often not feasible with observational data.  In experimental psychology it is common to instead run conceptual replications, examining new hypotheses based on the same underlying theory.  We should do more of this in non-experimental work. One big advantage is that with rich data sets we can often run conceptual replications on the same data.

To do a conceptual replication, we start from the theory behind the hypothesis, say “odd numbers prompt use of less traditional sources of information” and test new hypotheses. For example, this theory may predict that odd numbered respondents are more likely to read blogs instead of academic articles, read nutritional labels from foreign countries, or watch niche TV shows [6].

Conceptual replications should be statistically independent from original (under the null).[7]
That is to say, if an effect we observe is false-positive, the probability that the conceptual replication obtains p<.05 should be 5%. An example that would violate this would be testing if respondents with odd numbers are more likely to consult tarot readers. If by chance many superstitious individuals received an odd number by the GSS, they will both read the horoscope and consult tarot readers more often. Not independent under the null, hence not a good conceptual replication with the same data.

Moderation
A closely related alternative is also commonly used in experimental psychology: moderation. Does the effect get smaller/larger when the theory predicts it should?

For example, I once examined how the price of infant carseats sold on eBay responded to a new safety rating by Consumer Reports (CR), and to its retraction (surprisingly, the retraction was completely effective, .pdf). A referee noted that if the effects  were indeed caused by CR information, they should be stronger for new carseats, as CR advises against buying used ones. If I had a false-positive in my hands we would not expect moderation to work (it did).

Summary
1. With field data it’s easy to p-hack hypotheses.
2. The resulting false-positive findings will be robust to alternative specifications
3. Tools common in experimental psychology, conceptual replications and testing moderation, are viable solutions.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. As with most forms of p-hacking, selectively reporting hypotheses typically does not involve willful deception. []
  2. I chose weather and spelling bee as an arbitrary example. Any resemblance to actual papers is seriously unintentional. []
  3. (1-.95^6)=.2649 []
  4. Robustness tests may help with the selective reporting of hypothesis if a spurious finding is obtained due to specification rather than sampling error. []
  5. This finding is necessarily false-positive because ID numbers are assigned after the opportunity to read the horoscope has passed, and respondents are unaware of the number they have been assigned to; but see Bem (2011 .htm) []
  6. This opens the door to more selective reporting as a researcher may attempt many conceptual replications and report only the one(s) that worked. By virtue of using the same dataset to test a fixed theory, however, this is relatively easy to catch/correct if reviewers and readers have access to the set of variables available to the researcher and hence can at least partially identify the menu of conceptual replications available. []
  7. Red font clarification added after tweet from Sanjay Srivastava .htm []

[47] Evaluating Replications: 40% Full ≠ 60% Empty

Last October, Science published the paper “Estimating the Reproducibility of Psychological Science” (.pdf), which reported the results of 100 replication attempts. Today it published a commentary by Gilbert et al. (.pdf) as well as a response by the replicators (.pdf).

The commentary makes two main points. First, because of sampling error, we should not expect all of the effects to replicate even if all of them were true. Second, differences in design between original studies and replication attempts may explain differences in results. Let’s start with the latter.[1]

Design differences
The commentators provide some striking examples of design differences. For example, they write, “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon” (p. 1037).

People can debate if such differences can explain the results (and in their reply, the replicators explain why they don’t think so). However, for readers to consider whether design differences matter, they first need to know those differences exist. I, for one, was unaware of them before reading Gilbert et al. (They are not mentioned in the 6 page Science article .pdf, nor 26 page supplement .pdf). [2]

This is not about pointing fingers, as I have also made this mistake: I did not sufficiently describe differences between original and replication studies  in my Small Telescopes paper (see Colada [43]).

This is also not about taking a position on whether any particular difference is responsible for any particular discrepancy in results. I have no idea. Nor am I arguing design differences are a problem per-se, in most cases they were even approved by the original authors.

This is entirely about improving the reporting of replications going forward. After reading the commentary I better appreciate the importance of prominently disclosing design differences. This better enables readers to consider the consequences of such differences, while encouraging replicators to anticipate and address, before publication, any concerns they may raise. [3]

Noisy results
I am also sympathetic to the commentators’ other concern, which is that sampling error may explain the low reproducibility rate. Their statistical analyses are not quite right, but neither are those by the replicators in the reproducibility project.

A study result can be imprecise enough to be consistent both with an effect existing and with it not existing. (See Colada[7] for a remarkable example from Economics). Clouds are consistent with rain, but also consistent with no rain. Clouds, like noisy results, are inconclusive.

The replicators interpreted inconclusive replications as failures, the commentators as successes. For instance, one of the analyses by the replicators considered replications as successful only if they obtained p<.05, effectively treating all inconclusive replications as failures. [4]

Both sets of authors examined whether the results from one study were within the confidence interval of the other, selectively ignoring sampling error of one or the other study.[5]

In particular, the replicators deemed a replication successful if the original finding was within the confidence interval of the replication. Among other problems this approach leads most true effects to fail to replicate with sufficiently big replication samples.[6]

F1
The commentators, in contrast, deemed replications successful if their estimate was within the confidence interval of the original. Among other problems, this approach leads too many false-positive findings to survive most replication efforts.[7]

F2
For more on these problems with effect size comparisons, see p. 561 in “Small Telescopes” (.pdf).

Accepting the null
Inconclusive replications are not failed replications.

For a replication to fail, the data must support the null. They must affirm the non-existence of a detectable effect. There are four main approaches to accepting the null (see Colada [42]). Two lend themselves particularly well to evaluating replications:

(i) Small Telescopes (.pdf): Test whether the replication rejects effects big enough to be detectable by the original study, and (ii) Bayesian evaluation of replications (.pdf).

These are philosophically and mathematically very different, but in practice they often agree. In Colada [42] I reported that for this very reproducibility project, the Small Telescopes and the Bayesian approach are correlated r = .91 overall, and r = .72 among replications with p>.05. Moreover, both find that about 30% of replications were inconclusive. (R Code).  [8],[9]

40% full is not 60% empty
The opening paragraph of the response by the replicators reads:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled”

They are saying the glass is 40% full.  They are not explicitly saying it is 60% empty. But readers may be forgiven for jumping to that conclusion, and they almost invariably have.  This opening paragraph would have been equally justified:
[…] the Open Science Collaboration observed that the original result failed to replicate in ~30 of 100 studies sampled”

It would be much better to fully report:
“[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 replications were inconclusive.”

Summary
1. Replications must be analyzed in ways that allow for results to be inconclusive, not just success/fail
2. Design differences between original and replication should be prominently disclosed.

Wide logo


Author feedback.
I shared a draft of this post with Brian Nosek, Dan Gilbert and Tim Wilson, and invited them and their co-authors to provide feedback. I exchanged over 20 emails total with 7 of them. Their feedback greatly improved, and considerably lengthened, this post. Colada Co-host Joe Simmons provided lots of feedback as well.  I kept editing after getting feedback from all of them, so the version you just read is probably worse and surely different from the versions any of them commented on.


Concluding remarks
My views on the state of social science and what to do about it are almost surely much closer to those of the reproducibility team than to those of the authors of the commentary. But. A few months ago I came across a “Rationally Speaking” podcast (.htm) by Julia Galef (relevant part of transcript starts on page 7, .pdf) where she talks about debating with a “steel-man” version, as opposed to straw-man, of an argument. It changed how Iman of steel approach disagreements. For example, the Gilbert et al  commentary opens with what appears to be an incorrectly calculated probability. One could straw-man argue against the commentary by focusing on that calculation. But the argument such probability is meant to support does not hinge on precisely estimating it. There are other weak-links in the commentary, but its steel-man version, the one focusing on its strengths rather than weaknesses, did make me think better about the issues at hand and ended up with what I think is an improved perspective on replications.

We are greatly indebted to the collaborative work of 100s of colleagues behind the reproducibility project, and to Brian Nosek for leading that gargantuan effort (as well as many other important efforts to improve the transparency and replicability of social science). This does not mean we should not try to improve on it or to learn from its shortcomings.


 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. The commentators  actually focus on three issues: (1) (Sampling) error, (2) Statistical power, and (3) Design differences. I treat (1) and (2) as the same problem []
  2. However, the 100 detailed study protocols are available online (.htm), and so people can identify them by reading those protocols. For instance, here (.htm) is the (8 page) protocol for the military vs honeymoon study. []
  3. Brandt et al (JESP 2014) understood the importance of this long before I did, see their ‘Replication Recipe’ paper .pdf []
  4. Any true effect can fail to replicate with a small enough sample, a point made in most articles making suggestions for conducting and evaluating replications, including Small Telescopes (.pdf). []
  5. The original paper reported 5 tests of reproducibility: (i) Is the replication p<.05?, (ii) Is the original within the confidence interval of the replication?, (iii) Does the replication team subjectively rate it as successful vs failure? (iv) Is the replication directionally smaller than the original? and (v) Is the average of original and replication significantly different from zero? In the post I focus only on (i) and (ii) because: (iii)  is not a statistic with evaluative properties (but in any case, also does not include an ‘inconclusive bin’), and neither (iv) nor (v) measure reproducibility.  (iv) Measures publication bias (with lots of noise), and I couldn’t say what (v) measures. []
  6. Most true findings are inflated due to publication bias, so the unbiased estimate from the replication will eventually reject it []
  7. For example, the prototypically  p-hacked p=.049 finding, has a confidence interval that nearly touches zero. To obtain a replication outside that confidence interval, therefore, we need to observe a negative estimate. If the true effect is zero, that will happen only 50% of the time, so about half of false-positive p=.049 would survive replication attempts []
  8. Alex Etz in his blog post did the Bayesian analyses long before I did and I used his summary dataset, as is, to run my analyses. See his PLOS ONE paper, .htm. []
  9. The Small Telescope approach finds that only 25% of replications conclusively failed to replicate, whereas the Bayesian approach says this number is about 37%. However, several of the disagreements come from results that barely accept or don’t accept the null, so the two agree more than these two figures suggest. In the last section of Colada[42] I explain what causes disagreements between the two. []

[46] Controlling the Weather

Behavioral scientists have put forth evidence that the weather affects all sorts of things, including the stock market, restaurant tips, car purchases, product returns, art prices, and college admissions.

It is not easy to properly study the effects of weather on human behavior. This is because weather is (obviously) seasonal, as is much of what people do. This means that any investigation of the relation between weather and behavior must properly control for seasonality.

For example, in the U.S., Google searches for “fireworks” correlate positively with temperature throughout the year, but only because July 4th is in the summer. This is a seasonal effect, not a weather effect.
f0Almost every weather paper tries to control for seasonality. This post shows they don’t control enough.

How do they do it?
To answer this question, we gathered a sample of 10 articles that used weather as a predictor. [1]
t1In economics, business, statistics, and psychology, authors use monthly and occasionally weekly controls to account for seasonality. For instance they ask, “Does how cold it was when a coat was bought predict if it was returned, controlling for the month of the year in which it was purchased?”

That’s not enough.
The figures below show the average daily temperature in Philadelphia, along with the estimates provided by monthly (left panel) and weekly (right panel) fixed effects. These figures remind us that the weather does not jump discretely from month to month or week to week. Rather, weather, like earth, moves continuously. This means that seasonal confounds, which are continuous, will survive discrete (monthly or weekly) controls.

F12The vertical distance between the blue lines (monthly/weekly dummies) captures the residual seasonality confound. For example, during March (just left of the ‘100 day’ tick), the monthly dummy assigns 44 degrees to every March day, but temperature systematically fluctuates within March, from a long-term average of 39 degrees on March 1st to a long-term average of 50 degrees on March 31st. This is a seasonally confounded 11-degree difference that is entirely unaccounted for by monthly dummies.

The confounded effect of seasonality that survives weekly dummies is roughly 1/4 that size.

Fixing it.
The easy solution is to control for the historical average of the weather variable of interest for each calendar date.[2]

For example, when using how cold January 24, 2013 was to predict whether a coat bought that day was eventually returned, we include as a covariate the historical average temperature for January 24th  (in that city).[3]

Demonstrating the easy fix
To demonstrate how well this works, we analyze a correlation that is entirely due to a seasonal confound: the number of daylight hours  in Bangkok, Thailand (sunset – sunrise), and the temperature that same day in Philadelphia (data: .dta | .csv| ).  Colder days in Philadelphia tend to be shorter days in Bangkok, but not because coldness in one place shortens the day in the other (nor vice versa), but because seasonal patterns influence both variables. Properly controlling for seasonality should eliminate an association between these variables.

Using day duration in Bangkok as the dependent variable and temperature in Philly as the predictor, we threw in monthly and then weekly dummies to control for the seasonal confound. Neither technique fully succeeded, as same-day temperature survived as a significant predictor. (STATA .do)

t2BangkokThus, using monthly and weekly dummy variables made it seem like, over and above the effects of seasonality, colder days are more likely to be shorter. However, controlling for the historical average daily temperature showed, correctly, that seasonality is the sole driver of this relationship.

Wide logo

Original author feedback:
We shared a draft of this post with authors from all 10 papers from Table 1 and we heard back from 5 of them. Their feedback led to correcting errors in Table 1, changing the title of the post, and fixing the day-duration example (Table 2). Devin Pope, moreover, conducted our suggested analysis on his convertible purchases (QJE) paper and shared the results with us. The finding is robust to our suggested additional control. Devin thought it was valuable to highlight that while historic temperature average is a better control for weather-based seasonality, reducing bias, weekly/monthly dummies help with noise from other seasonal factors such as holidays. We agreed. Best practice, in our view, would be to include time dummies to the granularity permitted by the data to reduce noise, and to include the daily historic average to reduce the seasonal confound of weather variation.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. Uri created the list by starting with the most well-cited observational weather paper he knew – Hirshlifer & Shumway – and then selected papers citing it in the Web-of-science and published in journals he recognized. []
  2. Another is to use daily dummies. This option can easily be worse. It can lower statistical power by throwing away data. First, one can only apply daily fixed effects to data with at least two observations per calendar date. Second, this approach ignores historical weather data that precedes the dependent variable. For example, if using sales data from 2013-2015 in the analyses, the daily fixed effects force us to ignore weather data from any prior year. Lastly, it ‘costs’ 365 degrees-of-freedom (don’t forget leap year), instead of 1. []
  3. Uri has two weather papers. They both use this approach to account for seasonality. []

[45] Ambitious P-Hacking and P-Curve 4.0

In this post, we first consider how plausible it is for researchers to engage in more ambitious p-hacking (i.e., past the nominal significance level of p<.05). Then, we describe how we have modified p-curve (see app 4.0) to deal with this possibility.

Ambitious p-hacking is hard.
In “False-Positive Psychology” (SSRN), we simulated the consequences of four (at the time acceptable) forms of p-hacking. We found that the probability of finding a statistically significant result (p<.05) skyrocketed from the nominal 5% to 61%.f1

For a recently published paper, “Better P-Curves” (.pdf), we modified those simulations to see how hard it would be for p-hackers to keep going past .05. We found that p-hacking needs to increase exponentially to get smaller and smaller p-values. For instance, once a nonexistent effect has been p-hacked to p<.05, a researcher would need to attempt nine times as many analyses to achieve p<.01.

F2

Moreover, as Panel B shows, because there is a limited number of alternative analyses one can do (96 in our simulations), ambitious p-hacking often fails.[1]

P-Curve and Ambitious p-hacking
P-
curve is a tool that allows you to diagnose the evidential value of a set of statistically significant findings. It is simple: you plot the significant p-values of the statistical tests of interest to the original researchers, and you look at its shape. If your p-curve is significantly right-skewed, then the literature you are examining has evidential value. If it’s significantly flat or left-skewed, then it does not.

In the absence of p-hacking, there is, by definition, a 5% chance of mistakenly observing a significantly right-skewed p-curve if one is in fact examining a literature full of nonexistent effects. Thus, p-curve’s false-positive rate is 5%.

However, when researchers p-hack trying to get p<.05, that probability drops quite a bit, because p-hacking causes p-curve to be left-skewed in expectation, making it harder to (mistakenly) observe a right-skew. Thus, literatures studying nonexistent effects through p-hacking have less than a 5% chance of obtaining a right-skewed p-curve.

But if researchers get ambitious and keep p-hacking past .05, the barely significant results start disappearing and so p-curve starts having a spurious right-skew. Intuitively, the ambitious p-hacker will eliminate the .04s and push past to get more .03s or .02s. The resulting p-curve starts to look artificially good.

Updated p-curve app, 4.0 (htm), is robust to ambitious p-hacking
In “Better P-Curves” (.pdf) we introduced a new test for evidential value that is much more robust to ambitious p-hacking. The new app incorporates it (it also computes confidence intervals for power estimates, among many other improvements, see summary (.htm)).

The new test focuses on the “half p-curve,” the distribution of p-values that are p<.025. On the one hand, because half p-curve does not include barely significant results, it has a lower probability of mistaking ambitious p-hacking for evidential value. On the other hand, dropping observations makes the half p-curve less powerful, so it has a higher chance of failing to recognize actual evidential value.

Fortunately, by combining the full and half p-curves into a single analysis, we obtain inferences that are robust to ambitious p-hacking with minimal loss of power.

The new test of evidential value:
A set of studies is said to contain evidential value if either the half p-curve has a p<.05 right-skew test, or both the full and half p-curves have p<.1 right-skew tests. [2]

In the figure below we compare the performance of this new combination test with that of the full p-curve alone (the “old” test). The top three panels show that both tests are similarly powered to detect true effects. Only when original research is underpowered at 33% is the difference noticeable, and even then it seems acceptable. With just 5 p-values the new test still has more power than the underlying studies do.

f3

The bottom panels show that moderately ambitious p-hacking fully invalidates the “old” test, but the new test is unaffected by it.[3]

We believe that these revisions to p-curve, incorporated in the updated app (.html), make it much harder to falsely conclude that a set of ambitiously p-hacked results contains evidential value. As a consequence, the incentives to ambitiously p-hack are even lower than they were before.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. This is based on simulations of what we believe to be realistic combinations and levels of p-hacking. The results will vary depending on the types and levels of p-hacking. []
  2. As with all cutoffs, it only makes sense to use these as points of reference. A half p-curve with p=.051 is nearly as good as with p=.049, and both tests with p<.001 is much stronger than both tests with p=.099. []
  3. When the true effect is zero and researchers do not p-hack (an unlikely combination), the probability that the new test leads to concluding the studies contain evidential value is 6.2% instead of the nominal 5%. R Code: https://osf.io/mbw5g/  []

[44] AsPredicted: Pre-registration Made Easy

Pre-registering a study consists of leaving a written record of how it will be conducted and analyzed. Very few researchers currently pre-register their studies. Maybe it’s because pre-registering is annoying. Maybe it’s because researchers don’t want to tie their own hands. Or maybe it’s because researchers see no benefit to pre-registering.  This post addresses these three possible causes. First, we introduce AsPredicted.org, a new website that makes pre-registration as simple as possible. We then show that pre-registrations don’t actually tie researchers’ hands, they tie reviewers’ hands, providing selfish benefits to authors who pre-register. [1]

AsPredicted.org
The best introduction is arguably the home-page itself:homepage 11302015

No matter how easy pre-registering becomes, not pre-registering is always easier.  What benefits outweigh the small cost?

Benefit 1. No more self-censoring
In part by choice, and in part because some journals (and reviewers) now require it, more and more researchers are writing papers that properly disclose how their studies were run; they are disclosing all experimental conditions, all measures collected, any data exclusions, etc.

Disclosure is good. It appropriately increases one’s skepticism of post-hoc analytic decisions. But it also increases one’s skepticism of totally reasonable ex-ante decisions, for the two are sometimes confused. Imagine you collect and properly disclose that you measured one primary dependent variable and two exploratory variables,  only to get hammered by Reviewer 2, who writes:

This study is obviously p-hacked. The authors collected three measures and only used one as a dependent variable. Reject.

When authors worry that they will be accused of reporting only the best of three measures, they may decide to only collect a single measure. Preregistration frees authors to collect all three, while assuaging any concerns of being accused of p-hacking.

You don’t tie your hands with pre-registration. You tie Reviewer 2’s.

In case you skipped the third blue box above:
whatif

Benefit 2. Go ahead, data peek
Data peeking, where one decides whether to get more data after analyzing the data, is usually a big no-no. It invalidates p-values and (several aspects of) Bayesian inference. [2]  But if researchers pre-register how they will data peek, it becomes kosher again.

For example, you can pre-register, “In line with Frick (1986 .pdf) we will check data after every 20 observations per-cell, stopping whenever p<.01 or p>.36,”  or “In line with Pocock (1977 .pdf), we will collect up to 60 observations per-cell, in batches of 20, and stop early if p<.022.”

Lakens (2014 .pdf) gives an accessible introduction to legalized data-peeking for psychologists.

Benefit 3. Bolster credibility of odd analyses
Sometimes, the best way to analyze the data is difficult to sell to readers. Maybe you want to do a negative binomial regression, or do an arcsine transformation, or drop half the sample because the observations are not independent. You think about it for hours, ask your stat-savvy friends, and then decide that the weird way to analyze your data is actually the right way to analyze your data. Reporting the weird (but correct!) analysis opens you up to accusations of p-hacking. But not if you pre-register it. “We will analyze the data with an arc-sine transformation.” Done. Reviewer 2 can’t call you a p-hacker.
Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. More flexible options for pre-registration are offered by the Open Science Framework and the Social Science Registry, where authors can write up documents in any format, covering any aspect of their design or analysis, and without any character limits. See pre-registration instructions for the OSF here , and for the Social Science Registry here. []
  2. In particular, if authors peek at their data seeking a given Bayes Factor, they increase the odds they will find support for the alternative hypothesis even if the null is true – see Colada [13] – and they obtain biased estimates of effect size. []

[43] Rain & Happiness: Why Didn’t Schwarz & Clore (1983) ‘Replicate’ ?

In my “Small Telescopes” paper, I introduced a new approach to evaluate replication results (SSRN). Among other examples, I described two studies as having failed to replicate the famous Schwarz and Clore (1983) finding that people report being happier with their lives when asked on sunny days.

Figure and text from Small Telescopes paper (SSRN)
Small Telescopes quotes
I recently had an email exchange with a senior researcher (not involved in the original paper) who persuaded me I should have been more explicit regarding the design differences between the original and replication studies.  If my paper weren’t published I would add a discussion of such differences and would explain why I don’t believe these can explain the failures to replicate.  

Because my paper is already published, I write this post instead.

The 1983 study
This study is so famous that a paper telling the story behind it (.pdf) has over 450 Google cites.  It is among the top-20 most cited articles published in JPSP and the most cited by either (superstar) author.

In the original study a research assistant called University of Illinois students either during the “first two sunny spring days after a long period of gray, overcast days”, or during two rainy days within a “period of low-hanging clouds and rain” (p. 298, .pdf).

She asked about life satisfaction and then current mood. At the beginning of the phone conversation, she either did not mention the weather, mentioned it in passing, or described it as being of interest to the study.

The reported finding is that “respondents were more satisfied with their lives on sunny than rainy days—but only when their attention was not drawn to the weather” (p.298, .pdf)
results‘Replication’
Feddersen et al. (.pdf) matched weather data to the Australian Household Income Survey, which includes a question about life satisfaction. With 90,000 observations, the effect was basically zero.

There are at least three notable design differences between the original and replication studies:[1]

1. Smaller causes have smaller effect. The 1983 study focused on days on which weather was expected to have large mood effects, the Australian sample used the whole year. The first sunny day in spring is not like the 53rd sunny day of summer.

2. Already attributed. Respondents answered many questions in Australia before reporting their life-satisfaction, possibly misattributing mood to something else.

3. Noise. The representative sample is more diverse than a sample of college undergrads is; thus the data are noisier, less likely to detectably exhibit any effect.

Often this is where discussions of failed replications end—with the enumeration of potential moderators, and the call for more and better data. I’ll try to use the data we already have to assess whether any of the differences are likely to matter.[2]

Design difference 1. Smaller causes.
If weather contrasts were critical for altering mood and hence possibly happiness, then the effect in the 1983 study should be driven by the first sunny day in spring, not the Nth rainy day.  But a look at the bar chart above shows the opposite: People were NOT happier the first sunny day of spring; they were unhappier on the rainy days. Their description of these days again: and the rainy days we used were several days into a new period of low-hanging clouds and rain.’ (p. 298, .pdf)

The days driving the effect, then, were similar to previous days. Because of how seasons work, most days in the replication studies presumably were also similar to the days that preceded them (sunny after sunny and rainy after rainy), and so on this point the replication does not seem different or problematic.

Second, Lucas and Lawless (JPSP 2014, .pdf) analyzed a large (N=1 million) US sample and also found no effect of weather on life satisfaction. Moreover, they explicitly assessed if unseasonably cloudy/sunny days, or days with sunshine that differed from recent days, were associated with bigger effects. They were not. (See their Table 3).

Third, the effect size Schwarz and Clore report is enormous: 1.7 points in a 1-10 scale. To put that in perspective, from other studies, we know that the life satisfaction gap between people who got married vs. people who became widows over the past year is about 1.5 on the same scale (see Figure 1, Lucas 2005 .pdf). Life vs. death are estimated as less impactful than precipitation. Even if the effect were smaller on days not as carefully selected as those by Schwarz and Clore, the ‘replications’ averaging across all days should still have detectable effects.

The large effect is particularly surprising considering it is the downstream effect of weather on mood, and that effect is really tiny (see Tal Yarkoni’s blog review of a few studies .htm)

Design difference  2. Already attributed.
This concern, recall, is that people answering many questions in a survey may misattribute their mood to earlier questions. This makes sense, but the concern applies to the original as well.

The phone-call from Schwarz & Clore’s RA does not come immediately after the “mood induction” either, rather, participants get the RA’s phone call hours into a rainy vs sunny day.  Before the call they presumably made evaluations too, answering questions like “How are you and Lisa doing?” “How did History 101 go?” “Man, don’t you hate Champaign’s weather?” etc. Mood could have been misattributed to any of these earlier judgments in the original as well. Our participants’ experiences do not begin when we start collecting their data. [3]

Design difference 3. Noise.
This concern is that the more diverse sample in the replication makes it harder to detect any effect. If the replication were noisier, we may expect the dependent variable to have a higher standard deviation (SD).  For life-satisfaction Schwarz and Clore got about SD=1.69, Feddersen et al, SD=1.52.  So less noise in the replication. [4] Moreover, the replication has panel data and controls for individual differences via fixed effects. These account for 50% of the variance, so they have spectacularly less noise. [5]

Concluding bullet points.
– The existing data are overwhelmingly inconsistent with current weather affecting reported life satisfaction.
– This does not imply the theory behind Schwarz and Clore (1983), mood-as-information, is wrong.

Wide logo

Author feedback
I sent a draft of this post to Richard Lucas (.htm) who provided valuable feedback and additional sources. I also sent a draft to Norbert Schwarz (.htm) and Gerald Clore (.htm). They provided feedback that led me to clarify when I first identified the design differences between the original and replication studies (back in 2013, see footnotes 1&2).  They turned down several invitations to comment within this post.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. The first two were mentioned in the first draft of my paper but I unfortunately cut them out during a major revision, around May 2013. The third was proposed in Feburary of 2013 in a small mailing list discussing the first talk I gave of my Small Telescopes paper []
  2. There is also the issue, as Norbert Schwarz pointed out to me in an email in May of 2013, that the 1983 study is not about weather nor life satisfaction, but about misattribution of mood. The ‘replications’ do not even measure mood. I believe we can meaningfully discuss whether the affects of rain on happiness replicates without measuring mood, in fact, the difficulty to manipulate mood via weather is one thing that make the original finding surprising. []
  3. What one needs to explain the differences via the presence of other questions is that mood effects from weather replenish through the day, but not immediately. So on sunny days at 7AM I think my cat makes me happier than usual, and then at 10AM that my calculus teacher jokes are funnier than usual, but if the joke had been told at 7.15AM I would not have found it funny because I had already attributed my mood to the cat. This is possible. []
  4. Schwarz and Clore did not report SDs, but one can compute them off the reported test statistics. See Supplement 2 for Small Telescopes .pdf. []
  5. See Rin Feddersen et al’s Table A1, column 4 vs 3, .pdf  []

[42] Accepting the Null: Where to Draw the Line?

We typically ask if an effect exists.  But sometimes we want to ask if it does not.

For example, how many of the “failed” replications in the recent reproducibility project published in Science (.pdf) suggest the absence of an effect?

Data have noise, so we can never say ‘the effect is exactly zero.’  We can only say ‘the effect is basically zero.’ What we do is draw a line close to zero and if we are confident the effect is below the line, we accept the null.
Drawing on whiteboard with confidence intervals that do and do not include the lineWe can draw the line via Bayes or via p-values, it does not matter very much. The line is what really matters. How far from zero is it? What moves it up and down?

In this post I describe 4 ways to draw the line, and then pit the top-2 against each other.

Way 1. Absolutely small
The oldest approach draws the line based on absolute size. Say, diets leading to losing less than 2 pounds have an effect of basically zero. Economists do this often. For instance, a recent World Bank paper (.html) reads

“The impact of financial literacy on the average remittance frequency has a 95 percent confidence interval [−4.3%, +2.5%] …. We consider this a relatively precise zero effect, ruling out large positive or negative effects of training” (emphasis added)
(Dictionary note. Remittance: immigrants sending money home).

In much of behavioral science effects of any size can be of theoretical interest, and sample sizes are too small to obtain tight confidence intervals, making this approach unviable in principle and in practice. [1]

Way 2. Undetectably Small
In our first p-curve paper with Joe and Leif (SSRN), and in my “Small Telescopes” paper on evaluating replications (.pdf), we draw the line based on detectability.

We don’t draw the line where we stop caring about effects.
We draw the line where we stop being able to detect them.

Say an original study with n=50 finds people can feel the future. A replication with n=125 ‘fails,’ getting and effect estimate of d=0.01, p=.94. Data are noisy, so the confidence interval goes all the way up to d=.2. That’s a respectably big feeling-the-future effect we are not ruling out. So we cannot say the effect is absolutely small.
example
The original study, with just n=50, however, is unable to detect that small an effect (it would have <18% power). So we accept the null, the null that the effect is either zero, or undetectably small by existing studies.

Way 3. Smaller than expected in general
Bayesian hypothesis testing runs a horse race between two hypotheses:

Hypothesis 1 (null):              The effect is exactly zero.
Hypothesis 2 (alternative): The effect is one of those moderately sized ones. [2]

When data clearly favor 1 more than 2, we accept the null. The bigger the effects Hypothesis 2 includes, the further from zero we draw the line, the more likely we accept the null. [3]

The default Bayesian test, commonly used by Bayesian advocates in psychology, draws the line too far from zero (for my taste). Reasonably powered studies of moderately big effects wrongly accept the null of zero effect too often (see Colada[35]). [4]

Way 4. Smaller than expected this time
A new Bayesian approach to evaluate replications, by Verhagen and Wagenmakers (2014 .pdf), pits a different Hypothesis 2 against the null. Its Hypothesis 2 is what a Bayesian observer would predict for the replication after seeing the Original (with some assumed prior).

Similar to Way 3 the bigger the effect seen in the original is, the bigger the effect we expect in the replication, and hence the further from zero we draw the line. Importantly, here the line moves based on what we observed in the original, not (only) on what we arbitrarily choose to consider reasonable to expect. The approach is the handsome cousin of testing if effect size differs between original and replication.

Small Telescope vs Expected This Time (Way 2 vs Way 4)
I compared the conclusions both approaches arrive at when applied to the 100 replications from that Science paper. The results are similar but far from equal, r = .9 across all replications, and r = .72 among n.s. ones (R Code). Focusing on situations where the two lead to opposite conclusions is useful to understand each better. [5], [6]

In Study 7 in the Science paper,
The Original estimated a monstrous d=2.14 with N=99 participants total.
The Replication estimated a small    d=0.26, with a miniscule N=14.

The Small Telescopes approach is irked by the small sample of the replication. Its wide confidence interval includes effects as big as d =1.14, giving the original >99% power. We cannot rule out detectable effects, the replication is inconclusive.

The Bayesian observer, in contrast, draw a line quite far from zero after seeing the massive Original effect size. The line, indeed is at a remarkable d=.8. Replications with smaller effect size estimates, anything smaller than large, ‘supports the null.’ Because the replication is d=.26, it strongly supports the null.

A hypothetical scenario where they disagree in the opposite direction (R Code),
Original.       N=40,       d=.7
Replication.  N=5000, d=.1

The Small Telescopes approach asks if the replication rejects an effect big enough to be detectable by the original. Yes. d=.1 cannot be studied with N=40. Null Accepted.  [7]

Interestingly, that small N=40 pushes the Bayesian in the opposite direction. An original with N=40 changes very little her beliefs about the effect, so d=.1 in the replication is not that surprising  vs. the Original, but it is incompatible with d=0 given the large sample size, null rejected.

I find myself agreeing with the Small Telescopes’ line more than any other. But that’s a matter of taste, not fact.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. e.g., we need n=1500 per cell to have a confidence interval entirely within d<.1 and d>-.1 []
  2. The tests don’t formally assume the effects are moderately large, rather they assume distributions of effect size, say N(0,1). These distributions include tiny effects, even zero, but they also include very large effects, e.g., d>1 as probable possibilities.  It is hard to have intuitions for what assuming a distribution entails. So for brevity and clarity I just say they assume the effect is moderately large. []
  3. Bayesians don’t accept and reject hypotheses, instead, the evidence supports one or another hypothesis. I will use the term accept anyway. []
  4. This is fixable in principle, just define another alternative. If someone proposes a new Bayesian test, ask them “what line around zero is it drawing?”  Even without understanding Bayesian statistics you can evaluate if you like the line the test generates or not. []
  5. Alex Etz in a blogpost (.html) reported the Bayesian analysis of the 100 replications, I used some of his results here []
  6. These are the spearman correlation between the p-value testing the null that the original had at least 33% power, and Bayes Factor described above []
  7. Technically it is the upper end of the confidence interval we consider when evaluating the power of the original sample, it goes up to d=.14, I used d=.1 to keep things simpler []

[41] Falsely Reassuring: Analyses of ALL p-values

It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers:

Fig 0

The dotted circles highlight the excess of .05s, but most p-values are way smaller, suggesting  p-hacking happens but is not a first order concern. That’s reassuring, but falsely reassuring. [1] , [2]

Bad Sampling.
The are several problems with looking at all p-values, here I focus on sampling. [3]

If we want to know if researchers p-hack their results, we need to examine the p-values associated with their results, those they may want to p-hack in the first place. Samples, to be unbiased, must only include observations from the population of interest.

Most p-values reported in most papers are irrelevant for the strategic behavior of interest. Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data.  Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?” [4]

A Demonstration.
In our first p-curve paper (SSRN) we analyzed p-values from experiments with results reported only with a covariate.

We believed researchers would report the analysis without the covariate if it were significant, thus we believed those studies were p-hacked. The resulting p-curve was left-skewed, so we were right.

Figure 2. p-curve for relevant p-values in experiments reported only with a covariate.
Fig 1

I went back to the papers we had analyzed and redid the analyses, only this time I did them incorrectly.

Instead of collecting only the (23) p-values one should select -we provide detailed directions for selecting p-values in our paper SSRN– I proceeded the way the indiscriminate analysts of p-values proceed. I got ALL (712) p-values reported in those papers.

Figure 3. p-curve for all p-values reported in papers behind Figure 2
Fig 2

Figure 3 tells that that the things those papers were not studying were super true.
Figure 2 tells the ones they were studying were not.

Looking at all p-values is falsely reassuring.

Wide logo


Author feedback
I sent a draft of this post to the first author of the three papers with charts reprinted in Figure 1 and the paper from footnote 1. They provided valuable feedback that improved the writing and led to footnotes 2 & 4.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.
  1. The Econ and Psych papers were not meant to be reassuring, but they can be interpreted that way. For instance, a recent J of Econ Perspectives (.pdf) paper reads “Brodeur et al. do find excess bunching, [but] their results imply that it may not be quantitatively as severe as one might have thought”. The PLOS Biology paper was meant to be reassuring. []
  2. The PLOS Biology paper had two parts. The first used the indiscriminate selection of p-values from articles in a broad range of journals and attempted to assess the prevalence and impact of p-hacking in the field as a whole. This part is fully invalidated by the problems described in this post. The second used p-values from a few published-metaanalyses on sexual selection in evolutionary biology; this second part is by construction not representative of biology as a whole. In the absence of a p-curve disclosure table, where we know which p-value was selected from each study, it is not possible to evaluate the validity of this excercise. []
  3. For other problems see Dorothoy Bishop’s recent paper [.html] []
  4. Brodeur et al did painstaking work to exclude some irrelevant p-values, e.g., those explicitly described as control variables, but nevertheless left many in . To give a sense, they obtained an average of about 90 p-values from each paper. To give a concrete example, one of the papers in their sample is by Ferreira and Gyourko (.pdf). Via regression discontinuity it shows that a mayor’s political party does not predict policy. To demonstrate the importance of their design, Ferreira & Gyourko also report naive OLS regressions with highly significant but spurious and incorrect results that at face value contradict the paper’s thesis (see their Table II). These very small but irrelevant p-values were included in the sample by Brodeur et al. []

[40] Reducing Fraud in Science

Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things.  In this post I counter-argue, proposing three alternative solutions.

Problems with the Change the Incentives solution.
First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We should pit costs vs benefits. Asking questions with the most upside is beneficial.

Second, if we started rewarding unsexy stuff, a likely consequence is fabricateurs continuing to fake, now just unsexy stuff.  Fabricateurs want the lifestyle of successful scientists. [1] Changing incentives involves making our lifestyle less appealing. (Finally, a benefit to committee meetings). 

Third, the evidence for “liking sexy→fraud” is just not there. Like real research, most fake research is not sexy. Life-long fabricateur Diederik Stapel mostly published dry experiments with “findings” in line with the rest of the literature. That we attend to and remember the sexy fake studies is diagnostic of what we pay attention to, not what causes fraud.  

The evidence that incentives causes fraud comes primarily from self-reports, with fabricateurs saying “the incentives made me do it” (see e.g., Tijdink et al .pdf; or Stapel interviews).  To me, the guilty saying “it’s not my fault” seems like weak evidence. What else could they say?
“I realized I was not cut-out for this; it was either faking some science or getting a job with less status”
I am kind of a psychopath, I had fun tricking everyone”
“A voice in my head told me to do it”

Similarly weak, to me, is the observation that fraud is more prevalent in top journals; we find fraud where we look for it. Fabricateurs faking articles that don’t get read don’t get caught….

It’s good for universities to ignore quantity of papers when hiring and promoting, good for journals to publish interesting questions with inconclusive answers. But that won’t help with fraud.

Solution 1. Retract without asking “are the data fake?”
We have a high bar for retracting articles, and a higher bar for accusing people of fraud. 
The latter makes sense. The former does not.

Retracting is not such a big deal, it just says “we no longer have confidence in the evidence.” 

So many things can go wrong when collecting, analyzing and reporting data that this should be a relatively routine occurrence even in the absence of fraud. An accidental killing may not land the killer in prison, but the victim goes 6 ft under regardless. I’d propose a  retraction doctrine like:

If something is discovered that would lead reasonable experts to believe the results did not originate in a study performed as described in a published paper, or to conclude the study was conducted with excessive sloppiness, the journal should retract the paper.   

Example 1. Analyses indicate published results are implausible for a study conducted as described (e.g., excessive linearity, implausibly similar means, or a covariate is impossibly imbalanced across conditions). Retract.

Example 2. Authors of a paper published in a journal that requires data sharing upon request, when asked for it, indicate to have “lost the data”.  Retract. [2]

Example 3. Comparing original materials with posted data reveals important inconsistencies (e.g., scales ranges are 1-11 in the data but 1-7 in the original). Retract.

When journals reject original submissions it is not their job to figure out why the authors run an uninteresting study or executed it poorly. They just reject it.

When journals lose confidence in the data behind a published article it is not their job to figure out why the authors published data whose confidence was eventually lost. They should just retract it.

Employers, funders, and co-authors can worry about why an author published untrustworthy data. 

Solution 2. Show receipts
Penn, my employer, reimburses me for expenses incurred at conferences.receipt

However, I don’t get to just say “hey, I bought some tacos in that Kansas City conference, please deposit $6.16 onto my checking account.” I need receipts.  They trust me, but there is a paper trail in case of need.

When I submit the work I presented in Kansas City to a journal, in contrast, I do just say “hey, I collected the data this or that way.” No receipts.

The recent Science retraction, with canvassers & gay marriage, is a great example for the value of receipts. The statistical evidence  suggested something was off, but the receipts-like paper trail helped a lot 

Author: “so and so run the survey with such and such company”
Sleuths: “hello such and such company, can we talk with so and so about this survey you guys run?”
Such and such company: “we don’t know any so and so, and we don’t have the capability to run the survey.”

Authors should provide as much documentation about how they run their science as they do about what they eat at conferences: where exactly was the study run, at what time and day, which research assistant run it (with contact information), how exactly were participants paid, etc.

We will trust everything researchers say. Until the need to verify arises.

Solution 3. Post data, materials and code
Had the raw data not been available, the recent Science retraction would probably not have happened. Stapel would probably not have gotten caught. The cases against Sanna and Smeesters would not have move forward.  To borrow from a recent paper with Joe and Leif:

Journals that do not increase data and materials posting requirements for publications are causally, if not morally, responsible for the continued contamination of the scientific record with fraud and sloppiness.  

Wide logo


Feedback from Ivan Oransky, co-founder of Retraction Watch
Ivan co-wrote an editorial in the New  York Times on changing the incentives to reduce fraud (.pdf). I reached out to him to get feedback. He directed me to some papers on the evidence of incentives and fraud. I was unaware of, but also unpersuaded by, that evidence. This prompted to add the last paragraph in the incentives section (where I am skeptical of that evidence).  
Despite our different takes on the role of rewarding sexy-findings on fraud, Ivan is on board with the three non-incentive solutions proposed here.  I thank Ivan for the prompt response and useful feedback. (and for Retraction Watch!)


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. I use the word fabricateur to refer to scientists who fabricate data. Fraudster is insufficiently specific (e.g., selling 10 bagels calling them a dozen is fraud too), and fabricator has positive meanings (e.g., people who make things). Fabricateur has a nice ring to it. []
  2. Every author publishing in an American Psychological Association journal agrees to share data upon request []

[39] Power Naps: When do Within-Subject Comparisons Help vs Hurt (yes, hurt) Power?

A recent Science-paper (.pdf) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. 

N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent variable, the Implicit Association Test (IAT), was contrasted within-participant before and after napping. [1]

Reasonable question: How much more power does subtracting baseline IAT give a study?
Surprising answer: it lowers power.

Design & analysis of napping study
Participants took the gender and race IATs, then trained for the gender IAT (while listening to one sound) and the race IAT (different sound). Then everyone naps.  While napping one of the two sounds is played (to cue memory of the corresponding training, facilitating learning while sleeping). Then both IATs are taken again. Nappers were reported to be less biased in the cued IAT after the nap.

This is perhaps a good place to indicate that there are many studies with similar designs and sample sizes. The blogpost is about strengthening intuitions for within-subject designs, not criticizing the authors of the study.

Intuition for the power drop
Let’s simplify the experiment. No napping. No gender IAT. Everyone takes only the race IAT.

Half train before taking it, half don’t. To test if training works we could do
         Between-subject test: is the mean IAT different across conditions?

If before training everyone took a baseline race IAT, we could instead do
         Mixed design test: is the mean change in IAT different across conditions?

Subtracting baseline, going from between-subject to a mixed-design, has two effects: one good, one bad.

Good: Reduce between-subject differences. Some people have stronger racial associations than others. Subtracting baselines reduces those differences, increasing power.

Bad: Increase noise. The baseline is, after all, just an estimate. Subtracting baseline adds noise, reducing power.

Imagine the baseline was measured incorrectly. The computer recorded, instead of the IAT, the participant’s body temperature. IAT scores minus body temperature is a noisier dependent variable than just IAT scores, so we’d have less power.

If baseline is not quite as bad as body temperature, the consequence is not quite as bad, but same idea. Subtracting baseline adds the baseline’s noise.

We can be quite precise about this. Subtracting baseline only helps power if baseline is correlated r>.5 with the dependent variable, but it hurts if r<.5. [2]

See the simple math (.html). Or, just see the simple chart. F1
e.g., running n=20 per cell and subtracting baseline, when r=.3, lowers power enough that it is as if the sample had been n=15 instead of n=20. (R Code)

Before-After correlation for  IAT
Subtracting baseline IAT will only help, then, if when people take it twice, their scores are correlated r>.5. Prior studies have found test-retest reliability of r = .4 for the racial IAT. [3]  Analyzing the posted data (.html) from this study, where manipulations take place between measures, I got r = .35. (For gender IAT I got r=.2) [4]

Aside: one can avoid the power-drop entirely if one controls for baseline in a regression/ANCOVA instead of subtracting it.  Moreover, controlling for baseline never lowers power. See bonus chart (.pdf). 

Within-subject manipulations
In addition to subtracting baseline, one may carry out the manipulation within-subject, 
every participant gets treatment and control. Indeed, in the napping study everyone had a cued and a non-cued IAT.

How much this helps depends again on the correlation of the within-subject measures: Does race IAT correlate with gender IAT?  The higher the correlation, the bigger the power boost. 

Fig 2When both measures are uncorrelated it is as if the study had twice as many subjects. This makes sense. r=0 is as if the data came from different people, asking two questions from n=20 is like asking one from n=40. As r increases we have more power because we expect the two measure to be more and more similar, so any given difference is more and more statistically significant. [5] (R Code for chart)

Race & gender IATs capture distinct mental associations, measured with a test of low reliability, so we may not expect a high correlation. At baseline, r(race,gender)=-.07, p=.66.  The within-subject manipulation, then, “only” doubled the sample size.

So, how big was the sample?
The Science-paper reports N=40 people total. The supplement explains that actually combines two separate studies run months apart, each N=20. The analyses subtracted baseline IAT, lowering power, as if N=15. The manipulation was within subject, doubling it, to N=30. To detect “men are heavier than women” one needs N=92. [6]

Wide logoAuthor feedback
I shared an early draft of this post with the authors of the Science-paper. We had an extensive email exchange that led to clarifying some ambiguities in the writing. They also suggested I mention their results are robust to controlling instead of subtracting baseline.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. The IAT is the Implicit Association Test and assesses how strongly respondents associate, for instance, good things with Whites and bad things with Blacks; take a test (.html) []
  2. Two days after this post went live I learned, via Jason Kerwin, of this very relevant paper by David McKenzie (.pdf) arguing for economists to collect data from more rounds. David makes the same point about r>.5 for a gain in power from, in econ jargon, a diff-in-diff vs. the simple diff. []
  3. Bar-Anan & Nosek (2014, p. 676 .pdf); Lane et al. (2007, p.71 .pdf)  []
  4. That’s for post vs pre nap. In the napping study the race IAT is taken 4 times by every participant, resulting in 6 before-after  correlations. Raning  -.047 to r = .53; simple average r = .3. []
  5. This ignores the impact that going from between to within subject design has on the actual effect itself. Effects can get smaller or larger depending on the specifics. []
  6. The idea of using men-vs-women weight as a benchmark is to give a heuristic reaction; effects big enough to be detectable by the naked eye require bigger samples than the ones we are used to seeing when studying surprising effects. For those skeptical of this heuristic, let’s use published evidence on the IAT as a benchmark. Lai et al (2014 .pdf) run 17 interventions seeking to reduce IAT scores. The biggest effect among these 17 was d=.49. That effect size requires n=66 per cell, N=132 total, for 80% power (more than for men vs women weight). Moderating this effect through sleep, and moderating the moderation through cueing while sleeping, requires vastly larger samples to attain the same power.   []