[41] Falsely Reassuring: Analyses of ALL p-values

It is a neat idea. Get a ton of papers. Extract all p-values. Examine the prevalence of p-hacking by assessing if there are too many p-values near p=.05. Economists have done it [SSRN], as have psychologists [.html], and biologists [.html]. These charts with distributions of p-values come from those papers:

Fig 0

The dotted circles highlight the excess of .05s, but most p-values are way smaller, suggesting  p-hacking happens but is not a first order concern. That’s reassuring, but falsely reassuring.1 ,2

Bad Sampling.
The are several problems with looking at all p-values, here I focus on sampling.3

If we want to know if researchers p-hack their results, we need to examine the p-values associated with their results, those they may want to p-hack in the first place. Samples, to be unbiased, must only include observations from the population of interest.

Most p-values reported in most papers are irrelevant for the strategic behavior of interest. Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data.  Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?”4

A Demonstration.
In our first p-curve paper (SSRN) we analyzed p-values from experiments with results reported only with a covariate.

We believed researchers would report the analysis without the covariate if it were significant, thus we believed those studies were p-hacked. The resulting p-curve was left-skewed, so we were right.

Figure 2. p-curve for relevant p-values in experiments reported only with a covariate.
Fig 1

I went back to the papers we had analyzed and redid the analyses, only this time I did them incorrectly.

Instead of collecting only the (23) p-values one should select -we provide detailed directions for selecting p-values in our paper SSRN– I proceeded the way the indiscriminate analysts of p-values proceed. I got ALL (712) p-values reported in those papers.

Figure 3. p-curve for all p-values reported in papers behind Figure 2
Fig 2

Figure 3 tells that that the things those papers were not studying were super true.
Figure 2 tells the ones they were studying were not.

Looking at all p-values is falsely reassuring.

Wide logo



Author feedback
I sent a draft of this post to the first author of the three papers with charts reprinted in Figure 1 and the paper from footnote 1. They provided valuable feedback that improved the writing and led to footnotes 2 & 4.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The Econ and Psych papers were not meant to be reassuring, but they can be interpreted that way. For instance, a recent J of Econ Perspectives (.pdf) paper reads “Brodeur et al. do find excess bunching, [but] their results imply that it may not be quantitatively as severe as one might have thought”. The PLOS Biology paper was meant to be reassuring. []
  2. The PLOS Biology paper had two parts. The first used the indiscriminate selection of p-values from articles in a broad range of journals and attempted to assess the prevalence and impact of p-hacking in the field as a whole. This part is fully invalidated by the problems described in this post. The second used p-values from a few published-metaanalyses on sexual selection in evolutionary biology; this second part is by construction not representative of biology as a whole. In the absence of a p-curve disclosure table, where we know which p-value was selected from each study, it is not possible to evaluate the validity of this excercise. []
  3. For other problems see Dorothoy Bishop’s recent paper [.html] []
  4. Brodeur et al did painstaking work to exclude some irrelevant p-values, e.g., those explicitly described as control variables, but nevertheless left many in . To give a sense, they obtained an average of about 90 p-values from each paper. To give a concrete example, one of the papers in their sample is by Ferreira and Gyourko (.pdf). Via regression discontinuity it shows that a mayor’s political party does not predict policy. To demonstrate the importance of their design, Ferreira & Gyourko also report naive OLS regressions with highly significant but spurious and incorrect results that at face value contradict the paper’s thesis (see their Table II). These very small but irrelevant p-values were included in the sample by Brodeur et al. []

[40] Reducing Fraud in Science

Fraud in science is often attributed to incentives: we reward sexy-results→fraud happens. The solution, the argument goes, is to reward other things.  In this post I counter-argue, proposing three alternative solutions.

Problems with the Change the Incentives solution.
First, even if rewarding sexy-results caused fraud, it does not follow we should stop rewarding sexy-results. We should pit costs vs benefits. Asking questions with the most upside is beneficial.

Second, if we started rewarding unsexy stuff, a likely consequence is fabricateurs continuing to fake, now just unsexy stuff.  Fabricateurs want the lifestyle of successful scientists.1 Changing incentives involves making our lifestyle less appealing. (Finally, a benefit to committee meetings). 

Third, the evidence for “liking sexy→fraud” is just not there. Like real research, most fake research is not sexy. Life-long fabricateur Diederik Stapel mostly published dry experiments with “findings” in line with the rest of the literature. That we attend to and remember the sexy fake studies is diagnostic of what we pay attention to, not what causes fraud.  

The evidence that incentives causes fraud comes primarily from self-reports, with fabricateurs saying “the incentives made me do it” (see e.g., Tijdink et al .pdf; or Stapel interviews).  To me, the guilty saying “it’s not my fault” seems like weak evidence. What else could they say?
“I realized I was not cut-out for this; it was either faking some science or getting a job with less status”
I am kind of a psychopath, I had fun tricking everyone”
“A voice in my head told me to do it”

Similarly weak, to me, is the observation that fraud is more prevalent in top journals; we find fraud where we look for it. Fabricateurs faking articles that don’t get read don’t get caught….

It’s good for universities to ignore quantity of papers when hiring and promoting, good for journals to publish interesting questions with inconclusive answers. But that won’t help with fraud.

Solution 1. Retract without asking “are the data fake?”
We have a high bar for retracting articles, and a higher bar for accusing people of fraud. 
The latter makes sense. The former does not.

Retracting is not such a big deal, it just says “we no longer have confidence in the evidence.” 

So many things can go wrong when collecting, analyzing and reporting data that this should be a relatively routine occurrence even in the absence of fraud. An accidental killing may not land the killer in prison, but the victim goes 6 ft under regardless. I’d propose a  retraction doctrine like:

If something is discovered that would lead reasonable experts to believe the results did not originate in a study performed as described in a published paper, or to conclude the study was conducted with excessive sloppiness, the journal should retract the paper.   

Example 1. Analyses indicate published results are implausible for a study conducted as described (e.g., excessive linearity, implausibly similar means, or a covariate is impossibly imbalanced across conditions). Retract.

Example 2. Authors of a paper published in a journal that requires data sharing upon request, when asked for it, indicate to have “lost the data”.  Retract.2

Example 3. Comparing original materials with posted data reveals important inconsistencies (e.g., scales ranges are 1-11 in the data but 1-7 in the original). Retract.

When journals reject original submissions it is not their job to figure out why the authors run an uninteresting study or executed it poorly. They just reject it.

When journals lose confidence in the data behind a published article it is not their job to figure out why the authors published data whose confidence was eventually lost. They should just retract it.

Employers, funders, and co-authors can worry about why an author published untrustworthy data. 

Solution 2. Show receipts
Penn, my employer, reimburses me for expenses incurred at conferences.receipt

However, I don’t get to just say “hey, I bought some tacos in that Kansas City conference, please deposit $6.16 onto my checking account.” I need receipts.  They trust me, but there is a paper trail in case of need.

When I submit the work I presented in Kansas City to a journal, in contrast, I do just say “hey, I collected the data this or that way.” No receipts.

The recent Science retraction, with canvassers & gay marriage, is a great example for the value of receipts. The statistical evidence  suggested something was off, but the receipts-like paper trail helped a lot 

Author: “so and so run the survey with such and such company”
Sleuths: “hello such and such company, can we talk with so and so about this survey you guys run?”
Such and such company: “we don’t know any so and so, and we don’t have the capability to run the survey.”

Authors should provide as much documentation about how they run their science as they do about what they eat at conferences: where exactly was the study run, at what time and day, which research assistant run it (with contact information), how exactly were participants paid, etc.

We will trust everything researchers say. Until the need to verify arises.

Solution 3. Post data, materials and code
Had the raw data not been available, the recent Science retraction would probably not have happened. Stapel would probably not have gotten caught. The cases against Sanna and Smeesters would not have move forward.  To borrow from a recent paper with Joe and Leif:

Journals that do not increase data and materials posting requirements for publications are causally, if not morally, responsible for the continued contamination of the scientific record with fraud and sloppiness.  

Wide logo


Feedback from Ivan Oransky, co-founder of Retraction Watch
Ivan co-wrote an editorial in the New  York Times on changing the incentives to reduce fraud (.pdf). I reached out to him to get feedback. He directed me to some papers on the evidence of incentives and fraud. I was unaware of, but also unpersuaded by, that evidence. This prompted to add the last paragraph in the incentives section (where I am skeptical of that evidence).  
Despite our different takes on the role of rewarding sexy-findings on fraud, Ivan is on board with the three non-incentive solutions proposed here.  I thank Ivan for the prompt response and useful feedback. (and for Retraction Watch!)


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. I use the word fabricateur to refer to scientists who fabricate data. Fraudster is insufficiently specific (e.g., selling 10 bagels calling them a dozen is fraud too), and fabricator has positive meanings (e.g., people who make things). Fabricateur has a nice ring to it. []
  2. Every author publishing in an American Psychological Association journal agrees to share data upon request []

[39] Power Naps: When do within-subject comparisons help vs hurt (yes, hurt) power?

A recent Science-paper (.pdf) used a total sample size of N=40 to arrive at the conclusion that implicit racial and gender stereotypes can be reduced while napping. 

N=40 is a small sample for a between-subject experiment. One needs N=92 to reliably detect that men are heavier than women (SSRN). The study, however, was within-subject, for instance, its dependent variable, the Implicit Association Test (IAT), was contrasted within-participant before and after napping.1

Reasonable question: How much more power does subtracting baseline IAT give a study?
Surprising answer: it lowers power.

Design & analysis of napping study
Participants took the gender and race IATs, then trained for the gender IAT (while listening to one sound) and the race IAT (different sound). Then everyone naps.  While napping one of the two sounds is played (to cue memory of the corresponding training, facilitating learning while sleeping). Then both IATs are taken again. Nappers were reported to be less biased in the cued IAT after the nap.

This is perhaps a good place to indicate that there are many studies with similar designs and sample sizes. The blogpost is about strengthening intuitions for within-subject designs, not criticizing the authors of the study.

Intuition for the power drop
Let’s simplify the experiment. No napping. No gender IAT. Everyone takes only the race IAT.

Half train before taking it, half don’t. To test if training works we could do
         Between-subject test: is the mean IAT different across conditions?

If before training everyone took a baseline race IAT, we could instead do
         Mixed design test: is the mean change in IAT different across conditions?

Subtracting baseline, going from between-subject to a mixed-design, has two effects: one good, one bad.

Good: Reduce between-subject differences. Some people have stronger racial associations than others. Subtracting baselines reduces those differences, increasing power.

Bad: Increase noise. The baseline is, after all, just an estimate. Subtracting baseline adds noise, reducing power.

Imagine the baseline was measured incorrectly. The computer recorded, instead of the IAT, the participant’s body temperature. IAT scores minus body temperature is a noisier dependent variable than just IAT scores, so we’d have less power.

If baseline is not quite as bad as body temperature, the consequence is not quite as bad, but same idea. Subtracting baseline adds the baseline’s noise.

We can be quite precise about this. Subtracting baseline only helps power if baseline is correlated r>.5 with the dependent variable, but it hurts if r<.5.2

See the simple math (.html). Or, just see the simple chart. F1
e.g., running n=20 per cell and subtracting baseline, when r=.3, lowers power enough that it is as if the sample had been n=15 instead of n=20. (R Code)

Before-After correlation for  IAT
Subtracting baseline IAT will only help, then, if when people take it twice, their scores are correlated r>.5. Prior studies have found test-retest reliability of r = .4 for the racial IAT.3  Analyzing the posted data (.html) from this study, where manipulations take place between measures, I got r = .35. (For gender IAT I got r=.2)4

Aside: one can avoid the power-drop entirely if one controls for baseline in a regression/ANCOVA instead of subtracting it.  Moreover, controlling for baseline never lowers power. See bonus chart (.pdf). 

Within-subject manipulations
In addition to subtracting baseline, one may carry out the manipulation within-subject, 
every participant gets treatment and control. Indeed, in the napping study everyone had a cued and a non-cued IAT.

How much this helps depends again on the correlation of the within-subject measures: Does race IAT correlate with gender IAT?  The higher the correlation, the bigger the power boost. 

Fig 2When both measures are uncorrelated it is as if the study had twice as many subjects. This makes sense. r=0 is as if the data came from different people, asking two questions from n=20 is like asking one from n=40. As r increases we have more power because we expect the two measure to be more and more similar, so any given difference is more and more statistically significant.5 (R Code for chart)

Race & gender IATs capture distinct mental associations, measured with a test of low reliability, so we may not expect a high correlation. At baseline, r(race,gender)=-.07, p=.66.  The within-subject manipulation, then, “only” doubled the sample size.

So, how big was the sample?
The Science-paper reports N=40 people total. The supplement explains that actually combines two separate studies run months apart, each N=20. The analyses subtracted baseline IAT, lowering power, as if N=15. The manipulation was within subject, doubling it, to N=30. To detect “men are heavier than women” one needs N=92.6

Wide logoAuthor feedback
I shared an early draft of this post with the authors of the Science-paper. We had an extensive email exchange that led to clarifying some ambiguities in the writing. They also suggested I mention their results are robust to controlling instead of subtracting baseline.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. The IAT is the Implicit Association Test and assesses how strongly respondents associate, for instance, good things with Whites and bad things with Blacks; take a test (.html) []
  2. Two days after this post went live I learned, via Jason Kerwin, of this very relevant paper by David McKenzie (.pdf) arguing for economists to collect data from more rounds. David makes the same point about r>.5 for a gain in power from, in econ jargon, a diff-in-diff vs. the simple diff. []
  3. Bar-Anan & Nosek (2014, p. 676 .pdf); Lane et al. (2007, p.71 .pdf)  []
  4. That’s for post vs pre nap. In the napping study the race IAT is taken 4 times by every participant, resulting in 6 before-after  correlations. Raning  -.047 to r = .53; simple average r = .3. []
  5. This ignores the impact that going from between to within subject design has on the actual effect itself. Effects can get smaller or larger depending on the specifics. []
  6. The idea of using men-vs-women weight as a benchmark is to give a heuristic reaction; effects big enough to be detectable by the naked eye require bigger samples than the ones we are used to seeing when studying surprising effects. For those skeptical of this heuristic, let’s use published evidence on the IAT as a benchmark. Lai et al (2014 .pdf) run 17 interventions seeking to reduce IAT scores. The biggest effect among these 17 was d=.49. That effect size requires n=66 per cell, N=132 total, for 80% power (more than for men vs women weight). Moderating this effect through sleep, and moderating the moderation through cueing while sleeping, requires vastly larger samples to attain the same power.   []

[38] A Better Explanation Of The Endowment Effect

It’s a famous study. Give a mug to a random subset of a group of people. Then ask those who got the mug (the sellers) to tell you the lowest price they’d sell the mug for, and ask those who didn’t get the mug (the buyers) to tell you the highest price they’d pay for the mug. You’ll find that sellers’ minimum selling prices exceed buyers’ maximum buying prices by a factor of 2 or 3 (.pdf).

This famous finding, known as the endowment effect, is presumed to have a famous cause: loss aversion. Just as loss aversion maintains that people dislike losses more than they like gains, the endowment effect seems to show that people put a higher price on losing a good than on gaining it. The endowment effect seems to perfectly follow from loss aversion.

But a 2012 paper by Ray Weaver and Shane Frederick convincingly shows that loss aversion is not the cause of the endowment effect (.pdf). Instead, “the endowment effect is often better understood as the reluctance to trade on unfavorable terms,” in other words “as an aversion to bad deals.”1

This paper changed how I think about the endowment effect, and so I wanted to write about it.

A Reference Price Theory Of The Endowment Effect

Weaver and Frederick’s theory is simple: Selling and buying prices reflect two concerns. First, people don’t want to sell the mug for less, or buy the mug for more, than their own value of it. Second, they don’t want to sell the mug for less, or buy the mug for more, than the market price. This is because people dislike feeling like a sucker.2

To see how this produces the endowment effect, imagine you are willing to pay $1 for the mug and you believe it usually sells for $3. As a buyer, you won’t pay more than $1, because you don’t want to pay more than it’s worth to you. But as a seller, you don’t want to sell for as little as $1, because you’ll feel like a chump selling it for much less than it is worth.3. Thus, because there’s a gap between people’s perception of the market price and their valuation of the mug, there’ll be a large gap between selling ($3) and buying ($1) prices:

Weaver and Frederick predict that the endowment effect will arise whenever market prices differ from valuations.

However, when market prices are not different from valuations, you shouldn’t see the endowment effect. For example, if people value a mug at $2 and also think that its market price is $2, then both buyers and sellers will price it at $2:

And this is what Weaver and Frederick find. Repeatedly. There is no endowment effect when valuations are equal to perceived market prices. Wow.

Just to be sure, I ran a within-subjects hypothetical study that is much inferior to Weaver and Frederick’s between-subjects incentivized studies, and, although my unusual design produced some unusual results, I found strong support for their hypothesis (full description .pdf; data .xls). Most importantly, I found that people who gave higher selling prices than buying prices for the same good were much more likely to say they did this because they wanted to avoid a bad deal than because of loss aversion:

In fact, whereas 82.5% of participants endorsed at least one bad-deal reason, only 18.8% of participants endorsed at least one loss-aversion reason.4

I think Weaver and Frederick’s evidence makes it difficult to consider loss aversion the best explanation of the endowment effect. Loss aversion can’t explain why the endowment effect is so sensitive to the difference between market prices and valuations, and it certainly can’t explain why the effect vanishes when market prices and valuations converge.5

Weaver and Frederick’s theory is simple, plausible, supported by the data, and doesn’t assume that people treat losses differently than gains. It just assumes that, when setting prices, people consider both their valuations and market prices, and dislike feeling like a sucker.

Wide logo


Author feedback.
I shared an early draft of this post with Shane Frederick. Although he opted not to comment publicly, during our exchange I did learn of an unrelated short (and excellent) piece that he wrote that contains a pretty awesome footnote (.html).

  1. Even if you don’t read Weaver and Frederick’s paper, I strongly advise you to read Footnote 10. []
  2. Thaler (1985) called this “transaction utility” (.pdf). Technically Weaver and Frederick’s theory is about “reference prices” rather than “market prices”, but since market prices are the most common/natural reference price I’m going to use the term market prices. []
  3. Maybe because you got the mug for free, you’d be willing to sell it for a little bit less than the market price – perhaps $2 rather than $3. Even so, if the gap between market prices and valuations is large enough, there’ll still be an endowment effect []
  4. For a similar result, see Brown 2005 (.pdf). []
  5. Loss aversion is not the only popular account. According to an “ownership” account of the endowment effect (.pdf), owning a good makes you like it more, and thus price it higher, than not owning it. Although this mechanism may account for some of the effect (the endowment effect may be multiply determined), it cannot explain all the effects Weaver and Frederick report. Nor can it easily account for why the endowment effect is observed in hypothetical studies, when people simply imagine being buyers or sellers. []

[37] Power Posing: Reassessing The Evidence Behind The Most Popular TED Talk

A recent paper in Psych Science (.pdf) reports a failure to replicate the study that inspired a TED Talk that has been seen 25 million times.1  The talk invited viewers to do better in life by assuming high-power poses, just like Wonder Woman’s below, but the replication found that power-posing was inconsequential.

1 wonderIf an original finding is a false positive then its replication is likely to fail, but a failed replication need not imply that the original was a false positive. In this post we try to figure out why the replication failed.

Original study
Participants in the original study took an expansive “high-power” pose, or a contractive “low-power” pose (Psych Science 2010, .pdf).2 posingThe power-posing participants were reported to have felt more powerful, sought more risk, had higher testosterone levels, and lower cortisol levels.  In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels.2 The key point of the TED Talk, that power poses “can significantly change the outcomes of your life” (minute 20:10; video; transcript .html), was not supported.

Was The Replication Sufficiently Precise?
Whenever a replication fails, it is important to assess how precise the estimates are. A noisy estimate may be consistent with no effect (i.e., not significant) but also consistent with effects large enough so as to not challenge the original study.

Only precisely estimated non-significant results contradict an original finding (see Colada[7] for an example of a strikingly imprecise replication).

Here are the original and replication confidence intervals for the risk-taking measure:3

3 risk

This figure shows that the replication is precise enough to be informative.

The higher end of the confidence interval is d=.06, meaning that the data are consistent with zero and inconsistent with power-poses affecting risk-taking by more than 6% of a standard deviation. We can put in perspective how small that upper bound is by realizing that, with just n=21 per cell, the original study would have a meager 5.6% of chance of detecting it (i.e., <6% statistical power).

Thus, even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it. (For more on this approach to thinking about replications check out Uri’s Small Telescopes paper (.pdf)).

The same is true for the effects of power poses on hormones:

4 hormonesThis implies that there is either a moderator or the original is a false-positive.

Moderators
In their response (.pdf), the original authors reviewed every published study on power posing they could locate (they found 33), seeking potential moderators, factors that may affect whether power posing works. For example, they speculated that power posing may not have worked in the replication because participants were told that power poses could influence behavior. (One apparent implication of this hypothesis is that watching the power poses TED talk would make power poses ineffective).

The authors also list all differences in design they noted between the original and replication studies (see their nice summary .pdf).

One approach would be to run a series of studies to systematically manipulate these hypothesized moderators to see whether they matter.

But before spending valuable resources on that, it is necessary to first establish whether there is reason to believe, based on the published literature, that power posing is ever effective. Might it be instead the case that the original findings are false-positive?4

P-curve
It may seem that 30-some successful studies is enough to conclude the effect is real. However, we need to account for selective reporting. If studies only get published when they show an effect, the fact that all the published evidence shows an effect is not diagnostic.

P-curve is just the tool for this. It tells us whether we can rule out selective reporting as the sole explanation for a set of p<.05 findings (see p-curve paper .pdf).

We conducted a p-curve analysis on the 33 studies the original authors cited as evidence that power posing works in their reply. They come from many labs around the world.

If power posing were a true effect, we should see a curve that is right-skewed, tall on the left and low on the right. If power posing had not effect, we expect a flat curve.

5 Good - bad p-curveThe actual p-curve for power posing, is flattish, definitely not right-skewed.

6 pcurve

Note: you should ignore p-curve results that do not include a disclosure table; here is ours (.xlsx) .

Consistent with the replication motivating this post, p-curve indicates that either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it. Note that there are perfectly benign explanations for this: e.g., labs that run studies that worked wrote them up, labs that run studies that didn’t, didn’t.5

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

Wide logo


Author feedback.
We shared an early draft of this post with the authors of the original and failed replication studies. We also shared it with Joe Cesario, author of a few power-posing studies with whom we had discussed this literature a few months ago.

Note: It is our policy not to comment, they get the last word

Amy Cuddy (.html), co-author of the original study and TED talk speaker, provided useful suggestions for clarifications and modification that lead to several revisions, including a new title. She also sent this note:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. Given that our quite thorough response to the Ranehill et al. study has already been published in Psychological Science, I will direct you to that,(.html). I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

Roberto Weber (.html), co-author of the failed replication study sent Uri a note, co-signed with all co-authors, which he allowed us to post (see it here .pdf). They make four points, we quote attempting to summarize:

(1) none of the 33 published studies, other than the original and our replication, study the effect […] on hormones
(2) even within [the original] the hormone result seems to be presented inconsistently
(3) regarding differences in designs [from Table 2 in their response], […] the evidence does not support many of these specific examples as probable moderators in this instance
(4) we employed an experimenter-blind design […] this might be the factor that most plausibly serves as a moderator of the effect”

Again, full response: .pdf.

 Joe Cesario (html)

Excellent post, very clear and concise. Our lab has also been concerned with some of the power pose claims–less about direct replicability and more about the effectiveness in more realistic situations (motivating our Cesario & McDonald paper and another paper in prep with David Johnson). These concerns also led us to contact a number of the most cited power pose researchers to invite them for a multi-site, collaborative replication and extension project. Several said they could not commit the resources to such an endeavor, but several did agree in principle. Although progress has stalled, we are hopeful about such a project moving forward. This would be a useful and efficient way of providing information about replicability as well as potential moderation of the effects.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. This TED talk is ranked 2nd in overall downloads and 1st in per-year downloads. The talk also shared a very touching personal story of how the speaker overcame an immense personal challenge; this post is concerned exclusively with the science part of the talk []
  2. The original authors consider “feelings of power” to be a manipulation check rather than an outcome measure. In their most recent paper they write, “as a manipulation check, participants reported how dominant, in control, in charge, powerful, and like a leader they felt on a 5-point scale” (.pdf) []
  3. The definition of Small Effect in the figure corresponds to an effect that would give the original sample size 33% power, see Uri’s Small Telescopes paper (.pdf) []
  4. ***See Simine Vazire’s excellent post on why it’s wrong to always ask authors to explain failures to replicate by proposing moderators rather than concluding that the original is a false-positive .html (the *** is a gift for Simine)  []
  5. Because the replication obtained a significant effect of power posing on the manipulation check, self-reported power, we constructed a separate p-curve including only the 7 manipulation check results. The resulting p-curve was directionally  right-skewed (p=.075). We interpret this as further consistency between p-curve and the replication results. If from studies reporting both manipulation checks and the effect of interest we select only the manipulation check, and from other studies we select the effect of interest (something we believe is not reasonable but report anyway in case a reader disagrees and for the sake of robustness), the overall conclusions do not change. The overall p-curve still looks flat, does not conclude there is evidential value (Z=.84, p=.199), and does conclude the curve is flatter than 33% (Z=2.05, p=.0203) []

[36] How to Study Discrimination (or Anything) With Names; If You Must

Consider these paraphrased famous findings:
“Because his name resembles ‘dentist,’ Dennis became one” (JPSP, .pdf)
“Because the applicant was black (named Jamal instead of Greg) he was not interviewed” (AER, .pdf)
“Because the applicant was female (named Jennifer instead of John), she got a lower offer” (PNAS, .pdf)

Everything that matters (income, age, location, religion) correlates with people’s names, hence comparing people with different names involves comparing people with potentially different everything that matters.

This post highlights the problem and proposes three practical solutions.1

Gender
Jennifer was the #1 baby girl name between 1970 & 1984, while John has been a top-30 boy name for the last 120 years. Comparing reactions to profiles with these names pits mental associations about women in their late 30s/early 40s with those of  men of unclear age.

More generally, close your eyes and think of Jennifers. Now do that for Johns.
Is gender the only difference between the two sets of people you considered?

Here is what Google did when I asked it to close its eyes:2

 Jennifer  jenn
 John  John

Johns vary more in age, appearance, affluence, and presidential ambitions. For somewhat harder data, I consulted a website where people rate names on various attributes:John vs jennifer

Race
Distinctively black names (e.g., Jamal and Lakisha) signal low socioeconomic status while typical White names do not (QJE .pdf).  Do people not want to hire Jamal because he is Black or because he is of low status?

Even if all distinctively Black names (and even Black people) were perceived as low status, and hence Jamal were an externally valid signal of Blackness, the contrast with Greg might nevertheless be low in internal validity, because the difference attributed to race could instead be the result of status (or some other confounding variable). This is addressable because some (most?) low status people are not Black. We could compare Black names vs. low-status White names: say Jamal with Bubba or Billy Bob, and Lakisha with Bambi or Billy Jean. This would allow assessing  racial discrimination above and beyond status discrimination.3 greg jamalImagine reading a movie script where a Black drug dealer is being defended by a brilliant Black lawyer. One of these characters is named Greg, the other Jamal. The intuition that Greg is the lawyer’s name, is the intuition behind the internal validity problem.

Solution 1. Stop using names
Probably the best solution is to stop using names to manipulate race and gender.  A recent paper (PNAS .pdf) examined gender discrimination using only pronouns (and found that academics in STEM fields favored females over males 2:1).

Solution 2. Choose many names
A great paper titled “Stimulus Sampling” (PSPB .pdf) argues convincingly for choosing many stimuli for any given manipulation to avoid stumbling on unforeseen confounds. Stimulus sampling would involve going beyond Jennifer vs. John, to using, say, 20 female vs. 20 male names. This helps with idiosyncratic confounds (e.g., age) but not with the systematic confound that most distinctively Black names signal low socioeconomic status.4

Solution 3. Choose control names actively
If one chooses to study names, then one needs to select control names that if it weren’t for the scientific hypothesis of interest, would produce no difference with the target names (e.g., if it weren’t for racial discrimination, then people should like Jamal and this other name just as much)

I close with an example from a paper of mine where I attempted to generate proper control names to examine if people disproportionately marry others with similar names, e.g. Eric-Erica, because of implicit egotism: a preference for things that resemble the self. (JPSP .pdf)

We need control names that we would expect to marry Ericas just as frequently as Erics do in the absence of implicit egotism (e.g., of similar age, religion, income, class and location).  To find such names I looked at the relative frequency of wife names for every male name and asked “What male names have the most similar distribution of wife names to Erics?”5.

The answer was: Joseph, Frank and Carl. We would expect these three names to marry Erica just as frequently as Eric does, if not for implicit egotism. And we would be right.

For the Jamal vs. Greg study, we could compare Jamal to non-Black names that have the most similar distribution of occupations, or of Zip Codes, or of criminal records.

Wide logo

 


Feedback from original authors:
I shared an early draft of this post with the authors of the Jamal vs. Greg, and Jennifer vs. John study.

Sendhil Mullainathan, co-author of the former, indicated across a few emails he did not believe it was clear one should control for socioeconomic status differences in studies about race, because status and race are correlated in real life.

Corinne Moss-Racusin sent me a note she wrote with her co-authors of their PNAS study:

Thanks so much for contacting us about this interesting topic. We agree that these are thoughtful and important points, and have often grappled with them in our own research. The names we used (John and Jennifer) had been pretested and rated as equivalent on a number of dimensions including warmth, competence, likeability, intelligence, and typicality (Brescoll & Uhlmann, 2005 .pdf), but they were not rated for perceived age, as you highlight here. However, for our study in particular, age of the target should not have extensively impacted our results, because the age of both our targets could easily be inferred from the targets’ resume information that our participants were exposed to. Both the male and female targets (John and Jennifer respectively) were presented as recent college grads (with the same graduation year), and it is thus reasonable to assume that participants believed they were the same age, as recent college grads are almost always the same age (give or take a few years). Thus, although it is possible that age (and other potential variables) may indeed be confounded with gender across our manipulation, we nonetheless do not believe that choosing different male and female names that were equivalent for age would greatly impact our findings, given our design. That said, future research should still seek to replicate our key findings using different manipulations of target gender. Specifically, your suggestions (using only pronouns, and using multiple names) are particularly promising. We have also considered utilizing target pictures in the past, but have encountered issues relating to attractiveness and other confounds.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. Galen Bodenhausen read this post and told me about a paper on confounds in names used for gender research, from 1993(!) PsychBull .pdf []
  2. Based on the Jennifers and Johns I see, I suspect Google peeked at my cookies before closing its eyes,e.g., there are two bay area business school professors. Your results may differ. []
  3. Bertrand and Mullainathan write extensively about the socioeconomic confound and report a few null results that they interpret as suggesting it is not playing a large role (see their Section “V.B Potential Confounds”, .pdf). However, (1) the n.s. results of socioeconomic status are obtained with extremely noisy proxies and small samples, reducing the ability to conclude evidence of absence from the absence of evidence, and on the other, (2) these analyses seek to remedy the consequences of the name-confound rather than avoiding the confound from the get-go through experimental design. This post is about experimental design. []
  4. The Jamal paper used 9 different names per race/gender cell []
  5. To avoid biasing the test against implicit egotism, I excluded from the calculations male and female names starting with E_ []

[35] The Default Bayesian Test is Prejudiced Against Small Effects

When considering any statistical tool I think it is useful to answer the following two practical questions:

1. “Does it give reasonable answers in realistic circumstances?”
2. “Does it answer a question I am interested in?”

In this post I explain why, for me, when it comes to the default Bayesian test that’s starting to pop up in some psychology publications, the answer to both questions is no.”

The Bayesian test
The Bayesian approach to testing hypotheses is neat and compelling. In principle.1

The p-value assesses only how incompatible the data are with the null hypothesis. The Bayesian approach, in contrast, assesses the relative compatibility of the data with a null vs an alternative hypothesis.

The devil is in choosing that alternative.  If the effect is not zero, what is it?

Bayesian advocates in psychology have proposed using a “default” alternative (Rouder et al 1999, .pdf). This default is used in the online (.html) and R based (.html) Bayes factor calculators. The original papers do warn attentive readers that the default can be replaced with alternatives informed by expertise or beliefs (see especially Dienes 2011 .pdf), but most researchers leave the default unchanged.2

This post is written with that majority of default following researchers in mind. I explain why, for me, when running the default Bayesian test, the answer to Questions 1 & 2 is “no” .

Question 1. “Does it give reasonable answers in realistic circumstances?”
No. It is prejudiced against small effects

The null hypothesis is that the effect size (henceforth d) is zero. Ho: d = 0. What’s the alternative hypothesis? It can be whatever we want it to be, say, Ha: = .5. We would then ask: are the data more compatible with = 0 or are they more compatible with = .5?

The default alternative hypothesis used in the Bayesian test is a bit more complicated. It is a distribution, so more like Ha: d~N(0,1). So we ask if the data are more compatible with zero or with d~N(0,1).3

That the alternative is a distribution makes it difficult to think about the test intuitively.  Let’s not worry about that. The key thing for us is that that default is prejudiced against small effects.

Intuitively (but not literally), that default means the Bayesian test ends up asking: “is the effect zero, or is it biggish?” When the effect is neither, when it’s small, the Bayesian test ends up concluding (erroneously) it’s zero.4

Demo 1. Power at 50%

Let’s see how the test behaves as the effect size get smaller (R Code)Fig1The Bayesian test erroneously supports the null about 5% of the time when the effect is biggish, d=.64, but it does so five times more frequently when it is smallish, d=.28.  The smaller the effect (for studies with a given level of power), the more likely we are to dismiss its existence.  We are prejudiced against small effects.5

Note how as sample gets larger the test becomes more confident (smaller white area) and more wrong (larger red area).

Demo 2. Facebook
For a more tangible example consider the Facebook experiment (.html) that found that seeing images of friends who voted (see panel a below) increased voting by 0.39% (panel b).Facebook3While the null of a zero effect is rejected (p=.02) and hence the entire confidence interval for the effect is above zero,6 the Bayesian test concludes VERY strongly in favor of the null, 35:1. (R Code)

Prejudiced against (in this case very) small effects.

Question 2. “Does it answer a question I am interested in?”
No. I am not interested in how well data support one elegant distribution.

 When people run a Bayesian test they like writing things like
“The data support the null.”

But that’s not quite right. What they actually ought to write is
“The data support the null more than they support one mathematically elegant alternative hypothesis I compared it to”

Saying a Bayesian test “supports the null” in absolute terms seems as fallacious to me as interpreting the p-value as the probability that the null is false.

We are constantly reminded that:
P(D|H0)≠P(H0)
The probability of the data given the null is not the probability of the null

But let’s not forget that:
P(H0|D) / P(H1|D)  ≠ P(H0)
The relative probability of the null over one mathematically elegant alternative is not the probability of the null either.

Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it. The default Bayesian test does not answer a question I would ask.

Wide logo

 


Feedback from Bayesian advocates:
I shared an early draft of this post with three Bayesian advocates. I asked for feedback and invited them to comment.

1. Andrew Gelman  Expressed “100% agreement” with my argument but thought I should make it clearer this is not the only Bayesian approach, e.g., he writes “You can spend your entire life doing Bayesian inference without ever computing these Bayesian Factors.” I made several edits in response to his suggestions, including changing the title.

2. Jeff Rouder  Provided additional feedback and also wrote a formal reply (.html). He begins highlighting the importance of comparing p-values and Bayesian Factors when -as is the case in reality- we don’t know if the effect does or does not exist, and the paramount importance for science of subjecting specific predictions to data analysis (again, full reply: .html)

3. EJ Wagenmakers Provided feedback on terminology, the poetic response that follows, and a more in-depth critique of confidence intervals (.pdf)

In a desert of incoherent frequentist testing there blooms a Bayesian flower. You may not think it is a perfect flower. Its color may not appeal to you, and it may even have a thorn. But it is a flower, in the middle of a desert. Instead of critiquing the color of the flower, or the prickliness of its thorn, you might consider planting your own flower — with a different color, and perhaps without the thorn. Then everybody can benefit.”

Sunbaked Mud in Desert


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Footnotes.

  1. If you want to learn more about it I recommend Rouder et al. 1999 (.pdf), Wagenmakers 2007 (.pdf) and Dienes 2011 (.pdf) []
  2. e.g., Rouder et al (.pdf) write “We recommend that researchers incorporate information when they believe it to be appropriate […] Researchers may also incorporate expectations and goals for specific experimental contexts by tuning the scale of the prior on effect size” p.232 []
  3. The current default distribution is d~N(0,.707), the simulations in this post use that default []
  4. Again, Bayesian advocates are upfront about this, but one has to read their technical papers attentively. Here is an example in Rouder et al (.pdf) page 30: “it is helpful to recall that the marginal likelihood of a composite hypothesis is the weighted average of the likelihood over all constituent point hypotheses, where the prior serves as the weight. As [variance of the alternative hypothesis] is increased, there is greater relative weight on larger values of [the effect size] […] When these unreasonably large values […] have increasing weight, the average favors the null to a greater extent”.   []
  5. The convention is to say that the evidence clearly supports the null if the data are at least three times more likely when the null hypothesis is true than when the alternative hypothesis is, and vice versa. In the chart above I refer to data that do not clearly support the null nor the alternative as inconclusive. []
  6. note that the figure plots standard errors, not a confidence interval []

[34] My Links Will Outlive You

If you are like me, from time to time your papers include links to online references.

Because the internet changes so often, by the time readers follow those links, who knows if the cited content will still be there.

This blogpost shares a simple way to ensure your links live “forever.”  I got the idea from a recent New Yorker article [.html].

Content Rot
It is estimated that about 20%-30% of links referenced in papers are already dead and, like you and me, the remaining links aren’t getting any younger.1

I asked a research assistant to follow links in papers published in April of 2005 and April 2010 across four journals, to get a sense of what happens to links 5 and 10 years out.2

Figure1

Perusing results I noticed that:

  • Links still alive tend to involve individual newspaper articles (these will die when that newspaper shuts down) and .pdf articles hosted in university servers (these will die when faculty move on to other institutions).
  • Links to pages whose information has changed involved things like websites with financial information for 2009 (now reporting 2014 data), or working papers now replaced with updated or published versions.
  • Dead links tended to involve websites by faculty and students now at different institutions, and now-defunct online organizations.

If you intend to give future readers access to the information you are accessing today, providing links seems like a terrible way to do that.

Solution
Making links “permanent” is actually easy. It involves saving the referenced material on WebArchive.org, a repository that saves individual internet pages “forever.”

Here is an example. The Cincinnati Post was a newspaper that started in 1881 and shut down in 2007. The newspaper had a website (www.cincypost.com). If you visit it today, your browser will show this:

dead

The browser will show the same result if we follow any link to any story ever published by that newspaper.

Using the WebArchive, however, we can still read the subset of stories that were archived, for example, this October 2007 story on a fundraising event by then president George W. Bush (.html)

How to make your links “permanent”
1) Go to http://archive.org/web
2) Enter the URL of interest into the “Save Page Now” box

webarchive image

Copy paste the resulting permanent link onto your paper

Example
Imagine writing an academic article in which you want to cite, say, Colada[33] “The Effect Size Does not Exist”. The URL is http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

You could include that link in your paper, but eventually DataColada will die, and so will the content you are linking to. Someone reading your peer-reviewed Colada takedown in ninety years will have no way of knowing what you were talking about. But, if you copy-paste that URL into the WebArchive, you will save the post, and get a permanent link like this:

http://web.archive.org/web/20150221162234/http://datacolada.org/2015/02/09/33-the-effect-size-does-not-exist/

Done. Your readers can read Colada[33] long after DataColada.org is 6-feet-under.

PS: Note that WebArchive links include the original link. Were the original material to outlive WebArchive, readers could still see it. Archiving is a weakly dominating strategy.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. See “Related Work” section in this PlosONE article [.html] []
  2. I chose journals I read: The Journal of Consumer Research, Psychological Science, Management Science and The American Economic Review. Actually, I no longer read JCR articles, but that’s not 100% relevant. []

[33] “The” Effect Size Does Not Exist

Consider the robust phenomenon of anchoring, where people’s numerical estimates are biased towards arbitrary starting points. What does it mean to say “the” effect size of anchoring?

It surely depends on moderators like domain of the estimate, expertise, and perceived informativeness of the anchor. Alright, how about “the average” effect-size of anchoring? That’s simple enough. Right? Actually, that’s where the problem of interest to this post arises. Computing the average requires answering the following unanswerable question: How much weight should each possible effect-size get when computing “the average?” effect size?

Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?

Say anchoring effects are larger when estimating river lengths than door heights, does “the average” anchoring effect give all river studies combined 50% weight and all door studies the other 50%? If so, what do we do with canal-length studies, combine them with rivers or count them on their own?

If we weight by study rather than stimulus, “the average” effect gets larger as more rivers studies are conducted, and if we weight by sample size “the average” gets smaller if we run more subjects in the door studies.

31 13

What about the impact of anchoring on perceived strawberry-jam viscosity. Nobody has yet studied that but they could, does “the average” anchoring effect-size include this one?

What about all the zero estimates one would get if the experiment was done in a room without any lights or with confusing instructions?  What about all the large effects one would get via demand effects or confounds? Does the average include these?

Studies aren’t random
We can think of the problem using a sampling framework: the studies we run are a sample of the studies we could run. Just not a random sample.

Cheat-sheet. Random sample: every member of the population is equally likely to be selected.

First, we cannot run studies randomly, because we don’t know the relative frequency of every possible study in the population of studies. We don’t know how many “door” vs “river” studies exist in this platonic universe, so we don’t know with what probability to run a door vs a river study.

Second, we don’t want to run studies randomly, we want studies that will provide new information, that are similar to those we have seen elsewhere, that will have higher rhetorical value in a talk or paper, that we find intrinsically interesting, that are less confounded, etc.1

What can we estimate?
Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again.

To generalize beyond the data requires judgment rather than statistics.
Judgment can account for non-randomly run studies in a way that statistics cannot.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Running studies with a set instead of a single stimulus is nevertheless very important, but for construct rather than external validity. Running a set of stimuli reduces the risks of stumbling on the single confounded stimulus that works. Check out the excellent “Stimulus Sampling” paper by Wells and Windschitl (.pdf) []

[32] Spotify Has Trouble With A Marketing Research Exam

This is really just a post-script to Colada [2], where I described a final exam question I gave in my MBA marketing research class. Students got a year’s worth of iTunes listening data for one person –me– and were asked: “What songs would this person put on his end-of-year Top 40?” I compared that list to the actual top-40 list. Some students did great, but many made the rookie mistake of failing to account for the fact that older songs (e.g., those released in January) had more opportunity to be listened to than did newer songs (e.g., those released in November).

I was reminded of this when I recently received an email from Spotify (my chosen music provider) that read:

spotify figure 1

First, Spotify, rather famously, does not make listening-data particularly public,1 so any acknowledgement that they are assessing my behavior is kind of exciting. Second, that song, Inauguration [Spotify link], is really good. On the other hand, despite my respect for the hard working transistors inside the Spotify preference-detection machine, that song is not my “top song” of 2014.2

The thing is, “Inauguration” came out in January. Could Spotify be making the same rookie mistake as some of my MBA students?

Following Spotify’s suggestion, I decided to check out the rest of their assessment of my 2014 musical preferences. Spotify offered a ranked listing of my Top 100 songs from 2014. Basically, without even being asked, Spotify said “hey, I will take that final exam of yours.” So without even being asked I said, “hey, I will grade that answer of yours.” How did Spotify do?

Poorly. Spotify thinks I really like music from January and February.

Here is their data:

spotify figure 2

Each circle is a song; the red ones are those which I included in my actual Top 40 list.

If I were grading this student, I would definitely have some positive things to say. “Dear Spotify Preference-Detection Algorithm, Nice job identifying eight of my 40 favorite songs. In particular, the song that you have ranked second overall, is indeed in my top three.” On the other hand, I would also probably say something like, “That means that your 100 guesses still missed 32 of my favorites. Your top 40 only included five of mine. If you’re wondering where those other songs are hiding, I refer you to the entirely empty right half of the above chart. Of your Top 100, a full 97 were songs added before July 1. I like the second half of the year just as much as the first.” Which is merely to say that the Spotify algorithm has room for improvement. Hey, who doesn’t?

Actually, in preparing this post, I was surprised to learn that, if anything, I have a strong bias toward songs released later in the year. This bias that could reflect my tastes, or alternatively a bias in the industry (see this post in a music blog on the topic, .html). I looked at when Grammy-winning songs are released and learned that they are slightly biased toward the second half of the year3. The figure below shows the distributions (with the correlation between month and count).

spotify figure 3

I have now learned how to link my Spotify listening behavior to Last.fm. A year from now perhaps I will get emails from two different music-distribution computers and I can compare them head-to-head? In the meantime, I will probably just listen to the forty best songs of 2014 [link to my Spotify playlist].

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. OK, “famously” is overstated, but even a casual search will reveal that there are many users who want more of their own listening data. Also, “not particularly public” is not the same as “not at all public.” For example, they apparently share all kinds of data with Walt Hickey at FiveThirtyEight (.html). I am envious of Mr. Hickey. []
  2. My top song of 2014 is one of these (I don’t rank my Top 40): The Black and White Years – Embraces, Modern Mod – January, or Perfume Genius – Queen []
  3. I also learned that “Little Green Apples” won in the same year that “Mrs. Robinson” and “Hey Jude” were nominated. Grammy voters apparently fail a more basic music preference test. []