[8] Adventures in the Assessment of Animal Speed and Morality

Animal Virtue Figure 1
In surveys, most people answer most questions. That is true regardless of whether or not questions are coherently constructed and reasonably articulated. That means that absurd questions still receive answers, and in part because humans are similar to one another, those answers can even look peculiarly consistent. I asked an absurd question and was rewarded with an entertaining answer.

Some years ago, with Tom Meyvis, I tried to develop a manipulation to create an association between speed and virtue. Our spartan publication history on the topic testifies to our (lack of) success. That doesn’t mean that the pilot data weren’t interesting for a different reason.

Participants saw a sequence of 20 animal photographs and rated each on one of two bipolar dimensions: speed or goodness. The former is straightforward. The latter could be best construed as an evaluation of moral worth. That is an absurd question. What sorts of answers did we receive?
Animal Virtue Figure 2
My Top 5 observations:

1. The Tortoise is the most moral animal. I anticipated more class-profiling, and a resulting ingroup bias for mammalia. Nope. Perhaps researchers should try an implicit measure?*

2. Aquatic race featuring: Jellyfish vs. Starfish vs. Walrus. Who wins? People give the jellyfish the edge. The starfish has no chance.

3. Nature documentaries frequently bandy about facts like, “hippopotami kill more people every year than heart disease.” My respondents overlooked that; Hippos are more moral than sloths (which nature documentaries never mention for their killing ability).

4. The orangutan is not just a mammal or just a primate, it is a great ape. Huge opportunity for some ingroup favoritism. Instead people favor the cheetah, walrus, and hippo (amongst others). Explain that.

5. Most animals are good. Our scale had a meaningful midpoint, yet all but three animals are above it. Who is bad? Hyena, Barracuda, and Jellyfish. The Jellyfish is worst. And deceptively fast. Perhaps a researcher could prime people with jellyfish and see if they cheat more on that matrices task?**

Perhaps some absurd questions have correct answers? I asked a pair of experts. Pieter Thomas Jefferson Johnson is an ecologist possibly best known for solving a major scientific problem before he was old enough to drink. Michael Jennions is a world renowned evolutionary biologist, known for many things, including this video (the link alone makes this post worthwhile). I asked them to rank the 20 animals for speed and morality. Their speed ratings are similar to each other (r = .91) and the novices (r = .87). Morality was trickier. Both said that any response would be random, or as Piet said, “I would probably tie them all in ranking”. But responses aren’t quite random. Michael rated based on the complexity of the central nervous system (complex = evil), whereas Pieter used “trophic level, followed by an inverse body mass index”. Despite very different approaches, they are mildly correlated with each other (r = .29). Experts and novices all agree on the virtue of the Tortoise, but Michael and Piet are just as fond of the lowly snail.
Animal Virtue Figure 3
*No they shouldn’t.

**Don’t run that study. I mean it.

[7] Forthcoming in the American Economic Review: A Misdiagnosed Failure-to-Replicate

In the paper “One Swallow Doesn’t Make A Summer: New Evidence on Anchoring Effects”, forthcoming in the AER, Maniadis, Tufano and List attempted to replicate a classic study in economics. The results were entirely consistent with the original and yet they interpreted them as a “failure to replicate.” What went wrong?

This post answers that question succinctly; our new paper has additional analyses.

Original results
In an article with >600 citations, Ariely, Loewenstein, and Prelec (2003) showed that people presented with high anchors (“Would you pay $70 for a box of chocolates?”) end up paying more than people presented with low anchors (“Would you pay $20 for a box of chocolates?”). They found this effect in five studies, but the AER replication reran only Study 2. In that study, participants gave their asking prices for aversive sounds that were 10, 30, or 60 seconds long, after a high (50¢), low (10¢), or no anchor.

Replication results

comparing only the 10-cent and 50-cent anchor conditions, we find an effect size equal to 28.57 percent [the percentage difference between valuations], about half of what ALP found. The p-value […] was equal to 0.253” (p. 8).

So their evidence is unable to rule out the possibility that anchoring is a zero effect. But that is only part of the story. Does their evidence also rule out a sizable anchoring effect? It does not. Their evidence is consistent with an effect much larger than the original.

Fig1 Anchoring post

Those calculations use Maniadis et al.’s definition of effect size: % difference in valuations (as quoted above). An alternative is to divide the differences of means by the standard deviation (Cohen’s d). Using this metric the Replication’s effect size is more markedly different from the Original’s, d=.94 vs. d=.26 . However, the 95% confidence interval for the Replication includes effects as big as d=.64, midway between medium and large effects. Whether we examine Maniadis et al.’s operationalization of effect size, then, or Cohen’s d, we arrive at the same conclusion: the Replication is too noisy to distinguish between a nonexistent and a sizable anchoring effect.

Why is the Replication so imprecise?
In addition to having 12% fewer participants, nearly half of all valuations are ≤10¢. Even if anchoring had a large percentage effect, one that doubles WTA from 3¢ to 6¢, the tendency of participants to round both to 5¢ makes it undetectable. And there is the floor effect: valuations so close to $0 cannot drop. One way around this problem is to do something economists do all the time: Express the effect size of one variable (How big is the impact of X on Z?) relative to the effect size of another (it is half the effect of Y on Z). Figure 2 shows that, in cents, both the effect of anchoring and duration is smaller in the replication, and that the relative effect of anchoring is comparable across studies. Fig2 Anchoring post

The original paper had five studies, four were p<.01, the fifth p<.02. When we submit these p-values to p-curve we can empirically examine the fear expressed by the replicators that the original finding is false-positive. The results strongly reject this possibility; selective reporting is an unlikely explanation for the original paper, p<.0001.

Some successful replications
Every year Uri runs a replication of Ariely et al.’s Study 1 in his class. In an online survey at the beginning of the semester, students write down the last two digits of their social-security-number, indicate if they would pay that amount for something (this semester it was for a ticket to watch Jerry Seinfeld live on campus), and then indicate the most they would pay. Figure 3 has this year’s data:

Fig3 Anchoring post

We recently learned that SangSuk Yoon, Nathan Fong and Angelika Dimoka successfully replicated Ariely et al.’s Study 1 with real decisions (in contrast to this paper).

Concluding remark
We are not vouching for the universal replicability of Ariely et al here. It is not difficult to imagine moderators (beyond floor effects) that attenuate anchoring. We are arguing that the forthcoming “failure-to-replicate” anchoring in the AER is no such thing.

note: When we discuss others’ work at DataColada we ask them for feedback and offer them space to comment within the original post. Maniadis, Tufano, and List provided feedback only for our paper and did not send us comments to post here.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[6] Samples Can’t Be Too Large

Reviewers, and even associate editors, sometimes criticize studies for being “overpowered” – that is, for having sample sizes that are too large. (Recently, the between-subjects sample sizes under attack were about 50-60 per cell, just a little larger than you need to have an 80% chance to detect that men weigh more than women).

This criticism never makes sense.

The rationale for it is something like this: “With such large sample sizes, even trivial effect sizes will be significant. Thus, the effect must be trivial (and we don’t care about trivial effect sizes).”

But if this is the rationale, then the criticism is ultimately targeting the effect size rather than the sample size.  A person concerned that an effect “might” be trivial because it is significant with a large sample can simply compute the effect size, and then judge whether it is trivial.

(As an aside: Assume you want an 80% chance to detect a between-subjects effect. You need about 6,000 per cell for a “trivial” effect, say d=.05, and still about 250 per cell for a meaningful “small” effect, say d=.25. We don’t need to worry that studies with 60 per cell will make trivial effects be significant).

It is OK to criticize a study for having a small effect size. But it is not OK to criticize a study for having a large sample size. This is because sample sizes do not change effect sizes. If I were to study the effect of gender on weight with 40 people or with 400 people, I would, on average, estimate the same effect size (d ~= .59). Collecting 360 additional observations does not decrease my effect size (though, happily, it does increase the precision of my effect size estimate, and that increased precision better enables me to tell whether an effect size is in fact trivial).

Our field suffers from a problem of underpowering. When we underpower our studies, we either suffer the consequences of a large file drawer of failed studies (bad for us) or we are motivated to p-hack in order to find something to be significant (bad for the field). Those who criticize studies for being overpowered are using a nonsensical argument to reinforce exactly the wrong methodological norms.

If someone wants to criticize trivial effect sizes, they can compute them and, if they are trivial, criticize them. But they should never criticize samples for being too large.

We are an empirical science. We collect data, and use those data to learn about the world. For an empirical science, large samples are good. It is never worse to have more data.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[5] The Consistency of Random Numbers

What’s your favorite number between 1 and 100? Now, think of a random number between 1 and 100. My goal for this post is to compare those two responses.

Number preferences feel random. They aren’t. “Random” numbers also feel random. Those aren’t random either. I collected some data, found a pair of austere academic papers, and one outstanding blog post. I will tell you about all of them.

First, the data I collected. I (along with Hannah Perfecto, one of my excellent doctoral students) asked one group of people to generate a random number between 1 and 100. Another group reported their favorite number between 1 and 100. That’s it.

We know a little about preferences. People like their birthday numbers, for example. They pursue round numbers. In preparing this post, I learned of a simmering literature on single-digit number preferences, suggesting that in both 1971 and in 1988 people liked the number 7. (Aside: Someone should write the number preference equivalent of the Princeton Trilogy. In fact, why not move beyond preferences to other attributes? For example, are even numbers more warm or more competent?*). As far as I can tell, less is known about how people generate random numbers. Do people choose the same numbers at random as they choose as their favorites?

The figures tell the whole story, but words are useful. Consider four notable numbers. Consistent with past research, people like the number 7. Inconsistent with horror movie titlers and hotel floor number assigners, people also like the number 13. The number 42 has an entirely wonderful Wikipedia entry, suggesting that its consequence goes beyond Jackie Robinson and Douglas Adams. Perhaps the Data Colada can add a small footnote to its mystique? Finally, the number 69 also has a Wikipedia entry, though it is far less vivid than you’re anticipating. On the random side there are fewer obvious winners (three way tie between 5, 67, and 69). numbers frequencies

How about some other patterns? First of all, the two sets are highly, but imperfectly, correlated at r = .48. Random numbers are larger than favorite numbers (Ms = 46.9 vs. 30.7), t(565) = 7.01, p

numbers correlation

These tendencies are partially reflected in the numeric codes people choose for debit cards and their ilk. PIN numbers are a mix of preference and random, and consistent with the data we collected, a brilliant analysis of leaked PIN numbers reveals birthday liking (numbers below 32) and repeated numbers (like multiples of 11). Figure 3 reproduces a chart of 4-digit PIN codes. It will take 30 seconds to orient yourself, but then you will spend five minutes savoring it. numbers PIN

My favorite number is just about the most arbitrary preference possible. My “random” number is more arbitrary. But neither is arbitrary at all.

* Hypothesis: More warm. Odd numbers are wicked competent.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[4] The Folly of Powering Replications Based on Observed Effect Size

It is common for researchers running replications to set their sample size assuming the effect size the original researchers got is correct. So if the original study found an effect-size of d=.73, the replicator assumes the true effect is d=.73, and sets sample size so as to have 90% chance, say, of getting a significant result.

This apparently sensible way to power replications is actually deeply misleading.

Why Misleading?
Because of publication bias. Given that (original) research is only publishable if it is significant, published research systematically overestimates effect size (Lane & Dunlap, 1978). For example, if sample size is n=20 per cell, and true effect size is d=.2, published studies will on average estimate the effect to be d=.78. The intuition is that overestimates are more likely to be significant than underestimates, and so more likely to be published.

If we systematically overestimate effect sizes in original work, then we systematically overestimate the power of replications that assume those effects are real.

Let’s consider some scenarios. If original research were powered to 50%, a highly optimistic benchmark (Button et al, 2013;Sedlmeier Gigerenzer, 1989), here is what it looks like:

So replications claiming 80% power actually have just 51% (Details | R code).

Ok. What if original research were powered at a more realistic level of, say, 35%:
The figures show that the extent of overclaiming depends on the power of the original study. Because nobody knows what that is, nobody knows how much power a replication claiming 80%, 90% or 95% really has.

A self-righteous counterargument
A replicator may say:

Well, if the original author underpowered her studies, then she is getting what she deserves when the replications fail; it is not my fault my replication is underpowered, it is hers. SHE SHOULD BE DOING POWER ANALYSIS!!!

Three problems.
1. Replications in particular and research in general are not about justice. We should strive to maximize learning, not schadenfreude.

2. The original researcher may have thought the effect was bigger than it is, she thought she had  80% power, but she had only 50%. It is not “fair” to “punish” her for not knowing the effect size she is studying. That’s precisely why she is studying it.

3. Even if all original studies had 80% power, most published estimates would be over-estimates, and so even if  all original studies had 80% power, most replications based on observed effects would overclaim power. For instance, one in five replications claiming 80% would actually have <50% power (R code).


What’s the alternative?
In a recent paper (“Evaluating Replication Results”) I put forward a different approach to thinking about replication results altogether. For a replication to fail it is not enough that p>.05 in it, we need to also conclude the effect is too small to have been detected in the original study (in effect, we need tight confidence intervals around 0). Underpowered replications will tend to fail to reject 0, be n.s., but will also tend to fail to reject big effects. In the new approach this result is considered as uninformative rather than as a “failure-to-replicate.” The paper also derives a simple rule for sample size to be properly powered for obtaining informative failures to replicate:  2.5 times the original sample size ensures 80% power for that test. That number is unaffected by publication bias, how original authors power their studies, and the study design (e.g., two-proportions vs. ANOVA).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[3] A New Way To Increase Charitable Donations: Does It Replicate?

A new paper finds that people will donate more money to help 20 people if you first ask them how much they would donate to help 1 person.

This Unit Asking Effect (Hsee, Zhang, Lu, & Xu, 2013, Psychological Science) emerges because donors are naturally insensitive to the number of individuals needing help. For example, Hsee et al. observed that if you ask different people how much they’d donate to help either 1 needy child or 20 needy children, you get virtually the same answer. But if you ask the same people to indicate how much they’d donate to 1 child and then to 20 children, they realize that they should donate more to help 20 than to help 1, and so they increase their donations.

If true, then this is a great example of how one can use psychology to design effective interventions.

The paper reports two field experiments and a study that solicited hypothetical donations (Study 1). Because it was easy, I attempted to replicate the latter. (Here at Data Colada, we report all of our replication attempts, no matter the outcome).

I ran two replications, a “near replication” using materials that I developed based on the authors’ description of their methods (minus a picture of a needy schoolchild) and then an “exact replication” using the authors’ exact materials. (Thanks to Chris Hsee and Jiao Zhang for providing those).

In the original study, people were asked how much they’d donate to help a kindergarten principal buy Christmas gifts for her 20 low-income pupils. There were four conditions, but I only ran the three most interesting conditions:


The original study had ~45 participants per cell. To be properly powered, replications should have ~2.5 times the original sample size. I (foolishly) collected only ~100 per cell in my near replication, but corrected my mistake in the exact replication (~150 per cell). Following Hsee et al., I dropped responses more than 3 SD from the mean, though there was a complication in the exact replication that required a judgment call. My studies used MTurk participants; theirs used participants from “a nationwide online survey service.”

Here are the results of the original (some means and SEs are guesses) and my replications (full data).

I successfully replicated the Unit Asking Effect, as defined by Unit Asking vs. Control; it was marginal (p=.089) in the smaller-sampled near replication and highly significant (p< .001) in the exact replication.

There were some differences. First, my effect sizes (d=.24 and d=.48) were smaller than theirs (d=.88). Second, whereas they found that, across conditions, people were insensitive to whether they were asked to donate to 1 child or 20 children (the white $15 bar vs. the gray $18 bar), I found a large difference in my near replication and a smaller but significant difference in the exact replication. This sensitivity is important, because if people do give lower donations for 1 child than for 20, then they might anchor on those lower amounts, which could diminish the Unit Asking Effect.

In sum, my studies replicated the Unit Asking Effect.


[2] Using Personal Listening Habits to Identify Personal Music Preferences

Not everything at Data Colada is as serious as fraudulent data. This post is way less serious than that. This post is about music and teaching.

As part of their final exam, my students analyze a data set. For a few years that data set has been a collection of my personal listening data from iTunes over the previous year. The data set has about 500 rows, with each reporting a song from that year, when I purchased it, how many times I listened to it, and a handful of other pieces of information. The students predict the songs I will include on my end-of-year “Leif’s Favorite Songs” compact disc. (Note to the youth: compact discs were physical objects that look a lot like Blu-Ray discs. We used to put them in machines to hear music.) So the students are meant to combine regressions and intuitions to make predictions. I grade them based on how many songs they correctly predict. I love this assignment.

The downside, as my TA tells me, is that my answer key is terrible. The problem is that I am encumbered both by my (slightly) superior statistical sense and my (substantially) superior sense of my own intentions and preferences. You see, a lot goes into the construction of a good mix tape (Note to the youth: tapes were like CD’s, except if you wanted to hear track 1 and then track 8 you were SOL.) I expected my students to account for that. “Ah look,” I am picturing, “he listened a lot to Pumped Up Kicks. But that would be an embarrassing pick. On the other hand, he skipped this Gil Scott-Heron remix a lot, but you know that’s going on there.” They don’t do that. They pick the songs I listen to a lot.

But then they miss certain statistical realities. When it comes to grading, the single biggest differentiator is whether or not a student accounts for how long a song is in the playlist (see the scatterplot of 2011, below). If you don’t account for it, then you think that all of my favorite songs were released in the first couple of months. A solid 50% of students think that I have a mad crush on January music. The other half try to account for it. Some calculate a “listens per day” metric, while others use a standardization procedure of one type or another. I personally use a method that essentially accounts for the likelihood that a song will come up, and therefore heavily discounts the very early tracks and weighs the later tracks all about the same. You may ask, “wait, why are you analyzing your own data?” No good explanation. I will say though, I almost certainly change my preferences based on these analyses – I change them away from what my algorithm predicts. That is bad for the assignment. I am not a perfect teacher.

I don’t think that I will use this assignment anymore since I no longer listen to iTunes. Now I use Spotify. (Note to the old: Spotify is like a musical science fiction miracle that you will never understand. I don’t.)
Leif's Song Scatterplot

[1] "Just Posting It" works, leads to new retraction in Psychology

The fortuitous discovery of new fake data.
For a project I worked on this past May, I needed data for variables as different from each other as possible. From the data-posting journal Judgment and Decision Making I downloaded data for ten, including one from a now retracted paper involving the estimation of coin sizes. I created a chart and inserted it into a paper that I sent to several colleagues, and into slides presented at an APS talk.

An anonymous colleague, “Larry,” saw the chart and, for not-entirely obvious reasons, became interested in the coin-size study. After downloading the publicly available data he noticed something odd (something I had not noticed): while each participant had evaluated four coins, the data contained only one column of estimates. The average? No, for all entries were integers; averages of four numbers are rarely integers. Something was off.

Interest piqued, he did more analyses leading to more anomalies. He shared them with the editor, who contacted the author. The author provided explanations. These were nearly as implausible as they were incapable of accounting for the anomalies. The retraction ensued.

Some of the anomalies
1. Contradiction with paper
Paper describes 0-10 integer scale, dataset has decimals and negative numbers.

2. Implausible correlations among emotion measures
Shame and embarrassment are intimately related emotions, and yet they are correlated negatively in the data r = -.27. Fear and anxiety: r = -.01. Real emotion ratings don’t exhibit these correlations.

3. Impossibly similar results
Fabricated data often exhibit a pattern of excessive similarity (e.g., very similar means across conditions). This pattern led to uncovering Sanna and Smeesters as fabricateurs (see “Just Post It” paper). Diederik Stapel’s data also exhibit excessive similarity, going back to his dissertation at least.

The coin-size paper also has excessive similarity. For example, coin-size estimates supposedly obtained from 49 individuals across two different experiments are almost identical:
Experiment 1 (n=25): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7
Experiment 2 (n=24): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,_,6,6,6,6,6,6,6,7

Simulations drawing random samples from the data themselves (bootstrapping) show that it is nearly impossible to obtain such similar results. The hypothesis that these data came from random samples is rejected, p<.000025 (see R code, detailed explanation).

Who vs. which
These data are fake beyond reasonable doubt.  We don’t know, however, who faked them.
That question is of obvious importance to the authors of the paper and perhaps their home and granting institutions, but arguably not so much to the  research community more broadly. We should care, instead, about which data are fake.

If other journals followed the lead of Judgment and Decision Making and required data posting (its  editor Jon Baron, by the way,  started the data posting policy well before I wrote my “Just Post It”), we would have a much easier time identifying invalid data.  Some of the coin-size authors have  a paper in JESP, one in Psychological Science, and another with similar results  in Appetite.  If the data behind those papers were available, we would not need to speculate as to their validity.

Author’s response
When discussing the work of others, our policy here at Data Colada is to contact them before posting. We ask for feedback to avoid inaccuracies and misunderstandings, and  give authors space for commenting within our original blog post. The corresponding author of the retracted article,  Dr. Wen-Bin Chiou, wrote to me via email:

Although the data collection and data coding was done by my research assistant, I must be responsible for the issue.Unfortunately, the RA had left my lab last year and studied abroad. At this time, I cannot get the truth from him and find out what was really going wrong […] as to the decimal points and negative numbers, I recoded the data myself and sent the editor with the new dataset. I guess the problem does not exist in the new dataset. With regard to the impossible similar results, the RA sorted the coin-size estimate variable, producing the similar results. […]  Finally, I would like to thank Dr. Simonsohn for including my clarifications in this post.
[See unedited version]

Uri’s note: the similarity of data is based on the frequency of values across samples, not their order, so sorting does not explain  that the data are incompatible with random sampling.