[3] A New Way To Increase Charitable Donations: Does It Replicate?

A new paper finds that people will donate more money to help 20 people if you first ask them how much they would donate to help 1 person.

This Unit Asking Effect (Hsee, Zhang, Lu, & Xu, 2013, Psychological Science) emerges because donors are naturally insensitive to the number of individuals needing help. For example, Hsee et al. observed that if you ask different people how much they’d donate to help either 1 needy child or 20 needy children, you get virtually the same answer. But if you ask the same people to indicate how much they’d donate to 1 child and then to 20 children, they realize that they should donate more to help 20 than to help 1, and so they increase their donations.

If true, then this is a great example of how one can use psychology to design effective interventions.

The paper reports two field experiments and a study that solicited hypothetical donations (Study 1). Because it was easy, I attempted to replicate the latter. (Here at Data Colada, we report all of our replication attempts, no matter the outcome).

I ran two replications, a “near replication” using materials that I developed based on the authors’ description of their methods (minus a picture of a needy schoolchild) and then an “exact replication” using the authors’ exact materials. (Thanks to Chris Hsee and Jiao Zhang for providing those).

In the original study, people were asked how much they’d donate to help a kindergarten principal buy Christmas gifts for her 20 low-income pupils. There were four conditions, but I only ran the three most interesting conditions:

 

The original study had ~45 participants per cell. To be properly powered, replications should have ~2.5 times the original sample size. I (foolishly) collected only ~100 per cell in my near replication, but corrected my mistake in the exact replication (~150 per cell). Following Hsee et al., I dropped responses more than 3 SD from the mean, though there was a complication in the exact replication that required a judgment call. My studies used MTurk participants; theirs used participants from “a nationwide online survey service.”

Here are the results of the original (some means and SEs are guesses) and my replications (full data).

I successfully replicated the Unit Asking Effect, as defined by Unit Asking vs. Control; it was marginal (p=.089) in the smaller-sampled near replication and highly significant (p< .001) in the exact replication.

There were some differences. First, my effect sizes (d=.24 and d=.48) were smaller than theirs (d=.88). Second, whereas they found that, across conditions, people were insensitive to whether they were asked to donate to 1 child or 20 children (the white $15 bar vs. the gray $18 bar), I found a large difference in my near replication and a smaller but significant difference in the exact replication. This sensitivity is important, because if people do give lower donations for 1 child than for 20, then they might anchor on those lower amounts, which could diminish the Unit Asking Effect.

In sum, my studies replicated the Unit Asking Effect.

 

[2] Using Personal Listening Habits to Identify Personal Music Preferences

Not everything at Data Colada is as serious as fraudulent data. This post is way less serious than that. This post is about music and teaching.

As part of their final exam, my students analyze a data set. For a few years that data set has been a collection of my personal listening data from iTunes over the previous year. The data set has about 500 rows, with each reporting a song from that year, when I purchased it, how many times I listened to it, and a handful of other pieces of information. The students predict the songs I will include on my end-of-year “Leif’s Favorite Songs” compact disc. (Note to the youth: compact discs were physical objects that look a lot like Blu-Ray discs. We used to put them in machines to hear music.) So the students are meant to combine regressions and intuitions to make predictions. I grade them based on how many songs they correctly predict. I love this assignment.

The downside, as my TA tells me, is that my answer key is terrible. The problem is that I am encumbered both by my (slightly) superior statistical sense and my (substantially) superior sense of my own intentions and preferences. You see, a lot goes into the construction of a good mix tape (Note to the youth: tapes were like CD’s, except if you wanted to hear track 1 and then track 8 you were SOL.) I expected my students to account for that. “Ah look,” I am picturing, “he listened a lot to Pumped Up Kicks. But that would be an embarrassing pick. On the other hand, he skipped this Gil Scott-Heron remix a lot, but you know that’s going on there.” They don’t do that. They pick the songs I listen to a lot.

But then they miss certain statistical realities. When it comes to grading, the single biggest differentiator is whether or not a student accounts for how long a song is in the playlist (see the scatterplot of 2011, below). If you don’t account for it, then you think that all of my favorite songs were released in the first couple of months. A solid 50% of students think that I have a mad crush on January music. The other half try to account for it. Some calculate a “listens per day” metric, while others use a standardization procedure of one type or another. I personally use a method that essentially accounts for the likelihood that a song will come up, and therefore heavily discounts the very early tracks and weighs the later tracks all about the same. You may ask, “wait, why are you analyzing your own data?” No good explanation. I will say though, I almost certainly change my preferences based on these analyses – I change them away from what my algorithm predicts. That is bad for the assignment. I am not a perfect teacher.

I don’t think that I will use this assignment anymore since I no longer listen to iTunes. Now I use Spotify. (Note to the old: Spotify is like a musical science fiction miracle that you will never understand. I don’t.)
Leif's Song Scatterplot

[1] "Just Posting It" works, leads to new retraction in Psychology

The fortuitous discovery of new fake data.
For a project I worked on this past May, I needed data for variables as different from each other as possible. From the data-posting journal Judgment and Decision Making I downloaded data for ten, including one from a now retracted paper involving the estimation of coin sizes. I created a chart and inserted it into a paper that I sent to several colleagues, and into slides presented at an APS talk.

An anonymous colleague, “Larry,” saw the chart and, for not-entirely obvious reasons, became interested in the coin-size study. After downloading the publicly available data he noticed something odd (something I had not noticed): while each participant had evaluated four coins, the data contained only one column of estimates. The average? No, for all entries were integers; averages of four numbers are rarely integers. Something was off.

Interest piqued, he did more analyses leading to more anomalies. He shared them with the editor, who contacted the author. The author provided explanations. These were nearly as implausible as they were incapable of accounting for the anomalies. The retraction ensued.

Some of the anomalies
1. Contradiction with paper
Paper describes 0-10 integer scale, dataset has decimals and negative numbers.
image

2. Implausible correlations among emotion measures
Shame and embarrassment are intimately related emotions, and yet they are correlated negatively in the data r = -.27. Fear and anxiety: r = -.01. Real emotion ratings don’t exhibit these correlations.

3. Impossibly similar results
Fabricated data often exhibit a pattern of excessive similarity (e.g., very similar means across conditions). This pattern led to uncovering Sanna and Smeesters as fabricateurs (see “Just Post It” paper). Diederik Stapel’s data also exhibit excessive similarity, going back to his dissertation at least.

The coin-size paper also has excessive similarity. For example, coin-size estimates supposedly obtained from 49 individuals across two different experiments are almost identical:
Experiment 1 (n=25): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,6,7
Experiment 2 (n=24): 2,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5,_,6,6,6,6,6,6,6,7

Simulations drawing random samples from the data themselves (bootstrapping) show that it is nearly impossible to obtain such similar results. The hypothesis that these data came from random samples is rejected, p<.000025 (see R code, detailed explanation).
image

Who vs. which
These data are fake beyond reasonable doubt.  We don’t know, however, who faked them.
That question is of obvious importance to the authors of the paper and perhaps their home and granting institutions, but arguably not so much to the  research community more broadly. We should care, instead, about which data are fake.

If other journals followed the lead of Judgment and Decision Making and required data posting (its  editor Jon Baron, by the way,  started the data posting policy well before I wrote my “Just Post It”), we would have a much easier time identifying invalid data.  Some of the coin-size authors have  a paper in JESP, one in Psychological Science, and another with similar results  in Appetite.  If the data behind those papers were available, we would not need to speculate as to their validity.

Author’s response
When discussing the work of others, our policy here at Data Colada is to contact them before posting. We ask for feedback to avoid inaccuracies and misunderstandings, and  give authors space for commenting within our original blog post. The corresponding author of the retracted article,  Dr. Wen-Bin Chiou, wrote to me via email:

Although the data collection and data coding was done by my research assistant, I must be responsible for the issue.Unfortunately, the RA had left my lab last year and studied abroad. At this time, I cannot get the truth from him and find out what was really going wrong […] as to the decimal points and negative numbers, I recoded the data myself and sent the editor with the new dataset. I guess the problem does not exist in the new dataset. With regard to the impossible similar results, the RA sorted the coin-size estimate variable, producing the similar results. […]  Finally, I would like to thank Dr. Simonsohn for including my clarifications in this post.
[See unedited version]

Uri’s note: the similarity of data is based on the frequency of values across samples, not their order, so sorting does not explain  that the data are incompatible with random sampling.