[72] Metacritic Has A (File-Drawer) Problem

Metacritic.com scores and aggregates critics’ reviews of movies, music, and video games. The website provides a summary assessment of the critics’ evaluations, using a scale ranging from 0 to 100. Higher numbers mean that critics were more favorable.

In theory, this website is pretty awesome, seemingly leveraging the wisdom-of-crowds to give consumers the most reliable recommendations. After all, it’s surely better to know what a horde of reviewers thinks than to know what a single reviewer thinks.

But at least when it comes to music reviews, metacritic is broken. I’ll explain how/why it is broken, I’ll propose a way to fix it, and I’ll show that the fix works.

Metacritic Is Broken

A few weeks ago, a fairly unknown “Scottish chamber pop band” named Modern Studies released an album that is not very good (Spotify .html). At about the same time, the nearly perfect band Beach House released a nearly perfect album (Spotify .html).

You might think these things are subjective, but in many cases they are really not. The Great Gatsby is objectively better than this blog post, and Beach House’s album is objectively better than Modern Studies’s album.

But what does Metacritic say?

So, yeah, metacritic is broken.

If this were a one-off example, I wouldn’t be writing this post. It is not a one-off example. For example, Metacritic would lead you to believe that Fever Ray’s 2017 release (Metascore of 87) is almost as good as St. Vincent’s 2017 release (Metascore of 88), but St. Vincent’s album is, I don’t know, a trillion times better [1]. More recently, Metacritic rated an unspeakably bad album by Goat Girl as something that is worth your time (Metascore of 80). It is not worth your time.

So what’s going on?

What’s going on is publication bias. Music reviewers don’t publish a lot of negative reviews, especially of artists that are unknown. As evidence of this, consider that although Metascores theoretically range from 0-100, in practice only 16% of albums released in 2018 have received a Metascore below 70 [2]. This might be because reviewers don’t want to be mean to struggling artists. Or it might be because reviewers don’t like to spend their time reviewing bad albums. Or it might be for some other reason.

But whatever the reason, you have to correct for the fact that an album that gets just a few reviews is probably not a very good album.

How can we fix it?

What I’m going to propose is kind of stupid. I didn’t put in the effort to try to figure out the optimal way to correct for this kind of publication bias. Honestly, I don’t know that I could figure that out. So instead, I thought about it for about 19 seconds, and I came to the following three conclusions:

(1) We can approximate the number of missing reviews by subtracting the number of observed reviews from the maximum number of reviews another album received in the same year [3].

(2) We can assume that the missing reviewers would’ve given fairly poor reviews. Since it’s a nice round number, let’s say those missing reviews would average out to 70.

(3) Albums with metascores below 70 probably don’t need to be corrected at all, since reviewers already felt licensed to write negative reviews in these cases.

For 2018, the most reviews I observed for an album is 30. As you can see in the above figures, Beach House’s album received 27 reviews. Thus, my simple correction adds three reviews of 70, adjusting it (slightly) down from 81 to 79.9. Meanwhile, Modern Studies’s album received only 6 reviews, and so we would add 24 reviews of 70, resulting in a much bigger adjustment, from 86 to 73.2.

So now we have Beach House at 79.9 and Modern Studies at 73.2. That’s much better.

But the true test of whether this algorithm works would be to see whether Metascores become more predictive of consumers’ music evaluations after applying the correction than before applying the correction. But to do that, you’d need to have consumers evaluate a bunch of different albums, while ensuring that there is no selection bias in their ratings. How in the world do you do that?

Is it fixed?

Well, it just so happens that for the past 5.5 years, Leif Nelson, Yoel Inbar, and I have been systematically evaluating newly released albums. We call it Album Club, and it works like this. Almost every week, one of us assigns an album for the three of us to listen to. After we have given it enough listens, we email each other with a short review. In each review we have to (1) rate the album on a scale ranging from 0-10, and (2) identify our favorite song on the record [4].

The albums that we assign are pretty diverse. For example, we’ve listened to pop stars like Taylor Swift, popular bands like Radiohead and Vampire Weekend, underrated singer/songwriters like Eleanor Friedberger, a 21-year-old country singer who sounds like a 65-year-old country singer, a (very good) “experimental rap trio”, and even a (deservedly) unpopular “improvisational psych trio from Brooklyn” (Spotify .html) [5]. Moreover, at least one of us seems not to try to choose albums that we are likely to enjoy. So, for our purposes, this is a pretty great dataset (data .xlsx; code .R).

So let’s start by taking a look at 2018. So far this year, we have rated 22 albums, and 19 of those have received Metascores [6]. In the graphs below, I am showing the relationship between our average rating of each album and (1) actual Metascores (left panel) and (2) adjusted Metascores (right panel).

The first thing to notice is that, unlike Metacritic, Leif, Yoel, and I tend to use the whole freaking scale. The second thing to notice is the point of this post: Metascores were more predictive of our evaluations when we adjusted them for publication bias (right panel) than when we did not (left panel).

Now, I got the idea for this post because of what I noticed about a few albums in 2018. Thus, the analyses of the 2018 data must be considered a purely exploratory, potentially p-hacked endeavor. It is important to do confirmatory tests using the other years in my dataset (2013-2017). So that’s what I did. First, let’s look at a picture of 2017:

That looks like a successful replication. Though Metacritic did correctly identify the excellence of the St. Vincent album, it was otherwise kind of a disaster in 2017. But once you correct for publication bias, it does a lot better.

You can see in these plots that the corrected Metacritic scores are closer together on the “Adjusted” chart, indicating that the technique I am employing to correct for publication bias reduces the variance in Metascores. Reducing variance usually lowers correlations. But in this case, it increased the correlation. I take that as additional evidence that publication bias is indeed a big problem for Metacritic.

So what about the other years? Let’s look at a bar chart this time:

The effect is not always big, but it is always in the right direction. Metascores are more predictive when you use my dumb, blunt method of correcting for publication bias in music reviews than when you don’t.

In sum, you should listen to the new Beach House album. But not to the new Modern Studies album.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


  1. I am convinced that Fever Ray’s song “IDK About You” was the worst song released in 2017. I tried to listen to it again after typing that sentence, but was unable to. Still, Fever Ray is definitely not all bad. Their best song “I Had A Heart” (released in 2009) soundtracks a dark scene in Season 4 of “Breaking Bad” (.html). []
  2. 59 out of 307 []
  3. The same year is important, because the number of reviews has declined over time. I don’t know why. []
  4. In case you are interested, I’ve made a playlist of 25 of my favorite songs that I discovered because of Album Club: Spotify .html []
  5. Not to oversell the amount of diversity, it is understood that assigning a death metal album will get you kicked out of the Club. You can, however, assign songs that are *about* death metal: Spotify.html []
  6. Some albums that we assign are not reviewed by enough critics to qualify for a Metascore. This is true of 59 of the 261 albums in our dataset. This is primarily because Leif likes to assign albums by obscure high school bands from suburban Wisconsin, some of which are surprisingly good: Spotify .html. []

[53] What I Want Our Field To Prioritize

When I was a sophomore in college, I read a book by Carl Sagan called The Demon-Haunted World. By the time I finished it, I understood the difference between what is scientifically true and what is not. It was not obvious to me at the time: If a hypothesis is true, then you can use it to predict the future. If a hypothesis is false, then you can’t. Replicable findings are true precisely because you can predict that they will replicate. Non-replicable findings are not true precisely because you can’t. Truth is replicability. This lesson changed my life. I decided to try to become a scientist.

Although this lesson inspired me to pursue a career as a psychological scientist, for a long time I didn’t let it affect how I actually pursued that career. For example, during graduate school Leif Nelson and I investigated the hypothesis that people strive for outcomes that resemble their initials. For example, we set out to show that (not: test whether) people with an A or B initial get better grades than people with a C or D initial. After many attempts (we ran many analyses and we ran many studies), we found enough “evidence” for this hypothesis, and we published the findings in Psychological Science. At the time, we believed the findings and this felt like a success. Now we both recognize it as a failure.

The findings in that paper are not true. Yes, if you run the exact analyses we report on our same datasets, you will find significant effects. But they are not true because they would not replicate under specifiable conditions. History is about what happened. Science is about what happens next. And what happens next is that initials don’t affect your grades.

Inspired by discussions with Leif, I eventually (in 2010) reflected on what I was doing for a living, and I finally remembered that at some fundamental level a scientist’s #1 job is to differentiate what is true/replicable from what is not. This simple realization forever changed the way I conduct and evaluate research, and it is the driving force behind my desire for a more replicable science. If you accept this premise, then life as a scientist becomes much easier and more straightforward. A few things naturally follow.

First, it means that replicability is not merely a consideration, but the most important consideration. Of course I also care about whether findings are novel or interesting or important or generalizable, or whether the authors of an experiment are interpreting their findings correctly. But none of those considerations matter if the finding is not replicable. Imagine I claim that eating Funyuns® cures cancer. This hypothesis is novel and interesting and important, but those facts don’t matter if it is untrue. Concerns about replicability must trump all other concerns. If there is no replicability, there is no finding, and if there is no finding, there is no point assessing whether it is novel, interesting, or important. [1] Thus, more than any other attribute, journal editors and reviewers should use attributes that are diagnostic of replicability (e.g., statistical power and p-values) as a basis for rejecting papers. (Thank you, Simine Vazire, for taking steps in this direction at SPPS <.pdf>). [2]

Second, it means that the best way to prevent others from questioning the integrity of your research is to publish findings that you know to be replicable under specifiable conditions. You should be able to predict that if you do exactly X, then you will get Y. Your method section should be a recipe for getting an effect, specifying exactly which ingredients are sufficient to produce it. Of course, the best way to know that your finding replicates is to replicate it yourself (and/or to tie your hands by pre-registering your exact key analysis). This is what I now do (particularly after I obtain a p > .01 result), and I sleep a lot better because of it.

Third, it means that if someone fails to replicate your past work, you have two options. You can either demonstrate that the finding does replicate under specifiable/pre-registered conditions or you can politely tip your cap to the replicators for discovering that one of your published findings is not likely to be true. If you believe that your finding is replicable but don’t have the resources to run the replication, then you can pursue a third option: Specify the exact conditions under which you predict that your effect will emerge. This allows others with more resources to test that prediction. If you can’t specify testable circumstances under which your effect will emerge, then you can’t use your finding to predict the future, and, thus, you can’t say that it is true.

Andrew Meyer and his colleagues recently published several highly powered failures to reliably replicate my and Leif’s finding (.pdf; see Study 13) that disfluent fonts change how people predict sporting events (.pdf; see Table A6). We stand by the central claims of our paper, as we have replicated the main findings many times. But Meyer et al. showed that we should not  – and thus we do not – stand by the findings of Study 13. Their evidence that it doesn’t consistently replicate (20 games; 12,449 participants) is much better than our evidence that it does (2 games; 181 participants), and we can look back on our results and see that they are not convincing (most notably, p = .03). As a result, all we can do is to acknowledge that the finding is unlikely to be true. Meyer et al.’s paper wasn’t happy news, of course, but accepting their results was so much less stressful than mounting a protracted, evidence-less defense of a finding that we are not confident would replicate. Having gone that route before, I can tell you that this one was about a million times less emotionally punishing, in addition to being more scientific. It is a comfort to know that I will no longer defend my own work in that way. I’ll either show you’re wrong, or I’ll acknowledge that you’re right.

Fourth, it means advocating for policies and actions that enhance the replicability of our science. I believe that the #1 job of the peer review process is to assess whether a finding is replicable, and that we can all do this better if we know exactly what the authors did in their study, and if we have access to their materials and data. I also believe that every scientist has a conflict of interest – we almost always want the evidence to come out one way rather than another – and that those conflicts of interest lead even the best of us to analyze our data in a way that makes us more likely to draw our preferred conclusions. I still catch myself p-hacking analyses that I did not pre-register. Thus, I am in favor of policies and actions that make it harder/impossible for us to do that, including incentives for pre-registration, the move toward including exact replications in published papers, and the use of methods for checking that our statistical analyses are accurate and that our results are unlikely to have been p-hacked (e.g., because the study was highly powered).

I am writing all of this because it’s hard to resolve a conflict when you don’t know what the other side wants. I honestly don’t know what those who are resistant to change want, but at least now they know what I want. I want to be in a field that prioritizes replicability over everything else. Maybe those who are resistant to change believe this too, and their resistance is about the means (e.g., public criticism) rather than the ends. Or maybe they don’t believe this, and think that concerns about replicability should take a back seat to something else. It would be helpful for those who are resistant to change to articulate their position. What do you want our field to prioritize, and why?

  1. I sometimes come across the argument that a focus on replicability will increase false-negatives. I don’t think that is true. If a field falsely believes that Funyuns will cure cancer, then the time and money that may have been spent discovering true cures will instead be spent studying the Funyun Hypothesis. True things aren’t discovered when resources are allocated to studying false things. In this way, false-positives cause false-negatives. []
  2. At this point I should mention that although I am an Associate Editor at SPPS, what I write here does not reflect journal policy. []

[38] A Better Explanation Of The Endowment Effect

It’s a famous study. Give a mug to a random subset of a group of people. Then ask those who got the mug (the sellers) to tell you the lowest price they’d sell the mug for, and ask those who didn’t get the mug (the buyers) to tell you the highest price they’d pay for the mug. You’ll find that sellers’ minimum selling prices exceed buyers’ maximum buying prices by a factor of 2 or 3 (.pdf).

This famous finding, known as the endowment effect, is presumed to have a famous cause: loss aversion. Just as loss aversion maintains that people dislike losses more than they like gains, the endowment effect seems to show that people put a higher price on losing a good than on gaining it. The endowment effect seems to perfectly follow from loss aversion.

But a 2012 paper by Ray Weaver and Shane Frederick convincingly shows that loss aversion is not the cause of the endowment effect (.pdf). Instead, “the endowment effect is often better understood as the reluctance to trade on unfavorable terms,” in other words “as an aversion to bad deals.” [1]

This paper changed how I think about the endowment effect, and so I wanted to write about it.

A Reference Price Theory Of The Endowment Effect

Weaver and Frederick’s theory is simple: Selling and buying prices reflect two concerns. First, people don’t want to sell the mug for less, or buy the mug for more, than their own value of it. Second, they don’t want to sell the mug for less, or buy the mug for more, than the market price. This is because people dislike feeling like a sucker. [2]

To see how this produces the endowment effect, imagine you are willing to pay $1 for the mug and you believe it usually sells for $3. As a buyer, you won’t pay more than $1, because you don’t want to pay more than it’s worth to you. But as a seller, you don’t want to sell for as little as $1, because you’ll feel like a chump selling it for much less than it is worth. [3]. Thus, because there’s a gap between people’s perception of the market price and their valuation of the mug, there’ll be a large gap between selling ($3) and buying ($1) prices:

Weaver and Frederick predict that the endowment effect will arise whenever market prices differ from valuations.

However, when market prices are not different from valuations, you shouldn’t see the endowment effect. For example, if people value a mug at $2 and also think that its market price is $2, then both buyers and sellers will price it at $2:

And this is what Weaver and Frederick find. Repeatedly. There is no endowment effect when valuations are equal to perceived market prices. Wow.

Just to be sure, I ran a within-subjects hypothetical study that is much inferior to Weaver and Frederick’s between-subjects incentivized studies, and, although my unusual design produced some unusual results, I found strong support for their hypothesis (full description .pdf; data .xls). Most importantly, I found that people who gave higher selling prices than buying prices for the same good were much more likely to say they did this because they wanted to avoid a bad deal than because of loss aversion:

In fact, whereas 82.5% of participants endorsed at least one bad-deal reason, only 18.8% of participants endorsed at least one loss-aversion reason. [4]

I think Weaver and Frederick’s evidence makes it difficult to consider loss aversion the best explanation of the endowment effect. Loss aversion can’t explain why the endowment effect is so sensitive to the difference between market prices and valuations, and it certainly can’t explain why the effect vanishes when market prices and valuations converge. [5]

Weaver and Frederick’s theory is simple, plausible, supported by the data, and doesn’t assume that people treat losses differently than gains. It just assumes that, when setting prices, people consider both their valuations and market prices, and dislike feeling like a sucker.

Wide logo

Author feedback.
I shared an early draft of this post with Shane Frederick. Although he opted not to comment publicly, during our exchange I did learn of an unrelated short (and excellent) piece that he wrote that contains a pretty awesome footnote (.html).

  1. Even if you don’t read Weaver and Frederick’s paper, I strongly advise you to read Footnote 10. []
  2. Thaler (1985) called this “transaction utility” (.pdf). Technically Weaver and Frederick’s theory is about “reference prices” rather than “market prices”, but since market prices are the most common/natural reference price I’m going to use the term market prices. []
  3. Maybe because you got the mug for free, you’d be willing to sell it for a little bit less than the market price – perhaps $2 rather than $3. Even so, if the gap between market prices and valuations is large enough, there’ll still be an endowment effect []
  4. For a similar result, see Brown 2005 (.pdf). []
  5. Loss aversion is not the only popular account. According to an “ownership” account of the endowment effect (.pdf), owning a good makes you like it more, and thus price it higher, than not owning it. Although this mechanism may account for some of the effect (the endowment effect may be multiply determined), it cannot explain all the effects Weaver and Frederick report. Nor can it easily account for why the endowment effect is observed in hypothetical studies, when people simply imagine being buyers or sellers. []

[26] What If Games Were Shorter?

The smaller your sample, the less likely your evidence is to reveal the truth. You might already know this, but most people don’t (.pdf), or at least they don’t appropriately apply it (.pdf). (See, for example, nearly every inference ever made by anyone). My experience trying to teach this concept suggests that it’s best understood using concrete examples.

So let’s consider this question: What if sports games were shorter?

Most NFL football games feature a matchup between one team that is expected to win – the favorite – and one that is not – the underdog. A full-length NFL game consists of four 15-minute quarters. [1] After four quarters, favorites outscore their underdog opponents about 63% of the time. [2] Now what would happen to the favorites’ chances of winning if the games were shortened to 1, 2, or 3 quarters?

In this post, I’ll tell you what happens and then I’ll tell you what people think happens.

What If Sports Games Were Shorter?

I analyzed 1,008 games across four NFL seasons (2009-2012; data .xls). Because smaller samples are less likely to reveal true differences between the teams, the favorites’ chances of winning (vs. losing or being tied) increase as game length increases. [3]

Reality is more likely to deviate from true expectations when samples are smaller. We can see this again in an analysis of point differences. For each NFL game, well-calibrated oddsmakers predict how many points the favorite will win by. Plotting these expected point differences against actual point differences reveals how the relationship between expectation and reality increases with game length:

Sample sizes affect the likelihood that reality will deviate from an average expectation.

But sample sizes do not affect what our average expectation should be. If a coin is known to turn up heads 60% of the time, then, regardless of whether the coin will be flipped 10 times or 100,000 times, our best guess is that heads will turn up 60% of time. The error around 60% will be greater for 10 flips than for 100,000 flips, but the average expectation will remain constant.

To see this in the football data, I computed point differences after each quarter, and then scaled them to a full-length game. For example, if the favorite was up by 3 points after one quarter, I scaled that to a 12-point advantage after 4 quarters. We can plot the difference between expected and actual point differences after each quarter.

The dots are consistently near the red line on the above graph, indicating that the average outcome aligns with expectations regardless of game length. However, as the progressively decreasing error bars show, the deviation from expectation is greater for shorter games than for longer ones.

Do People Know This?

I asked MTurk NFL fans to consider an NFL game in which the favorite was expected to beat the underdog by 7 points in a full-length game. I elicited their beliefs about sample size in a few different ways (materials .pdf; data .xls).

Some were asked to give the probability that the better team would be winning, losing, or tied after 1, 2, 3, and 4 quarters. If you look at the average win probabilities, their judgments look smart.

But this graph is super misleading, because the fact that the average prediction is wise masks the fact that the average person is not. Of the 204 participants sampled, only 26% assigned the favorite a higher probability to win at 4 quarters than at 3 quarters than at 2 quarters than at 1 quarter. About 42% erroneously said, at least once, that the favorite’s chances of winning would be greater for a shorter game than for a longer game.

How good people are at this depends on how you ask the question, but no matter how you ask it they are not very good.

I asked 106 people to indicate whether shortening an NFL game from four quarters to two quarters would increase, decrease, or have no effect on the favorite’s chance of winning. And I asked 103 people to imagine NFL games that vary in length from 1 quarter to 4 quarters, and to indicate which length would give the favorite the best chance to win.

The modal participant believed that game length would not matter. Only 44% correctly said that shortening the game would reduce the favorite’s chances, and only 33% said that the favorite’s chances would be best after 4 quarters than after 3, 2, or 1.

Even though most people get this wrong there are ways to make the consequences of sample size more obvious. It is easy for students to realize that they have a better chance of beating LeBron James in basketball if the game ends after 1 point than after 10 points. They also know that an investment portfolio with one stock is riskier than one with ten stocks.

What they don’t easily see is that these specific examples reflect a general principle. Whether you want to know which candidate to hire, which investment to make, or which team to bet on, the smaller your sample, the less you know.

Wide logo

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. If the game is tied, the teams play up to 15 additional minutes of overtime. []
  2. 7% of games are tied after four quarters, and, in my sample, favorites won 57% of those in overtime; thus favorites win about 67% of games overall []
  3. Note that it is not that the favorite is more likely to be losing after one quarter; it is likely more to be losing or tied. []

[18] MTurk vs. The Lab: Either Way We Need Big Samples

Back in May 2012, we were interested in the question of how many participants a typical between-subjects psychology study needs to have an 80% chance to detect a true effect. To answer this, you need to know the effect size for a typical study, which you can’t know from examining the published literature because it severely overestimates them (.pdf1; .pdf2; .pdf3).

To begin to answer this question, we set out to estimate some effects we expected to be very large, such as “people who like eggs report eating egg salad more often than people who don’t like eggs.” We did this assuming that the typical psychology study is probably investigating an effect no bigger than this. Thus, we reasoned that the sample size needed to detect this effect is probably smaller than the sample size psychologists typically need to detect the effects that they study.

We investigated a bunch of these “obvious” effects in a survey on amazon.com’s Mechanical Turk (N=697). The results are bad news for those who think 10-40 participants per cell is an adequate sample.

Turns out you need 47 participants per cell to detect that people who like eggs eat egg salad more often than those who dislike eggs. The finding that smokers think that smoking is less likely to kill someone requires 149 participants per cell. The irrefutable takeaway is that, to be appropriately powered, our samples must be a lot larger than they have been in the past, a point that we’ve made in a talk on “Life After P-Hacking” (slides).

Of course, “irrefutable” takeaways inevitably invite attempts at refutation. One thoughtful attempt is the suggestion that the effect sizes we observed were so small because we used MTurk participants, who are supposedly inattentive and whose responses are supposedly noisy. The claim is that these effect sizes would be much larger if we ran this survey in the Lab, and so samples in the Lab don’t need to be nearly as big as our MTurk investigation suggests.

MTurk vs. The Lab

Not having yet read some excellent papers investigating MTurk’s data quality (the quality is good; .pdf1; .pdf2; .pdf3), I ran nearly the exact same survey in Wharton’s Behavioral Lab (N=192), where mostly undergraduate participants are paid $10 to do an hour’s worth of experiments.

I then compared the effect sizes between MTurk and the Lab (materials .pdf; data .xls). [1] Turns out…

…MTurk and the Lab did not differ much. You need big samples in both.

Six of the 10 the effects we studied were directionally smaller in the Lab sample: [2]

No matter what, you need ~50 per cell to detect that egg-likers eat egg salad more often. The one effect resembling something psychologists might actually care about – smokers think that smoking is less likely to kill someone – was actually quite a bit smaller in the Lab sample than in the MTurk sample: to detect this effect in our Lab would actually require many more participants than on MTurk (974 vs. 149 per cell).

Four out of 10 effect sizes were directionally larger in the Lab, three of them involving gender differences:

So across the 10 items, some of the effects were bigger in the Lab sample and some were bigger in the MTurk sample.

Most of the effect sizes were very similar, and any differences that emerged almost certainly reflect differences in population rather than data quality. For example, gender differences in weight were bigger in the Lab because few overweight individuals visit our lab. The MTurk sample, by being more representative, had a larger variance and thus a smaller effect size than did the Lab sample. [3]


MTurk is not perfect. As with anything, there are limitations, especially the problem of nonnaïvete (.pdf), and since it is a tool that so many of us use, we should continue to monitor the quality of the data that it produces. With that said, the claim that MTurk studies require larger samples is based on intuitions unsupported by evidence.

So whether we are running our studies on MTurk or in the Lab, the irrefutable fact remains:

We need big samples. And 50 per cell is not big.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. To eliminate outliers, I trimmed open-ended responses below the 5th and above the 95th percentiles. This increases effect size estimates. If you don’t do this, you need even more participants for the open-ended items than the figures below suggest. []
  2. For space considerations, here I report only the 10 of 12 effects that were significant in at least one of the samples; the .xls file shows the full results. The error bars are 95% confidence intervals. []
  3. MTurk’s gender on weight effect size estimate more closely aligns with other nationally representative investigations (.pdf) []

[16] People Take Baths In Hotel Rooms

This post is the product of a heated debate.

At a recent conference, a colleague mentioned, much too matter-of-factly, that she took a bath in her hotel room. Not a shower. A bath. I had never heard of someone voluntarily bathing in a hotel room. I think bathing is preposterous. Bathing in a hotel room is lunacy.

I started asking people to estimate the percentage of people who had ever bathed in a hotel room. The few that admitted bathing in a hotel room guessed 15-20%. I guessed 4%, but then decided that number was way too high. A group of us asked the hotel concierge for his estimate. He said 60%, which I took as evidence that he lacked familiarity with numbers.

One of the participants in this conversation, Chicago professor Abigail Sussman, suggested testing this empirically. So we did that.

We asked 532 U.S.-resident MTurkers who had spent at least one night in a hotel within the last year to recall their most recent hotel stay (notes .pdf; materials .pdf; data .xls). We asked them a few questions about their stay, including whether they had showered, bathed, both, or, um, neither.

Here are the results, removing those who said their hotel room definitely did not have a bathtub (N = 442).

Ok, about 80% of people took a normal person’s approach to their last hotel stay. One in 20 didn’t bother cleaning themselves (respect), and 12.4% took a bath, including an incomprehensible subset who bathed but didn’t shower. Given that these data capture only their most recent hotel stay, the proportion of people bathing is at least an order of magnitude higher than I expected.


Gender Differences

If you had told me that 12.4% of people report having taken a bath during their last hotel stay, I’d have told you to include some men in your sample. Women have to be at least 5 times more likely to bathe. Right?


Women bathed more than men (15.6% vs. 10.8%), but only by a small, nonsignificant margin (p=.142). Also surprising is that women and men were equally likely to take a Pigpen approach to life.


What Predicts Hotel Bathing?

Hotel quality: People are more likely to bathe in higher quality hotels, and nobody bathes in a one-star hotel.


Perceptions of hotel cleanliness: People are more likely to bathe when they think hotels are cleaner, although almost 10% took a bath despite believing that hotel rooms are somewhat dirtier than the average home.


Others in the room: Sharing a room with more than 1 person really inhibits bathing, as it should, since it’s pretty inconsiderate to occupy the bathroom for the length of time that bathing requires. More than one in five people bathe when they are alone.


Bathing History & Intentions

We also asked people whether they had ever, as an adult, bathed in a hotel room. Making a mockery of my mockery, the concierge’s estimate was slightly closer to the truth than mine was, as fully one-third (33.3%) reported doing so. One in three.

I give up.

Finally, we asked those who had never bathed in a hotel room to report whether they’d ever consider doing so. Of the 66.7% who said they had never bathed in a hotel room, 64.1% said they’d never consider it.

So only 43% have both never bathed in a hotel room and say they would never consider it.

Those are my people.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[14] How To Win A Football Prediction Contest: Ignore Your Gut

This is a boastful tale of how I used psychology to win dominate a football prediction contest.

Back in September, I was asked to represent my department – Operations and Information Management – in a Wharton School contest to predict NFL football game outcomes. Having always wanted a realistic chance to outperform Adam Grant at something, I agreed.

The contest involved making the same predictions that sports gamblers make. For each game, we predicted whether the superior team (the favorite) was going to beat the inferior team (the underdog) by more or less than the Las Vegas point spread. For example, when the very good New England Patriots played the less good Pittsburgh Steelers, we had to predict whether or not the Patriots would win by more than the 6.5-point point spread. We made 239 predictions across 16 weeks.

Contrary to popular belief, oddsmakers in Las Vegas don’t set point spreads in order to ensure that half of the money is wagered on the favorite and half the money is wagered on the underdog. Rather, their primary aim is to set accurate point spreads, one that gives the favorite (and underdog) a 50% chance to beat the spread. [1] Because Vegas is good at setting accurate spreads, it is very hard to perform better than chance when making these predictions. The only way to do it is to predict the NFL games better than Vegas does.

Enter Wharton professor Cade Massey and professional sports analyst Rufus Peabody. They’ve developed a statistical model that, for an identifiable subset of football games, outperforms Vegas. Their Massey-Peabody power rankings are featured in the Wall Street Journal, and from those rankings you can compute expected game outcomes. For example, their current rankings (shown below) say that the Broncos are 8.5 points better than the average team on a neutral field whereas the Seahawks are 8 points better. Thus, we can expect, on average, the Broncos to beat the Seahawks by 0.5 points if they were to play on a neutral field, as they will in Sunday’s Super Bowl. [2]


My approach to the contest was informed by two pieces of information.

First, my work with Leif (.pdf) has shown that naïve gamblers are biased when making these predictions – they predict favorites to beat the spread much more often than they predict underdogs to beat the spread. This is because people’s first impression about which team to bet on ignores the point spread and is thus based on a simpler prediction as to which team will win the game. Since the favorite is usually more likely to win, people’s first impressions tend to favor favorites. And because people rarely talk themselves out of these first impressions, they tend to predict favorites against the spread. This is true even though favorites don’t win against the spread more often than underdogs (paper 1, .pdf), and even when you manipulate the point spreads to make favorites more likely to lose (paper 2, .pdf). Intuitions for these predictions are just not useful.

Second, knowing that evidence-based algorithms are better forecasters than humans (.pdf), I used the Massey-Peabody algorithm for all my predictions.

So how did the results shake out? (Notes on Analyses; Data)

First, did my Wharton colleagues also show the bias toward favorites, a bias that would indicate that they are no more sophisticated than the typical gambler?

Yes. All of them predicted significantly more favorites than underdogs.


Second, how did I perform relative to the “competition?”

Since everyone loves a humble champion, let me just say that my victory is really a victory for Massey-Peabody. I don’t deserve all of the accolades. Really.

Yeah, for about the millionth time (see meta-analysis, .pdf), we see that statistical models outperform human forecasters. This is true even (especially?) when the humans are Wharton professors, students, and staff.

So, if you want to know who is going to win this Sunday’s Super Bowl, don’t ask me and don’t ask the bestselling author of Give and Take. Ask Massey-Peabody.

And they will tell you, unsatisfyingly, that the game is basically a coin flip.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Vegas still makes money in the long run because gamblers have to pay a fee in order to bet []
  2. For any matchup involving home field advantage, give an additional 2.4 points to the home team []

[6] Samples Can’t Be Too Large

Reviewers, and even associate editors, sometimes criticize studies for being “overpowered” – that is, for having sample sizes that are too large. (Recently, the between-subjects sample sizes under attack were about 50-60 per cell, just a little larger than you need to have an 80% chance to detect that men weigh more than women).

This criticism never makes sense.

The rationale for it is something like this: “With such large sample sizes, even trivial effect sizes will be significant. Thus, the effect must be trivial (and we don’t care about trivial effect sizes).”

But if this is the rationale, then the criticism is ultimately targeting the effect size rather than the sample size.  A person concerned that an effect “might” be trivial because it is significant with a large sample can simply compute the effect size, and then judge whether it is trivial.

(As an aside: Assume you want an 80% chance to detect a between-subjects effect. You need about 6,000 per cell for a “trivial” effect, say d=.05, and still about 250 per cell for a meaningful “small” effect, say d=.25. We don’t need to worry that studies with 60 per cell will make trivial effects be significant).

It is OK to criticize a study for having a small effect size. But it is not OK to criticize a study for having a large sample size. This is because sample sizes do not change effect sizes. If I were to study the effect of gender on weight with 40 people or with 400 people, I would, on average, estimate the same effect size (d ~= .59). Collecting 360 additional observations does not decrease my effect size (though, happily, it does increase the precision of my effect size estimate, and that increased precision better enables me to tell whether an effect size is in fact trivial).

Our field suffers from a problem of underpowering. When we underpower our studies, we either suffer the consequences of a large file drawer of failed studies (bad for us) or we are motivated to p-hack in order to find something to be significant (bad for the field). Those who criticize studies for being overpowered are using a nonsensical argument to reinforce exactly the wrong methodological norms.

If someone wants to criticize trivial effect sizes, they can compute them and, if they are trivial, criticize them. But they should never criticize samples for being too large.

We are an empirical science. We collect data, and use those data to learn about the world. For an empirical science, large samples are good. It is never worse to have more data.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[3] A New Way To Increase Charitable Donations: Does It Replicate?

A new paper finds that people will donate more money to help 20 people if you first ask them how much they would donate to help 1 person.

This Unit Asking Effect (Hsee, Zhang, Lu, & Xu, 2013, Psychological Science) emerges because donors are naturally insensitive to the number of individuals needing help. For example, Hsee et al. observed that if you ask different people how much they’d donate to help either 1 needy child or 20 needy children, you get virtually the same answer. But if you ask the same people to indicate how much they’d donate to 1 child and then to 20 children, they realize that they should donate more to help 20 than to help 1, and so they increase their donations.

If true, then this is a great example of how one can use psychology to design effective interventions.

The paper reports two field experiments and a study that solicited hypothetical donations (Study 1). Because it was easy, I attempted to replicate the latter. (Here at Data Colada, we report all of our replication attempts, no matter the outcome).

I ran two replications, a “near replication” using materials that I developed based on the authors’ description of their methods (minus a picture of a needy schoolchild) and then an “exact replication” using the authors’ exact materials. (Thanks to Chris Hsee and Jiao Zhang for providing those).

In the original study, people were asked how much they’d donate to help a kindergarten principal buy Christmas gifts for her 20 low-income pupils. There were four conditions, but I only ran the three most interesting conditions:


The original study had ~45 participants per cell. To be properly powered, replications should have ~2.5 times the original sample size. I (foolishly) collected only ~100 per cell in my near replication, but corrected my mistake in the exact replication (~150 per cell). Following Hsee et al., I dropped responses more than 3 SD from the mean, though there was a complication in the exact replication that required a judgment call. My studies used MTurk participants; theirs used participants from “a nationwide online survey service.”

Here are the results of the original (some means and SEs are guesses) and my replications (full data).

I successfully replicated the Unit Asking Effect, as defined by Unit Asking vs. Control; it was marginal (p=.089) in the smaller-sampled near replication and highly significant (p< .001) in the exact replication.

There were some differences. First, my effect sizes (d=.24 and d=.48) were smaller than theirs (d=.88). Second, whereas they found that, across conditions, people were insensitive to whether they were asked to donate to 1 child or 20 children (the white $15 bar vs. the gray $18 bar), I found a large difference in my near replication and a smaller but significant difference in the exact replication. This sensitivity is important, because if people do give lower donations for 1 child than for 20, then they might anchor on those lower amounts, which could diminish the Unit Asking Effect.

In sum, my studies replicated the Unit Asking Effect.