[65] Spotlight on Science Journalism: The Health Benefits of Volunteering

I want to comment on a recent article in the New York Times, but along the way I will comment on scientific reporting as well. I think that science reporters frequently fall short in assessing the evidence behind the claims they relay, but as I try to show, assessing evidence is not an easy task. I don’t want scientists to stop studying cool topics, and I don’t want journalists to stop reporting cool findings, but I will suggest that they should make it commonplace to get input from uncool data scientists and statisticians.

Science journalism is hard. Those journalists need to maintain a high level of expertise in a wide range of domains while being truly exceptional at translating that content in ways that are clear, sensible, and accurate. For example, it is possible that Ed Yong couldn’t run my experiments, but I certainly couldn’t write his articles. [1]

I was reminded about the challenges of science journalism when reading an article about the health benefits of being a volunteer. The journalist, Nicole Karlis, seamlessly connects interviews with recent victims, interviews with famous researchers, and personal anecdotes.

It also cites some evidence in the form of three scientific findings. Like the journalist, I am not an expert in this area. The journalist’s profession requires her to float above the ugly complexities of the data, whereas my career is spent living amongst (and contributing to) those complexities. So I decided to look at those three papers.

OK, here are those references (the first two come together):

If you would like to see those articles for yourself, they can be found here (.html) and here (.html).

First the blood pressure finding. The original researchers analyze data from a longitudinal panel of 6,734 people who provided information about their volunteering and had their blood pressure measured. After adding a number of control variables [2], they look to see if volunteering has an influence on blood pressure. OK, how would you do that? 40.4% of respondents reported some volunteering. Perhaps they could be compared to the remaining 59.6%? Or perhaps there is a way to look at how the number of hours volunteered decreases units of blood pressure? The point is, there are a few ways to think about this. The authors found a difference only when comparing non-volunteers to the category of people who volunteered 200 hours or more. Their report:

“In a regression including the covariates, hours of volunteer work were related to hypertension risk (Figure 1). Those who had volunteered at least 200 hours in the past 12 months were less likely to develop hypertension than non-volunteers (OR=0.60; 95% CI:0.40–0.90). There was also a decrease in hypertension risk among those who volunteered 100–199 hours; however, this estimate was not statistically reliable (OR=0.78; 95% CI=0.48–1.27). Those who volunteered 1–49 and 50–99 hours had hypertension risk similar to that of non-volunteers (OR=0.95; 95% CI: 0.68–1.33 and OR=0.96; 95% CI: 0.65–1.41, respectively).”

So what I see is some evidence that is somewhat suggestive of the claim, but it is not overly strong. The 200-hour cut-off is arbitrary, and the effect is not obviously robust to other specifications. I am worried that we are seeing researchers choosing their favorite specification rather than the best specification. So, suggestive perhaps, but I wouldn’t be ready to cite this as evidence that volunteering is related to improved blood pressure.

The second finding is “volunteering is linked to… decreased mortality rates.” That paper analyzes data from a different panel of 10,317 people who report their volunteer behavior and whose deaths are recorded. Those researchers convey their finding in the following figure:

So first, that is an enormous effect. People who volunteered were about 50% less likely to die within four years. Taken at face value, that would suggest an effect seemingly on the order of normal person versus smoker + drives without a seatbelt + crocodile-wrangler-hobbyist. But recall that this is observational data and not an experiment, so we need to be worried about confounds. For example, perhaps the soon-to-be-deceased also lack the health to be volunteers? The original authors have that concern too, so they add some controls. How did that go?

That is not particularly strong evidence. The effects are still directionally right, and many statisticians would caution against focusing on p-values… but still, that is not overly compelling. I am not persuaded. [3]

What about the third paper referenced?

That one can be found here (.html).

Unlike the first two papers, that is not a link to a particular result, but rather to a preregistration. Readers of this blog are probably familiar, but preregistrations are the time-stamped analysis plans of researchers from before they ever collect any data. Preregistrations – in combination with experimentation – eliminate some of the concerns about selective reporting that inevitably follow other studies. We are huge fans of preregistration (.html, .html, .html). So I went and found the preregistered primary outcome on page 8:

Perfect. That outcome is (essentially) one of those mentioned in the NY Times. But things got more difficult for me at that point. This intervention was an enormous undertaking, with many measures collected over many years. Accordingly, though the primary outcome was specified here, a number of follow-up papers have investigated some of those alternative measures and analyses. In fact, the authors anticipate some of that by saying “rather than adjust p-values for multiple comparison, p-values will be interpreted as descriptive statistics of the evidence, and not as absolute indicators for a positive or negative result.” (p. 13). So they are saying that, outside of the mobility finding, p-values shouldn’t be taken quite at face value. This project has led to some published papers looking at the influence of the volunteerism intervention on school climate, Stroop performance, and hippocampal volume, amongst others. But the primary outcome – mobility – appears to be reported here (.html). [4]. What do they find?

Well, we have the multiple comparison concern again – whatever difference exists is only found at 24 months, but mobility has been measured every four months up until then. Also, this is only for women, whereas the original preregistration made no such specification. What happened to the men? The authors say, “Over a 24-month period, women, but not men, in the intervention showed increased walking activity compared to their sex-matched control groups.” So the primary outcome appears not to have been supported. Nevertheless, making interpretation a little challenging, the authors also say, “the results of this study indicate that a community-based intervention that naturally integrates activity in urban areas may effectively increase physical activity.” Indeed, it may, but it also may not. These data are not sufficient for us to make that distinction.

That’s it. I see three findings, all of which are intriguing to consider, but none of which are particularly persuasive. The journalist, who presumably has been unable to read all of the original sources, is reduced to reporting their claims. The readers, who are even more removed, take the journalist’s claims at face value: “if I volunteer then I will walk around better, lower my blood pressure, and live longer. Sweet.”

I think that we should expect a little more from science reporting. It might be too much for every journalist to dig up every link, but perhaps they should develop a norm of collecting feedback from those people who are informed enough to consider the evidence, but far enough outside the research area to lack any investment in a particular claim. There are lots of highly competent commentators ready to evaluate evidence independent of the substantive area itself.

There are frequent calls for journalists to turn away from the surprising and uncertain in favor of the staid and uncontroversial. I disagree – surprising stories are fun to read. I just think that journalists should add an extra level of scrutiny to ensure that we know that the fun stories are also true stories.

Wide logo


Author Feedback.
I shared a draft of this post with the contact author for each of the four papers I mention, as well as the journalist who had written about them. I heard back from one, Sara Konrath, who had some helpful suggestions including a reference to a meta-analysis (.html) on the topic.

 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Obviously Mr. Yong could run my experiments better than me also, but I wanted to make a point. At least I can still teach college students better than him though. Just kidding, he would also be better at that. []
  2. average systolic blood pressure (continuous), average diastolic blood pressure (continuous), age (continuous), sex, self-reported race (Non-Hispanic White, Non-Hispanic Black, Hispanic, Non-Hispanic Other), education (less than high school, General Equivalency Diploma [GED], high school diploma, some college, college and above), marital status (married, annulled, never married, divorced, separated, widowed), employment status (employed/not employed), and self-reported history of diabetes (yes/no), cancer (yes/no), heart problems (yes/no), stroke (yes/no), or lung problems (yes/no []
  3. It is worth noting that this paper, in particular, goes on to consider the evidence in other interesting ways. I highlight this portion because it was the fact being cited in the NYT article. []
  4. I think. It is really hard for me, as a novice in this area, to know if I have found all of the published findings from this original preregistration. If there is a different mobility finding elsewhere I couldn’t find it, but I will correct this post if it gets pointed out to me. []

[32] Spotify Has Trouble With A Marketing Research Exam

This is really just a post-script to Colada [2], where I described a final exam question I gave in my MBA marketing research class. Students got a year’s worth of iTunes listening data for one person –me– and were asked: “What songs would this person put on his end-of-year Top 40?” I compared that list to the actual top-40 list. Some students did great, but many made the rookie mistake of failing to account for the fact that older songs (e.g., those released in January) had more opportunity to be listened to than did newer songs (e.g., those released in November).

I was reminded of this when I recently received an email from Spotify (my chosen music provider) that read:

spotify figure 1

First, Spotify, rather famously, does not make listening-data particularly public, [1] so any acknowledgement that they are assessing my behavior is kind of exciting. Second, that song, Inauguration [Spotify link], is really good. On the other hand, despite my respect for the hard working transistors inside the Spotify preference-detection machine, that song is not my “top song” of 2014. [2]

The thing is, “Inauguration” came out in January. Could Spotify be making the same rookie mistake as some of my MBA students?

Following Spotify’s suggestion, I decided to check out the rest of their assessment of my 2014 musical preferences. Spotify offered a ranked listing of my Top 100 songs from 2014. Basically, without even being asked, Spotify said “hey, I will take that final exam of yours.” So without even being asked I said, “hey, I will grade that answer of yours.” How did Spotify do?

Poorly. Spotify thinks I really like music from January and February.

Here is their data:

spotify figure 2

Each circle is a song; the red ones are those which I included in my actual Top 40 list.

If I were grading this student, I would definitely have some positive things to say. “Dear Spotify Preference-Detection Algorithm, Nice job identifying eight of my 40 favorite songs. In particular, the song that you have ranked second overall, is indeed in my top three.” On the other hand, I would also probably say something like, “That means that your 100 guesses still missed 32 of my favorites. Your top 40 only included five of mine. If you’re wondering where those other songs are hiding, I refer you to the entirely empty right half of the above chart. Of your Top 100, a full 97 were songs added before July 1. I like the second half of the year just as much as the first.” Which is merely to say that the Spotify algorithm has room for improvement. Hey, who doesn’t?

Actually, in preparing this post, I was surprised to learn that, if anything, I have a strong bias toward songs released later in the year. This bias that could reflect my tastes, or alternatively a bias in the industry (see this post in a music blog on the topic, .html). I looked at when Grammy-winning songs are released and learned that they are slightly biased toward the second half of the year [3]. The figure below shows the distributions (with the correlation between month and count).

spotify figure 3

I have now learned how to link my Spotify listening behavior to Last.fm. A year from now perhaps I will get emails from two different music-distribution computers and I can compare them head-to-head? In the meantime, I will probably just listen to the forty best songs of 2014 [link to my Spotify playlist].

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. OK, “famously” is overstated, but even a casual search will reveal that there are many users who want more of their own listening data. Also, “not particularly public” is not the same as “not at all public.” For example, they apparently share all kinds of data with Walt Hickey at FiveThirtyEight (.html). I am envious of Mr. Hickey. []
  2. My top song of 2014 is one of these (I don’t rank my Top 40): The Black and White Years – Embraces, Modern Mod – January, or Perfume Genius – Queen []
  3. I also learned that “Little Green Apples” won in the same year that “Mrs. Robinson” and “Hey Jude” were nominated. Grammy voters apparently fail a more basic music preference test. []

[25] Maybe people actually enjoy being alone with their thoughts

Recently Science published a paper concluding that people do not like sitting quietly by themselves (.html). The article received press coverage, that press coverage received blog coverage, which received twitter coverage, which received meaningful head-nodding coverage around my department. The bulk of that coverage (e.g., 1, 2, and 3) focused on the tenth study in the eleven-study article. In that study, lots of people preferred giving themselves electric shocks to being alone in a room (one guy shocked himself 190 times). I was more intrigued by the first nine studies, all of which were very similar to each other. [1]

Opposite inference
The reason I write this post is that upon analyzing the data for those studies, I arrived at an inference opposite the authors’. They write things like:

Participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think. (abstract)

It is surprisingly difficult to think in enjoyable ways even in the absence of competing external demands. (p.75, 2nd column)

The untutored mind does not like to be alone with itself (last phrase)

But the raw data point in the opposite direction: people reported to enjoy thinking.

Three measures
In the studies, people sit in a room for a while and then answer a few questions when they leave, including how enjoyable, how boring, and how entertaining the thinking period was, in 1-9 scales (anchored at 1 = “not at all”, 5 = “somewhat”, 9 = “extremely”). Across the nine studies, 663 people rated the experience of thinking, the overall mean for these three variables was M=4.94, SD=1.83, not significantly different from 5, the midpoint of the scale, t(662)=.9, p=.36. The 95% confidence interval for the mean is tight, 4.8 to 5.1. Which is to say, people endorse the midpoint of the scale composite: “somewhat boring, somewhat entertaining, and somewhat enjoyable.”

Five studies had means below the midpoint, four had means above it.

I see no empirical support for the core claim that “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves.” [2]

Focusing on enjoyment
Because the paper’s inferences are about enjoyment I now focus on the question that directly measured enjoyment. It read “how much did you enjoy sitting in the room and thinking?” 1 = “not at all enjoyable” to 5 = “somewhat enjoyable” to 9 = “extremely enjoyable”. That’s it. OK, so what sort of pattern would you expect after reading “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think.”?

Rather than entirely rely on your (or my) interpretations, I asked a group of people (N=50) to specifically estimate the distribution of responses that would lead to that claim. [3] Here is what they guessed:

Figure 1

And now, with that in mind, let’s take a look at the distribution that the authors observed on that measure: [4]

Figure 2

Out of 663 participants, MOST (69.6%) said that the experience was somewhat enjoyable or better. [5]

If I were trying out a new manipulation and wanted to ensure that participants typically DID enjoy it, I would be satisfied with the distribution above. I would infer people typically enjoy being alone in a room with nothing to do but think.

It is still interesting
The thing is, though that inference is rather directly in opposition to the authors’, it is not any less interesting. In fact, it highlights value in manipulations they mostly gloss over. In those initial studies, the authors try a number of manipulations which compare the basic control condition to one in which people were directed to fantasize during the thinking period. Despite strong and forceful manipulations (e.g., Participants chose and wrote about the details of activities that would be fun to think about, and then were told to spend the thinking period considering either those activities, or if they wanted, something that was more pleasant or entertaining), there were never any significant differences. People in the control condition enjoyed the experience just as much as the fantasy conditions. [6] People already know how to enjoy their thoughts. Instructing them how to fantasize does not help. Finally, if readers think that the electric shock finding is interesting conditional on the (I think, erroneous) belief that it is not enjoyable to be alone in thought, then the finding is surely even more interesting if we instead take the data at face value: Some people choose to self-administer an electric shock despite enjoying sitting alone with their thoughts.

Authors’ response
Our policy at DataColada is to give drafts of our post to authors whose work we cover before posting, asking for feedback and providing an opportunity to comment. Tim Wilson was very responsive in providing feedback and suggesting changes to previous drafts. Furthermore, he offered the response below.

We thank Professor Nelson for his interest in our work and for offering to post a response.  Needless to say we disagree with Prof. Nelson’s characterization of our results, but because it took us a bit more than the allotted 150 words to explain why, we have posted our reply here.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Excepting Study 8, for which I will consider only the control condition. Study 11 was a forecasting study. []
  2. The condition from Study 8 where people were asked to engage in external activities rather than think is –obviously- not included in this overall average. []
  3. I asked 50 mTurk workers to imagine that 100 people had tried a new experience and that their assessments were characterized as “participants typically did not enjoy the experience”. They then estimated, given that description, how many people responded with a 1, a 2, etc. Data. []
  4. The authors made all of their data publicly available. That is entirely fantastic and has made this continuing discussion possible. []
  5. The pattern is similar focusing in the subset of conditions with no other interventions. Out of 240 participants in the control conditions, 65% chose midpoint or above. []
  6. OK, a caveat here to point out that the absence of statistical significance should not be interpreted as accepting the null. Nevertheless, with more than 600 participants, they really don’t find a hint of an effect, the confidence interval for the mean enjoyment is (4.8 to 5.1). Their fantasy manipulations might not be a true null, but they certainly are not producing a truly large effect. []

[22] You know what’s on our shopping list

As part of an ongoing project with Minah Jung, a nearly perfect doctoral student, we asked  people to estimate the percentage of people who bought some common items in their last trip to the supermarket. For each of 18 items, we simply asked people (N = 397) to report whether they had bought it on their last trip to the store and also to estimate the percentage of other people who bought it [1].

Take a sample item: Laundry Detergent. Did you buy laundry detergent the last time you went to the store? What percentage of other people [2] do you think purchased laundry detergent? The correct answer is that 42% of people bought laundry detergent. If you’re like me, you see that number and say, “that’s crazy, no one buys laundry detergent.” If you’re like Minah, you say, “that’s crazy, everyone buys laundry detergent.” Minah had just bought laundry detergent, whereas I had not. Our biases are shared by others. People who bought detergent thought that 69% of others bought detergent whereas non-buyers thought that number was only 29%. Those are really different. We heavily emphasize our own behavior when estimating the behavior of others [3].
Grocery Shopping Figure 1
That effect, generally referred to as the false consensus effect (see classic paper .pdf), extends beyond estimates of detergent purchase likelihoods. All of the items (e.g., milk, crackers, etc.) showed a similar effect. The scatterplot below shows estimates for each of the products. The x-axis is the actual percentage of purchasers and the y-axis reports estimated percentages (so the identity line would be a perfectly accurate estimate).
Grocery Shopping Figure 2
For every single product, buyers gave a higher estimate than non-buyers; the false consensus effect is quite robust. People are biased. But a second observation gets its own chart. What happens if you just average the estimates from everyone?
Grocery Shopping Figure 3
That is a correlation of r = .95.

As a judgment and decision making researcher, one of my tasks is to identify idiosyncratic shortcomings in human thinking (e.g., the false consensus effect). Nevertheless, under the right circumstances, I can be entranced by accuracy. In this case, I marvel at the wisdom of crowds. Every person has a ton of error (e.g., “I have no idea whether you bought detergent”) and a solid amount of bias (e.g., “but since I didn’t buy detergent, you probably didn’t either.”). When we put all of that together, the error and the bias cancel out. What’s left over is astonishing amounts of signal.

Minah and I could cheerfully use the same data to write one of two papers. The first could use a pervasive judgmental bias (18 out of 18 products show the effect!) to highlight the limitations of human thinking. A second paper could use the correlation (.95!) to highlight the efficiency of human thinking. Fortunately, this is a blog post, so I get to comfortably write about both.

Sometimes, even with judgmental shortcomings in the individual, there is still judgmental genius in the many.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Truth be told, it was ever so slightly more complicated. We asked half the people to talk about purchases from their next shopping trip. To first approximation there are no differences between these conditions, so for the simplicity of verb tense I refer to the past. []
  2. “Other people” was articulated as “other people who are also answering this question on mTurk.” []
  3. In fact, you might recall from Colada[16] that Joe is rather publicly prone to this error. []

[12] Preregistration: Not just for the Empiro-zealots

I recently joined a large group of academics in co-authoring a paper looking at how political science, economics, and psychology are working to increase transparency in scientific publications. Psychology is leading, by the way.

Working on that paper (and the figure below) actually changed my mind about something. A couple of years ago, when Joe, Uri, and I wrote False Positive Psychology, we were not really advocates of preregistration (a la clinicaltrials.gov). We saw it as an implausible superstructure of unspecified regulation. Now I am an advocate. What changed?

Transparency in Scientific Reporting Figure

First, let me relate an anecdote originally told by Don Green (and related with more subtlety here). He described watching a research presentation that at one point emphasized a subtle three-way interaction. Don asked, “did you preregister that hypothesis?” and the speaker said “yes.” Don, as he relates it, was amazed. Here was this super complicated pattern of results, but it had all been predicted ahead of time. That is convincing. Then the speaker said, “No. Just kidding.” Don was less amazed.

The gap between those two reactions is the reason I am trying to start preregistering my experiments. I want people to be amazed.

The single most important scientific practice that Uri, Joe, and I have emphasized is disclosure (i.e., the top panel in the figure). Transparently disclose all manipulations, measures, exclusions, and sample size specification. We have been at least mildly persuasive, as a number of journals (e.g., Psychological Science, Management Science) are requiring such reporting.

Meanwhile, as a researcher, transparency creates a rhetorical problem. When I conduct experiments, for example, I typically collect a single measure that I see as the central test of my hypothesis. But, like any curious scientist, I sometimes measure some other stuff in case I can learn a bit more about what is happening. If I report everything, then my confirmatory measure is hard to distinguish from my exploratory measures. As outlined in the figure above, a reader might reasonably think, “Leif is p-hacking.” My only defense is to say, “no, that first measure was the critical one. These other ones were bonus.” When I read things like that I am often imperfectly convinced.

How can Leif the researcher be more convincing to Leif the reader? By saying something like, “The reason you can tell that the first measure was the critical one is because I said that publicly before I ran the study. Here, go take a look. I preregistered it.” (i.e., the left panel of the figure).

Note that this line of thinking is not even vaguely self-righteous. It isn’t pushy. I am not saying, “you have to preregister or else!” Heck, I am not even saying that you should; I am saying that I should. In a world of transparent reporting, I choose preregistration as a way to selfishly show off that I predicted the outcome of my study. I choose to preregister in the hopes that one day someone like Don Green will ask me, and that he will be amazed.

I am new to preregistration, so I am going to be making lots of mistakes. I am not going to wait until I am perfect (it would be a long wait). If you want to join me in trying to add preregistration to your research process, it is easy to get started. Go here, and open an account, set up a page for your project, and when you’re ready, preregister your study. There is even a video to help you out.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[8] Adventures in the Assessment of Animal Speed and Morality

Animal Virtue Figure 1
In surveys, most people answer most questions. That is true regardless of whether or not questions are coherently constructed and reasonably articulated. That means that absurd questions still receive answers, and in part because humans are similar to one another, those answers can even look peculiarly consistent. I asked an absurd question and was rewarded with an entertaining answer.

Some years ago, with Tom Meyvis, I tried to develop a manipulation to create an association between speed and virtue. Our spartan publication history on the topic testifies to our (lack of) success. That doesn’t mean that the pilot data weren’t interesting for a different reason.

Participants saw a sequence of 20 animal photographs and rated each on one of two bipolar dimensions: speed or goodness. The former is straightforward. The latter could be best construed as an evaluation of moral worth. That is an absurd question. What sorts of answers did we receive?
Animal Virtue Figure 2
My Top 5 observations:

1. The Tortoise is the most moral animal. I anticipated more class-profiling, and a resulting ingroup bias for mammalia. Nope. Perhaps researchers should try an implicit measure?*

2. Aquatic race featuring: Jellyfish vs. Starfish vs. Walrus. Who wins? People give the jellyfish the edge. The starfish has no chance.

3. Nature documentaries frequently bandy about facts like, “hippopotami kill more people every year than heart disease.” My respondents overlooked that; Hippos are more moral than sloths (which nature documentaries never mention for their killing ability).

4. The orangutan is not just a mammal or just a primate, it is a great ape. Huge opportunity for some ingroup favoritism. Instead people favor the cheetah, walrus, and hippo (amongst others). Explain that.

5. Most animals are good. Our scale had a meaningful midpoint, yet all but three animals are above it. Who is bad? Hyena, Barracuda, and Jellyfish. The Jellyfish is worst. And deceptively fast. Perhaps a researcher could prime people with jellyfish and see if they cheat more on that matrices task?**

Perhaps some absurd questions have correct answers? I asked a pair of experts. Pieter Thomas Jefferson Johnson is an ecologist possibly best known for solving a major scientific problem before he was old enough to drink. Michael Jennions is a world renowned evolutionary biologist, known for many things, including this video (the link alone makes this post worthwhile). I asked them to rank the 20 animals for speed and morality. Their speed ratings are similar to each other (r = .91) and the novices (r = .87). Morality was trickier. Both said that any response would be random, or as Piet said, “I would probably tie them all in ranking”. But responses aren’t quite random. Michael rated based on the complexity of the central nervous system (complex = evil), whereas Pieter used “trophic level, followed by an inverse body mass index”. Despite very different approaches, they are mildly correlated with each other (r = .29). Experts and novices all agree on the virtue of the Tortoise, but Michael and Piet are just as fond of the lowly snail.
Animal Virtue Figure 3
*No they shouldn’t.

**Don’t run that study. I mean it.

[5] The Consistency of Random Numbers

What’s your favorite number between 1 and 100? Now, think of a random number between 1 and 100. My goal for this post is to compare those two responses.

Number preferences feel random. They aren’t. “Random” numbers also feel random. Those aren’t random either. I collected some data, found a pair of austere academic papers, and one outstanding blog post. I will tell you about all of them.

First, the data I collected. I (along with Hannah Perfecto, one of my excellent doctoral students) asked one group of people to generate a random number between 1 and 100. Another group reported their favorite number between 1 and 100. That’s it.

We know a little about preferences. People like their birthday numbers, for example. They pursue round numbers. In preparing this post, I learned of a simmering literature on single-digit number preferences, suggesting that in both 1971 and in 1988 people liked the number 7. (Aside: Someone should write the number preference equivalent of the Princeton Trilogy. In fact, why not move beyond preferences to other attributes? For example, are even numbers more warm or more competent?*). As far as I can tell, less is known about how people generate random numbers. Do people choose the same numbers at random as they choose as their favorites?

The figures tell the whole story, but words are useful. Consider four notable numbers. Consistent with past research, people like the number 7. Inconsistent with horror movie titlers and hotel floor number assigners, people also like the number 13. The number 42 has an entirely wonderful Wikipedia entry, suggesting that its consequence goes beyond Jackie Robinson and Douglas Adams. Perhaps the Data Colada can add a small footnote to its mystique? Finally, the number 69 also has a Wikipedia entry, though it is far less vivid than you’re anticipating. On the random side there are fewer obvious winners (three way tie between 5, 67, and 69). numbers frequencies

How about some other patterns? First of all, the two sets are highly, but imperfectly, correlated at r = .48. Random numbers are larger than favorite numbers (Ms = 46.9 vs. 30.7), t(565) = 7.01, p

numbers correlation

These tendencies are partially reflected in the numeric codes people choose for debit cards and their ilk. PIN numbers are a mix of preference and random, and consistent with the data we collected, a brilliant analysis of leaked PIN numbers reveals birthday liking (numbers below 32) and repeated numbers (like multiples of 11). Figure 3 reproduces a chart of 4-digit PIN codes. It will take 30 seconds to orient yourself, but then you will spend five minutes savoring it. numbers PIN

My favorite number is just about the most arbitrary preference possible. My “random” number is more arbitrary. But neither is arbitrary at all.

* Hypothesis: More warm. Odd numbers are wicked competent.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

[2] Using Personal Listening Habits to Identify Personal Music Preferences

Not everything at Data Colada is as serious as fraudulent data. This post is way less serious than that. This post is about music and teaching.

As part of their final exam, my students analyze a data set. For a few years that data set has been a collection of my personal listening data from iTunes over the previous year. The data set has about 500 rows, with each reporting a song from that year, when I purchased it, how many times I listened to it, and a handful of other pieces of information. The students predict the songs I will include on my end-of-year “Leif’s Favorite Songs” compact disc. (Note to the youth: compact discs were physical objects that look a lot like Blu-Ray discs. We used to put them in machines to hear music.) So the students are meant to combine regressions and intuitions to make predictions. I grade them based on how many songs they correctly predict. I love this assignment.

The downside, as my TA tells me, is that my answer key is terrible. The problem is that I am encumbered both by my (slightly) superior statistical sense and my (substantially) superior sense of my own intentions and preferences. You see, a lot goes into the construction of a good mix tape (Note to the youth: tapes were like CD’s, except if you wanted to hear track 1 and then track 8 you were SOL.) I expected my students to account for that. “Ah look,” I am picturing, “he listened a lot to Pumped Up Kicks. But that would be an embarrassing pick. On the other hand, he skipped this Gil Scott-Heron remix a lot, but you know that’s going on there.” They don’t do that. They pick the songs I listen to a lot.

But then they miss certain statistical realities. When it comes to grading, the single biggest differentiator is whether or not a student accounts for how long a song is in the playlist (see the scatterplot of 2011, below). If you don’t account for it, then you think that all of my favorite songs were released in the first couple of months. A solid 50% of students think that I have a mad crush on January music. The other half try to account for it. Some calculate a “listens per day” metric, while others use a standardization procedure of one type or another. I personally use a method that essentially accounts for the likelihood that a song will come up, and therefore heavily discounts the very early tracks and weighs the later tracks all about the same. You may ask, “wait, why are you analyzing your own data?” No good explanation. I will say though, I almost certainly change my preferences based on these analyses – I change them away from what my algorithm predicts. That is bad for the assignment. I am not a perfect teacher.

I don’t think that I will use this assignment anymore since I no longer listen to iTunes. Now I use Spotify. (Note to the old: Spotify is like a musical science fiction miracle that you will never understand. I don’t.)
Leif's Song Scatterplot