Data Colada
Menu
  • Home
  • Table of Contents
  • Feedback Policy
  • About
Menu

[124] "Complexity": 75% of participants missed comprehension questions in AER paper critiquing Prospect Theory


Posted on March 14, 2025March 31, 2025 by Uri Simonsohn

Kahneman and Tversky’s (1979) “Prospect Theory” article is the most cited paper in the history of economics, and it won Kahneman the Nobel Prize in 2002. Among other things, it predicts that people are risk seeking for unlikely gains (e.g., they pay more than $1 for a 1% chance of $100) but risk averse for unlikely losses (e.g., they pay more than $1 to avoid 1% chance of losing $100).

These patterns have been replicated in dozens, possibly hundreds of studies.

A just-published American Economic Review (AER) paper, claims that Prospect Theory gives an incorrect explanation for these patterns. The author proposes that the patterns are not driven by how people think about probabilities or outcomes, but instead by "complexity", the mental "difficulty of valuing a disaggregated object" (p.3791). Indeed, the abstract says that "much of the behavior motivating our most important behavioral theories of risk derive from complexity-driven mistakes". Exactly how "complexity" impacts valuations in general, or how it leads to the same predictions as prospect theory, is not discussed in the paper. [1]

The AER paper reports on five similar experiments in which participants said how much they value different lotteries (e.g., 10% chance of $25) as well as how much they valued something called riskless "mirrors". Mirrors are prospects assumed to have similar complexity, but without any risk (e.g., getting 10% of $25 for sure, so $2.50 for sure, but expressed in a 'complex' way). 

The paper's first main finding is that valuations for lotteries and mirrors are "virtually identical" (p. 3797). That is, despite mirrors having no risk, people value them as if they were lotteries. This is consistent with the "complexity" hypothesis.

With Daniel Banki, Robert Walatka, and George Wu, we recently posted to SSRN a commentary on this paper (htm). In this post I share some of our analyses. Any views expressed here, and the writing style, are my own, not necessarily theirs. But all the analyses are joint work.

OK, let's not bury the lede. The experiment confused many participants. 75% of them erred in the comprehension questions; for the remaining 25%, median valuations of lotteries and mirrors were very different. Mirrors were priced at expected value, lotteries in line with prospect theory.

Figure 1. Differences between expected value and median valuations.
[2025-03-23/31: dropped incorrect statement from caption/edited right header]

Boxes
The experiments implemented both mirrors and lotteries by asking participants to consider sets of imaginary boxes. For example, the 10% of $25 lottery and mirror showed participants 100 boxes where 10 contained $25 and 90 contained $0. For lotteries, the participants were told to imagine opening a random box and getting whatever was in it. For mirrors, participants were paid "the sum of the rewards in all of the boxes, weighted by the total number of boxes" (p. 3790).

Value elicitation
In addition to understanding the lotteries and mirrors, participants had to understand how to communicate their valuations. In psychology we usually ask hypotheticals straight up: "how much do you value this thing?", but in econ they like including incentives, which can complicate things. The screenshot below shows instructions for a mirror involving losses. [2],[3]

I have a hard time imagining the person who struggles to understand "10% chance of $25", yet has no difficulty interpreting these instructions.

Imagine you wanted to assess the impact of carrying a brick on how well people can run, and in your study people needed to carry the brick on top of a heavy backpack. 

You need an empty backpack to study the impact of carrying a brick.
Similarly, I think you need a simple experiment to study the impact of lottery complexity.

Did people understand how to provide their valuations? We don’t know because there were no comprehension questions about it. But there were comprehension questions about the mirrors and lotteries. Did people understand them? Many didn’t.

Comprende?
Participants were asked 4 multiple-choice comprehension questions about a 50:50 lottery that paid $16 or $0 (and another 4 questions about the corresponding mirror). Participants were asked the probability of getting different payoffs (e.g., of getting $16). Each question had only 3 possible answers. If participants provided the wrong answer, they could try again until they provided the correct one. People made tons of errors.

Fig 2. Distribution of number of errors per participant

That's not very promising, but it's possible that participants got these questions wrong, learned from seeing the correct answer, and by the time they started valuing boxes for real they had figured out how everything worked. Let's see.

There is something economists call "first-order stochastic dominance." It means that people should pay more for strictly better things. So people should pay more for a 90% chance of $25 than for a 10% chance of $25. For short I will call this “being coherent”.  If people misunderstood the boxes, we may expect some of them to be incoherent. And they were.

In Figure 3 we compare the share of participants that were incoherent in previously published papers, vs in studies in the AER paper. In earlier experiments, about 3% of participants were incoherent, in the AER paper 21% were. While people with the most errors were the most incoherent, even the best participants in the AER paper, those with zero comprehension errors, were worse (12.4%) than participants overall in every previous study (<11.4%).

Fig 3. Incoherent responses (e.g., valuing 10% of $25 more than 90% of $25)

If the comprehension questions in the AER paper were intended as training, the data suggests they were insufficient training.

Share of participants showing effects
The figure at the beginning of the post – the one showing that the AER findings go away when looking at participants without errors – depicts medians. We obtain a similar pattern when analyzing the share of participants showing effects. Figure 4 shows the share of participants showing the fourfold pattern predicted by prospect theory, and the share who value the prospects at expected value. As you can see, the majority of participants who exhibited no comprehension errors valued the lotteries in the way that prospect theory predicts and valued mirrors at their expected value. In other words, the people who best understood the task provided valuations consistent with prospect theory and inconsistent with complexity. [4].

 
Fig 4. Share of participants valuing mirrors at Expected Value and lotteries in line with Prospect Theory

Why do confused participants show prospect theory behavior with mirrors?
First, some participants probably thought the mirrors were lotteries, which can of course cause them to provide similar valuations of lotteries and mirrors. (We have evidence of this in the paper but no space here.).

Second, and less obviously and more interestingly, this can arise because of regression to the mean.

Imagine a clueless participant, someone who never understood the instructions or someone who stopped paying attention after, say, 5 minutes of providing dollar values for sets of imaginary boxes. What would this clueless person do when facing this table?:

A participant who chooses randomly or capriciously will tend to give answers away from the extremes of the scale. This will cause median and (especially) mean valuations towards the midpoint of the scale (henceforth, "regression to the mean" for short). [5]

When the dependent variable is how much people value prospects, regression to the mean creates spurious evidence in line with prospect theory. When people answer randomly for 10% chance of $25, they overvalue it, because the “right” valuation is $2.50, and the scale mostly contains values that are higher than that. When people answer randomly for 90% chance of $25, they undervalue it, because the “right” valuation is $22.50 and the scale mostly contains values that are lower than that. Thus, random or careless responding will produce the same pattern predicted by prospect theory.

One can design a study where regression to the mean works against prospect theory. Kahneman and Tversky did some decades ago, relying on binary choice. Like this:

What do you prefer?
  a) $2.50 for sure
  b) 10% chance of $25

That’s what an empty backpack looks like.

In this case, random responding biases the estimate towards 50:50, and thus away from prospect theory predictions. The "Asian Disease problem", for example, the most famous instantiation of the kind of risk preference reversals studied in this AER paper, relies on simple binary choice. [6]

George Wu just posted on SSRN the results from a binary-choice experiment contrasting lotteries with mirrors, finding big differences in how mirrors and lotteries area valued (.htm). The design and analyses would take some time to explain, so here I just highlight an interesting result:


Fig 5. Highlighted results from new study by George Wu, available on SSRN (paper: htm | his Figure 4 ,pdf)

Though the results aren't perfect for prospect theory, e.g., only half the people took the 10% lottery, with lotteries we do see more risk seeking with unlikely gains that with likely gains; while with mirrors there is no difference. Notably: (1) random responding cannot explain the lottery results, and (2) lotteries are treated quite differently from mirrors.

Regression to the mean is probably behind two other results
The AER paper highlights two correlations as key results: (1) the effect size for lotteries and mirrors is correlated across participants, which is interpreted as showing the same mechanism drives both effects (‘complexity’) (2) more sophisticated people (e.g., STEM majors, high CRT people) show smaller effects, which is interpreted as showing the mechanism involves making cognitive mistakes smart people avoid.

Regression to the mean may offer a simpler explanation.

To see how regression to the mean is at play, let’s look at the raw valuations of mirrors and lotteries for 10% of $25.


Fig 6.
Correlation in lotteries and mirrors seems driven by regression to the mean

Starting from the left, we see that the median participants who had zero comprehension errors valued mirrors at expected value and lotteries above it, as predicted by prospect theory. As we move right, both medians go up, way up. The AER paper suggests that we should think of that variation as showing higher and higher valuations of the prospect, but that variation confounds noise with signal, regression to the mean, with the true underlying valuation of mirrors and lotteries (FWIW, I'd say that valuing a 10% chance of $25 at $11 is not consistent with prospect theory, it's too high an over-valuation).

If we compute the vertical difference in valuations, between mirrors and lotteries, we partial out some of that randomness (econ friends, think: diff-in-diff instead of diff). Participants with no errors show a pattern consistent with prospect theory, but the other participants do not.

A similar logic applies to the correlations with cognitive sophistication (e.g., STEM or CRT).

The AER paper claims that smarter participants show less prospect theory behavior, but that’s because the analysis confounds noise (regression to the mean) with signal (true differences in valuations).

I will report results for the CRT because it is fun to talk about it (see Frederick, any date). So, let’s look at the same figure as before, but now with CRT errors in the x-axis.

Fig 7. Folks with perfect CRT scores behave in line with prospect theory only for lotteries.

Starting on the left we see that the median participant with a perfect CRT score values mirrors at expected value and the lottery above is, as predicted by prospect theory. The lottery/mirror gap vanishes for lower CRT score participant.

FWIW, that median of $4.50 for the high CRT folks matches (proportionally) the median that Tversky and Kahneman (1992) report for their 10% lottery.

In other words, the smartest participants in this difficult-to-understand experiment, behave perfectly in line with the typical participant in Tversky & Kahneman's easy to understand experiment; and, in line with prospect theory.

In Fig 9 in our paper we generalize these results to all prospects and six alternative measures of cognitive sophistication.

Two things need to be true.
The AER paper argues that past studies showing support for prospect theory actually reflect participants making mistakes because of lottery complexity. Two things would need to be true about the data in the AER paper for this inference to be justified, and neither is.

First, it needs to be true that the level of confusion in the AER studies is comparable to the level of confusion in past studies. But, Figure 3 shows it is not true. Incoherent respondents are 6 times more common in the AER studies.

Second, it needs to be true that people who understood the lotteries do not show behavior in line with prospect theory. But Figures 1, and 5-7 of this post show this is not true. The more people understood the experiment, the more they treated lotteries and not mirrors in line with prospect theory predictions.

Wide logo


Disclaimer: two economists who read the SSRN paper thought we should explicitly indicate no misconduct is suspected. To be clear: there is zero concern of academic misconduct. To avoid other possible misreadings I want to also indicate I am not suggesting the AER paper insufficiently controlled for weather variables (Colada[46]), I am not suggesting it should have included interaction controls (Colada[80]), and I am not suggesting the AER author takes baths in hotel rooms (Colada[16]). 


Author feedback
Our policy (.htm) is to share drafts of blog posts with authors whose work we discuss, in order to solicit suggestions for things we should change prior to posting.

We had sent the SSRN paper to Ryan Oprea, the author of the AER paper, before posting it there, and I sent a draft of this post to him as well, about a month ago. Ryan and I exchanged many emails over the past few weeks. I want to start with some bad news for Devin Pope, who I have now demoted (in my mental ranking) to being the second nicest economist.

I summarize below what I believe are the key points of agreement and disagreement with Ryan. I shared this summary with him, he provided comments, which I interleave below behind links like this one:

See Ryan's comment

While it is never truly “fun” to have your work criticized, Uri managed to make our exchange surprisingly close to fun.  We had a great (and sometimes very entertaining) exchange and I think we really tried to listen to one another over the course of a long series of emails.  I maintain that a number of our disagreements are semantic and am of course disappointed that I failed to entirely convince him of this, but I remain grateful to him for his patience, curiosity and good faith during our exchange on this comment and post.

You can also read his full separate response to the post here: pdf


Three points we agree on
1. Some participants in experiments find it too hard or unpleasant to work through what expressing their true preferences (or beliefs) would entail and instead do something else, that something else may be an artifactual heuristic, responding randomly, etc. Ryan frames this a complexity, as in, the complexity involved in providing the answer expected by the experimenter gets in the way. I will refer to this as measurement error.

See Ryan's comment

I agree.   What some people in the literature have hoped we are measuring with these valuation tasks is people’s rationally expressed “tastes” for risk.   But a lot of other people in the literature (stretching back to the beginning of prospect theory) have thought we might be measuring something else entirely – the effects of the (potentially noisy) cognitive shortcuts people use in valuation because rationally valuing lotteries is hard or personally costly (“complex”).   As I read the literature, these are both traditional interpretations of prospect theory, and neither is really the “orthodox” interpretation.   Relative to that first interpretation (that we’re measuring rational tastes), the second (that we’re measuring not fully rational behavior) constitutes a kind of measurement error – a set of drivers of valuations that confound attempts to measure rational tastes for lotteries. I think of the mirror as an attempt to (at least to some extent) measure that “measurement error” and study whether it takes on the distinctive shape of prospect theory.


2. With some experimental designs, measurement error can masquerade as the effect of interest to the researchers, e.g., people may over-pay for a 10% chance of $100 not because they value it more than $10, but because they didn't think it through and just chose the midpoint of the scale, say  $50, pushing the mean above the expected value of $10. This would constitute spurious evidence of the phenomenon of interest, probability weighting, and it would just be measurement error.

See Ryan's comment

I agree but with one interesting caveat.  There is a long tradition of interpreting prospect theory as resulting to some degree from this kind of imprecise behavior.    For instance, Tversky and Kahneman in their 1992 paper (one of the two most seminal papers in the development of the theory) directly point out that patterns like prospect theory look a lot like perceptual distortions that are closely related to what Uri is describing here.  So it isn’t clear that this kind of measurement error suggests probability weighting is “spurious” – just that probability weighting might reflect at least to some degree something other than people’s rationally expressed taste for risk.


3. Experimenters in the past have paid insufficient attention to the potential role of measurement error in their findings.
           

See Ryan's comment

I don’t have a strong opinion on how to judge the history here (as I said there is a long literature in psychology and economics stretching back to the 1960s that openly contemplates the possibility that risky choice may be shaped by something other than people’s rationally expressed tastes), but I definitely think understanding the sometimes predictable ways people do things that aren’t narrowly rational (e.g., express their “true” tastes for things like risk when asked) is interesting and worthy of more study.


What we disagree on
While I agree with points 1-3 as theoretical concerns, I think it is ultimately an empirical question whether measurement error accounts for a substantial share of empirical patterns attributed to prospect theory in past studies. While Ryan's paper proposes that his data show that past results are mostly or entirely the result of measurement error, I believe his data do not show that. I believe they show that measurement error plays a huge role in his studies, and actually suggest measurement error played a much smaller if any role in past studies.

See Ryan's comment

I think this is a reasonable concern that will only be fully resolved by future experiments that study iterations on the AER paper’s design.   The authors of the comment benchmark the AER study against a relatively small set of studies that differed in a number of ways from mine in order to draw this conclusion (I think this list of benchmarked studies should be expanded to get a fuller picture).  For instance, as the comment notes, previous studies (unlike the AER study) often group and arrange valuation tasks in such a way as to make it plausibly a lot easier to articulate consistent valuations from task-to-task – something that could aid people in making artificially consistent decisions from task to task and avoid FOSD violations.   More importantly (as my formal response to the comment argues), the results reported in the AER paper don’t actually change qualitatively in the most important respects when we restrict to the subjects who make the fewest FOSD violations which suggests this might not be as important as it might seem for interpreting the findings.  Consistent with this, the only other published study I know of that studies “mirrors” (Vieider 2024) finds low FOSD violations but nonetheless finds very similar results to those in the AER paper – again, suggesting that FOSD violations may not be too directly related to the AER paper’s results.  Time will tell.  Regardless, in many ways the most important observation in the study is that when people do make mistakes in these types of valuation tasks, those mistakes tend to look like the classical patterns of prospect theory and that’s an important thing for us to know.


The key issue in my mind is that Ryan's studies are much more confusing that past studies were, and thus produced lots more measurement error. If we think of his studies as a sample used to estimate measurement error in past studies, we should think of it as an unrepresentative sample. This is not just because when I read Ryan's instructions they seem confusing to me, it is not just because mirrors are a weird contraption that I believe participants did not understand and many confused with lotteries, but because his data objectively show higher levels of confusion than in past studies.

See Ryan's comment

As I explain in more detail in my formal response to the comment (see the link), there are two problems with this conclusion in my view.  One is that the evidence from the training questions included in the experiment (the centerpiece of this line of criticism) actually show that people learn a lot over the course of the questions (and importantly these questions were actually included not to measure confusion but to train away confusion in subjects before they enter the experiment), with error rates dropping substantially from question to question.   To use early errors in these questions as a measure of confusion is a little like assuming a cure failed because of the initial severity of the disease.  By the end, only a small fraction of the subjects who make overall errors continue to make errors that are even consistent with confusion of mirrors for lotteries, which suggests we don’t have a lot of basis to conclude (at least based on errors in these questions) that people are confused about payoffs upon entering the experiment.  The other problem is, as I’ll argue below, there are lots of reasons to make inconsistent decisions other than being confused about the rules of the experiment.


In his data, for instance, participants valued prospects incoherently (e.g., paying more for 10% chance of $25 than for 90% chance of it), about 6 times as often as they did in past studies. I don't think the behavior of participants in a categorically more confusing study can inform the interpretation of behavior from participants in past studies.

See Ryan's comment

It is important to emphasize that there are a lot of reasons other than confusion about the experiment for subjects to make inconsistent decisions in difficult tasks like these.  One is that expressing one’s valuations in these types of tasks is difficult, and many people would rather do something simpler (and less consistent) instead — which is how I interpret the data in the AER paper.  Evidence of noisy or inconsistent behavior is far from diagnostic of the conclusion that subjects are confused because it is equally consistent  with the possibility that people don’t like doing complex things.  And, again, mistakes in mirrors continue to look systematically prospect theoretic (and average mistakes continue to predict average behavior in lotteries from subject-to-subject) even when we cut out subjects who most severely violate FOSD.  So it really isn’t clear that whatever drives these FOSD violations is confounding the AER paper’s results


Moreover, focusing on participants who presumably have the least measurement error in Ryan's data (e.g., those with perfect scores in comprehension question or in the CRT), we find substantive evidence in line with prospect theory for lotteries, but not for mirrors. This in my mind speaks actively against the hypothesis that measurement error explains evidence of risk preferences in line with prospect theory.

See Ryan's comment

But in my view, this conclusion is based on the authors agreeing with the main idea of the AER paper:  that imperfectly rational behavior arises similarly in mirrors and lotteries and that we can use the former to better understand what is happening in the latter.   To reach their conclusion, the authors difference out valuations in mirrors from valuations of lotteries, accepting the core premise of the AER paper that there are similar mistakes occurring in both types of tasks.  And what they find when they do this (as I emphasize in my formal response to the comment) is that the size of the average prospect theoretic behavior that survives this decomposition is much smaller (even for the most attentive subjects) than the raw lottery valuations alone would suggest.  That’s evidence that a lot of what we measuring in these kinds of lottery valuation tasks might be the effects of subjects’ use of cognitive shortcuts (the same ones that they similarly use in mirrors) rather than their rational expressions of their tastes for risk.  And that is really the main idea of the AER paper.  Because of this, I interpret this point of disagreement as instead a point of qualified agreement.


Shortly before this blogpost went live, Ryan shared with us a his response to our SSRN paper (.pdf)


Footnotes

  1. The paper indicates that complexity leads to suboptimal decisions, but this does not make directional predictions, behaving opposite of prospect theory is also suboptimal, as is behaving randomly. [↩]
  2. I edited the screenshot a bit. The example I had access to was for 4 boxes, so I edited to read 100 boxes, the more common interface. I also cut the middle part for the range to make the screenshot shorter. [↩]
  3. One of the five experiments had, instead of this price-list, an open ended elicitation ("Becker-Degroot-Marshak"), I actually found the BDM more confusing, especially for losses, because it required navigating a triple negative. You had to pay to avoid losing money in the boxes you did not choose. [↩]
  4. The fourfold pattern involves being risk seeking for unlikely gains and likely losses, and risk average for likely gains and unlikely losses [↩]
  5. Participants clicked on a single row, and the computer filled in the answers for all the other rows, participants did not have to make a selection in each row. [↩]
  6. In conversations with the AER author, he pointed out a paper in Management Science (htm) comparing lotteries with mirrors relying on binary choice.  That paper follows closely the design from the AER paper, there are boxes, mirrors, and lotteries. It had very similar comprehension questions, those were 75% of participants made mistakes in the AER paper, but data on performance for those questions was not collected for the Management Science paper, thus we cannot examine the impact of misunderstood instructions on this study. In addition, that experiment used non-rounded dollar amounts and probability values (e.g., a lottery for 88% of $18), almost surely increasing mesurement error as participants make rounding errors computing expected value. [↩]

Related

Get Colada email alerts.

Join 10.5K other subscribers

Social media

Recent Posts

  • [125] "Complexity" 2: Don't be mean to the median
  • [124] "Complexity": 75% of participants missed comprehension questions in AER paper critiquing Prospect Theory
  • [123] Dear Political Scientists: The binning estimator violates ceteris paribus
  • [122] Arresting Flexibility: A QJE field experiment on police behavior with about 40 outcome variables
  • [121] Dear Political Scientists: Don't Bin, GAM Instead

Get blogpost email alerts

Join 10.5K other subscribers

tweeter & facebook

We announce posts on Twitter
We announce posts on Bluesky
And link to them on our Facebook page

Posts on similar topics

About Research Design, Discuss Paper by Others
  • [125] "Complexity" 2: Don't be mean to the median
  • [124] "Complexity": 75% of participants missed comprehension questions in AER paper critiquing Prospect Theory
  • [122] Arresting Flexibility: A QJE field experiment on police behavior with about 40 outcome variables
  • [121] Dear Political Scientists: Don't Bin, GAM Instead
  • [119] A Hidden Confound in a Psych Methods Pre‑registrations Critique
  • [101] Transparency Makes Research Evaluable: Evaluating a Field Experiment on Crime Published in Nature
  • [99] Hyping Fisher: The Most Cited 2019 QJE Paper Relied on an Outdated Stata Default to Conclude Regression p-values Are Inadequate
  • [98] Evidence of Fraud in an Influential Field Experiment About Dishonesty
  • [97] Data Replicada #10: Does Goal Conflict Affect Time Spent on Work and Leisure?
  • [96] Madam Speaker: Are Female Presenters Treated Worse in Econ Seminars?

search

© 2021, Uri Simonsohn, Leif Nelson, and Joseph Simmons. For permission to reprint individual blog posts on DataColada please contact us via email..