When we design experiments, we have to decide how to generate and select the stimuli that we use to test our hypotheses. In a forthcoming JPSP article, “Stimulus Sampling Reimagined” (htm), we propose that for at least 60 years we have been thinking about stimulus selection in experiments in the wrong way [1]. Specifically, with Andres Montealegre (Yale postdoc) and Ioannis Evangelidis (fellow prof at Esade, Barcelona) we propose that stimuli should be chosen not with an eye towards external validity, towards ensuring the effect size of chosen stimuli resembles that of unchosen stimuli, but instead, with an eye towards internal validity, towards ensuring the chosen stimuli have effects because of the hypothesized reason.
This distinction is not a pedantic exercise in philosophy of science; it is instead a very practical matter. Aiming for internal instead of external validity changes how many stimuli we choose, how we choose them, and how we analyze the resulting data. So, it changes everything.
Specifically, past methods papers on stimulus selection, which have focused on external validity, have argued for using large numbers of stimuli (dozens and even hundreds of them; see e.g., Judd et al 2012), have paid limited (or no) attention to how authors should go about choosing stimuli (implicitly assuming any sample of stimuli is a random/representative sample) [2], and have considered increasingly mathematically sophisticated ways to compute "the average" effect across stimuli (ANOVA→Mixed-Models→Maximal Mixed-Models→Bayesian Maximal Mixed-Models (I'm serious) …).
A focus on internal validity changes all that. Five to ten stimuli can be more than enough in many situations, how those stimuli are chosen is absolutely key, and what matters most is not "the average" effect, but the extent to which different stimuli yield consistent vs inconsistent results. In our paper we propose relying on "stimulus plots" to analyze that variation in results across stimuli.
The main goal of stimulus plots is to identify possible confounds and/or moderators. In this post I focus on those stimulus plots, presenting our reanalysis of data from two recent psychology papers.
In a future post I will focus on "Mix-and-Match", our proposed method for choosing stimuli for psychology experiments.
OK, let's see some data.
Example 1. When is Snitching OK?
In Study 4 of a 2022 JPSP paper, the authors examined whether revealing someone else’s secret transgression is more acceptable when the transgression is intentional. The study had 20 vignette pairs. One pair, for example, involving the use of illegal drugs, was:
- Intentional: Ross brought illegal party drugs to a party, which he then took when he got there.
- Unintentional: Ross went to a party, and although he had decided beforehand, he would not take any illegal party drugs, a friend offered him some, and in the heat of the moment, he said yes.
As is currently customary for multi-stimuli experiments, the paper only reports overall results averaging across stimuli: M1=2.55 vs M2=3.20, p<.001 (using a mixed-model, not that it matters).
The most tempting thing when seeing comparisons of means is to not think about the underlying stimuli at all, and just think of the study as having generated the two condition means one can compare. The second most tempting thing to do is imagine that all stimuli show essentially the same effect. The third most tempting thing to do is imagine that the different stimuli all have slightly different effects, but they are all symmetrically centered around the overall mean (mixed-models assume this for instance). But there is no reason, at all, for that to be the case [3]. Each stimulus is different, they could have totally different true effects. So we are pitching a fourth thing to do: don't assume, look at the data.
So aggregating across 20 scenarios the means are M=2.55 vs M=3.20, but what's behind these means? Do we observe a consistent effect across scenarios? Do we see some reversals where people snitch less on intentional acts? Do we see outliers where one scenario is doing all the action? Do we see a bunch of zero effects and a bunch of medium size effects? Obviously the answers to these questions matter, and yet most papers don't try to answer them. Let's start answering them [4].
The stimulus plot below has two panels. The first shows the means for each stimulus-pair, the second shows the difference between those means (i.e., the effects). The second panel also shows the expected range of variation of observed effect across scenarios, if all scenarios had the same true effect. This allows us to visually assess whether there is more variation than expected by chance (if your intuition is that the expected line should be flat, see footnote: [5]).
Fig 1. Stimulus Plots for JPSP (2022) – Study 4
The first plot shows that nearly half the stimuli did not produce the hypothesized effect: intentional and unintentional secret acts were deemed similarly revealable. That does not necessarily invalidate the main conclusion (indeed, 11 of the 20 stimuli are individually significant), but it does warrant a deeper exploration of the design and results. For example, the figure drew our attention to "harm", the largest effect. John cut himself intentionally (to "deal with his emotional pain"), vs unintentionally ("while chopping vegetables"). We wondered whether the large effect may arise because respondents wished to help John with his self-cutting problems rather than to punish him. We see the motivation to help John as a confound. Conversely, the smallest effect involves Kathy surprising her husband with opera tickets intentionally ("kept this a surprise for months") or unintentionally ("had forgotten to put it on their shared calendar"). We wondered if the directional reversal arose because the action just isn't immoral at all, and intentionality may make it a more positive act.
This is of course speculation. The kind of speculation stimulus plots are meant to generate (and not just with the largest vs smallest effect). In studies with multiple stimuli we should explore the stimulus-level results. This may lead us to revise or refine our theorizing, and lead us to modify or build on study designs. Perhaps the next study drops a possibly confounded stimulus, or explores the candidate moderator suggested by the stimulus plot.
Example 2. Posing While Black
In Study 3 of a 2023 JPSP paper, the authors examined differences in how Black vs White people are perceived when power posing. Undergraduates (n=105) chose potential partners for an upcoming task. Each participant saw 20 sets of 4 photographs of different people (crossing race and pose within each set) and chose from each set one partner. The key finding is that White partners were chosen more often when power posing but Black partners were not: "poses did not influence participants' willingness to interact with Black targets" (p.59, bold added). Any given potential partner was evaluated in expansive and constrictive posing, allowing the calculation of stimulus-level results. Does Black potential partner #19 show an effect? Figure 2 has the answer. He does. While he was chosen 47% of the time when in an constrictive pose, he was chosen only 13% of the time in an expansive pose. A massive, over 30 percentage point swing.
Fig 2. Stimulus Plots for Black Partners in Study 3 of a JPSP 2023 paper
More generally, in the first panel we see that while on average there is no effect of power posing on choosing a Black partner (23% vs 22%), the means hide substantive heterogeneity. About half the stimuli are individually significant. Indeed, the average absolute effect is larger for Black (12.7%) vs White (11.0%) potential partners, contradicting the paper’s claim that "poses did not influence participants' willingness to interact with Black targets".
Perhaps examining the photographs one could understand these results, identifying a moderator or confound, but the photos are not included in the paper and were not shared with us after we requested them a couple of times. We don’t think these results are interpretable as currently reported in the JPSP paper.
Our paper includes a third example.
Summary
Ignoring stimulus results represents a missed opportunity to learn from our data. Reporting only the average effect across a set of stimuli can be uninformative, it can be misleading, and it is never sufficient. Stimulus Plots are an efficient way to report descriptive and inferential statistics for studies with multiple stimuli, and to make sure overall means are interpretable. Our "stimulus" R Package produces stimulus plots with a single line of code [6].
In a future post I will cover our proposed approach to choosing stimuli: "Mix-and-Match". If you can't wait, or you love .pdfs, you can read our paper now.
Author feedback
Our policy (.htm) is to share drafts of blog posts with authors whose work we discuss, in order to solicit suggestions for things we should change prior to posting. Authors of the two papers discussed here acted as reviewers when our paper was under peer-review (one for JPSP, one for a different journal), so it did not seem necessary to reach out to them again. One reviewer was very positive, both provided useful feedback.
Footnotes.
- We think the first paper proposing stimuli should be chosen to generalize to unchosen stimuli is by Coleman (1964), but the argument was more influentially made by Clark (1973). [↩]
- A few papers have proposed systematically exploring the full space of possible stimuli, Baribault et al 2018; DeKay et al. 2022, and there is an older debate about choosing stimuli to mimic their frequency in everyday experiences (Brunswik 1955); I will say more about this in the post about experimental design with "Mix-and-Match" [↩]
- The central limit theorem tells us the mean of (basically) any distribution is normally distributed, but here we are talking about the underlying things being averaged. Average income is normally distributed, income is not. Average city size is normally distributed, city size is not. The average effect in a set of stimuli is normally distributed, but effect size across studies is not; there is no reason for the true effects of a set of stimuli to fall symmetrically on either side of their true mean. [↩]
- Mixed-models can be used to compute the effects of each stimulus, and in fact, under the hood the models do that, it's what the overall mean and its standard error is based on (the default output of mixed-models does not include those estimates and they are seldom reported in papers). Also, and more importantly, there is a difference between accounting for variation, and exploring the variation. Estimating a mixed-model is neither necessary nor sufficient to explore variation. [↩]
- Imagine you have 5 people flipping 20 coins each. While we expect each of them to get 50% heads on average, we wouldn't expect all 5 to get exactly 10 heads and 10 tails. Via math or simulations you can compute how many heads the person that gets the fewest heads should get, the 2nd fewest, etc. That will be an increasing curve, like the blue line in the second stimulus plot figure. The slope is determined by sampling error, so, how big is the sample for a given stimulus, and how much variance its DV has. [↩]
- For now it is available from GitHub. You can install it with groundhog like this:
library(groundhog)
groundhog.library("urisohn/stimulus","2025-05-01")
[↩]