Consider the robust phenomenon of anchoring, where people’s numerical estimates are biased towards arbitrary starting points. What does it mean to say “the” effect size of anchoring?
It surely depends on moderators like domain of the estimate, expertise, and perceived informativeness of the anchor. Alright, how about “the average” effect-size of anchoring? That’s simple enough. Right? Actually, that’s where the problem of interest to this post arises. Computing the average requires answering the following unanswerable question: How much weight should each possible effect-size get when computing “the average?” effect size?
Should we weight by number of studies? Imagined, planned, or executed? Or perhaps weight by how clean (free-of-confounds) each study is? Or by sample size?
Say anchoring effects are larger when estimating river lengths than door heights, does “the average” anchoring effect give all river studies combined 50% weight and all door studies the other 50%? If so, what do we do with canal-length studies, combine them with rivers or count them on their own?
If we weight by study rather than stimulus, “the average” effect gets larger as more rivers studies are conducted, and if we weight by sample size “the average” gets smaller if we run more subjects in the door studies.
What about the impact of anchoring on perceived strawberry-jam viscosity. Nobody has yet studied that but they could, does “the average” anchoring effect-size include this one?
What about all the zero estimates one would get if the experiment was done in a room without any lights or with confusing instructions? What about all the large effects one would get via demand effects or confounds? Does the average include these?
Studies aren’t random
We can think of the problem using a sampling framework: the studies we run are a sample of the studies we could run. Just not a random sample.
Cheat-sheet. Random sample: every member of the population is equally likely to be selected.
First, we cannot run studies randomly, because we don’t know the relative frequency of every possible study in the population of studies. We don’t know how many “door” vs “river” studies exist in this platonic universe, so we don’t know with what probability to run a door vs a river study.
Second, we don’t want to run studies randomly, we want studies that will provide new information, that are similar to those we have seen elsewhere, that will have higher rhetorical value in a talk or paper, that we find intrinsically interesting, that are less confounded, etc. 
What can we estimate?
Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again.
To generalize beyond the data requires judgment rather than statistics.
Judgment can account for non-randomly run studies in a way that statistics cannot.
- Running studies with a set instead of a single stimulus is nevertheless very important, but for construct rather than external validity. Running a set of stimuli reduces the risks of stumbling on the single confounded stimulus that works. Check out the excellent “Stimulus Sampling” paper by Wells and Windschitl (.pdf) [↩]