This post is an introduction to a series of posts about meta-analysis [1]. We think that many, perhaps most, meta-analyses in the behavioral sciences are invalid. In this introductory post, we make that case with arguments. In subsequent posts, we will make that case by presenting examples taken from published meta-analyses.
We have recently written a short article for Nature Reviews Psychology in which we briefly described some fundamental problems with meta-analysis, and proposed an alternative way to generate more productive and less misleading literature reviews (.htm). Because of space constraints, in that article we couldn’t fully articulate our concerns with meta-analysis, and we were unable to include many examples. But we can do that here, over the course of a few posts.
What’s Wrong With Meta-Analyses
When, years ago, we first started thinking about meta-analysis, we presumed, just like most researchers, that it is an imperfect but useful technique. But then we started looking at it closely, and we were surprised to realize – as others before us had realized – that this “gold standard” method of scholarship, is frequently invalid and often misleading [2].
Meta-analysis has many problems. For example, meta-analyses can exacerbate the consequences of p-hacking and publication bias (Vosgerau et al 2019; .htm), and common methods of correcting for those biases work only in theory, not in practice (see Data Colada [30],[58],[59]). But in this series, we will focus our attention on only two of the many problems: (1) lack of quality control, and (2) the averaging of incommensurable results. We should say at the outset that we do not think that these problems are present in all meta-analyses. In particular, meta-analyses may be informative when they combine studies that have identical manipulations and measures. Our critiques apply most forcefully to meta-analyses that combine studies with different manipulations or outcomes, as is done in many meta-analyses in the social sciences.
Two of the Problems That Invalidate Meta-Analytic Averages
1. Some Studies Are More Valid Than Others
Some studies in the scientific literature have clean designs and provide valid tests of the meta-analytic hypothesis. Many studies, however, do not, as they suffer from confounds, demand effects, invalid statistical analyses, reporting errors, data fraud, etc. (see, e.g., many papers that you have reviewed). In addition, some studies provide valid tests of a specific hypothesis, but not a valid test of the hypothesis being investigated in the meta-analysis [3].
When we average valid with invalid studies, the resulting average is invalid. This is in part because invalid results can be extreme and have a disproportionate effect on the overall mean, and in part because, unlike random error, various forms of invalidity are not expected to cancel each other out. The invalidity of a study with a demand effect does not cancel out the invalidity of a study with an inappropriate statistical procedure. Indeed, in a literature in which it is easier to publish some results rather than the opposite (i.e., arguably most literatures), different sources of invalidity are likely to bias results in the same direction rather than in the opposite direction.
Meta-analysts frequently detail the criteria that need to be met for inclusion, but they are rarely so vigilant about decisions to exclude studies for invalid designs or analyses, let alone because the underlying data might contain errors or are possibly fraudulent. Indeed, rather than attempting to evaluate individual studies and separate the wheat from the chaff, many meta-analysts attempt to be comprehensive, actively seeking out and incorporating studies that would otherwise – and deservedly – have gone unnoticed or unpublished. Nobly intentioned, this procedure will lead to the inclusion of studies that have been evaluated to a lower standard, never evaluated at all, or evaluated and found to be invalid. In other words, many meta-analysts act as though 100% of studies conducted anywhere, by anyone, are valid, and thus that the average of any set of located studies is valid. But that is probably not true.
To give a quick example of how some meta-analyses lack quality control, a paper on the controversial literature of ‘behavioral priming’ (Weingarten et al., 2016, .htm) located 283 results, and none of them were excluded due to quality considerations. That is, 100% of behavioral priming research was presumed to be valid.
2. Combining Incommensurable Results
Averaging results from very similar studies – e.g., studies with identical operationalizations of the independent and dependent variables – may yield a meaningful (and more precise) estimate of the effect size of interest. But in some literatures the studies are quite different, with different manipulations, populations, dependent variables and even research questions. What is the meaning of an average effect in such cases? What is being estimated?
To see the problem with averaging across disparate studies, it is useful to consider an example outside the set of usual research questions. Let’s imagine a meta-analysis on “the effect” of walking.
Walking has many effects. It can cause people to (1) get closer to the kitchen, (2) burn some calories, and (3) fall off a cliff. If we really wanted to, we could measure (1), (2) and (3), convert what we measured into Cohen’s ds, and then compute a meta-analytic average. But what would you learn from the statement “the average effect of walking is d = .43”? Now consider the claim that “the average effect of nudging is d = .43”. That result appears in the abstract of a meta-analysis that we will discuss in our next post. Does that statistical tidbit really convey more information than our imaginary claim about ‘the effect’ of walking? We don’t think so.
Meta-analysts typically try to account for some forms of variation in study design by, for example, conducting moderator analyses in which they report averages across different subsets of results. But the accounting for some differences routinely means the neglect of other differences. A notable example is that meta-analysts frequently average the effect of interventions while ignoring the size of those interventions. For example, they might average the effects of a blatant reminder with the effects of a subtle reminder, they might report that average as d = .25, and they might summarize that as indicating that reminders work but the effect isn’t big. As sensible as that might seem, it is meaningless and potentially misleading, as blatant reminders may have large effects and subtle reminders may not work at all.
Attempting to average across manipulations of different magnitudes is a bit like trying to answer the question of whether objects are on average light or heavy. It’s not a meaningful question, and it doesn’t have a meaningful answer. Indeed, the answer you compute depends on the objects you sample, just like meta-analysts’ averages depend on the particular – and non-random – constellation of studies that happen to be in their sample. If most studies in the literature happen to use blatant reminders, then the average effect of reminders will be big. If most studies happen to use subtle reminders, then the average effect will be small. If the goal is to simply compute the average effect size observed in this non-random sample of studies that you happen to get your hands on, then you can do that. But that is not the goal. The goal seems to be to ask, “In general, what is the effect of reminders?” That question has no meaningful answer. And so we shouldn’t try to answer it.
Conclusion
We have no vendetta against meta-analysis. We have a vendetta against meaningless means. If done for exploratory purposes, meta-analytic efforts can potentially generate novel hypotheses that can be tested in future research; eventually, that new research may bear fruit and generate important new discoveries [4]. But meta-analyses are usually not pitched as exploratory efforts, but rather as the definitive take on a research question. And if those meta-analyses are invalid – as we think many of them are – then this is bad. Like, really bad.
Our next post in this series will be on Thursday, when we discuss a recent meta-analysis investigating the effects of nudges on behavior change.
Footnotes.
- Meta-analysis can mean many things, but in this series we use it to refer to the ubiquitous practice of *averaging* across effect sizes from different studies. So really when we are critiquing meta-analysis, we are critiquing meta-analytic *averaging*. [↩]
- For an earlier exposition of similar concerns, see e.g., Sharpe (1997; .htm). We hope the inclusion of many concrete examples in this series will be more persuasive than past critiques have been. [↩]
- For example, imagine an experiment that gives some people a $20 bill and other people a $1 bill and observes increased spending in the former condition. If that paper is interpreted as “increased cash reserves increases spending” we would consider it valid, and wouldn’t object to its inclusion in the “cash reserve” meta-analysis. If, however, it was interpreted as money priming (i.e., people in the former condition are primed with higher denominations and therefore spend more) we would see the paper as locally invalid, and certainly invalid to include in a priming meta-analysis. [↩]
- Indeed, we are big fans of forest plots that display effect sizes across studies, for those plots can sometimes encourage the formation of testable hypotheses about what is driving variation in effect sizes. [↩]