[63] “Many Labs” Overestimated The Importance of Hidden Moderators

Are hidden moderators a thing? Do experiments intended to be identical lead to inexplicably different results?

Back in 2014, the “Many Labs” project (.pdf) reported an ambitious attempt to answer these questions. More than 30 different labs ran the same set of studies and the paper presented the results side-by-side. They did not find any evidence that hidden moderators explain failures to replicate, but did conclude that hidden moderators play a large role in studies that do replicate.

Statistically savvy observers now cite the Many Labs paper as evidence that hidden moderators, “unobserved heterogeneity”, are a big deal. For example, McShane & Böckenholt (.pdf) cite only the ManyLabs paper to justify this paragraph “While accounting for heterogeneity has long been regarded as important in meta-analyses of sets of studies that consist of […] [conceptual] replications, there is mounting evidence […] this is also the case [with] sets of [studies] that use identical or similar materials.” (p.1050)

Similarly, van Aert, Wicherts , and van Assen (.pdf) conclude heterogeneity is something to worry about in meta-analysis by pointing out that “in 50% of the replicated psychological studies in the Many Labs Replication Project, heterogeneity was present” (p.718)

In this post I re-analyze the Many Labs data and conclude the authors substantially over-estimated the importance of hidden moderators in their data.

Aside:  This post was delayed a few weeks because I couldn’t reproduce some results in the Many Labs paper.  See footnote for details [1].

How do you measure hidden moderators?
In meta-analysis one typically tests for the presence of hidden moderators, “unobserved heterogeneity”, by comparing how much the dependent variable jumps around across studies, to how much it jumps around within studies (This is analogous to ANOVA if that helps).  Intuitively, when the differences are bigger across studies than within, we conclude that there is a hidden moderator across studies.

This was the approach taken by Many Labs [2]. Specifically, they reported a statistic called I2 for each study. I2 measures the percentage of the variation across studies that is surprising.  For example, if a meta-analysis has I2=40%, then 60% of the observed differences across studies is attributed to chance, and 40% is attributed to hidden moderators.

Aside: in my opinion the I2 is kind of pointless.
We want to know if heterogeneity is substantial in absolute, not relative terms. No matter how inconsequential a hidden moderator is, as you increase sample size of the underlying studies, you will decrease the variation due to chance, and thus increase I2. Any moderator, no matter how small, can approach I2=100% with a big enough sample. To me, saying that a particular design is rich in heterogeneity because it has I2=60% is a lot like saying that a particular person is rich because she has something that is 60% gold, without specifying if the thing is her earring or her life-sized statue of Donald Trump. But I don’t know very much about gold. (One could report, instead, the estimated standard deviation of effect size across studies ‘tau’ / τ). Most meta-analyses rely on I2. This is in no way a criticism of the Many Labs paper.

Criticisms of I2 along these lines have been made before, see e.g., Rucker et al (.pdf) or Borenstein et al (.pdf).

I2 in Many Labs
Below is a portion of Table 3 in their paper:


For example, the first row shows that for the first anchoring experiment, 59.8% of the variation across labs is being attributed to chance, and the remaining 40.2% to some hidden moderator.

Going down the table, we see that just over half the studies have a significant I2. Notably, four of these involve questions where participants are given an anchor and then asked to generate an open-ended numerical estimate.

For some time I have wondered if the strange distributions of responses that one tends to get with anchoring questions (bumpy, skewed, some potentially huge outliers; see histograms .png), or the data-cleaning that such responses led the Many Labs authors to take, may have biased upwards the apparent role of hidden moderators for these variables.

For this post I looked into it, and it seems like the answer is: ay, squared.

Shuffling data.
In essence, I wanted to answer the question: “If there were no hidden moderators what-so-ever in the anchoring questions in Many Labs, how likely would it be that the authors would find (false-positive) evidence for them anyway?

To answer this question I could run simulations with made up data. But because the data were posted, there is a much better solution: run simulations with the real data.   Importantly, I run the simulations “under the null,” where the data are minimally modified to ensure there is in fact no hidden moderator, and then we assess if I2 manages to realize that (it does not). This is essentially a randomization/reallocation/permutation test [3].

The posted “raw” datafile (.csv | 38Mb) has more than 6000 rows, one for each participant, across all labs. The columns have the variables for each of the 13 studies. There is also a column indicating which lab the participant is from. To run my simulations I shuffle that “lab” column. That is, I randomly sort that column, and only that column, keeping everything else in the spreadsheet intact. This creates a placebo lab column which cannot be correlated with any of the effects. With the shuffled column the effects are all, by construction, homogenous, because each observation is equally likely to originate in any “lab.” This means that variation within lab must be entirely comparable to variation across labs, and thus that the true I2 is equal to zero.  When testing for heterogeneity in this shuffled dataset we are asking: are observations randomly labeled “Brian’s Lab” systematically different from those randomly labeled “Fred’s Lab”? Of course not.

This approach sounds kinda cool and kinda out-there. Indeed it is super cool, but it is as old as it gets. It is closely related to permutation tests, which was developed in the 1930s when hypothesis testing was just getting state. [4].

Results.
I shuffled the dataset, conducted a meta-analysis on each of the four anchoring questions, computed I2, and repeated this 1000 times (R Code).  The first thing we can ask is how many of those meta-analyses led to a significant, false-positive, I2.  How often do they wrongly conclude there is a hidden moderator?

The answer should be, for each of the 4 anchoring questions: “5%”.
But the answers are: 20%, 49%, 47% and 46%.

Yikes.

The figures below show the distributions of I2 across the simulations.

Figure 1: For the 4 anchoring questions, the hidden-moderators test Many Labs used is invalid: high false-positive rate, inflated estimates of heterogeneity (R Code).

For example, for Anchoring 4, we see that the median estimate is that I2=38.4% of the variance is caused by hidden moderators, and 46% of the time it concludes there is statistically significant heterogeneity. Recall, the right answer is that there is zero heterogeneity, and only 5% of the time we should conclude otherwise. [5],[6],[7].

Validating the shuffle.
To reassure readers the approach I take above is valid, I submitted normally distributed data to the shuffle test. I added three columns to the Many Labs spreadsheets with made up data: In the first, the true anchoring effect was d=0 in all labs. In the second it was d=.5 in all labs. And in the third it was on average d=.5, but it varied across labs with sd(d)=.2 [8].

Recall that because the shuffle-test is done under the null, 5% of the results should be significant (false-positive), and I should get a ton of I2=0% estimates, no matter which of the three true effects are simulated.

That’s exactly what happens.

Figure 2: for normally distributed data, ~5% false-positive rate, and lots of accurate I2=0% estimates (R Code).

I suspect I2 can tolerate fairly non-normal data, but anchoring (or perhaps the intensive way it was ‘cleaned’) was too much for it. I have not looked into which specific aspect of the data, or possibly the data cleaning, disconcerts I2 [9].

Conclusions.
The authors of Many Labs saw the glass half full, concluding hidden moderators were present only in studies with big effect sizes. The statistically savvy authors of the opening paragraphs saw it mostly-empty, warning readers of hidden moderators everywhere. I see the glass mostly-full: the evidence that hidden moderators influenced large-effect anchoring studies appears to be spurious.

We should stop arguing hidden moderators are a thing based on the Many Labs paper.


Author feedback
Our policy is to share drafts of blog posts that discuss someone else’s work with them to solicit feedback (see our updated policy .htm). A constructive dialogue with several authors from Many Labs helped make progress with the reproducibility issues (see footnote 1) and improve the post more generally. They suggested I share the draft with Robbie van Aert, Jelte Wicherts and Marcel Van Assen and I did. We had an interesting discussion on the properties of randomization tests, shared some R Code back and forth, and they alerted me to the existence of the Rucker et al paper cited above.

In various email exchanges the term “hidden moderator” came up. I use the term literally: there are things we don’t see (hidden) that influence the size of the effect (moderators), thus to me it is a synonym with unobserved (hidden) heterogeneity (moderators). Some authors were concerned that “hidden moderator” is a loaded term used to excuse failures to replicate. I revised the writing of the post taking that into account, hopefully making it absolutely clear that the Many Labs authors concluded that hidden moderators are not responsible for the studies that failed to replicate.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The most important problem I had was that I could not reproduce the key results in Table 3 using the posted data. As the authors explained when I shared a draft of this post, a few weeks after first asking them about this, the study names are mismatched in that table, such that the results for one study are reported in the row for a different study.  A separate issue is that I tried to reproduce some extensive data cleaning performed on the anchoring questions but gave up when noticing large discrepancies between sample sizes described in the supplement and present in the data. The authors have uploaded a new supplement to the OSF where the sample sizes match the posted data files (though not the code itself that would allow reproducing the data cleaning). []
  2. they also compared lab to online, and US vs non-US, but these are obviously observable moderators []
  3. I originally called this “bootstrapping,” many readers complained. Some were not aware modifying the data so that the null is true can also be called bootstrapping. Check out section 3 in this “Introduction to the bootstrap world” by Boos (2003) .pdf. But others were aware of it and nevertheless objected because I keep the data fixed without resampling so to them bootstrapping was misleading. Permutation is not perfect because permutation tests are associated with trying all permutations. Reallocation test is not perfect because they usually involve swapping the treatment column, not the lab column. This is a purely semantic issue, as we all know what the test I run consisted of. []
  4. Though note that here I randomize the data to see if a statistical procedure is valid for the data, not to adjust statistical significance []
  5. It is worth noting that the other four variables with statistically significant heterogeneity in Table 3 do not suffer from the the same unusual distributions and/or data cleaning procedures as do the four anchoring questions. But, because they are quite atypical exemplars of psychology experiments I would personally not base estimates of heterogeneity in psychology experiments in general from them. One is not an experiment, two involve items that are more culturally dependent than most psychological constructs: reactions to George Washington and an anti-democracy speech. The fourth is barely significantly heterogeneous: p=.04. But this is just my opinion []
  6. For robustness I re-run the simulations keeping the number of observations in each condition constant within lab, the results were are just as bad. []
  7. One could use these distributions to compute an adjusted p-value for heterogeneity; that is, compute how often we get a p-value as low as the one obtained by Many Labs, or an I2 as high as they obtained, among the shuffled datasets. But I do not do that here because such calculations should build in the data-cleaning procedures, and code for such data cleaning is not available. []
  8. In particular, I made the effect size be proportional to the size of the name of the lab. So for the rows in the data where the lab it called “osu” the effect is d=.3, where it is “Ithaca” the effect is d=.6. As it happens, the average lab length is about 5 characters, so this works, and sd(length) is .18, which i round to .2 in the figure title. []
  9. BTW I noted that tau makes more sense conceptually than I2 . But mathematically, for a given dataset, tau is a simple transformation of it I2 (or rather, vice versa: I2=tau2/(total variance) ) thus if one is statistically invalid, so is the other. Tau is not a more robust statistic than I2 is. []

[62] Two-lines: The First Valid Test of U-Shaped Relationships

Can you have too many options in the menu, too many talented soccer players in a national team, or too many examples in an opening sentence? Social scientists often hypothesize u-shaped relationships like these, where the effect of x on y starts positive and becomes negative, or starts negative and becomes positive. Researchers rely almost exclusively on quadratic regressions to test if a u-shape is present (y ax+bx2), typically asking if the b term is significant [1].

The problem is that quadratic regressions are not diagnostic regarding u-shapedness. Under realistic circumstances the quadratic has 100% false-positive rate, for example, concluding y=log(x) is u-shaped. In addition, under plausible circumstances, it can obtain 0% power: failing to diagnose a function that is u-shaped as such, even with an infinite sample size [2].

With Leif we wrote Colada[27] on this problem a few years ago. We got started on a solution that I developed further in a recent working paper (SSRN). I believe it constitutes the first general and valid test of u-shaped relationships.

Two-lines
The test consists of estimating two regression lines, one for ‘low’ values of x, another for ‘high’ values of x. A U-shape is present if the two lines have opposite sign and are individually significant. The difficult thing is setting the breakpoint, what is a low x vs high x? Most of the work in the project went into developing a procedure to set the breakpoint that would maximize power [3].

The solution is what I coined the “Robin Hood” procedure; it increases the statistical power of the u-shape test by setting a breakpoint that strengthens the weaker line at the expense of the stronger one .

The paper (SSRN) discusses its derivation and computation in detail. Here I want to focus on its performance.

Performance
To gauge the performance of this test, I ran a horse-race between the quadratic and the two-lines procedure. The image below previews the results.

The next figure has more details.

It reports the share of simulations that obtain a significant (p<.05) u-shape for many scenarios. For each scenario, I simulated a relationship that is not u-shaped (Panel A), or one that is u-shaped (Panel B). I changed a bunch of things across scenarios, such as the distribution of x, the level of noise, etc. The figure caption has all the details you may want.

 
Panel A shows the quadratic has a comically high false-positive rate. It sees u-shapes everywhere.

Panel B – now excluding the fundamentally invalid quadratic regression – shows that for the two-lines test, Robin Hood is the most powerful way to set the breakpoint. Interestingly, the worst way is what we had first proposed in Colada[27], splitting the two lines at the highest point identified by the quadratic regression. The next worst is to use 3 lines instead of 2 lines.

Demonstrations.
The next figure applies the two-lines procedure to data from two published psych papers in which a u-shaped hypothesis appears to be spuriously supported by the invalid quadratic regression test.  In both cases the first positive line is highly significant, but the supposed sign reversal for high xs is not close.

It is important to state that the (false-positive) u-shaped hypothesis by Sterling et al. is incidental to their central thesis.

Fortunately the two-lines test does not refute all hypothesized u-shapes. A couple of weeks ago Rogier Kievit tweeted about a paper of his (.htm) which happened to involve testing a u-shaped hypothesis. Upon seeing their Figure 7 I replied:


Rogier’s reply:

Both u-shapes were significant via the two-lines procedure.

Conclusions
Researchers should continue to have u-shaped hypotheses, but they should stop using quadratic regressions to test them. The two lines procedure, with its Robin Hood optimization, offers an alternative that is both valid and statistically powerful

  • Full paper: SSRN.
  • Online app (with R code): .html
  • Email me questions or comments.

Wide logo


Frequently Asked Questions about two-lines test.
1. Wouldn’t running a diagnostic plot after the quadratic regression prevent researchers from mis-using it?
Leaving aside the issue that in practice diagnostic plots are almost never reported in papers (thus presumably not run), it is important to note that diagnostic plots are not diagnostic, at least not of u-shapedness. They tell you if overall you have a perfect model (which you never do) not if you are making the right inference.  Because true data are almost never exactly quadratic, diagnostic plots will look slightly off whether a relationship is or is not u-shaped. See concrete example (.pdf).

2. Wouldn’t running 3 lines be better, allowing for the possibility that there is a middle flat section in the U, thus increasing power?.
Interestingly, this intuition is wrong, if you have three lines instead of two, you will have a dramatic loss of power to detect u-shapes. In Panel B in the figure after the fallen horse, you can see the poorer performance of the 3-line solution.

3. Why fit two-lines to the data, an obviously wrong model, and not a spline, kernel regression, general additive model, etc?
Estimating two regression lines does not assume the true model has two lines. Regression lines are unbiased estimates of average slopes in a region, whether the slope is constant in that region or not. The specification error is thus inconsequential. To visually summarize what the data look like, indeed a spline, kernel regression, etc., is better (and the online app for the two-lines test reports such line, see Rogier’s plots above).  But these lines do not allow testing holistic data properties, such as whether a relationship is u-shaped. In addition, these models require that researchers set arbitrary parameters such as how much to smooth the data; one arbitrary choice of smoothing may lead to an apparent u-shape while another choice not. Flexible models are seldom the right choice for hypothesis testing.

4. Shouldn’t one force the two lines to meet? Surely there is no discontinuity in the true model.
No. It is imperative that the two lines be allowed not to meet. To run an “interrupted” rather than a “segmented” regression. This goes back to Question 3. We are not trying to fit the data overall as well as possible, but merely to compute an average slope within a set of x-values. If you don’t allow the two lines to be interrupted, you are no longer computing two separate means and can have a very high false-positive rate detecting u-shapes (see example .png).



Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. At least three papers have proposed a slightly more sophisticated test of u-shapedness relying on quadratic regression (Lind & Mehlum .pdf | Miller et al. pdf | Spiller et al .pdf). All three propose, actually, a mathematically equivalent solution, and that solution is only valid if the true relationship is quadratic (and not, for example, y=log(x) ). This  assumption is strong, unjustified, and untestable, and if it is not met, the results are invalid. When I showcase the quadratic regression performing terribly in this post, I am relying on this more sophisticated test. The more commonly used test, simply looking if b is significant, fares worse. []
  2. The reason for the poor performance is that quadratic regressions assume the true relationships is, well, quadratic, and if it is is not, and why would it be, anything can happen. This is in contrast to linear regression, which assumes linearity but if linearity is not met, the linear regression is nevertheless interpretable as the unbiased estimate of the average slope. []
  3. The two lines are estimated within a single interrupted regression for greater efficiency []

[61] Why p-curve excludes ps>.05

In a recent working paper, Carter et al (.pdf) proposed that one can better correct for publication bias by including not just p<.05 results, the way p-curve does, but also p>.05 results [1]. Their paper, currently under review, aimed to provide a comprehensive simulation study that compared a variety of bias-correction methods for meta-analysis.

Although the paper is well written and timely, the advice is problematic. Incorporating non-significant results into a tool designed to correct for publication bias requires one to make assumptions about how difficult it is to publish each possible non-significant result. For example, one has to make assumptions about how much more likely an author is to publish a p=.051 than a p=.076, or a p=.09 in the wrong direction than a p=.19 in the right direction, etc. If the assumptions are even slightly wrong, the tool’s performance becomes disastrous [2]

Assumptions and p>.05s
The desire to include p>.05 results in p-curve type analyses is understandable. Doing so would increase our sample sizes (of studies), rendering our estimates more precise. Moreover, we may be intrinsically interested in learning about studies that did not get to p<.05.

So why didn’t we do that when we developed p-curve? Because we wanted a tool that would work well in the real world.  We developed a good tool, because the perfect tool is unattainable.

While we know that the published literature generally does not discriminate among p<.05 results (e.g., p=.01 is not perceptibly easier to publish than is p=.02), we don’t know how much easier it is to publish some non-significant results rather than others.

The downside of p-curve focusing only on p<.05 is that p-curve can “only” tell us about the (large) subset of published results that are statistically significant. The upside is that p-curve actually works.

All p>.05 are not created equal
The simulations reported by Carter et al. assume that all p>.05 findings are equally likely to be published: a p=.051 in the right direction is as likely to be published as a p=.051 in the wrong direction. A p=.07 in the right direction is as likely to be published as a p=.97 in the right direction. If this does not sound implausible to you, we recommend re-reading this paragraph.

Intuitively it is easy to see how getting this assumption wrong will introduce bias. “Imagine” that a p=.06 is easier to publish than is a p=.76. A tool that assumes both results are equally likely to be published will be naively impressed when it sees many more p=.06s than p=.76s, and it will fallaciously conclude there is evidential value when there isn’t any.

A calibration
We ran simulations matching one of the setups considered by Carter et al., and assessed what happens if the publishability of p>.05 results deviated from their assumptions (R Code). The black bar in the figure below shows that if their fantastical assumption were true, the tool would do well, producing a false-positive rate of 5%. The other bars show that under some (slightly) more realistic circumstances, false-positives abound.

One must exclude p>.05
It is obviously not true that all p>.05s are equally publishable. But no alternative assumption is plausible. The mechanisms that influence the publication of p>.05 results are too unknowable, complex, and unstable from paper to paper, to allow one to make sensible assumptions or generate reasonable estimates. The probability of publication depends on the research question, on the authors’ and editors’ idiosyncratic beliefs and standards, on how strong other results in the paper are, on how important the finding is for the paper’s thesis, etc.  Moreover, comparing the 2nd and 3rd bar in the graph above, we see that even minor quantitative differences in a face-valid assumption make a huge difference.

P-curve is not perfect. But it makes minor and sensible assumptions, and is robust to realistic deviations from those assumptions. Specifically, it assumes that all p<.05 are equally publishable regardless of what exact p-value they have. This captures how most researchers perceive publication bias to occur (at least in psychology). Its inferences about evidential value are robust to relatively large deviations from this assumption (e.g., if researchers start aiming for p<.045 instead of p<.05, or even p<.035, or even p<.025, p-curve analysis, as implemented in the online app (.htm), will falsely conclude there is evidential value when the null is true, no more than 5% of the time.  See our “Better P-Curvespaper (SSRN)).

Conclusion
With p-curve we can determine whether a set of p<.05 results have evidential value, and what effect we may expect in a direct replication of those studies.  Those are not the only questions you may want to ask. For example, traditional meta-analysis tools ask what is the average effect of all of the studies that one could possibly run (whatever that means; see Colada[33]), not just those you observe. P-curve does not answer that question. Then again, no existing tool does. At least not even remotely accurately.

P-curve tells you “only” this: If I were to run these statistically significant studies again, what should I expect?

Wide logo


Author feedback.
We shared a draft of this post with Evan Carter, Felix Schönbrodt, Joe Hilgard and Will Gervais. We had an incredibly constructive and valuable discussion, sharing R Code back and forth and jointly editing segments of the post.

We made minor edits after posting responding to readers’ feedback. The original version is archived here .htm.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. When applying p-curve to estimate effect size, it is extremely similar to following the “one-parameter-selection-model” by Hedges 1984 (.pdf). []
  2. Their paper is nuanced in many sections, but their recommendations are not. For example, they write in the abstract, “we generally recommend that meta-analysis of data in psychology use the three-parameter selection model.” []

[60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research

A forthcoming article in the Journal of Personality and Social Psychology has made an effort to characterize changes in the behavior of social and personality researchers over the last decade (.pdf). In this post, we refer to it as “the JPSP article” and to the authors as “the JPSP authors.” The research team, led by Matt Motyl, uses two strategies. In the first, they simply ask a bunch of researchers how they have changed. Fewer dropped dependent variables? More preregistration? The survey is interesting and worth a serious look.

The other strategy they employ is an audit of published research from leading journals in 2003/2004 and again from 2013/2014. The authors select a set of studies and analyze them with a variety of contemporary metrics designed to assess underlying evidence. One of those metrics is p-curve, a tool the three of us developed together (see p-curve.com)  [1]. In a nutshell, p-curve analysis uses the distribution of significant p-values testing the hypotheses of interest in a set of studies to assess the evidential value of the set [2]. We were very interested to see how the JPSP authors had used it.

In any given paper, selecting the test that’s relevant for the hypothesis of interest can be difficult for two reasons. First, sometimes papers simply do not report it [3].  Second, and more commonly, when relevant tests are reported, they are surrounded by lots of other results: e.g., manipulation checks, covariates, and omnibus tests.  Because these analyses do not involve the hypothesis of interest, their results are not relevant for evaluating the evidential value of the hypothesis of interest. But p-curvers often erroneously select them anyway.  To arrive at relevant inferences about something, you have to measure that something, and not measure something else.

As we show below, the JPSP authors too often measured something else. Their results are not diagnostic of the evidential value of the surveyed papers. Selecting irrelevant tests invalidates not only conclusions from p-curve analysis, but from any analysis.

Selecting the right tests
When we first developed p-curve analysis we had some inkling that this would be a serious issue, and so we talked about p-value selection extensively in our paper (see Figure 5, SSRN), user guide (.pdf), and online app instructions (.htm). Unfortunately, authors, and reviewers, are insufficiently attentive to these decisions.

When we review papers using p-curve, about 95% of our review time is spent considering how the p-values were selected. The JPSP authors included 1,800 p-values in their paper, an extraordinary number that we cannot thoroughly review. But an evaluation of even a small number of them makes clear that the results reported in the paper are erroneous. To arrive at a diagnostic result, one would need to go back and verify or correct all 1,800 p-values. One would need to start from scratch.

The JPSP authors posted all the tests they selected (.csv). We first looked at the selection decisions they had rated as “very easy.”  The first decision we checked was wrong. So was the second. Also the third. And the 4th, the 5th, the 6th, the 7th and the 8th.  The ninth was correct.

To convey the intuition for the kinds of selection errors in the JPSP article, and to hopefully prevent other research teams from committing the same mistakes, we will share a few notable examples, categorized by the type of error. This is not an exhaustive list.

Error 1. Selecting the manipulation check
Experimenters often check to make sure that they got the manipulation right before testing its effect on the critical dependent variable. Manipulation checks are not informative about the hypothesis of interest and should not be selected. This is not controversial. For example, the authors of the JPSP article instructed their coders that “manipulation checks should not be counted.” (.pdf)

Unfortunately, the coders did not follow these instructions.

For example, from an original article that manipulated authenticity to find out if it influences subjective well-being, the authors of the JPSP article selected the manipulation check instead of the effect on well being.

Whereas the key test has a familiar p-value of .02, the manipulation check has a supernatural p-value of 10-32. P-curve sees those rather differently.

Error 2. Selecting an omnibus test.
Omnibus tests look at multiple means at once and ask “are any of these means different from any of the other means?” Psychological researchers almost never ask questions like that. Thus omnibus tests are almost never the right test to select in psychological research. The authors of the JPSP article selected about 200 of them.

Here is one example.  An original article examined satisfaction with bin Laden’s death. In particular, it tested whether Americans (vs. non-Americans), would more strongly prefer that bin Laden be killed intentionally rather than accidentally [4].

The results:

This is a textbook attenuated interaction prediction: the effect is bigger here than over there. Which interaction to select is nevertheless ambiguous: Should we collapse Germans and Pakistanis into one non-American bin? Should we include “taken to court”? Should we collapse across all forms of killings or do a separate analysis for each type? Etc. The original authors, therefore, had a large menu of potentially valid analyses to choose from, and thus so did the JPSP authors. But they chose an invalid one instead. They selected the F(10,2790)=31.41 omnibus test:

The omnibus test they selected does not test the interaction of interest. It is irrelevant for the original paper, and so it is irrelevant to use to judge the evidential value of that paper. If the original authors were wrong (so Americans and non-Americans actually felt the same way about accidental vs intentional bin Laden’s death), the omnibus test would still be significant if Pakistanis were particularly dissatisfied with the British killing bin Laden, or if the smallest American vs non-American difference was for “killed in airstrike” and the largest for “killed by British”. And so on [5].

Error 3. Selecting the non-focal test
Often researchers interested in interactions first report a simple effect, but only the interaction tests the hypothesis of interest. First the set-up:

So there are two groups of people and for each the researchers measure pro-white bias. The comparison of the two groups is central. But, since that comparison is not the only one reported, there is room for p-curver error. The results:


Three p-values. One shows the presence of a pro-white bias in the control condition, the next shows the presence of a pro-white bias in the experimental condition, and the third compares the experimental condition to the control condition. The third one is clearly the critical test for the researchers, but the JPSP authors pulled the first one. Again, the difference is meaningful: p = .048 vs. p = .00001.

Note: we found many additional striking and consequential errors.  Describing them in easy-to-understand ways is time consuming (about 15 minutes each) but we prepared three more in a powerpoint (.pptx)

Conclusion
On the one hand, it is clear that the JPSP authors took this task very seriously. On the other hand, it is just as clear that they made many meaningful errors, and that the review process fell short of what we should expect from JPSP.

The JPSP authors draw conclusions about the status of social psychological and personality research. We are in no position to say whether their conclusions are right or wrong. But neither are they.

Wide logo


Author feedback.
We shared a draft of this post on May 3rd with Matt Motyl (.htm) and Linda Skitka (.htm); we exchanged several emails, but -despite asking several times- did not receive any feedback on the post. They did draft a response, but they declined to share it with us before posting it, and chose not to post it here.

Also, just before emailing Matt about our post last week, he coincidentally emailed us. He indicated that various people had identified (different) errors in their use of p-curve analysis in their paper and asked us to help correct them. Since Matt indicated being interested in fixing such errors before the paper is officially published in JPSP, we do not discuss them here (but may do so in a future post).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Furthermore, and for full disclosure, Leif is part of a project that has similar goals (https://osf.io/ngdka/). []
  2. Or, if you prefer a slightly larger nutshell: P-curve is a tool which looks at the distribution of critical and significant p-values and makes an assessment of underlying evidential value. With a true null hypothesis, p-values will be distributed uniformly between 0 and .05. When the null is false (i.e., an alternative is true) then p-values will be distributed with right-skew (i.e., more 0<p<.01 than .01<p<.02 etc.). P-curve analysis involves examining the skewness of a distribution of observed p-values in order to draw inferences about the underlying evidential value: more right skewed means more evidential value; and evidential value, in turn, implies the expectation that direct replications would succeed. []
  3. For example, papers often make predictions about interactions but never actually test the interaction. Nieuwenhuis et al find that of 157 papers they looked at testing interactions, only half(!) reported an interaction .pdf []
  4. Here is the relevant passage from the original paper, it involves two hypothesis, the second is emphasized more and mentioned in the abstract, so we focus on it here:

    []

  5. A second example of choosing the omnibus test is worth a mention, if only in a footnote. It comes from a paper by alleged fabricateur Larry Sanna. Here is a print-screen of footnote 5 in that paper. The highlighted omnibus text is the only result selected from this study.  The original authors here are very clearly stating that this is not their hypothesis of interest:
    []

[59] PET-PEESE Is Not Like Homeopathy

PET-PEESE is a meta-analytical tool that seeks to correct for publication bias. In a footnote in my previous post (.htm), I referred to is as the homeopathy of meta-analysis. That was unfair and inaccurate.

Unfair because, in the style of our President, I just called PET-PEESE a name instead of describing what I believed was wrong with it. I deviated from one of my rules for ‘menschplaining’ (.htm): “Don’t label, describe.”

Inaccurate because skeptics of homeopathy merely propose that it is ineffective, not harmful. But my argument is not that PET-PEESE is merely ineffective, I believe it is also harmful. It doesn’t just fail to correct for publication bias, it adds substantial bias where none exists.

note: A few hours after this blog went live, James Pustejovsky (.htm) identified a typo in the R Code which affects some results. I have already updated the code and figures below. (I archived the original post: .htm).

PET-PEESE in a NUT-SHELL
Tom Stanley (.htm), later joined by Hristos Doucouliagos, developed PET-PEESE in various papers that have each accumulated 100-400 Google cites (.pdf | .pdf). The procedure consists of running a meta-regression: a regression in which studies are the unit of analysis, with effect size as the dependent variable and its variance as the key predictor [1]. The clever insight by Stanley & Doucouliagos is that the intercept of this regression is the effect we would expect in the absence of noise, thus, our estimate of the -publication bias corrected- true effect [2].

PET-PEESE in Psychology
PET-PEESE was developed with the meta-analysis of economics papers in mind (regressions with non-standardized effects). It is possible that some of the problems identified here, considering meta-analyses of standardized effect sizes, Cohen’s d, do not extend to such settings [3].

Psychologists have started using PET-PEESE recently. For instance, in meta-analyses about religious primes (.pdf), working memory training (.htm), and personality of computer wizzes (.htm). Probably the most famous example is Carter et al.’s meta-analysis of ego depletion, published in JEP:G (.pdf).

In this post I share simulation results that suggest we should not treat PET-PEESE estimates, at least of psychological research, very seriously. It arrives at wholly invalid estimates under too many plausible circumstances. Statistical tools need to be generally valid, or at least valid under predictable circumstances. PET-PEESE, to my understanding, is neither [4].

Results
Let’s start with a baseline case for which PET-PEESE does OK: there is no publication bias, every study examines the exact same effect size, and sample sizes are distributed uniformly between n=12 and n=120 per cell. Below we see that when the true effect is d=0, PET-PEESE correctly estimates it as d̂=0, and as d gets larger, d̂ gets larger (R Code).

About 2 years ago, Will Gervais evaluated PET-PEESE in a thorough blog post (.htm) (which I have cited in papers a few times). He found that in the presence of publication bias PET-PEESE did not perform well, but that in the absence of publication bias it at least did not make things worse. The simulations depicted above are not that different from his.

Recently, however, and by happenstance, I realized that Gervais got lucky with the simulations (or I guess PET-PEESE got lucky) [5]. If we deviate slightly from some  of the specifics of the ideal scenario in any of several directions, PET-PEESE no longer performs well even in the absence of publication bias.

For example, imagine that sample sizes don’t go all the way to up n=120 per cell; instead, they go up to only n=50 per cell (as is commonly the case with lab studies) [6]:

A more surprisingly consequential assumption involves the symmetry of sample sizes across studies. Whether there are more small than large n studies, or vice versa, PET PEESE’s performance suffers quite a bit. For example, if sample sizes look like this:

then PET-PEESE looks like this:


Micro-appendix

1) It looks worse if there are more big n than small n studies (.png).
2) Even if studies have n=50 to n=120, there is noticeable bias if n is skewed across studies (.png)

It’s likely, I believe, for real meta-analyses to have skewed n distributions. e.g., this is what it looked like in that ego depletion paper (note: it plots total N, not per-cell):

So far we have assumed all studies have the exact same effect size, say all studies in the d=.4 bin are exactly d=.4. In real life different studies have different effects. For example, a meta-analysis of ego-depletion may include studies with stronger and weaker manipulations that lead to, say, d=.5 and d=.3 respectively. On average the effect may be d=.4, but it moves around. Let’s see what happens if across studies the effect size has a standard deviation of SD=.2.

Micro-appendix
3) If big n studies are more common than small ns: .png
4) If n=12 to n=120 instead of just n=50, .png

Most troubling scenario
Finally, here is what happens when there is publication bias (only observe p<.05)


Micro-appendix
With publication bias,
5) If n goes up to n=120: .png
6) If n is uniform n=12 to n=50 .png
7) If d is homogeneous, sd(d)=0 .png

It does not seem prudent to rely on PET-PEESE, in any way, for analyzing psychological research. It’s an invalid tool under too many scenarios.

Wide logo


Author feedback.
Our policy is to share early drafts of our post with authors whose work we discuss. I shared this post with the creators of PET-PEESE, and also with others familiar with it: Will Gervais, Daniel Lakens, Joe Hilgard, Evan Carter, Mike McCullough and Bob Reed. Their feedback helped me identify an important error in my R Code, avoid some statements that seemed unfair, and become aware of the recent SPPS paper by Tom Stanley (see footnote 4). During this process I also learned, to my dismay, that people seem to believe -incorrectly- that p-curve is invalidated under heterogeneity of effect size. A future post will discuss this issue, impatient readers can check out our p-curve papers, especially Figure 1 in our first paper (here) and Figure S2 in our second (here), which already address it; but evidently insufficiently compellingly.

Last but not least, everyone I contacted was offered an opportunity to reply within this post. Both Tom Stanley (.pdf), and Joe Hilgard (.pdf) did.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Actually, that’s just PEESE; PET uses the standard error as the predictor []
  2. With PET-PEESE one runs both regressions. If PET is significant, one uses PEESE; if PET is not significant, one uses PET (!). []
  3. Though a working paper by Alinaghi and Reed suggests PET-PEESE performs poorly there as well .pdf []
  4. I shared an early draft of this paper with various peers, including Daniel Lakens and Stanley himself. They both pointed me to a recent paper in SPPS by Stanley (.pdf). It identifies conditions under which PET-PEESE gives bad results. The problems I identify here are different, and much more general than those identified there. Moreover, results presented here seem to directly contradict the conclusions from the SPPS paper. For instance, Stanley proposes that if the observed heterogeneity in studies is I2<80% we should trust PET-PEESE, and yet, in none of the simulations I present here, with utterly invalid results, is I2>80%; thus I would suggest to readers to not follow that advice. Stanley (.pdf) also points out that when there are 20 or fewer studies PET-PEESE should not be used; all my simulations assume 100 studies, and the results do not improve with a smaller sample of studies. []
  5. In particular, when preparing Colada[58] I simulated meta-analyses where, instead of choosing sample size at random, as the funnel-plot assumes, researchers choose larger samples to study smaller effects. I found truly spectacularly poor performance by PET-PEESE, much worse that trim-and-fill. Thinking about it, I realized that if researchers do any sort of power calculations, even intuitive or based on experience, then a symmetric distributions of effect size leads to an asymmetric distributions of sample size. See this illustrative figure (R Code):

    So it seemed worth checking if asymmetry alone, even if researchers were to set sample size at random, led to worse performance for PET-PEESE. And it did. []
  6. e.g., using d.f. in t-test from scraped studies as data, back in 2010, the median n in Psych Science was about 18, and around 85% of studies were n<50 []

[58] The Funnel Plot is Invalid Because of This Crazy Assumption: r(n,d)=0

The funnel plot is a beloved meta-analysis tool. It is typically used to answer the question of whether a set of studies exhibits publication bias. That’s a bad question because we always know the answer: it is “obviously yes.” Some researchers publish some null findings, but nobody publishes them all. It is also a bad question because the answer is inconsequential (see Colada[55]). But the focus of this post is that the funnel plot gives an invalid answer to that question. The funnel plot is a valid tool only if all researchers set sample size randomly [1].

What is the funnel plot?
The funnel plot is a scatter-plot with individual studies as dots. A study’s effect size is represented on the x-axis, and its precision is represented on the y-axis. For example, the plot below, from  a 2014 Psych Science paper (.pdf), shows a subset of studies on the cognitive advantage of bilingualism.

The key question people ask when staring at funnel plots is: Is this thing symmetric?

If we observed all studies (i.e., if there was no publication bias), then we would expect the plot to be symmetric because studies with noisier estimates (those lower on the y-axis) should spread symmetrically on either side of the more precise estimates above them. Publication bias kills the symmetry because researchers who preferentially publish significant results will be more likely to drop the imprecisely estimated effects that are close to zero (because they are p > .05), but not those far from zero (because they are p < .05). Thus, the dots in the bottom left (but not in the bottom right) will be missing.

The authors of this 2014 Psych Science paper concluded that publication bias is present in this literature in part based of how asymmetric the above funnel plot is (and in part on their analysis of publication outcomes of conference abstracts).

The assumption
The problem is that the predicted symmetry hinges on an assumption about how sample size is set: that there is no relationship between the effect size being studied, d, and the sample size used to study it, n. Thus, it hinges on the assumption that r(n, d) = 0.

The assumption is false if researchers use larger samples to investigate effects that are harder to detect, for example, if they increase sample size when they switch from measuring an easier-to-influence attitude to a more difficult-to-influence behavior. It is also false if researchers simply adjust sample size of future studies based on how compelling the results were in past studies. If this happens, then r(n,d)<0 [2].

Returning to the bilingualism example, that funnel plot we saw above includes quite different studies; some studied how well young adults play Simon, others at what age people got Alzheimer’s. The funnel plot above is diagnostic of publication bias only if the sample sizes researchers use to study these disparate outcomes are in no way correlated with effect size. If more difficult-to-detect effects lead to bigger samples, the funnel plot is no longer diagnostic [3].

A calibration
To get a quantitative sense of how serious the problem can be, I run some simulations (R Code).

I generated 100 studies, each with a true effect size drawn from d~N(.6,.15). Researchers don’t know the true effect size, but they guess it; I assume their guesses correlate .6 with the truth, so r(d,dguess)=.6.  Using dguess they set n for 80% power. No publication bias, all studies are reported [4].

The result: a massively asymmetric funnel plot.

That’s just one simulated meta-analysis; here is an image with 100 of them: (.png).

That funnel plot asymmetry above does not tell us “There is publication bias.”
That funnel plot asymmetry above tells us “These researchers are putting some thought into their sample sizes.”

Wait, what about trim and fill?
If you know your meta-analysis tools, you know the most famous tool to correct for publication bias is trim-and-fill, a technique that is entirely dependent on the funnel plot.  In particular, it deletes real studies (trims) and adds fabricated ones (fills), to force the funnel plot to be symmetric. Predictably, it gets it wrong. For the simulations above, where mean(d)=.6, trim-and-fill incorrectly “corrects” the point estimate downward by over 20%, to d̂=.46, because it forces symmetry onto a literature that should not have it (see R Code) [5].

Bottom line.
Stop using funnel plots to diagnose publication bias.
And stop using trim-and-fill and other procedures that rely on funnel plots to correct for publication bias.
Wide logo


Authors feedback.
Our policy is to share early drafts of our post with authors whose work we discuss. This post is not about the bilingual meta-analysis paper, but it did rely on it, so I contacted the first author, Angela De Bruin. She suggested some valuable clarifications regarding her work that I attempted to incorporate (she also indicated to be interested in running p-curve analysis on follow-up work she is pursuing).

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. By “randomly” I mean orthogonally to true effect size, so that the expected correlation between sample and effect size being zero: r(n,d)=0. []
  2. The problem that asymmetric funnel plots may arise from r(d,n)<0 is mentioned in some methods papers (see e.g., Lau et al. .pdf), but is usually ignored by funnel plot users. Perhaps in part because the problem is described as a theoretical possibility, a caveat; but it is is a virtual certainty, a deal-breaker. It also doesn’t help that so many sources that explain funnel plots don’t disclose this problem, e.g., the Cochrane handbook for meta-analysis .htm. []
  3. Causality can also go the other way: Given the restriction of a smaller sample, researchers may measure more obviously impacted variables. []
  4. To give you a sense of what assuming r(d,dguess)=.6 implies for researchers ability to figure out the sample size they need; for the simulations described here, researchers would set sample size that’s on average off by 38%, for example, the researcher needs n=100, but she runs n=138, or runs n=62, so not super accurate R Code. []
  5. This post was modified on April 7th, and October 2nd, 2017; you can see an archived copy of the original version here []

[57] Interactions in Logit Regressions: Why Positive May Mean Negative

Of all economics papers published this century, the 10th most cited appeared in Economics Letters , a journal with an impact factor of 0.5.  It makes an inconvenient and counterintuitive point: the sign of the estimate (b̂) of an interaction in a logit/probit regression, need not correspond to the sign of its effect on the dependent variable (Ai & Norton 2003, .pdf; 1467 cites).

That is to say, if you run a logit regression like y=logit(b1x1+b2x2+b3x1x2), and get 3= .5, a positive interaction estimate, it is possible (and quite likely) that for many xs, the impact of the interaction on the dependent variable is negative; that is, that as x1 gets larger, the impact of x2 on y gets smaller.

This post provides an intuition for that reversal, and discusses when it actually matters.

side note: Many economists run “linear probability models” (OLS) instead of logits, to avoid this problem. But that does not fix this problem, it just hides it. I may write about that in a future post.

Buying a house (no math)
Let’s say your decision to buy a house depends on two independent factors: (i) how much you like it (ii) how good an investment it is.

Unbounded scale. If the house decision were on an unbounded scale, say how much to pay for it, liking and investment value would remain independent. If you like the house enough to pay $200k, and in addition it would give you $50k in profits, you’d pay $250k; if the profits were $80k instead of $50k, then pay $280k. Two main effects, no interaction [1].

Bounded scale. Now consider, instead of $ paid, measuring how probable it is that you buy the house; a bounded dependent variable (0-1).  Imagine you love the house (Point C in figure below). Given that enthusiasm, a small increase or drop in how good an investment it is, doesn’t affect the probability much. If you felt lukewarm, in contrast (Point B), a moderate increase in the investment quality could make a difference. And in Point A, moderate changes again don’t matter much.

Key intuition: when the dependent variable is bounded [2], the impact of every independent variable moves it closer/further from that bound, and hence, impacts how flat the curve is, how sensitive the dependent variable it is to changes in any other variable. Every variable, then, has an interactive effect on all variables, even if they are not meaningfully related to one another and even if interaction effects are not included in the regression equation.

Mechanical vs conceptual interactions
I call interactions that arise from the non-linearity of the model, mechanical interactions, and those that arise from variables actually influencing each other, conceptual interactions.

In life, most conceptual interactions are zero: how much you like the color of the kitchen in a house does not affect how much you care about roomy closets, the natural light in the living room, or the age of the AC system. But, in logit regressions, EVERY mechanical interaction is ≠0; if you love the kitchen enough that you really want to buy the house, you are far to the right in the figure above and so all other attributes now matter less: closets, AC system and natural light all now have less detectable effects on your decision.

In a logit regression, the b̂s one estimates, only capture conceptual interactions. When one computes “marginal effects”, when one goes beyond the b̂ to ask how much the dependent variable changes as we change a predictor, one adds the mechanical interaction effect.

Ai and Norton’s point, then, is that the coefficient may be positive, b̂>0, conceptual interaction positive, but the marginal effect negative, conceptual+mechanical negative.

Let’s take this to logit land
Let
y: probability of buying the house
x1: how much you like it
x2: how good an investment it is

and,
y= logit(b1x1+b2x2)  [3]
(note: there is no interaction in the true model, no x1x2 term)

Below I plot that true model, y on x2, keeping x1 constant at x1=0 (R Code for all plots in post).


We are interested in the interaction of x1 with x2. On how x2 affects the impact of x1 on y. Let’s add a new line to the figure, keeping x1 fixed at x1=1 instead of x1=0.


For any given investment value, say x2=0, you are more likely to buy the house if you like it more (dashed red vs solid black line). The vertical distance between lines is the impact of x1=1 vs x1=0; one can already see that around the extremes the gap is smaller, so the effect of x1 gets smaller when x2 is very big or very small.

Below I add arrows that quantify the vertical gaps at specific x2 values. For example, when x2=-2, going from x1=0 to x1=1 increases the probability of purchase by 15%, and by 23% when x2=-1 [4]

The difference across arrows captures how the impact of x1 changes as we change x2; the interaction. The bottom chart, under the brackets shows the results.  Recall there is no conceptual interaction here, model is y=x1+x2, so those interactions, +.08 and -.08 respectively, are purely mechanical.

Now: the sign reversal
So far we assumed x1 and x2 were not conceptually related. The figure below shows what happens when they are: y=logit(x1+x2+0.25x1x2). Despite the conceptual interaction being b=.25 > 0, the total effect of the interaction is negative for high values of x2 (e.g., from x2=1 to x2=2, it is -.08); the mechanical interaction dominates.


What to do about this?

Ai & Norton propose not focusing on point estimates at all, not focusing on b̂3=.25. To instead compute how much the dependent variable changes with a change of the underlying variables, the marginal effect of the interaction, the one that combines conceptual and mechanical. To do that for every data-point, and reporting the average.

In another Econ Letters paper, Greene (2010; .pdf) [5] argues averaging the interaction is kind of meaningless. He has a point, ask yourself how informative it is to tell a reader that the average interaction effect depicted above, +.11 and -.08, is +.015. He suggests plotting the marginal effect for every value instead.

But, such graphs will combine conceptual and mechanical interactions. Do we actually want to do that? It depends on whether we have a basic-research or applied-research question.

What is the research question?
Imagine a researcher examining the benefits of text-messaging parents of students who miss a homework and that the researcher is interested on whether messages are less beneficial for high GPA student (so on the interaction: message*GPA).

An applied research question may be:

How likely is a student to get an A in this class if we text message his parents when missing a homework?”

For that question, yes, we need to include the mechanical interaction to be accurate. If high GPA students were going to get an A anyway, then the text-message will not increase the probability for them. The ceiling effect is real and should be taken into account. So we need the marginal effect.

A (slightly more) basic-research question may be:

How likely is a student to get more academically involved in this class if we text message his parents when missing a homework?

Here grades are just a proxy, a proxy for involvement; if high GPA students were getting an A anyway, but thanks to the text-message will become more involved, we want to know that. We do not want the marginal effect on grades, we want the conceptual interaction, we want b̂.

In sum: When asking conceptual or basic-research questions, if b̂ and the marginal effects disagree, go with b̂.

Wide logo


Authors feedback.
Our policy is to contact authors whose work we discuss, asking to suggest changes and reply within our blog if they wish. I shared a draft with Chunrong Ai & Edward Norton. Edward replied indicating he appreciated the post and suggested I tell readers about another article of his, further delving into this issue (.pdf)

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. What really matters is linear vs non-linear scale rather that bounded vs not, but bounded provides the intuition more clearly. []
  2. As mentioned before, the key is non-linear rather than bounded []
  3. the logit model is y=eb1x1+b2x2/(1+e b1x1+b2x2) . []
  4. percentage points, I know, but it’s a pain to write that every time. []
  5. The author of that “Greene” Econ PhD econometrics textbook .htm []

[56] TWARKing: Test-Weighting After Results are Known

On the last class of the semester I hold a “town-hall” meeting; an open discussion about how to improve the course (content, delivery, grading, etc). I follow-up with a required online poll to “vote” on proposed changes [1].

Grading in my class is old-school. Two tests, each 40%, homeworks 20% (graded mostly on a completion 1/0 scale). The downside of this model is that those who do poorly early on, get demotivated. Also, a bit of bad lack in a test hurts a lot. During the latest town-hall the idea of having multiple quizzes and dropping the worst was popular. One problem with this model is that students can blow off a quiz entirely. After the town-hall I thought of why students loved the drop-1 idea and whether I could capture the same psychological benefit with a smaller pedagogical loss.

I came up with TWARKing: assigning test weights after results are known [2]. With TWARKing, instead of each test counting 40% for every student, whichever test an individual student did better on, gets more weight; so Julie does better in Test 1 than Test 2, then Julie’s test 1 gets 45% and test 2 35%, but Jason did better in Test 2, so Jason’s test 2 gets 45%. [3]. Dropping a quiz becomes a special case of TWARKing: worst gets 0% weight.

It polls well
I expected TWARKing to do well in the online poll but was worried students would fall prey to competition-neglect, so I wrote a long question stacking the deck against TWARKing:
question

f1

70% of student were in favor, only 15% against (N=92, only 3 students did not complete the poll).

The poll is not anonymous, so I looked at how TWARKing attitudes are correlated with actual performance.

f2

Panel A shows that students doing better like TWARking less, but the effect is not as strong as I would have expected. Students liking it 5/5 perform in the bottom 40%, those liking 2/5 are in the top 40%.

Panel B shows that students with more uneven performance do like the TWARKing more, but the effect is small and unimpressive (Spearman’s r=.21:, p=.044).

For Panel C I recomputed the final grades had TWARKing been implemented for this semester and saw how the change in ranking correlated with support of TWARKing. It did not. Maybe it was asking too much for this to work as students did not yet know their Test 2 scores.

My read is that students cannot anticipate if it will help vs. hurt them, and they generally like it all the same.

TWARKing could be pedagogically superior.
Tests serve two main roles: motivating students and measuring performance. I think TWARKing could be better on both fronts.

Better measurement. My tests tend to include insight-type questions: students either nail them or fail them. It is hard to get lucky in my tests, I think, hard to get a high score despite not knowing the material. But, easy, unfortunately, to get unlucky; to get no points on a topic you had a decent understanding of [4].  Giving more weight to the highest test is hence giving more weight to the more accurate of the two tests.  So it could improve the overall validity of the grade.  A student who gets a 90 and a 70 is, I presume, better than one getting 80 in both tests.

This reminded me of what Shugan & Mitra (2009 .pdf) label the “Anna Karenina effect” in their under-appreciated paper (11 Google cites). Their Anna Karenina effect (there are a few; each different from the other), occurs when less favorable outcomes carry less information than more favorable ones; for those situations, measures other than the average, e.g., the max, performs better for out-of-sample prediction. [5]

To get an intuition for this Anna Karenina effect: think about what contains more information, a marathon runner’s best vs worst running time? A researcher’s most vs least cited paper?

Note that one can TWARK within test, weighting the highest scored answer by each student more. I will.

Motivation. After doing very poorly in a test it must be very motivating to feel that if you study hard you can make this bad performance count less. I speculate that with TWARKing underperforming students in Test 1 are less likely to be demotivated for Test 2 (I will test this next semester, but without random assignment…).  TWARKing has the magical psychological property that the gains are very concrete, every single student gets a higher average with TWARKing than without, and they see that; the losses, in contrast, are abstract and unverifiable (you don’t see the students who benefited more than you did, leading to a net-loss in ranking).

Bottom line
Students seem to really like TWARKing.
It may make things better for measurement.
It may improve motivation.

A free happiness boost.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. Like Brexit, the poll in OID290 is not binding []
  2. Obviously the name is inspired by ‘HARKing’: hypothesizing after results are known.  The similarity to Twerking, in contrast, is unintentional, and, given the sincerity of the topic, probably unfortunate. []
  3. I presume someone already does this , not claiming novelty []
  4. Students can still get lucky if I happen to ask on a topic they prepared better for. []
  5. They provide calibrations with real data in sports, academia and movie ratings. Check the paper out. []

[55] The file-drawer problem is unfixable, and that’s OK

The “file-drawer problem” consists of researchers not publishing their p>.05 studies (Rosenthal 1979 .pdf).
P-hacking consist of researchers not reporting their p>.05 analyses for a given study.

P-hacking is easy to stop. File-drawering nearly impossible.
Fortunately, while p-hacking is a real problem, file-drawering is not.

Consequences of p-hacking vs file-drawering
With p-hacking it’s easy to get a p<.05 [1].  Run 1 study, p-hack a bit and it will eventually “work”; whether the effect is real or not.  In “False-Positive Psychology” we showed that a bit of p-hacking gets you p<.05 with more than 60% chance (SSRN).

With file-drawering, in contrast, when there is no real effect, only 1 in 20 studies work. It’s hard to be a successful researcher with such low a success rate [2]. It’s also hard to fool oneself the effect of interest is real when 19 in 20 studies fail. There are only so many hidden moderators we can talk ourselves into. Moreover, papers typically have multiple studies. A four-study paper would require file-drawering 76 failed studies. Nuts.

File-drawering entire studies is not really a problem, which is good news, because the solution for the file-drawer is not really a solution [3].

Study registries: The non-solution to the file-drawer problem
Like genitals & generals, study registries & pre-registrations sound similar but mean different things.

A study registry is a public repository where authors report all studies they run. A pre-registration is a document authors create before running one study, to indicate how that given study will be run. Pre-registration intends to solve p-hacking. Study registries intend to solve the file-drawer problem.

Study registries sound great, until you consider what needs to happen for them to make a difference.

How the study registry is supposed to work
You are reading a paper and get to Study 1. It shows X. You put the paper down, visit the registry, search for the set of all other studies examining X or things similar to X (so maybe search by author, then by keyword, then by dependent variable, then by topic, then by manipulation), then decide which subset of the studies you found are actually relevant for the Study 1 in front of you (e.g., actually studying X, with a similarly clean design, competent enough execution, comparable manipulation and dependent variable, etc.). Then you tabulate the results of those studies found in the registry, and use the meta-analytical statistical tool of your choice  to combine those results with the one from the study still sitting in front of you.  Now you may proceed to reading Study 2.

Sorry, I probably made it sound much easier than it actually is. In real life, researchers don’t comply with registries the way they are supposed to. The studies found in the registry almost surely will lack the info you need to ‘correct’ the paper you are reading.  A year after being completed, about 90% of studies registered in ClinicalTrials.gov do not have the results uploaded to the database (NEJM, 2015 .pdf). Even for the subset of trials where posting results is ‘mandatory’  it does not happen (BMJ, 2012 .pdf), and when results are uploaded, they are often incomplete and inconsistent with the results in the published paper (Ann Int Medicine 2014 .pdf). This sounds bad, but in social science it will be way worse; in medicine the registry is legally required, for us it’s voluntary. Our registries would only include the subset of studies some social scientists choose to register (the rest remain in the file-drawer…).

Study registries in social science fall short of fixing an inconsequential problem, the file-drawer, they are burdensome to comply with, and to use.

Pre-registration: the solution to p-hacking
Fixing p-hacking is easy: authors disclose how sample size was set & all measures, conditions, and exclusions (“False Positive Psychology” SSRN). No ambiguity, no p-hacking.

For experiments, the best way to disclose is with pre-registrations.  A pre-registration consists of writing down what one wants to do before one does it. In addition to the disclosure items above, one specifies the hypothesis of interest and focal statistical analysis. The pre-registration is then appended to studies that get written-up (and file-drawered with those that don’t). Its role is to demarcate planned from unplanned analysis. One can still explore, but now readers know one was exploring.

Pre-registrations is an almost perfect fix to p-hacking, and can be extremely easy to comply with and use.

In AsPredicted it takes 5 minutes to create a pre-registration, half a minute to read it (see sample .pdf). If you pre-register and never publish the study, you can keep your AsPredicted private forever (it’s about p-hacking, not the file-drawer). Over 1000 people created AsPredicteds in 2016.

Summary
– The file-drawer is not really a problem, and study registries don’t come close to fixing it.
P-hacking is a real problem. Easy to create and evaluate pre-registrations all but eliminate it.
Wide logo


Uri’s note: post was made public by mistake when uploading the 1st draft.  I did not receive feedback from people I was planning to contact and made several edits after posting. Sorry.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. With p-hacking it also easy to get Bayes Factor >3; see “Posterior Hacking” http://DataColada.org/13. []
  2. it’s actually 1 in 40 since usually we make directional predictions and rely on two-sided tests []
  3. p-curve is a statistical remedy to the file-drawer problem and it does work .pdf []

[54] The 90x75x50 heuristic: Noisy & Wasteful Sample Sizes In The “Social Science Replication Project”

An impressive team of researchers is engaging in an impressive task: Replicate 21 social science experiments published in Nature and Science in 2010-2015 (.htm).

The task requires making many difficult decisions, including what sample sizes to use. The authors’ current plan is a simple rule: Set n for the replication so that it would have 90% power to detect an effect that’s 75% as large as the original effect size estimate.  If “it fails” (p>.05), try again powering for an effect 50% as big as original.

In this post I examine the statistical properties of this “90-75-50” heuristic, concluding it is probably not the best solution available. It is noisy and wasteful [1].

Noisy n.
It takes a huge sample to precisely estimate effect size (ballpark: n=3000 per cell, see DataColada[20]). Typical experiments, with much smaller ns, provide extremely noisy estimates of effect size; sample size calculations for replications, based on such estimates, are extremely noisy as well.

As a calibration let’s contrast 90-75-50 with the “Small-Telescopes” approach (.pdf), which requires replications to have 2.5 times the original sample size to ensure 80% power to accept the null. Zero noise.

The figure below illustrates. It considers an original study that was powered at 50% with a sample size of 50 per cell. What sample size will that original study recommend for the first replication (powered 90% for 75% of observed effect)? The answer is a wide distribution of sample sizes reflecting the wide distribution of effect size estimates the original could result in [2]. Again, this is the recommendation for replicating the exact same study, with the same true effect and same underlying power; the variance you see for the replication recommendation purely reflects sampling error in the original study (R Code). f1

We can think of this figure as the roulette wheel being used to set the replication’s sample size.

The average sample size recommendations of both procedures are similar: n=125 for the Small Telescopes approach vs. n=133 for 90-75-50. But the heuristic has lots of noise: the standard deviation of its recommendations is 50 observations, more than 1/3 of its average recommendation of 133 [3].

Waste
The 90-75-50 heuristic throws good money after bad, escalating commitment to studies that have already accepted the null.  Consider an original study that is false-positive with n=20. Given the distribution of (p<.05) possible original effect-size estimates, 90-75-50 will on average recommends n=67 per-cell for the first replication, and when that one fails (which it will with 97.5% chance because the original is false-positive), it will run a second replication now with n=150 participants per-cell  (R Code).

From the “Small Telescopes” paper (.pdf) we know that if 2.5 times the original (n=20) were run in the first replication, n=50,  we already would have an 80% chance to accept the null. So in the vast majority of cases, when replicating it with n=67, we will already have accepted the null; why throw another n=150 at it? That dramatic explosion of sample size for false-positive original findings is about the same for any original n, such that:

False-positive original findings lead to replications with about 12 times as many subjects per-cell when relying on 90-75-50

If the false-positive original was p-hacked, it’s worse. The original p-value will be close to p=.05, meaning a smaller estimated original effect size and hence even larger replication sample size. For instance, if the false-positive original got p=.049, 90-75-50 will trigger replications with 14 times the original sample size (R Code).

Rejecting the null
So far we have focused on power and wasted observations for accepting the null. What if the null is false? The figure below shows power for rejecting the null. We see that if the original study had even mediocre power, say 40%, the gains of going beyond 2.5 times the original are modest. The Small Telescopes approach provides reasonable power to accept and also to reject the null (R Code).

f3c

Better solution.
Given the purpose (and budget) of this replication effort, the Small-Telescopes recommendation could be increased to 3.5n instead of 2.5n, giving nearly 90% power to accept the null [4].

The Small Telescopes approach requires fewer participants overall than 90-75-50 does, is unaffected by statistical noise, and it paves the way to a much needed “Do we accept the null?” mindset to interpreting ‘failed’ replications.

Wide logo


Author feedback.
Our policy is to contact authors whose work we discuss, asking to suggest changes and reply within our blog if they wish. I shared a draft with several of the authors behind the Social Science Replication Project and discussed it with a few them. They helped me clarify the depiction of their sample-size selection heuristic, prompted me to drop a discussion I had involving biased power estimates for the replications, and prompted me -indirectly- to add the entire calculations and discussions involving waste that’s included in the post you just read. Their response was prompt and valuable.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.


Footnotes.

  1. The data-peeking involved in the 2nd replication inflates false-positives a bit, from 5% to about 7%, but since replications involve directional predictions, if they use two-sided tests, it’s fine. []
  2. The calculations behind the figure work as follows. One begins with the true effect size, the one giving the original sample 50% power. Then one computes how likely each possible significant effect size estimate is, that is, the distribution of possible effect size estimates for the original (this comes straight from the non-central distribution). Then one computes for each effect size estimate, the sample size recommendation for the replication that the 90-75-50 heuristic would result in, that is, one based on an effect 75% as big as the estimate, and since we know how likely each estimate is, we know how likely each recommendation is, and that’s what’s plotted. []
  3. How noisy the 90-75-50 heuristic recommendation is depends primarily on the power of the original study and not the specific sample and effect sizes behind such power. If the original study has 50% power, the SD of the recommendation over the average recommendation is ~37% (e.g., 50/133) whether the original had n=50, n=200 or n=500. If underlying power is 80%, the ratio is ~46% for those same three sample sizes. See Section (5) in the R Code []
  4. Could also do the test half-way, after 1.75n, ending study if already conclusive; using a slightly stricter p-value cutoff to maintain desired false-positive rates; hi there @lakens []