A few years ago our Journal Club discussed an interesting methods paper entitled, “Putting Psychology to the Test: Rethinking Model Evaluation Through Benchmarking and Prediction” (.htm). This post describes my attempt to understand what’s happening in Figure 1 of that paper, which shows that extremely simple experiments can generate extremely negative R2s. I learned a…
[133] Heterofriendly: The Intuition for Why You Always Need Robust Standard Errors
When I taught my first PhD-level methods course, I invited students to submit questions about any topic in statistics or methodology. Six out of 10 students asked about the same topic: robust & clustered standard errors. It's clearly a topic they found both important and confusing. Psychologists basically never use robust standard errors. But they…
[132] statuser: R in user-friendly mode
t.test(), the R function for running t-tests, is disconcertingly imperfect. A t-test involves computing the difference between two means. And yet, t.test(), does not report… …said difference of means. It reports the p-value for the difference of means, it reports the confidence interval for the difference of means, but not the difference of means itself….
[131] Bending Over Backwards:
The Quadratic Puts the U in AI
For a recent journal club in Barcelona, we read a just published article in the Journal of Experimental Psychology: General (JEP:G). The paper is on the impact of using gen-AI on creativity. The paper proposes an inverted U: people are most creative with moderate levels of AI use. The paper has three studies. Studies 1…
[130] ResearchBox: Even Easier to Use and More Transparently Permanent than Before
Over the past 10 years or so, posting data, code, and materials for published papers has gone from eccentric to mundane. There are a few platforms that enable sharing research files, including ResearchBox. ResearchBox is hosted by the Wharton Credibility Lab, which I co-direct. We also host the pre-registration platform AsPredicted, and a new platform…
[129] P-curve works in practice, but would it work if you dropped a piano on it?
P-curve is a statistical tool we developed about 15 years ago to help rule out selective reporting, be it p-hacking or file-drawering, as the sole explanation for a set of significant results. This post is about a forthcoming critique of p-curve in the statistics journal JASA (pdf). The authors identify four p-curve properties they object…
[128] LinkedOut: The Best Published Audit Study, And Its Interesting Shortcoming
There is a recent QJE paper reporting a LinkedIn audit study comparing responses to requests by Black vs White young males. I loved the paper. At every turn you come across a clever, effortful, and effective solution to a challenge posed by studying discrimination in a field experiment. But, no paper is perfect, and this…
[127] Meaningless Means #4: Correcting Scientific Misinformation
Before we got distracted by things like being sued, we had been working on a series called Meaningless Means, which exposed the fact that meta-analytic averaging is (really) bad. When a meta-analysis says something like, “The average effect of mindsets on academic performance is d = .32”, you should not take it at face value….
[126] Stimulus Plots
When we design experiments, we have to decide how to generate and select the stimuli that we use to test our hypotheses. In a forthcoming JPSP article, “Stimulus Sampling Reimagined” (htm), we propose that for at least 60 years we have been thinking about stimulus selection in experiments in the wrong way [1]. Specifically, with…
[125] "Complexity" 2: Don't be mean to the median
In Colada[124] I summarized a co-authored critique (with Banki, Walatka and Wu) of a recent AER paper that proposed risk preferences reflect 'complexity' rather than preferences a-la Prospect Theory. Ryan Oprea, the AER author, has written a rejoinder (.pdf). Its first main point (pages 5-12), is that our results with medians are 'knife edge' (p.8),…
