Metacritic.com scores and aggregates critics’ reviews of movies, music, and video games. The website provides a summary assessment of the critics’ evaluations, using a scale ranging from 0 to 100. Higher numbers mean that critics were more favorable. In theory, this website is pretty awesome, seemingly leveraging the wisdom-of-crowds to give consumers the most reliable…
[71] The (Surprising?) Shape of the File Drawer
Let’s start with a question so familiar that you will have answered it before the sentence is even completed: How many studies will a researcher need to run before finding a significant (p<.05) result? (If she is studying a non-existent effect and if she is not p-hacking.) Depending on your sophistication, wariness about being asked…
[70] How Many Studies Have Not Been Run? Why We Still Think the Average Effect Does Not Exist
We have argued that, for most effects, it is impossible to identify the average effect (datacolada.org/33). The argument is subtle (but not statistical), and given the number of well-informed people who seem to disagree, perhaps we are simply wrong. This is my effort to explain why we think identifying the average effect is so hard….
[69] Eight things I do to make my open research more findable and understandable
It is now common for researchers to post original materials, data, and/or code behind their published research. That’s obviously great, but open research is often difficult to find and understand. In this post I discuss 8 things I do, in my papers, code, and datafiles, to combat that. Paper 1) Before all method sections, I…
[68] Pilot-Dropping Backfires (So Daryl Bem Probably Did Not Do It)
Uli Schimmack recently identified an interesting pattern in the data from Daryl Bem’s infamous “Feeling the Future” JPSP paper, in which he reported evidence for the existence of extrasensory perception (ESP; htm)[1]. In each study, the effect size is larger among participants who completed the study earlier (blogpost: .htm). Uli referred to this as the "decline…
[67] P-curve Handles Heterogeneity Just Fine
A few years ago, we developed p-curve (see p-curve.com), a statistical tool that identifies whether or not a set of statistically significant findings contains evidential value, or whether those results are solely attributable to the selective reporting of studies or analyses. It also estimates the true average power of a set of significant findings [1]….
[66] Outliers: Evaluating A New P-Curve Of Power Poses
In a forthcoming Psych Science paper, Cuddy, Schultz, & Fosse, hereafter referred to as CSF, p-curved 55 power-posing studies (.pdf | SSRN), concluding that they contain evidential value [1]. Thirty-four of those studies were previously selected and described as “all published tests” (p. 657) by Carney, Cuddy, & Yap (2015; .htm). Joe and Uri p-curved…
[65] Spotlight on Science Journalism: The Health Benefits of Volunteering
I want to comment on a recent article in the New York Times, but along the way I will comment on scientific reporting as well. I think that science reporters frequently fall short in assessing the evidence behind the claims they relay, but as I try to show, assessing evidence is not an easy task….
[64] How To Properly Preregister A Study
P-hacking, the selective reporting of statistically significant analyses, continues to threaten the integrity of our discipline. P-hacking is inevitable whenever (1) a researcher hopes to find evidence for a particular result, (2) there is ambiguity about how exactly to analyze the data, and (3) the researcher does not perfectly plan out his/her analysis in advance….
[63] "Many Labs" Overestimated The Importance of Hidden Moderators
Are hidden moderators a thing? Do experiments intended to be identical lead to inexplicably different results? Back in 2014, the "Many Labs" project (htm) reported an ambitious attempt to answer these questions. More than 30 different labs ran the same set of studies and the paper presented the results side-by-side. They did not find any…