[28] Confidence Intervals Don’t Change How We Think about Data

Some journals are thinking of discouraging authors from reporting p-values and encouraging or even requiring them to report confidence intervals instead. Would our inferences be better, or even just different, if we reported confidence intervals instead of p-values?

One possibility is that researchers become less obsessed with the arbitrary significant/not-significant dichotomy. We start paying more attention to effect size. We start paying attention to precision. A step in the right direction.

Another possibility is that researchers forced to report confidence intervals will use them as if they were p-values and will only ask “Does the confidence interval include 0?” In this world confidence intervals are worse than p-values, because p=.012, p=.0002, p=.049 all become p<.05. Our analyses become more dichotomous. A step in the wrong direction.

How to test this?
To empirically assess the consequences of forcing researchers to replace p-values with confidence intervals we could randomly impose the requirement on some authors and see what happens.

That’s hard to pull off for a blog post.  Instead, I exploit a quirk in how “mediation analysis” is now reported in psychology. In particular, the statistical program everyone uses to run mediation reports confidence intervals rather than p-values.  How are researchers analyzing those confidence intervals?

Sample: 10 papers
I went to Web-of-Science and found the ten most recent JPSP articles (.html) citing the Preacher and Hayes (2004) article that provided the statistical programs that everyone runs (.pdf).

All ten of them used confidence intervals as dichotomus p-values, none discussed effect size or precision. None discussed the percentage of the effect that was mediated. One even accepted the null of no mediation because the confidence interval included 0 (it also included large effects).

 

F1

This sample suggests confidence intervals do not change how we think of data.

If people don’t care about effect size here…
Unlike other effect-size estimates in the lab, effect-size in mediation is intrinsically valuable.

No one asks how much more hot sauce subjects pour for a confederate to consume after watching a film that made them angry, but we do ask how much of that effect is mediated by anger; ideally all of it. [1]

Change the question before you change the answer
If we want researchers to care about effect size and precision, then we have to persuade researchers that effect size and precision are important.

I have not been persuaded yet. Effect size matters outside the lab for sure. But in the lab not so clear. Our theories don’t make quantitative predictions, effect sizes in the lab are not particularly indicative of how important a phenomenon is outside the lab, and to study effect size with even moderate precision we need  samples too big to plausibly be run in the lab (see Colada[20]). [2]

My talk at a recent conference (SESP) focused on how research questions should shape the statistical tools we choose to run and report. Here are the slides. (.pptx). This post is an extension of Slide #21.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. In practice we do not measure things perfectly, so going for 100% mediation is too ambitious []
  2. I do not have anything against reporting confidence intervals alongside p-values. They will probably be ignored by most readers, but a few will be happy to see them, and it is generally good to make people happy (Though it is worth pointing out that one can usually easily compute confidence intervals from test results).  Descriptive statistics more generally, e.g., means and SDs, should always be reported to catch errors, facilitate meta-analyses, and just generally better understand the results. []

[24] P-curve vs. Excessive Significance Test

In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.

The many-labs project is a collaboration of 36 labs around the world, each running a replication of 13 published effects in psychology (paper: pdf; data: xlsx). [1]

One of the most replicable effects was the Asian Disease problem, a demonstration of people being risk seeking for losses but risk averse for gains; it was p<.05 in 31 of 36 labs (we also replicated it  in Colada[11]).

Here I apply the Excessive Significance Test and p-curve to those 31 studies (summary table .xlsx).

How The Excessive Significance Test Works
It takes a set of studies (e.g., all studies in a paper) and asks whether too many are statistically significant. For example, say a paper has five studies, all p<.05. Imagine each obtained an effect size that would have given it 50% power. The probability that five out of five studies powered to 50% would all get p<.05 is .5*.5*.5*.5*.5=.03125. So we reject the null of full reporting, meaning that at least one null finding was not reported.

The excessive significance test was developed by Ioannidis and Trikalinos (.pdf). In psychology it has been popularized by Greg Francis (.html) and Ulrich Schimmack (html). I have twice been invited to publish commentaries on Francis’ use of the test: “It Does not Follow” (.pdf) and “It Really Just Does not Follow” (.pdf)

How p-curve Works
P-curve is a tool that assesses if, after accounting for p-hacking and file-drawering, a set of statistically significant findings have evidential value.  It looks at the distribution of p-values and asks whether that distribution is what we would expect of a set of true findings. In a nutshell, you see more low (e.g., p<.025) than high (e.g., p>.025) significant p-values when an effect is true (for details see www.p-curve.com)

Running both tests
The Excessive Significance Test takes the 31 studies that worked and spits out p=.03: rejecting the null that all studies were reported. It nails it. We know 5 studies were not “reported” and the test infers accordingly. (R Code) [2]

This inference is pointless for two reasons.

First, we always know the answer to the question of whether all studies were published. The answer is always “No.” Some people publish some null findings, but nobody publishes all null findings.

Second, it tells us about researcher behavior, not about the world, and we do science to learn about the world, not to learn about researcher behavior.

The question of interest is not “is there a null finding you are not telling me about?” The question of interest is “do these significant findings you are telling me about have truth value?”

P-curve takes the 31 studies and tells us that taken as a whole the studies do support the notion that gain vs loss framing has an effect on risk preferences.

f1

The figure (generated with the online app) shows that consistent with a true effect, there are more low than high p-values among the 31 studies that worked.

The excessive significance test tells you only that the glass is not 100% full.
P-curve tells you whether it has enough water to quench your thirst.

Wide logo


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. More data: https://osf.io/wx7ck/ []
  2. Ulrich Schimmack (.pdf) proposes a variation in how the test is conducted, computing power based on each individual effect size rather than pooling. When done this way, the Excessive Signifiance Test is also significant, p=.01; see R Code link above []

[23] Ceiling Effects and Replications

A recent failure to replicate led to an attention-grabbing debate in psychology.

As you may expect from university professors, some of it involved data.  As you may not expect from university professors, much of it involved saying mean things that would get a child sent to the principal’s office (.pdf).

The hostility in the debate has obscured an interesting empirical question. This post aims to answer that interesting empirical question. [1]

Ceiling effect
The replication (.pdf) was pre-registered; it was evaluated and approved by peers, including the original authors, before being run. The predicted effect was not obtained, in two separate replication studies.

The sole issue of contention regarding the data (.xlsx), is that nearly twice as many respondents gave the highest possible answer in the replication as in the original study (about 41% vs about 23%).  In a forthcoming commentary (.pdf), the original author proposes a “ceiling effect” explanation: it is hard to increase something that is already very high.

I re-analyzed the original and replication data to assess this sensible concern.
My read is that the evidence is greatly inconsistent with the ceiling effect explanation.

The experiments
In the original paper (.pdf), participants rated six “dilemmas” involving moral judgments (e.g., How wrong  is it to keep money found in a lost wallet?). These judgments were predicted to become less harsh for people primed with cleanliness (Study 1) or who just washed their hands (Study 2).

The new analysis
In a paper with Joe and Leif (SSRN), we showed that a prominent failure to replicate in economics was invalidated by a ceiling effect. I use the same key analysis here. [2]

It consists of going beyond comparing means, examining instead all observations.The stylized figures below give the intuition. They plot the cumulative percentage of observations for each value of the dependent variable.

The first shows an effect across the board: there is a gap between the curves throughout.
The third shows the absence of an effect: the curves perfectly overlap.

Example FigureThe middle figure captures what a ceiling effect looks like. All values above 2 were brought down to 2 so the lines overlap there, but below the ceiling the gap is still easy to notice.

Let’s now look at real data. Study 1 first: [3]
Ori1  Rep1
It is easy to spot the effect in the original data.
It is just as easy to spot the absence of an effect in the replication.

Study 2 is more compelling,
Ori2 Rep2

In the Original the effect is largest in the 4-6 range. In the Replication about 60% of the data is in that range, far from the ceiling of 7. But still there is no gap between the lines.

Ceiling analysis by original author
In her forthcoming commentary (.pdf), effect size is computed as a percentage and shown to be smaller in scenarios with higher baseline levels (see her Figure 1). This is interpreted as evidence of a ceiling effect.
I don’t think that’s right.

Dividing something by increasingly larger numbers leads to increasingly smaller ratios, with or without a ceiling. Imagine the effect were constant, completely unaffected by ceiling effects. Say a 1 point increase in the morality scale in every scenario. This constant effect would be a smaller % in scenarios with a larger baseline; going from 2 to 3 is a 50% increase, whereas going from 9 to 10 only 11%. [4]

If a store-owner gives you $5 off any item, buying a $25 calculator gets you a 20% discount, buying a $100 jacket gets you only a 5% discount. But there is no ceiling, you are getting $5 in both cases.

To eliminate the arithmetic confound, I redid this analysis with effect size defined as the difference of means, rather than %, and there was no association between effect size and share of answers at boundary across scenarios (see calculations, .xlsx).

Ceiling analysis by replicators
In their rejoinder (.pdf), the replicators counter by dropping all observations at the ceiling and showing the results are still not significant.
I don’t think that’s right either.

Dropping observations at the boundary lowers power whether there is a ceiling effect or not, by a lot.  In simulations, I saw drops of 30% and more, say from 50% to 20% power (R Code). So not getting an effect this way does not support the absence of a ceiling effect problem.

Tobit
To formally take ceiling effects into account one can use the Tobit model (common in economics for censored data, see Wikipedia). A feature of this approach is that it allows analyzing the data at the scenario level, where the ceiling effect would actually be happening. I run Tobits on all datasets. The replications still had tiny effect sizes (<1/20th size of original), with p-values>.8 (STATA code). [5]

Wide logo

Authors’ response
Our policy at DataColada is to give drafts of our post to authors whose work we cover before posting, asking for feedback and providing an opportunity to comment. This causes delays (see footnote 1) but avoids misunderstandings.

The replication authors, Brent Donnellan, Felix Cheung and David Johnson suggested minor modifications to analyses and writing. They are reflected in the version you just read.

The original author, Simone Schnall, suggested a few edits also, and asked me to include this comment from her:

Your analysis still does not acknowledge the key fact: There are significantly more extreme scores in the replication data (38.5% in Study 1, and 44.0% in Study 2) than in the original data. The Tobin analysis is a model-based calculation and makes certain assumptions; it is not based on the empirical data. In the presence of so many extreme scores a null result remains inconclusive.

 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. This blogpost was drafted on Thursday May 29th and was sent to original and replication authors for feedback, offering also an opportunity to comment. The dialogue with Simone Schnall lasted until June 3rd, which is why it appears only today. In the interim Tal Yarkoni and Yoel Inbar, among others, posted their own independent analyses. []
  2. Actually, in that paper it was a floor effect []
  3. The x-axis on these graphs had a typo that we were alerted to by Alex Perrone in August, 2014. The current version is correct []
  4. She actually divides by the share of observations at ceiling, but the same intuition and arithmetic apply. []
  5. I treat the experiment as nested, with 6 repeated-measures for each participant, one per scenario []

[20] We cannot afford to study effect size in the lab

Methods people often say  – in textbooks, task forces, papers, editorials, over coffee, in their sleep – that we should focus more on estimating effect sizes rather than testing for significance.

I am kind of a methods person, and I am kind of going to say the opposite.

Only kind of the opposite because it is not that we shouldn’t try to estimate effect sizes; it is that, in the lab, we can’t afford to.

The sample sizes needed to estimate effect sizes are too big for most researchers most of the time.

With n=20, forget it
The median sample size in published studies in Psychology is about n=20 per cell. [1] There have been many calls over the last few decades to report and discuss effect size in experiments. Does it make sense to push for effect size reporting when we run small samples? I don’t see how.

Arguably the lowest bar for claiming to care about effect size is so to distinguish among Small, Medium, and Large effects. And with n=20 we can’t do even that.

Cheatsheet: I use Cohen’s d to index effect size. d is by how many standard deviations the means differ. Small is d=.2, Medium d=.5 and Large d=.8.

The figure below shows 95% confidence intervals surrounding Small, Medium and Large estimates when n=20 (see simple R Code).

f1d

Whatever effect we get, we will not be able to rule out effects of a different qualitative size.

Four-digit n’s
It is easy to bash n=20 (please do it often). But just how big an n do we need to study effect size?

I am about to show that the answer has four-digits.

It will be rhetorically useful to consider a specific effect size. Let’s go with d=.5. You need n=64 per cell to detect this effect 80% of the time.

If you run the study with n=64, then you will get a confidence interval that will not include zero 80% of the time, but if your estimate is right on the money at d=.5, that confidence interval still will include effects smaller than Small (d<.2) and larger than Large (d>.8). So n=64 is fine for testing whether the effect exists, but not for estimating its size.

Properly powered studies teach you almost nothing about effect size. [2]

f2b
What if we go the extra mile, or three, and power it to 99.9%, running n=205 per cell. This study will almost always produce a significant effect, yet the expected confidence interval is massive, spanning a basically small effect (d=.3) to a basically large effect (d=.7).

To get the kind of confidence interval that actually gives confidence regarding effect size, one that spans say ±0.1, we need n=3000 per cell. THREE-THOUSAND. (see simple R Code)  [3]

In the lab, four-digit per-cell sample sizes are not affordable.

Advocating a focus on effect size estimation, then, implies advocating for either:
1)       Leaving the lab (e.g., mTurk, archival data). [4]
2)       Running within-subject designs.

Some may argue effect size is so important we ought to do these things.
But that’s a case to be made, not an implication to be ignored.

UPDATE 2014 05 08: A commentary on this post is available here

Wide logo


 

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Based on the degrees of freedom reported in thousands of test statistics I scraped from Psych Science and JPSP []
  2. Unless you properly power for a trivially small effect by running a gigantic sample []
  3. If you run n=1000 the expected confidence interval spans d=.41 and d=.59 []
  4. One way to get big samples is to combine many small samples. Whether one should focus on effect size in meta-analysis is not something that seems controversial enough to be interesting to discuss []

[19] Fake Data: Mendel vs. Stapel

Diederik Stapel, Dirk Smeesters, and Lawrence Sanna published psychology papers with fake data. They each faked in their own idiosyncratic way, nevertheless, their data do share something in common. Real data are noisy. Theirs aren’t.

Gregor Mendel’s data also lack noise (yes, famous peas-experimenter Mendel). Moreover, in a mathematical sense, his data are just as lacking in noise. And yet, while there is no reasonable doubt that Stapel, Smeesters, and Sanna all faked, there is that Mendel did. Why? Why does the same statistical anomaly make a compelling case against the psychologists but not Mendel?

Because Mendel, unlike the psychologists, had a motive. Mendel’s motive is his alibi.

Excessive similarity
To get a sense for what we are talking about, let’s look at the study that first suggested Smeesters was a fabricateur. Twelve groups of participants answered multiple-choice questions. Six were predicted to do well, six poorly.  semesters3

(See retracted paper .pdf)

Results are as predicted.  The lows, however, are too similarly low. Same with highs. In “Just Post It” (SSRN), I computed this level of lack of noise had ~21/100000 chance if the data were real (additional analyses lower the odds to miniscule).

Stapel and Sanna had data with the same problem. Results too similar to be true. Even if population means were identical, samples have sampling error. Lack of sampling error suggests lack of sampling.

How Mendel is like Stapel, Smeesters & Sanna
Mendel famously crossed plants and observed the share of baby-plants with a given trait. Sometimes he predicted 1/3 would show it, sometimes 1/4. He was, of course, right. The problem, first noted by Ronald Fisher in 1911 (yes, p-value co-inventor Fisher) is that Mendel’s data also, unlike real data, lacked sampling error. His data were too-close to his predictions.

Mendel

Recall how Smeesters’ data had 27/100000 chance if data were real?
Well, Gregor Mendels’ data had 7/100000 chance (See Fisher’s 1936 paper .pdf; especially Table V).

How Mendel is not like Stapel, Smeesters, Sanna
Mendel wanted his data to look like his theory. He had a motive for lacking noise. It made his theory look better.

Imagine Mendel runs an experiment and gets 27% instead of 33% of baby-plants with a trait. He may toss out a plant or two, he may re-run the experiment; he may p-hack. This introduces reasonable doubt. P-hacking is an alternative explanation for Mendel’s anomalous data. (See e.g., Pires & Branco’s 2010 paper suggesting Mendel p-hacked, .pdf).

Sanna, Smeesters and Stapel, in contrast lack a motive. The similarity is unrelated to their hypotheses. P-hacking to get their results does not explain the degenerate results they get.

One way to think of this is that Mendel’s theory was rich enough to make point predictions, and his data are too similar to these. Psychologists seldom make point predictions, the fabricaterus had their means too similar to each other, not to theory.

Smeesters, for instance, did not need the low conditions to be similar to each other, just to be different from the high so p-hacking wouldn’t get his data to lack noise. [1] Worse. p-hacking makes Smeesters’ data look more aberrant.

Before resigning his tenured position, Smeesters told the committee investigating him, that he merely dropped extreme observations aiming for better results. If this were true, if he had p-hacked that way, his low-means would look too different  from each other, not too similar.  [2]

If Mendel p-hacked, his data would look the way they look.
If Smeesters p-hacked, his data would look the opposite of the way they look.

This gives reasonable doubt to Mendel being a fabricateur, and eliminates reasonable doubt for Smeesters.

CODA: Stapel’s dissertation
Contradicting the committee that investigated him, Stapel has indicated that he did not fake his dissertation (e.g., in this NYTimes story .pdf).

Check out this table from a (retracted) paper based on that dissertation (.pdf).JPSP1996

These are means from a between subject design with n<20 per cell. Such small samples produce so much noise as to make it very unlikely to observe means this similar to each other.

Most if not all studies in his dissertation look this way.

 


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Generic file-drawering of n.s. results will make means appear slightly surprisingly similar among the significant subset. But, not enough to raise red flags (see R code). Also, it will not explain the other anomalies in Smeesters data (e.g., lack of round numbers, negative correlations in valuations of similar items). []
  2. See details in the Nonexplanations section of  “Just Post It”, SSRN []

[17] No-way Interactions

This post shares a shocking and counterintuitive fact about studies looking at interactions where effects are predicted to get smaller (attenuated interactions).

I needed a working example and went with Fritz Strack et al.’s  (1988, .pdf) famous paper [933 Google cites], in which participants rated cartoons as funnier if they saw them while holding a pen with their lips (inhibiting smiles) vs. their teeth (facilitating them).

holding pens
The paper relies on a sensible and common tactic: Show the effect in Study 1. Then in Study 2 show that a moderator makes it go away or get smaller. Their Study 2 tested if the pen effect got smaller when it was held only after seeing the cartoons (but before rating them).

In hypothesis-testing terms the tactic is:

Study Statistical Test Example
#1 Simple effect People rate cartoons as funnier with pen held in their teeth vs. lips.
#2 Two-way interaction But less so if they hold pen after seeing cartoons

This post’s punch line:
To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.

Power discussions get muddied by uncertainty about effect size. The blue fact is free of this problem: whatever power Study 1 had, at least twice as many subjects are needed in Study 2, per cell, to maintain it. We know this because we are testing the reduction of that same effect.

Study 1 with the cartoons had n=31 per-cell. [1] Study 2 hence needed to increase to at least n=62 per cell, but instead the authors decreased it to n=21.  We should not make much of the fact that the interaction was not significant in Study 2

(Strack et al. do, interpreting the n.s. result as accepting the null of no-effect and hence as evidence for their theory).

The math behind the blue fact is simple enough (see math derivations .pdf | R simulations| Excel Simulations).
Let’s focus on consequences.

A multiplicative bummer
Twice as many subjects per cell sounds bad. But it is worse than it sounds. If Study 1 is a simple two-cell design, Study 2 typically has at least four (2×2 design).
If Study 1 had 100 subjects total (n=50 per cell), Study 2 needs at least 50 x 2 x 4=400 subjects total.
If Study 2 instead tests a three-way interaction (attenuation of an attenuated effect), it needs N=50 x 2 x2 x 8=1600 subjects .

With between subject designs, two-way interactions are ambitious. Three-ways are more like no-way.

How bad is it to ignore this?
VERY.
Running Study 2 with the same per-cell n as Study 1 lowers power by ~1/3.
If Study 1 had 80% power, Study 2 would have 51%.

Why do you keep saying at least?
Because I have assumed the moderator eliminates the effect. If it merely reduces it, things get worse. Fast. If the effect drops in 70%, instead of 100%, you need FOUR times as many subjects in Study 2, again, per cell. If two-cell Study 1 has 100 total subjects, 2×2 Study 2 needs 800.

How come so many interaction studies have worked?
In order of speculated likelihood:

1) p-hacking: many interactions are post-dicted “Bummer, p=.14. Do a median split on father’s age… p=.048, nailed it!” or if predicted, obtained by dropping subjects, measures, or conditions.

2) Bad inferences: Very often people conclude an interaction ‘worked’ if one effect is  p<.05  and the other isn’t. Bad reasoning allows underpowered studies to “work.”
(Gelman & Stern explain the fallacy .pdf, Nieuwenhuis et al document it’s common .pdf)

3) Cross-overs: Some studies examine if an effect reverses rather than merely goes away,those may need only 30%-50% more subjects per cell.

4) Stuff happens: even if power is just 20%, 1 in 5 studies will work

5) Bigger ns: Perhaps some interaction studies have run twice as many subjects per cell as Study 1s, or Study 1 was so high-powered that not doubling n still lead to decent power.

teeth

(you can cite this blogpost using DOI: 10.15200/winn.142559.90552)


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Study 1 was a three-cell design, with a pen-in-hand control condition in the middle. Statistical power of a linear trend with three n=30 cells is virtually identical to a t-test on the high-vs-low cells with n=30. The blue fact applies to the cartoons paper all the same. []

[15] Citing Prospect Theory

Kahneman and Tversky’s (1979) Prospect Theory (.pdf), with its 9,206 citations, is the most cited article in Econometrica, the prestigious journal in which it appeared. In fact, it is more cited than any article published in any economics journal. [1]

Let’s break it down by year.prospect theory number

To be clear, this figure shows that just in 2013, Prospect Theory got about 700 citations.

Kahneman won the Nobel prize in Economics in 2002. This figure suggests a Nobel bump in citations. To examine whether the Nobel bump is real, I got citation data for other papers. I will get to that in about 20 seconds. Let’s not abandon Prospect Theory just yet.

Fan club.
Below we see which researchers and which journals have cited Prospect Theory the most. Leading the way is the late Duncan Luce with his 38 cites.

researchers

If you are wondering, Kahneman would be ranked 14th with 24 cites, and Tversky 15th with 23. Richard Thaler comes in 33rd place with 16, and DanAriely in 58th with 12.

How about journals?

journals

I think the most surprising top-5 is Management Science, it only recently started its Behavioral Economics and Judgment & Decision Making departments

Not drinking the cool-aid
The first article to cite Prospect Theory came out the same year, 1979, in Economics Letters (.pdf). It provided a rational explanation for risk attitudes differing for gains and losses. The story is perfect if one is willing to make ad-hoc assumptions about the irreversibility of decisions and if one is also willing to ignore the fact that Prospect Theory involves small stakes decisions.  Old school.

Correction: an earlier version of this post indicated the Journal of Political Economy did not cite Prospect Theory until 2005, in fact it was cited already in 1987 (.html).  

About that Nobel bump
The first figure in this blog suggests the Nobel lead to more Prospect Theory cites. I thought I would look at other 1979 Econometrica papers as a “placebo” comparison. It turned out that they also showed a marked and sustained increase in the early 2000s. Hm?

I then realized that Heckman’s famous “Sample Selection as Specification Error” paper was also published in Econometrica in 1979 (good year!) and Heckman, it turns out, got the Nobel in 2000, my placebo was no good. Whether the bump was real or spurious it was expected to show the same pattern.

So I used Econometrica 1980. The figure below shows that deflating Prospect Theory cites by cites of all articles published in Econometrica in 1980, the same Nobel bump pattern emerges. Before the Nobel, Prospect Theory was getting about 40% as many cites per year as all 1980 Econometrica articles combined. Since then that has been rising, in 2013 they were nearly tied.

ratio econometrica 1980

Let’s take this out of sample, did other econ Nobel laureates get a bump in citations? I looked for laureates from different time periods and whose award I thought could be tied to a specific paper.

There seems to be something there, though the Coase cites started increasing a bit early and Akerlof’s a bit late. [2]

coase Lemos


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. Only two psychology articles have more citations:Baron and Kenny paper introducing mediation (.pdf) with 21,746, and Bandura’s on Self-Efficacy (.html) with 9,879 []
  2. The papers: Akerlof, 1970 (..pdf) & Coase, 1960 (..pdf)   []

[13] Posterior-Hacking

Many believe that while p-hacking invalidates p-values, it does not invalidate Bayesian inference. Many are wrong.

This blog post presents two examples from my new “Posterior-Hacking” (SSRN) paper showing  selective reporting invalidates Bayesian inference as much as it invalidates p-values.

Example 1. Chronological Rejuvenation experiment
In  “False-Positive Psychology” (SSRN), Joe, Leif and I run experiments to demonstrate how easy p-hacking makes it to obtain statistically significant evidence for any effect, no matter how untrue. In Study 2 we “showed” that undergraduates randomly assigned to listen to the song “When I am 64” became 1.4 years younger (p<.05).

We obtained this absurd result by data-peeking, dropping a condition, and cherry-picking a covariate. p-hacking allowed us to fool Mr. p-value. Would it fool Mrs. Posterior also? If we take the selectively reported result and feed it to a Bayesian calculator. What happens?

The figure below shows traditional and Bayesian 95% confidence intervals for the above mentioned 1.4 years-younger chronological rejuvenation effect.  Both point just as strongly (or weakly) toward the absurd effect existing. [1]

f2

When researchers p-hack they also posterior-hack

Example 2. Simulating p-hacks
Many Bayesian advocates propose concluding an experiment suggests an effect exists if the data are at least three times more likely under the alternative than under the null hypothesis. This “Bayes factor>3” approach is philosophically different, and mathematically more complex than computing p-values, but it is in practice extremely similar to simply requiring p< .01 for statistical significance. I hence run simulations assessing how p-hacking facilitates getting p<.01 vs getting Bayes factor>3. [2]

I simulated difference-of-means t-tests p-hacked via data-peeking (getting n=20 per-cell, going to n=30 if necessary), cherry-picking among three dependent variables, dropping a condition, and dropping outliers. See R-code.

f3
Adding 10 observations to samples of size n=20 a researcher can increase her false-positive rate from the nominal 1% to 1.7%. The probability of getting a Bayes factor >3 is a comparable 1.8%. Combined with other forms of p-hacking, the ease with which a false finding is obtained increases multiplicatively. A researcher willing to engage in any of the four forms of p-hacking, has a 20.1% chance of obtaining p<.01, and a 20.8% chance of obtaining a Bayes factor >3.

When a researcher p-hacks, she also Bayes-factor-hacks.

Everyone needs disclosure
Andrew Gelman and colleagues, in their influential Bayesian textbook write:

A naïve student of Bayesian inference might claim that because all inference is conditional on the observed data, it makes no difference how those data were collected, […] the essential flaw in the argument is that a complete definition of ‘the observed data’ should include information on how the observed values arose […]”
(p.203, 2nd edition)

Whether doing traditional or Bayesian statistics, without disclosure, we cannot evaluate evidence.


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. The Bayesian confidence interval is the “highest density posterior interval”, computed using Kruschke’s BMLR (html). []
  2. This equivalence is for the default-alternative, see Table 1 in Rouder et al, 2009 (HTML).  []

[9] Titleogy: Some facts about titles

Naming things is fun. Not sure why, but it is. I have collaborated in the naming of people, cats, papers, a blog, its posts, and in coining the term “p-hacking.” All were fun to do. So I thought I would write a Colada on titles.

To add color I collected some data. At the end what I wrote was quite boring, so I killed it, but the facts seemed worth sharing. Here they go, in mostly non-contextualized prose.

Cliché titles
I dislike titles with (unmodified) idioms. The figure below shows how frequent some of them are in the web-of-science archive.
f1
Ironically, the most popular (I found), at 970 papers, is “What’s in a name?” …Lack of originality?

Colonization
A colleague once shared his disapproval of the increase in the use of colons in titles. With this post as an excuse, I used Mozenda to scrape ~30,000 psychology paper titles published across 19 journals over 40 years, and computed the fraction including a colon. “Colleague was Wrong: Title Colonization Has Been Stable at about 63% Since the 1970s.” [1]

That factoid took a couple of hours to generate. Data in hand I figured I should answer more questions. Any sense of coherence in this piece disappears with the next pixel.

Have titles gotten longer over time? 
f2
Yes. At about 1.5 characters per year (or a tweet a century).
note: controlling for journal fixed effects.

Three less obvious questions to ask
Question 1. What are the two highest scoring Scrabble words used in a Psychology title?
f3
Hypnotizability (37 points), is used in several articles it turns out. [2]
Ventriloquized (36 points) only in this paper.

Question 2. What is the most frequent last-word in a Psychology paper title?
(try guessing before reading the next line)

This is probably the right place to let you know the Colada has a Facebook page now 

Winner: 137 titles end with: “Tasks”
Runner up: 70 titles end with “Effect”

Question 3. What’s more commonly used in a Psychology title, “thinking” or “sex”?
Not close.

Sex: 407.
Thinking: 172.

Alright, that’s not totally fair, in psychology sex often refers to gender rather than the activity. Moreover, thinking (172) is, as expected for academic papers, more common than doing (44).
But memory blows sex, thinking, and doing combined out of the water with 2008 instances; one in 15 psychology titles has the word memory in them.

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  1. I treated the Journal of Consumer Research as a psychology journal, a decision involving two debatable assumptions. []
  2. Shane Frederick indicated via email that this is a vast underestimate that ignores tripling of points; Hypnotizability could get you 729 points . []

[4] The Folly of Powering Replications Based on Observed Effect Size

It is common for researchers running replications to set their sample size assuming the effect size the original researchers got is correct. So if the original study found an effect-size of d=.73, the replicator assumes the true effect is d=.73, and sets sample size so as to have 90% chance, say, of getting a significant result.

This apparently sensible way to power replications is actually deeply misleading.

Why Misleading?
Because of publication bias. Given that (original) research is only publishable if it is significant, published research systematically overestimates effect size (Lane & Dunlap, 1978). For example, if sample size is n=20 per cell, and true effect size is d=.2, published studies will on average estimate the effect to be d=.78. The intuition is that overestimates are more likely to be significant than underestimates, and so more likely to be published.

If we systematically overestimate effect sizes in original work, then we systematically overestimate the power of replications that assume those effects are real.

Let’s consider some scenarios. If original research were powered to 50%, a highly optimistic benchmark (Button et al, 2013;Sedlmeier Gigerenzer, 1989), here is what it looks like:

50
So replications claiming 80% power actually have just 51% (Details | R code).

Ok. What if original research were powered at a more realistic level of, say, 35%:
35
The figures show that the extent of overclaiming depends on the power of the original study. Because nobody knows what that is, nobody knows how much power a replication claiming 80%, 90% or 95% really has.

A self-righteous counterargument
A replicator may say:

Well, if the original author underpowered her studies, then she is getting what she deserves when the replications fail; it is not my fault my replication is underpowered, it is hers. SHE SHOULD BE DOING POWER ANALYSIS!!!

Three problems.
1. Replications in particular and research in general are not about justice. We should strive to maximize learning, not schadenfreude.

2. The original researcher may have thought the effect was bigger than it is, she thought she had  80% power, but she had only 50%. It is not “fair” to “punish” her for not knowing the effect size she is studying. That’s precisely why she is studying it.

3. Even if all original studies had 80% power, most published estimates would be over-estimates, and so even if  all original studies had 80% power, most replications based on observed effects would overclaim power. For instance, one in five replications claiming 80% would actually have <50% power (R code).

 

What’s the alternative?
In a recent paper (“Evaluating Replication Results”) I put forward a different approach to thinking about replication results altogether. For a replication to fail it is not enough that p>.05 in it, we need to also conclude the effect is too small to have been detected in the original study (in effect, we need tight confidence intervals around 0). Underpowered replications will tend to fail to reject 0, be n.s., but will also tend to fail to reject big effects. In the new approach this result is considered as uninformative rather than as a “failure-to-replicate.” The paper also derives a simple rule for sample size to be properly powered for obtaining informative failures to replicate:  2.5 times the original sample size ensures 80% power for that test. That number is unaffected by publication bias, how original authors power their studies, and the study design (e.g., two-proportions vs. ANOVA).


Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.