[120] Off-Label Smirnov: How Many Subjects Show an Effect in Between-Subjects Experiments?

There is a classic statistical test known as the Kolmogorov-Smirnov (KS) test (Wikipedia).

This post is about an off-label use of the KS-test that I don’t think people know about (not even Kolmogorov or Smirnov), and which seems useful for experimentalists in behavioral science and beyond (most useful, I think, for clinical trials and field experiments of policies that could backfire on some people).

If I’m wrong and the off-label use is known, well, that’s a little embarrassing, but please let me know.

Like the t-test, the KS-test can be used to compare two samples¹.

While the t-test involves only differences of means, the KS-test considers any type of difference across samples. So, like the t-test, the KS-test could be statistically significant because the sample means are quite different, but, unlike the t-test, the KS-test could also be statistically significant because the sample variances are quite different, or the share of observations that are exactly 0 are quite different, etc.

The KS test is seldom used, but when it is used, it is used as a a test, as a way to obtain a p-value quantifying whether things are significantly different across groups².

But, there is an off-label use of the KS-test, one that does not use it as a test, but as a way to estimate something. Specifically, it can estimate the share of observations in a between-subjects design that are impacted by the manipulation. So, instead of asking “Is the average of the dependent variable higher in A than in B?”, we can ask “How many people have a higher dependent variable in A than in B”? Again. This is noteworthy because it involves between subjects design where people are only in A or only in B so we don't know if any given individual has a higher DV in A or in B ³.

What does the KS-test do?
The KS test compares the entire distribution of values between samples (the CDFs). It quantifies the biggest observed difference in those distributions. Let's illustrate with data from an old paper of mine (.pdf) in which participants provided their valuation, in a between subject design, for either a $50 Gift Certificate, or for a 50:50 gamble that for sure paid either that $50 certificate, or a $100 one (people, puzzlingly, pay less for the lottery, the objectively superior alternative; this is known as the “Uncertainty Effect”.htm). The figure below has the CDFs. For example, if we were to draw a horizontal line (that unfortunately STATA didn't draw) at the 50^th percentile, the medians are about $10 and $35. About half the people pay $10 or less for the lottery, and about half the people pay $35 or less for the $50 gift certificate.

Fig 1. Distribution of valuations an the D+ and D- in a KS-test.

OK, let's now get to the KS test. It computes the biggest positive and negative differences in these CDFs. The biggest positive difference is about 55%, seen around $10. In the KS-test that's the D⁺ (test) statistic, the biggest positive difference. There is also the D^–, flagged here with the 2^nd arrow; D^–= 18%.

Researchers and statisticians usually ignore those D+ and D- values, they usually do not even report them. But it is those D+ and D- that give us the off-label use. Those D values inform the share of people showing each effect. Returning to Figure 1 above, the Ds mean that at least 55% of participants pay more for the $50 gift certificate than for the lottery, and at least 18% of participants do the opposite.

The KS-test off-label use involves using D+ and D- as bounds on the share of people impacted positively vs negatively by treatment.

I should say I did not discover or prove this fact; a paper published in Econometric Theory in 2010 did. But the authors do not seem to have realized that that's what they proved (they do not mention the KS test in their paper). I wrote my personal journey to acquiring this tidbit of information, but it is long, so I put it behind the green button.

When working on the paper with the data for Figure 1, many years ago, I wanted to estimate how many participants paid more for the gift-certificate than the lottery, and how many did the opposite. But the data were between-subjects, any given participant only valued one of the two, so this was not possible. But I thought perhaps I could “bound” it, so that I could say "at least x% of people paid more for the lottery, and at least y% paid less for it". And I developed an algorithm that did just that. Details in footnote⁴. I then submitted the algorithm idea as a separate paper to the AER (an econ journal). A kind and wise reviewer pointed out that a paper had just been published in the journal "Econometric Theory" providing a more general (and less clunky, my words) solution to the same problem: bounding share of participants showing an effect in between subject designs. I obviously downloaded that paper immediately (.pdf). It was too technical for me to follow in enough detail, but I determined the reviewer was right, and I dropped the project.

But, while scratching my head looking at that Econometric Theory paper, at "lemma 2.1" in particular, something caught my eye which appears to have missed the eyes of the authors of the paper (and anyone else who may have read it). Their formula to find bounds for participants being impacted positively and negatively by treatment, is the same formula as that of the D+ and D- in the KS test.

Here is a visual approximation to what happened on my desk⁵:

I thought “one day I will write about this”, and now, some 15 years later, is that day.

An intuition for this off-label use
The intuition behind the D+ and D- being bounds is actually straightforward, and in my nerdy opinion, intellectually stimulating. Let’s do an example. If we start with a binary dependent variable things are easier.

Let’s say we are studying whether people pay more for red or white wine. We run a study where people are offered either a red wine or a white wine, and are asked: “Would you pay $24 for this bottle?”

Imagine the share of ‘yesses’ is 72% for red and 42% for white.
How many people pay more for red than for white? Well, we don’t know, because for any person we only see their answer for either red or white. But we can bound it.

The only fact to keep in mind is that we estimate a 30% change in the share of people who are willing to pay $24. It's useful to spell out what that 30% of people entails. Those are people that would pay less than $24 for white wine, but would pay more than $24 for red wine. For example someone who would pay $10 for white and $30 for red is there.

It’s possible that 100% of people pay more for red wine in general, but only 30% are in the sweetspot that crosses over $24. For example, if Marjorie pays $10 for White and $20 for red, she shows an effect, but not in our sample, as she would not buy either wine. If Marco pays $26 for White and $30 for red, same idea: exhibits an effect which we do not observe. Because some people exhibit an effect that's not observable, it's possible that 100% of people exhibit the effect.

It’s similarly possible that 75% of people exhibit an effect but we observe only 30%.
It’s similarly possible that 50% of people exhibit an effect but we observe only 30%.
Many things are possible, but usefully, some are impossible⁶.

For example, it is not possible that only 10% of people pay more for red wine. It’s not possible because if only 10% of people paid more for red wine, then we would not see 30% more people paying more than $24 for red than for white, we would see at most a 10% increase, if everyone was in the sweetspot that values white <$24 and red >$24.

This is why we can bound the share showing the effect at 30%.
If 30% more people would pay $24 for red but not for white, then at least 30% of people would pay more for red wine. When you put it this way, it's kind of… …duh! (at least to someone who first thought of this 14 years ago).

Alright, but what if the dependent variable is continuous though? Can we still bound it?
Yes! And it's similarly intuitive. You dichotomize the DV, then do the same duh thing.

Say in our wine study we asked how much they would pay for either the red or the white wine, instead of asking only whether they would pay $24. So we get our data which is a bunch of dollar values instead of yesses and nos, some $29 here, some $7 there. How do we use dichotomization to find the bounds?

Say we dichotomize at $10. We compute how many dollar amounts are above and below that in the two conditions, and find that 80% of people pay $10 or more for red wine, and 55% for white wine. OK, so the duh thing tells us that then at least 25% of people pay more for red. Fine, but why $10? That was arbitrary. OK. We can try every possible price. Dichotomize at $1, then at $2, then at $3…. For each dichotomization you have one estimate, take the biggest of those numbers, and you have D+ from the KS test: the biggest vertical gap in the CDF. And hopefully you now see why that number bounds the share of people showing the effect, because my attempt at an explanation ends with this period.

This may feel p-hacky, the KS test is finding the cutoff by trying everything in the data. Well, it is p-hacky.
But the KS-test takes that into account when it computes its p-value, it adjusts for the 'multiple comparisons' so to speak. Indeed, it is usually considered a conservative test (under-rejects the null).

In Sum
The KS-test computes all the possible dichotomizations you could do of the dependent variable, and keeps track of the two dichotomizations producing the biggest differences between conditions in the cumulative shares of observations. These values, D+ and D- respectively, are bounds on the share of people influenced by the treatment positively and negatively.

Is this useful?
Maybe.

It’s a bound, so it will tend to be a low number. That is, the true value could be that 70% of people are influenced, and the bound could easily be just 20%. Bounds are like that. "At least 1 person will love this post" is not that informative a statement about how good this post is. So I lean to thinking this is more interesting than useful. Though it is an empirical question that depends on the shape of the distributions produced by treatments. I suspect it is potentially useful in two scenarios.

Scenario 1: when the majority of people in a study show an effect in one direction, but a minority of people show a strong effect in the opposite direction, and you care to realize this may be happening. For example, some antidepressants help most patients but exacerbate depression in a minority. The off-label Smirnov would be useful there to pick up that heterogeneity and prevent some disasters that have happened. So seems possibly useful for clinical trials (most people get less depressed, some people get very depressed)⁷

Scenario 2: to remind us that when we show a difference of means we may be focusing on an effect that is shown by a minority of participants. If we got into the habit of reporting off-label KS, we may put more effort in determining whether our findings apply to most, some, or few people.

Author feedback
Our policy (.htm) is to share drafts of blog posts with authors whose work we discuss, in order to solicit suggestions for things we should change prior to posting, I emailed the authors of the Journal of Econometrics authors, but they did not reply. I did not contact Kolmogorov or Smirnov as they died in 1987 and 1966 respectively.

Subscribe to Blog via Email

Footnotes

1
The KS-test can also be used to compare one sample to a theoretical distribution, e.g., to assess if a sample is statistically significantly not-normally distributed. We relied on it to when assessing whether the Ariely insurance driving data were uniformly distributed, see footnote 5 in Colada[98]
2
Or one group, see Footnote 1
3
Technically, the KS test does not estimate the share showing an effect, it bounds it, meaning it tells you its smallest possible value, rather than its expected value. More on this later.
4
The algorithm involved graph theory and matching participants across conditions ('imagine subject 1 in the control is subject 17 in the treatment, would they have paid more or less for the lottery?), and doing that for every participant. It did not require actually doing all possible combinations, the algorithm sorted the data optimally and did it just once.
5
In Lemma 2.1 we need to make delta=0 to get the KS D+ and D- statistic. Delta is the size of the effect, so if delta=0, we are asking how many show a positive vs negative effect. If we make delta=1 we bound the share of participants showing an effect of at least $1, vs less than $1. It is easy to achieve that with the KS-test too, just add delta to the condition, and rerun the KS-test; that is, the KS test can also bound the share of people showing an effect of a particular size or larger
6
For ease of exposition I use the term possible/impossible, but this is ignoring sampling error, so the statements are probabilistic, requiring statistical inference. One way of thinking about it is that D+ and D- are estimates with random error, so D+ is the estimated bound, with random error
7
would this be more or less useful than quantile tests? I don't know

Subscribe to Blog via Email

Footnotes

Related