March 2019 By Jared Wilber

The Permutation Test

A Visual Explanation of Statistical Testing

Statistical tests, also known as hypothesis tests, are used in the design of experiments to measure the effect of some treatment(s) on experimental units. They are employed in a large number of contexts: Oncologists use them to measure the efficacy of new treatment options for cancer. Google uses them to determine which color of blue (e.g. this blue vs this blue) is most effective for outgoing links. And entomologists use them to study the sex habits of flies *Proof that, yes, statistics is definitely very sexy.

Unfortunately, a lot of statistical tests require complex assumptions and convoluted formula. This is especially true of those methods taught in introductory courses, giving the false impression that experimental design is boring and unintuitive. But fret not, my valued reader - not all tests are so bad! In what follows, I present a visual explanation for the permutation test: an awesome nonparametric test that is light on assumptions, widely applicable, and very intuitive.

You're An Alpaca Shepherd Now

You've finally achieved your lifelong dream: you're an alpaca shepherd. And like any alpaca shepherd will tell you, your foremost concern is the wool quality of your herd.*this may or may not be true

Word on the street in Cusco is that a popular new shampoo increases the wool quality of your alpaca. But you're no sucker - you're going to find out for sure. You're going to test the difference with statistics.

In statistical testing, we structure experiments in terms of null & alternative hypotheses. Our test will have the following hypothesis schema:

Η0: μtreatment <= μcontrol
ΗA: μtreatment > μcontrol

Our null hypothesis claims that the new shampoo does not increase wool quality. The alternative hypothesis claims the opposite; new shampoo yields superior wool quality.


As a first step, we randomly assign half of our sampled alpaca to the new shampoo, and half to the old.

We say that the alpaca receiving the new shampoo belong to the treatment group, and the others to the control group. The assignment of an alpaca to a given diet is known as its treatment assignment.

Randomization of treatment assignment is very important. It removes bias and confunding from our results, and provides the basis for the theory underpinning our statistical test.

Response Values

After giving each alpaca its designated shampoo, we determine if the new shampoo has any effect on wool quality.

In statistics jargon, every experimental unit has a response value. For us, each alpaca is an experimental unit, and its measure of wool quality after taking its shampoo is its response value.

We can eyeball these values ourselves and get a feel for any perceived differences between the two shampoos. However, , we'll need a more rigourous method to determine if the differences are statistically significant.

Test Statistic

To determine whether or not the new shampoo really is effective, we need a way to quantify the difference between our null and alternative hypotheses.

Luckily for us, such a numerical summary exists: the test statistic.

A benefit of the permutation test is that it allows us to use any numerical value that we want for our test statistic.*many other tests require complex, specificc test statistics Because our analysis is fairly straightforward, we'll simply use the difference in mean response values between the two shampoos:

Test Statistic = μTreatment - μControl

To obtain our initial test statistic, we simply subtract the mean wool quality of the alpacas that used the new shampoo (treatment group) from the mean wool quality of the alpacas that did not use the new shampoo (control group).

The 'P' in 'Permutation'

Enter the most important step of the permutation test, as well as its namesake.*It's also called the 'randomization test'

While keeping the same response values we received earlier, we permute (shuffle) the treatment assignments of our alpaca, and re-calculate our test statistic.

We do this because we analyze the results of our experiment relative to the null hypothesis, which posits the new shampoo as having no benefit on wool quality.

While this may seem a bit odd, the logic is quite simple: if the new shampoo truly doesn't improve wool quality, shuffling the shampoo label of our alpaca and recalculating our test statistic won't matter - we'll obtain similar wool quality values for both groups.

More Permutations

We repeat this process, permuting our data over and over again, and recalculate a test statistic at each iteration.

Ideally, we'd calculate a test statistic for every possible permutation of shampoo assignment among our alpaca. This would create an exact distribution of all possible test statistics under our null hypothesis.

Unfortunatley, calculating every permutation is often far too large for practicality. No worries! Instead we'll resample enough permutations to build an approximation to our distribution, as that'll work just as well.

Test Statistic Distribution

Eventually, after some sufficient number of permutations, we create the approximate test statistic distribution.

This distribution approximates all possible test statistic values we could have seen under the null hypothesis. We can then use this distribution to obtain probabilities associated with different mean-difference values*Or whatever calculation you used for your test statistic , where we assume that wool quality does not increase with the new shampoo.

By observing where our initial test statistic falls within this distribution, we obtain the final piece for our test: The magical p-value.

The P-Value

A p-value represents the probability of obtaining the observed values, assuming the null hypothesis is true. For us, it's the probability of obtaining the differences in wool quality we did, assuming the new shampoo did not increase wool quality.

To determine the outcome of our test, we compare our p-value to a significance level. This should be determined a prioir, but we'll just say ours is 10%. If the p-value is less than or equal to the significance level, we reject the null hypothesis; the outcome is said to be statistically significant.

For us, a low p-value signals that, assuming the null hypothesis is true, the probability of obtaining our initial differences in wool quality occurs with a low probability. A high p-value signals the opposite, such an outcome is likely under the null hypothesis.

Our Results

To calculate the p-value for a permutation test, we simply count the number of test-statistics as or more extreme than our initial test statistic, and divide that number by the total number of test-statistics we calculated.

In our case, only sixteen out of our two-hundred test statistics were as or more extreme than our initial test statistic.

Thus, our p-value is:

P-Value = 16 / 200

    = 0.08

    = 8%

In other words, if it's truly the case that the new shampoo doesn't improve wool quality, then obtaining the initial difference in wool quality we did occurs with a probability of only 8%.

That's a fairly low probability. In fact, at our 10% level of significance, we reject our null hypothesis and accept our alternative: the new shampoo does appear to be increasing wool quality. Time to buy some more!


So that's the permutation test, or at least my attempt at explaining it. Pretty cool. Pretty simple. And, hopefully, pretty intuitive.

To recap, the algorithm comprises three steps:

1). Determine & calculate the initial test-statistic.

2). Construct approximate test-statistic distribution.

3). Calculate the p-value.

This was not an exhaustive treatment of the statistical testing, some things were left out. But I hope it was helpful in explaining the permutation test, and, more broadly, for communicating that statistical testing involves more than just memorizing formulae.

In any case, I just hope that at some point you found yourself muttering aloud to yourself, "Woah, statistics is kind of cool." To which I'd respond yes, anonymous reader - you're damn right it is.


Introduction to Design and Analysis of Experiments (George W. Cobb, 1998)

There is only one test! (Allen Downey, 2011)

Permutation Tests (Thomas Leeper, 2013)


D3.js (Mike Bostock)

Rough.js (Preet Shihn)

scrollama.js (Russel Goldenberg)