Cargo cult of p-values, part 3: a way out

Oct 11, 2021

Learn how the cult formed and led to replication crisis in part 1.

Learn about statistical power and why no one likes statistics in part 2.

The best thing to come out of the replication crisis is the replicability index. I have to say, I’m proud of humanity: the crisis started in 2011, and we got a full-fledged manual for the index by 2016.

Ulrich Schimmack, its author, has a website about it, where he publishes a lot of his work without a paywall (I know, crazy). You don’t see that often in modern science. The index gives you mathematical foundation to go “Wait a minute!” and call out someone’s dirty secret. It’s like a superpower.

The replicability index is based on two simple facts:

Statistical power is the chance to detect an effect, which is another way of saying “a limit for success rate of your studies.”
Power is directly determined by your effect size, sample size, and confidence—we talked about it in part 2, but if you want to go deeper here’s a calculator and some examples.

Using that, based on a series of results, the R-index gives a prediction of what will happen if you try to reproduce them.

R-Index = 2 * Median Observed Power – Success Rate

Which basically means it’s suspicious to have discrepancy between reported success rate and statistical power of tests behind it (which actually defines the success rate). These things should match.

Let’s apply it to the product world. Imagine you see an article that someone slowed their website and it increased conversion rate. It’s weird, it’s questionable, but what if it works? You start searching and find several other products cases reporting the same thing, and not a single one stating the opposite. Is this real? Now you can figure it out.

If those articles publish the stats of their tests, you can calculate their observed power. Let’s say it’s 70%. Success rate is 100% since all articles are saying that slowing down the site works. This gives you a

R-index = 2 * 70% - 100% = 40%

40% chance to get a similar result if you run the same test. Not too impressive. Something is certainly afoot here.

Maybe much more people tried to slow down their sites, saw that it didn’t help at all, got embarrassed with the idea and never talked about it.

Maybe the authors of those articles did some p-hacking to inflate their confidence levels.

Maybe they fished for those results by running as many tests as possible until they found something.

There are a lot of dark research patterns they could’ve used. Do you want to know more about them?

There’s a 52% chance to replicate that.

Schimmack himself understands it’s pretty high and says, “it is not clear how questionable research practices influence the R-Index.”

You know what that means? We’ll get even crazier stats magic in a few years. What a time to be alive.

Figuring out

Discussion about this post