In Defense Of Irreproducible Results

The dangers of p-values and Klingons

Science is hard. Well, lots of things are hard. Baking a good baguette is hard. Remembering everyone’s birthday (before their birthday) is hard too, as are granite boulders. But science has a special trick up its sleeve that makes it quantifiably hard: the p-value. P-values are supposed to help us identify results that are statistically significant, but getting a low p-value is difficult. In the medical sciences, achieving statistical significance usually means having a good question, a lot of patient samples, and doing your assays and calculations well. The former requires cleverness, the middle requires resources, and the final entails diligence. Combining all these traits in one researcher, or even one research team, is hard. I, for one, still haven’t figured out what all the bins in my refrigerator are for, let alone be able to manage an entire clinical study workflow.

And to make it worse, the scientific community is now taking very close looks at everyone’s p-values to make sure that these low p-values mean what they’re supposed to mean. Not only should data be significant, but it should be reproducible. Unfortunately it seems that research results frequently are not reproducible. In fact, we are in the midst of a “reproducibility crisis”, according to some. Various studies have suggested that most published results, in medicine and the social sciences, are not repeatable, despite having nice p-values in the original study.

Does this reproducibility crisis merit pitchforks and torches, or better study design and a philosophical debate? Why not both?!

Why is this? Is the reproducibility crisis due to a mixture of fraud and lazy science, for which pitchforks, lighted torches, and storming the gates are the best response? Or is it more complicated, requiring more agreement on good study design and an understanding of just how reproducible scientific results really should be? Probably mostly the latter, but we can add in a little of the former too, just to keep in interesting.

The use of p-values was standardized by Ronald Fisher in the 1920’s to help identify studies whose results that are statistically significant. A p-value of less than 0.05 is often used as a threshold, and sometimes one will see it interpreted as, “the chance that our hypothesis was wrong was less than 5%.” That’s incorrect. What it actually means is, “the chance of getting these results (or even more extreme results) if our hypothesis was completely wrong was less than5%” (there are more exact ways of defining a p-value, but that will suffice). The difference between these interpretations is important.

As an example, let’s take an everyday example. Say you’re a science officer on the starship Enterprise, and your ship has a cargo hold of the wheat hybrid quadrotriticale, destined for the planet Sherman’s World, whose colonization is under dispute between the Federation and the Klingon Empire. You find that the grain hold has been infested by tribbles, which are eating the grain, and that half the tribbles are dead. You hypothesize that, given the known life-cycle of tribbles, many more tribbles are dead than one would expect- the grain was likely poisoned by the Klingons. A statistical test is in order! You count 1000 tribbles, and find 454 dead, and 546 are alive. Given their known life span, and that the entire population descended from two tribbles introduced by a crew member just last week (tribbles breed really fast), you expect no more than 10% to be dead (100 tribbles) if they were living normal, happy tribble lives. A standard statistical test would give us a p-value of less than 0.0001 with these numbers, meaning that if our expected death rate of tribbles were correct, the probability of observing such a large number of dead tribbles is very low.

A low p-value just rules out that tribbles are dying at the expected rate, but not the possibility that it’s all due to not reversing the polarity of the neutron flow*

Does this mean that the Klingon’s poisoned our grain? Well, it doesn’t look good for them, the p-value is very low. However it doesn’t necessarily follow that the Klingons are at fault. As described above, the p-value tests not that our hypothesis (a lot more tribbles are dead than should occur naturally, probably due to those nefarious Klingons) is correct. Instead, the study just shows that it’s very unlikely that we’d see half of the tribbles are dead if they had an expected life cycle. It could be that we have a special breed of short-lived tribbles, or that tribble’s hate wheat, or that someone reversed the polarity of the neutron plasma flow that runs right past the grain hold and resulted in a temporal fold such that time moved differently in there (this latter hypothesis has a good chance of being correct, as any Star Trek fan will attest). In other words, a low p-value doesn’t mean our hypothesis is correct, just that we don’t have a good reason to discount it (yet).

Another way our p-value could lead us astray is if we had only a few tribbles to count. Possibly the grain hold had been opened to space to clean it out, and we only had a few tribbles to measure, or the Chief Engineer had transported the whole lot to the Klingon ship. We only found six remaining living tribbles and four dead ones. Given that we expected to find no more than one dead tribble, we still get a significant p-value (p=0.0018). However we can also ask that, given that we only tested a few tribbles, how confident are we that we’re testing the true proportions in all the tribbles (before they were blown into space)? Another test shows us that we can say we have a 95% confidence that the real number of dead tribbles lay between 10% to 70%. That’s a pretty big range, given that we expected to see 10% dead from natural causes. In contrast, when we had 1000 tribbles to measure, our 95% confidence range was between 42% to 48% dead tribbles in the cargo hold.

“P-hacking” your way towards an implication that Klingons are responsible?

We have some evidence consistent with the Klingon’s being involved with sabotaging colonization efforts on Sherman’s World (though no definitive proof has been shown yet), but what has this to do with the current reproducibility crisis in science?

One, reproducibility relies on a large population to study. We have seen that sample size can affect the quality of our results, even if p-values remain small. It’s now becoming a standard demand for publication of a scientific study that more than p-values be given. In our two scenarios, showing the confidence ranges helps us gauge the reliability of the study:

# of tribbles	proportion dead with 95% confidence range	p-value
1000	46% (42 to 48)%	<0.0001
10	40% (10% to 70%)	0.002

Presented this way, the results with only ten tribbles doesn’t look very impressive. Before risking war with the Klingon empire, another test with more tribbles is probably in order. This holds for clinical studies too, of course. A good looking result with a phenomenal p-value may simply be due to a small sample size. Though it’s still unlikely that the tribbles are dying at a rate predicted by what we know of their lifecycle, it wouldn’t be at all surprising to find from a larger sampling that only 20% of the tribbles were dead. That’s more than we thought there would be, but not a very exciting number. It’s probably not worth a war with Klingons simply because they made tribbles sickly.

Secondly, reproducibility relies on a good hypothesis. Our low p-value in our tribble measurement study doesn’t mean our hypothesis was correct. Setting aside the plasma flow induced temporal fold theory, all we’re really confident in is that the grain is probably related to tribbles dying. Tribbles eating the grain died faster than expected, those that didn’t were fine. Given the circumstances, Klingon sabotage is a good guess, but it could be other aspects of the grain were involved. This is a new space-hybrid, just right for the colonization efforts on this new planet, after all. Therefore it’s possible that an effort to show that Klingons are poisoning grain would not be replicated in another study (on another ship colonizing another world, presumably). However we know now to be looking at the grain for problems. It may not be Klingon poison, but quadrotriticale may still be not all that it’s supposed to be.

These two factors, which may affect our ability to replicate our tribble study, are why we can say that pitchforks may have some utility in confronting the “reproducibility crisis”. Poor study design and analysis can contribute to a study, even one with really low p-values, having a low chance of reproducibility. The scientific community therefore has an obligation to find better ways of encouraging proper techniques in analysis and reporting to ensure that studies with a better chance of reproducibility are published. There have been many recent recommendations on better reporting practices and study design to help improve the matter. This is a good thing, wasting money (often public money) on studies that don’t mean anything is inefficient and slows the discovery of ‘real’ results. So to those who resort to “p hacking”, using just the right statistical test or study subset to hit that magical “p-value <0.05” number, mind our wrath!

An inability to reproduce a study can have many meanings. Not all suggest incompetence.

However, a lack of reproducibility can have a great number of causes, and an equally great number of interpretations. For example, following up our space wheat/tribble study may not result in showing that Klingons are nefarious grain poisoners. It may be that more studies will show that quadrotriticale is unstable in space, and breaks down into lethal tribble poison (which has also been shown to cause an unsightly space rash** in people). Furthermore, tribbles may be more diverse than expected, and some populations get space sick easily, making them more susceptible to space wheat poison. Therefore our study wasn’t reproducible, but it was useful. Initially we had no idea that quadrotricale may have problems. Following a lead based on the dead tribbles, and an idea it may be due to poisoned grain, we eventually were able to create a complicated, but reproducible explanation about how quadrotricale, space travel, and tribbles interact. The initial study was necessary to guide further studies that eventually rooted out just what’s wrong with space wheat.

A more down-to-earth example, we once did a study looking at reproducibility of genetic variants that predict risk of cancer. We weren’t looking at the reproducibility crisis per se, but whether genetic cancer risk studies performed in one country are applicable to another. We found that, by one standard, they often weren’t; most genetic cancer risk studies did not replicate when performed in a different ethnic group. However, we also found that the basic effect of a genetic variant was consistent between ethnic groups; you might not get a p-value <0.05 in another study, but a marker associated with high risk in one group is more likely to predict high risk rather than low risk in another group. Therefore, it seems that these “irreproducible” studies were still useful: the genetic markers they were testing were not ready for clinical use, but they were pointing at consistent biological effects that, if studied more closely, possibly will be useful in the clinic. In particular, it appeared that the genetic variants that had been tested were probably markers for the real risk alleles, which were close but not exactly at the location of the studied alleles.

Basically we need to ask, “How much reproducibility is correct?” This isn’t just a scientific question, but an ethical one as well.

This leads us to an interesting, though often unasked, question underlying the supposed reproducibility crisis: how much reproducibility should be expected from our research? Accepting low reproducibility allows difficult studies to be performed, at the cost of generating a lot of studies with unclear results. Aiming for high reproducibility saves money and time, but means many scientific questions may lay unanswered.

As we have shown, poor sample size is probably key contributing factor towards a lack of reproducibility. Studies done in small groups can give an apparent effect size larger than is likely to be found in the real world, and thus are less likely to replicate. Therefore it has been proposed that only studies with a very good prior likelihood of being shown to be true should be performed. Under these criteria, a study that is measuring only a tiny effect, say for a drug that extends the life span of cancer patients by a few months, or that affects only a small percentage of people, should not be tested. This is an issue that is becoming more and more relevant, especially in cancer, as we are finding many diseases are actually a constellation of related disorders, each which may only affect a small number of people.

Or consider our difficulty in framing a good hypothesis when confronted with the dead tribbles (or the cancer risk gene variants). An initial study may not be able to offer much to base an expected result upon. We can see a clear benefit, a planet is waiting for our space wheat, but frankly, the chance that our first hypothesis is going to be correct and reproducible is small. Our best hope is that it will guide a better study the next time. If reproducibility is the main criteria, however, this study may not be performed.

It’s important to recognize that these criteria for deciding whether to perform a study are not primarily scientific, but philosophical. Is the goal of science to efficiently increase the greater good? Or can there be a moral cause to follow a scientific lead, even if it benefits only a small number of people and may be hard to replicate? The former could be defined as a version of pragmatic utilitarianism, the belief that the most moral act is the one that benefits the most people, with “benefit” being something that can be empirically defined. This methodology has the advantage of being quantifiable, almost anything pragmatic (using a philosophical definition of pragmatism) can get numbers wrapped around it. Numbers are easy to communicate, at least, easier than more squishy value statements. However utilitarianism has, it its root, a belief that “useful” and “morally right” are essentially the same. There is not much room for ideas like justice in this mode of thought. It states that any acts of apparent self-sacrifice, altruism, or love, if morally correct, must be oriented towards the good of all. Any acts that benefit only a few are morally wrong, or at least, misdirected. Inefficient research is a form of self-sacrifice. This may seem like a fairly bland form of sacrifice, but given the tightness of research budgets, it’s not the smallest sacrifice one might face.

Science is hard. Not only do we need to get a good p-value at the end of all our work, which requires first formulating a good hypothesis and getting a large enough population to study it in, but we have to be able to defend our study on philosophical grounds. Are we limiting ourselves to only studies with large effect sizes and for which we can assemble large patient populations, ensuring a high likelihood that a replication of the study would be successful, and thus maximizing the efficiency of research dollars? Or can we justify experiments solely on the basis that we have a promising lead may benefit people, even when the odds of a replication of the study is low? Or when a hypothesis is just at the beginning stages of being formulated (like our space wheat/Klingon poison theory), and we really don’t have enough studies yet to know what, precisely, we need to be testing, can we justify performing the experiment on the basis on laying new ground in an uncharted territory?

Improving our research efficiency will help reduce the reproducibility problem. It would also mean setting aside notions like equality and justice when deciding what we should study. Obviously this isn’t an “either/or” decision. Efforts to improve reproducibility, or at least highlight problems in achieving it, are for the good. However setting the bar for reproducibility too high is likely not the type of science we want. Most of us probably have notions of of just practice and courageous exploration of new scientific territory as part of our vision of science. It’s likely that some irreproducibility in our research is an unfortunate necessity.

* In Star Trek, all problems are solved by reversing the polarity of the neutron flow.

** Putting the word “space” in front of anything occurring in space is, unfortunately, a narrative law in science fiction.