How Scientists Massage Results with ‘P-Hacking’

Jonathan KitchenGetty Images

The pursuit of science is designed to search for significance in a maze of data. At least, that’s how it should work.

By some accounts, that façade began to crumble in 2010 when a social psychologist from Cornell University, Daryl Bem, published a 10-year analysis on the prestigious. Journal of Personality and Social Psychology, demonstrating by widely accepted statistical methods that extrasensory perception (ESP), essentially a “sixth sense,” was an observable phenomenon. Bem’s peers could not replicate the paper’s results, quickly blaming what we now call “p-hacking,” a process of massaging and overanalyzing your data in search of statistically significant, publishable results.

♾ You love math. So do we. Let’s dive deep into its intricacies together – join Pop Mech Pro.

To support or reject a hypothesis, the goal is to establish statistical significance by recording a “p-value” of less than 0.05, explains Benjamin Baer, ​​a postdoctoral researcher and statistician at the University of Rochester, whose recent work to whom this issue is addressed. The “P” in p-value stands for probability and is a measure of how likely a null hypothesis result is versus chance.

For example, if you wanted to test whether or not all roses are red, you would count the number of red roses and roses of other colors in a sample and perform a hypothesis test to compare the values. If this test yields a p-value of less than 0.05, then you have statistically significant reason to claim that only red roses exist—even though evidence outside of your flower sample suggests otherwise.

Misusing p-values ​​to support the idea that ESP exists can be relatively harmless, but when the practice is used in medical trials, it can have far more deadly results, Baer says. “I think the big risk is that a wrong decision can be made,” he explains. “There’s a huge debate going on in science and statistics, trying to figure out how to make sure that this process can happen better and that the decisions are actually based on what they should be.”

Baer was first author on a paper published in late 2021 in the journal PNAS along with his former Cornell mentor and statistics professor Martin Wells, who examined how new statistics could improve the use of p-values. The metric they looked at is called the frailty index, and it’s designed to complement and improve p-values.

This measure describes the vulnerability of a data set to some of its data points going from a positive result to a negative result—for example, if a patient who was positively affected by a drug actually felt no effect at all. . If changing just a few of these data points is enough to drop a result from being statistically significant to not, then it is considered fragile.

p value curve


In 2014, physician Michael Walsh first proposed the frailty index in Journal of Clinical Epidemiology. In the paper, he and his colleagues applied the frailty index to just under 400 randomized control trials with statistically significant results and found that one in four had low frailty scores, meaning their findings could be generalized. not really be very reliable or strong.

However, the frailty index has yet to gain much steam in medical trials. Some critics of the approach have emerged, such as Rickey Carter of the Mayo Clinic, who says it is too similar to p-values ​​without providing enough improvement. “The irony is that the vulnerability index was a p-hacking approach,” Carter says.

“Talking to the victim’s family after a failed operation is very different [experience] than statisticians sitting at their desks doing math.”

To improve the fragility index, Baer, ​​Wells, and colleagues focused on improving two main elements to answer previous criticisms: only making fairly feasible modifications and generalizing the approach to work beyond 2×2 binary tables ( representing the positive or negative control and experimental group results). .

Despite the uphill battle the frailty index has fought so far, Baer says he still believes it’s a useful metric for medical statisticians and hopes the improvements made in their recent work will also help convince others to this.

“Talking to the victim’s family after a failed operation is very different [experience] than statisticians sitting at their desks doing math,” says Baer.

Leave a Comment

Your email address will not be published. Required fields are marked *