P Values and 'Data Dredging' in Clinical Research

February 24, 2017

Article

We spoke with Dr. Garrett-Mayer about concerns regarding statistical probability (P) values and “data dredging” in clinical research.

Elizabeth Garrett-Mayer, PhD

Professor of Biostatistics and Epidemiology Elizabeth Garrett-Mayer, PhD, is Director of Biostatistics at the Medical University of South Carolina’s Hollings Cancer Center in Charleston, South Carolina. She is a member of the American Society of Clinical Oncology (ASCO) Cancer Research Committee. We spoke with Dr. Garrett-Mayer about concerns regarding statistical probability (P) values and “data dredging” in clinical research.

-Interviewed by Bryant Furlow

OncoTherapy Network: Statistical P values and their interpretation have become somewhat controversial. What is their value in assessing clinical trial outcomes in oncology?

Dr. Garrett-Mayer: P values can have some value, but should not be interpreted on their own. It is always important to consider the clinical effect size and sample size when interpreting a P value. P values are most useful in a setting where a trial was designed to have an appropriate sample size to address the primary objective of the study. If the study is much larger or smaller than required to answer the primary research question, then the P value can be misleading on its own. ASCO’s perspective paper from 2014, “Raising the Bar for Clinical Trials by Defining Clinically Meaningful Outcomes,” addresses this problem by providing guidance for what would considered “clinically meaningful” in a number of different patient populations so that trials can be designed with an appropriate sample size.

OncoTherapy Network: Does statistical significance typically imply biological significance?

Dr. Garrett-Mayer: Typically? No. But there are so many settings in which P values are reported. In cases where studies are specifically designed to detect a clinically (or biologically) meaningful difference, then statistical significance will imply clinical (or biological) significance. But, the majority of the P values reported do not fall into this category. Most trials are designed around a single primary objective, for example, to detect a clinically meaningful difference in survival between two treatments. However, when the trial is reported, there are numerous other comparisons made, such as differences in toxicity rates, differences in progression-free survival, etc. The P values from these comparisons should be interpreted cautiously because the sample size was not selected based on those other outcomes. In preclinical research, we are seeing articles in which hundreds of P values are reported from small studies. With so many P values reported, we expect quite a few to be significant just by chance alone.

OncoTherapy Network:Is the role of P values different in hypothesis-generating data exploration settings versus confirmatory hypothesis testing?

Dr. Garrett-Mayer: Yes. There is an interesting history of how the P value came to be used as it is today, and it is not used as it was ever intended. R.A. Fisher proposed a P value (without a threshold) to be interpreted qualitatively in conjunction with prior knowledge to interpret new data. Neyman and Pearson proposed setting the “alpha threshold” and to reject a hypothesis when the P value was less than the threshold without concern for how small or how large the P value was. But now, we use a conflated approach. We use the threshold approach (usually set at 0.05) and we also use the P value as a judge of the level of evidence (ala Fisher). The biggest problem, however, is that we too often ignore the other important facets of the analysis, such as the effect size, confidence intervals for the effect size, and the sample size, when interpreting our results.

Back to the specific question now: The Neyman-Pearson approach is more consistent with the idea of confirmatory hypothesis testing. A specific study is designed to confirm (or deny) a specific hypothesis and the alpha threshold is set. At the conclusion of the study, the hypothesis is either rejected or not, and because the study was designed around a specific hypothesis, statistical and clinical significance are both supported with a P value less than the threshold. However, in hypothesis-generating type settings, P values should be considered more qualitatively (as per Fisher) where the magnitude of the P value is considered in conjunction with other factors. In these types of early studies, the sample size cannot be selected to accommodate all hypotheses of interest so that statistical significance (or lack thereof) will not necessarily imply clinical (or biological) significance. In these cases, many statisticians-including myself-would encourage researchers to utilize graphical displays of data with summary statistics and confidence intervals.

Does it really make sense to use the same data analysis and interpretation approaches in basic science research as in confirmatory clinical trials? No, and this is why the notion that using and interpreting P values in the same way for all types and stages of research is silly.

OncoTherapy Network: Can you please describe concerns about P value "hacking" or "data dredging," and comment on whether or not you share these concerns (and why or why not)?

Dr. Garrett-Mayer: I definitely have concerns with P value hacking and data dredging. For those not familiar with these approaches, they bring to mind the quote, “If you torture the data long enough, it will confess” (often attributed to Ronald Coase, although his original quote differed slightly). With a large enough dataset (meaning a large enough set of measurements), eventually, one can find a statistically significant association, or one can find a subgroup of individuals for which there is a significant difference in an outcome of interest. As we know from the basic principle of frequentist statistics, with an alpha level of 0.05, 5% of the time we will conclude that we have a significant association when none exists. So, for example, if a researcher searches through 40 genes for an association with cancer, even if all 40 truly have no association with cancer, we would expect two of these genes will have P values less than 0.05.

A major problem with this is that these reported and published results become part of our body of knowledge from which we continue our research pursuits. Results from “dredged data” lead people down the wrong path in most cases and resources (including both money and time) are wasted.

OncoTherapy Network: Are there contexts or applications for which P value concerns are more germane (such as high-throughput -omics studies) than others?

Dr. Garrett-Mayer: Yes, and in the very early days of analysis of high-throughput datasets, these concerns were not fully appreciated. However, it did not take much time for statisticians to jump in and raise concerns. In most high-throughput analysis approaches, one will see that there is control over the false-discovery rate, which controls the percentage of genes that are identified that are “false-positives.” In addition, we much more commonly see validation approaches incorporated into high-throughput analyses to avoid reporting spurious findings.

OncoTherapy Network: Are there widely-accepted ways to correct, statistically, for multiple comparisons?

Dr. Garrett-Mayer: The Bonferroni correction has been popular. It is simple to implement, but has the drawback that it is very conservative. Other approaches which have gained in popularity including the Benjamini-Hochberg correction which is more powerful than the Bonferroni approach, and there are others, too. But, when considering the need for correction, one must consider the context. If one is interrogating a dataset with hundreds or thousands of markers, then one must address the multiplicity issue to define a set of markers. And, if one is performing a randomized phase III trial with two primary clinical outcomes (and the new treatment would be approved if either one yields a significant result), then one needs to correct for multiple comparisons as well. But for most other settings, multiple comparisons are less necessary. The controversy over P values will hopefully lead us away from the heavy reliance we place on them in medical research, and lead us toward better reporting of results, where clinical and biological significance is emphasized through easy to interpret graphical displays.

And, there are other hypothesis testing approaches that do not rely on P values at all, such as Bayesian and Evidential (i.e., Likelihood)-based approaches. These are gaining some popularity in clinical research and will lead to less reliance on P values.