The P-value Controversy
Recent years have seen statisticians become increasing vocal about the limitations of traditional hypothesis tests and their associated p-values. The limitations are not new discoveries, many were well known one hundred years ago when hypothesis testing was popularized by Ronald Fisher. Fisher’s goal was to provide tools that research workers could employ in their own work. However, several recent highly publicized reviews of scientific literature have demonstrated the weaknesses of using p-values, especially with their present overemphasis and their preeminent role in deciding if research is publishable or not.
For students of inferential statistics, the issue is complicated by the traditional method of instruction. Statistics is typically taught by introducing the theories of probability, which is used to build-up the notion of sampling distributions, which lead to the construction of hypothesis tests, p-values, and confidence intervals. P-values and hypothesis tests provide a seemingly useful tool as well as an impressive conclusion in such courses, yet the limitations and alternatives are rarely covered in these introductions.
In the words of The American Statistical Association,
Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index [e.g. the p-value] should substitute for scientific reasoning. [emphasis added]
The statement has six additional key points:
- Low p-values can indicate incompatibility between the data and the statistical model.
- Researchers are free to treat p = .051 and p = .049 as providing similar, yet quite weak evidence against the statistical model.
- Report confidence intervals (or equivalently, effect sizes and standard errors). Consider reporting other measures from a Bayesian framework or a decision theoretic framework.
- Low p-values do not indicate large effect sizes nor scientific importance: Business, policy, and publication decisions should not be made solely based on p-values.
- High p-values do not indicate good evidence for the null-hypothesis. These p-values are sensitive to sample size and effect size. [They are strikingly poor evidence under Bayesian frameworks].
- Studies should be transparent in all procedures and not engage in any form of cherry-picking or p-hacking. [Be wary of decisions made after looking at the data, or alternatively, identify the study as exploratory].
Since the release of the ASA Statement, one of the flagship journals of the ASA has published a special issue dedicated to the topic. The lead editorial adds an additional directive:
Regardless of whether it was ever useful, a declaration of ‘statistical significance’ has today become meaningless….don’t say it and don’t use it.
The authors fully intend to abandon all equivalent procedures, not simply the phrase "statistically significant". This includes "significantly different", "p<.05", asterisks used to indicate significance, and "nonsignificant"—essentially any traditional form of hypothesis testing.
Grad students, of course, should seek guidance from their instructors, faculty advisors, grant agencies, and journal guidelines. Yet, nothing prevents the student from using a richer set of tools in forming their own personal beliefs, even if institutions are not yet ready to abandon hypothesis testing.
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129-133. doi:10.1080/00031305.2016.1154108
- Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond "p < 0.05". The American Statistician, 73(sup1), 1-19. doi:10.1080/00031305.2019.1583913