P>0.05 Is Good: The NORD-h Protocol for Several Hypothesis Analysis Based on Known Risks, Costs, and Benefits

Article information

J Prev Med Public Health. 2024;57(6):511-520
Publication date (electronic) : 2024 September 20
doi : https://doi.org/10.3961/jpmph.24.250
1International Committee Against the Misuse of Statistical Significance, Bovezzo, Italy
2Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran
Corresponding author: Alessandro Rovetta, International Committee Against the Misuse of Statistical Significance, Via Brede Traversa II, Bovezzo 25073, Italy E-mail: alessandrorovetta@redeev.com
Received 2024 May 18; Revised 2024 July 15; Accepted 2024 August 14.

Abstract

Statistical testing in medicine is a controversial and commonly misunderstood topic. Despite decades of efforts by renowned associations and international experts, fallacies such as nullism, the magnitude fallacy, and dichotomania are still widespread within clinical and epidemiological research. This can lead to serious health errors (e.g., misidentification of adverse reactions). In this regard, our work sheds light on another common interpretive and cognitive error: the fallacy of high significance, understood as the mistaken tendency to prioritize findings that lead to low p-values. Indeed, there are target hypotheses (e.g., a hazard ratio of 0.10) for which a high p-value is an optimal and desirable outcome. Accordingly, we propose a novel method that goes beyond mere null hypothesis testing by assessing the statistical surprise of the experimental result compared to the prediction of several target assumptions. Additionally, we formalize the concept of interval hypotheses based on prior information about costs, risks, and benefits for the stakeholders (NORD-h protocol). The incompatibility graph (or surprisal graph) is adopted in this context. Finally, we discuss the epistemic necessity for a descriptive, (quasi) unconditional approach in statistics, which is essential to draw valid conclusions about the consistency of data with all relevant possibilities, including study limitations. Given these considerations, this new protocol has the potential to significantly impact the production of reliable evidence in public health.

GRAPHICAL ABSTRACT

INTRODUCTION

Context

Statistical significance misuse is a pervasive issue in medicine [1-6]. Despite being recognized problems for decades, nullism (exclusive analysis of the null hypothesis of zero effect), the magnitude fallacy (misinterpretation of statistical significance as practical significance), and dichotomania (p-value <0.05 is significant, p-value≥0.05 is non-significant) persist and consistently lead to serious health consequences, including misidentification of adverse events, illusory replication failures, and mistrust in science [7-20]. Therefore, we propose and discuss alternatives to statistical significance.

A Proper Definition of the P-value

Assuming all background hypotheses hold true (e.g., normality, linearity, absence of biases and confounding, honesty), the p-value is an index of incompatibility between data and the target hypothesis (e.g., the null hypothesis) as assessed by the selected test [14]. P-values nearing 1 (respectively 0) signify low (respectively high) incompatibility. However, a more fitting definition would be to view the p-value as a measure of compatibility, as increasing p-values equate to increasing compatibility [2,9,11-13,15]. The relevance of the term (in)compatibility is rooted in its expression of a degree of (dis)agreement rather than endorsement. For example, discovering a person at the crime scene is consistent with their guilt but it does not serve as evidence of guilt (the situation is also consistent with an attempt to assist or a mere coincidence). Likewise, declaring “these results are compatible with the drug’s effectiveness” conveys a necessary but insufficient condition to substantiate the drug’s effectiveness [9-16]. A second critical concept is the notion of interval estimates in contrast to point estimates, such as assessing the compatibility of data with a range of hypotheses rather than with a single hypothesis [17].

Incompatibility and Surprisal

P-values pose interpretative challenges. For example, the statistical information contained in the difference between p=0.95 and p=0.90 differs from that between p=0.10 and p=0.05, even though Δp is 0.05 in both instances (the information in a probability is translation-invariant only on the logarithmic scale). Also, the p-value is often mistakenly associated with the significance of real phenomena, despite being calculated under the assumption of a world of pure chance. To solve these problems, the s-value (surprisal) has been introduced [10-13,17-19]. Assuming all background hypotheses are true (we call this ideal situation the “utopian scenario”), the surprisal is the number “s” of consecutive heads we should obtain, tossing a fair coin “s” times, to match the unexpectedness of our statistical result compared to the prediction of the target hypothesis. Thus, the researcher is aware that the s-value is merely a benchmark and does not measure the scientific importance of the outcome: It simply “tells us” how surprised we should feel about the observed result compared to the prediction of the adopted model. The relationship between p-values and s-values is then expressed by p=0.5S, i.e., s=-log2p. For instance, according to a well-specified model, an s-value of 6.3 indicates that the statistical result is as surprising as getting approximately 6 consecutive heads in 6 fair tosses compared to the target hypothesis prediction. This framework allows us to see that the difference in surprise between the above 2 pairs of p-values is substantial, since Δs1=-log2(0.90)+log2(0.95)=0.15-0.07=0.08, whereas Δs2=-log2(0.05)+log2(0.10)=4.3-3.3=1.

Surprisal and Loss Acceptability

There is no universal scale for assessing surprise, as the latter should be evaluated based on the acceptability of situational costs and risks [11]. To illustrate, getting 6 heads in 6 coin tosses is generally highly surprising, making it reasonable to bet on the hypothesis that the coin is rigged in most everyday contexts. Nonetheless, the key question is: How does the decision change if the stakes increase? For example, if the cost of an error were much higher than the benefit of winning the bet, one might raise the surprise requirement to 8 heads in 8 coin tosses—in other words, it is necessary to establish a loss function. Similarly, concerning a certain drug, an s-value of 6 might be considered sufficient if referring to data incompatibility with hypotheses of frequent mild adverse events (e.g., headache) but still insufficient if referring to data incompatibility with hypotheses of frequent serious adverse events (e.g., anaphylaxis). These considerations all depend on the risk-benefit ratio. Ergo, the present manuscript adopts the s-value to accomplish the following objectives: (1) incorporate costs, risks, and benefits in the hypothesis selection, (2) utilize interval hypotheses as opposed to point hypotheses (e.g., the interval [-3, 3] instead of the null hypothesis of zero effect), and (3) evaluate the incompatibility of the outcome with various interval hypotheses of interest.

COMPATIBILITY AND SURPRISAL

Compatibility Intervals Instead of Confidence Intervals

Confidence intervals (CIs) are frequently (mis)used to gauge confidence in results. This unwarranted optimism is conclusively refuted by the expectation that all experimental replications would be equivalent, which is impossible to guarantee in scientific practice due to sources of uncertainty (e.g., human errors) and variability (e.g., living organisms) [16,20,21]. Nevertheless, a CI offers valuable information when employed as a compatibility interval [12,19,22]. A 100(1-α)% compatibility interval (e.g., 95% CI) contains all target assumptions whose p-value, according to the chosen test, is greater than α (e.g., 0.05). Therefore, a compatibility interval contains all target assumptions that are more compatible with data than the interval limits. A 90% CI=(0, 20) “tells us” that the null hypothesis h=0 is as compatible with our experimental data, according to the chosen test, as the non-null hypothesis h=20 (p-value=0.10). All contained hypotheses (e.g., h=1, 2, or 18, 19) have p-values>0.10, meaning they are more compatible with the data than h=0 and h=20. It should be noted that all values inside a CI are not equally compatible [23]: In a 2-sided test, the point estimate is the most compatible (p-value=1), and values near it are more compatible than those near the interval limits. Similarly, not all values outside are equally (in) compatible: The values just outside the limits are practically just as compatible as those just inside the limits, but those far from the limits are much less compatible.

Compatibility to Avoid Nullism

The null hypothesis of zero effect is generally deemed the sole focus of interest, whereas there can be different hypotheses equally pertinent to research objectives. Suppose we have a hazard ratio (HR) of 4.5, 95% CI=(0.6, 34), null p-value=0.15 (we call “null p-value” the p-value referring to the null hypothesis). Can we conclude non-significance? No! The null p-value= 0.15 only “tells us” that, according to the chosen test, the null hypothesis HR0=1 is quite compatible with our statistical result HR=4.5. However, there are many extremely non-null hypotheses with p-values>0.15, like HR1=2 or HR2=3 (as they are “closer” than HR0 to the point estimate HR=4.5). Since these hypotheses are clinically relevant (in most contexts), our findings are more consistent with a non-null effect than a null effect. Nevertheless, the 95% CI of (0.6, 34) displays a vast range of hypotheses with p>0.05, signaling high statistical uncertainty.

Surprisal Intervals as Improvements of Compatibility Intervals

Compatibility intervals still present challenges. For instance, despite 99% CI, 95% CI, and 91% CI seem symmetric around the central interval, the associated compatibility requirements are different since Δs1=-log2(0.01)+log2(0.05)=6.6-4.3=2.3 while Δs2=-log2(0.05)+log2(0.09)=4.3-3.5=0.8. For this reason, the surprisal interval, or s-I, (e.g., 4-I=s-interval for s=4) has been introduced [13,22]: This encompasses all target hypotheses that are less surprising than “s” consecutive heads in “s” fair tosses compared to the test result. Considering “r” as the observed experimental result (e.g., the average effect or, in general, the point estimate), if we obtain r=10 and a 4-I=(1, 19) for a 1-sample t-test we know that, in the utopian scenario, r=10 is as surprising as 4 consecutive heads in 4 fair tosses compared to the hypotheses h=1 and h=19, and less surprising than 4 consecutive heads in 4 fair tosses compared to the hypotheses h=2 or h=18 (since these are contained within the 4-interval). Thus, a straightforward connection can be made between surprisal and compatibility intervals. An s-value of 4 corresponds to a p-value of (0.5)4=0.06. The associated compatibility interval is therefore 100(1-0.06)%=94% CI. In general, an s-interval corresponds to a 100(1-0.5s)% compatibility interval [23]. A novel convention for presenting multiple intervals enables the rapid extraction of valuable information. For example, 4-I=(a, b) and 5-I=(c, d) can be contracted in 4|5-I=(a, b|c, d), which shows the set of hypotheses between “a” and “b” that yield an s-value<4 and the set of hypotheses between “c” and “d” that yield an s-value<5 (according to the adopted test). A situation such as 4|5|6-I=(4, 10|2, 12|0, 14) displays the rate at which statistical surprise changes in relation to various hypotheses. Very different degrees of surprise near small effects can indicate large statistical uncertainty, which translates into an inability to draw useful conclusions about the main goal (even in the utopian scenario): e.g., in the context of blood pressure (mmHg), if the null hypothesis h=0 leads to s=6 and the non-null hypothesis h=0.5 leads to s=2, our result could be simultaneously considered very surprising (like 6 heads in 6 fair coin tosses) and not very surprising (like 2 heads in 2 fair coin tosses) compared to hypotheses of little practical relevance (h=0.0 and 0.5, respectively). Consequently, we cannot determine a global degree of incompatibility with hypotheses of small effects. However, summary papers (e.g., meta-analyses) still require more comprehensive frameworks.

Surprisal as a Measure of Incompatibility: Low S-values Can Be Very Good

Surprisal is tied to incompatibility: A result deemed highly surprising is essentially an unexpected outcome, meaning it is highly incompatible with the hypothesis under consideration (through the chosen statistical model). In the utopian scenario, the s-value is an intuitive measure of the statistical incompatibility of a result with a specific hypothesis through a certain test. A higher (respectively lower) s-value indicates greater (lesser) incompatibility. One of the most misunderstood aspects of statistical testing is that highly compatible hypotheses (e.g., p>0.10, i.e., s<3) can be very favorable for scientists. The obsession with seeking “statistical significance” as erroneous proof of practical significance has diverted attention from a trivial point—namely, the hypotheses most compatible with a statistical result are highly relevant as they represent those hypotheses that align better with what occurred in the experiment (as assessed by the chosen test). Indeed, a scientist might establish an ideal scenario where certain hypotheses should be highly compatible with the data (e.g., a substantial reduction in low-density lipoprotein [LDL] cholesterol due to therapy) while others should be highly incompatible (e.g., a null decrease in LDL cholesterol due to therapy).

(In)Compatibility as a Descriptive Non-inferential Measure

For frequentist-inferential statistics to be unambiguously informative about a research objective, it is necessary that all background assumptions hold. However, as discussed in the literature, researchers do not have the means to manage all sources of uncertainty [16,20]. In accordance with Amrhein et al. [16], the present manuscript adopts statistical incompatibility as a non-inferential measure of the discrepancy between the data and the chosen statistical model (which consists of the target and all other background hypotheses). The scope is not to infer the population parameters but to adopt s-values as descriptive statistics of the relationship between observed data and statistical hypotheses. While minimizing uncertainty remains an indispensable task [24], the advantage of the descriptive approach lies in being “unconditional” regarding background assumptions [16,20,25]. The goal is to present all scientific eventualities that are coherent with what happened in the experiment (including limitations). Indeed, even though we are merely interested in the target hypotheses and despite our best efforts to validate the model, other plausible explanations may be equally or more consistent with the data. The task of a good researcher is to present the most complete, unconditional scenario possible. In this regard, we suggest that the “the least conditional approach” is most suitable, since this still relies on the authors’ overall ability to formulate conclusions. Inference exists only in the degree of consistency of various studies, and it is essential to understand that the p-value (or s-value) for the null or any other hypothesis does not play a particular role in it [8]. For example, if in 10 different studies, conducted to the best of our capacities, we obtain an HR of approximately 4 and a null p≈0.10 (>0.05) (i.e., null s≈3) each time, the most plausible hypothesis is still HR=4 and not the null hypothesis HR0=1.

THE NULL-OPTIMAL-RISKY-DETRIMENTAL HYPOTHESES PROTOCOL

Decision Types: Health-consequential Versus Non-health Consequential

The scientific process is based on decisions [26]. The very choice to conduct a study implies a decision that entails direct consequences (e.g., allocation of funding) and indirect consequences (e.g., cutting resources for other sectors). Many of these consequences are unknown at the time the decision is made. Initial decisions (e.g., starting a project) and especially intermediate ones (e.g., continuing the project) may be driven by personal beliefs or motivations (e.g., authors’ or financiers’ obstinacy). Science cannot eliminate this uncertainty but can minimize it through pre-study protocols, independent experiments, vigilance, and promoting competence and integrity as values [27]. We define 2 types of decisions based on their implications (Table 1): health-consequential (HC) versus non-health-consequential (NHC).

Major decision types in public health

A similar distinction was introduced by Good [28]: HC decisions could be defined as “terminal” (conclusive), while NHC decisions could be defined as “sequential” (they require further experiments). However, our modification stresses the impact on the stakeholders. HC decisions directly affect people’s health either directly (e.g., approving a treatment) or indirectly (e.g., rejecting it). These are the most sensitive as they involve patients, the most vulnerable stakeholders. Conversely, NHC decisions involve scientists, funders, institutions, and so forth, which may better withstand a negative impact. HC decisions necessitate multiple consistent studies (e.g., systematic reviews with meta-analyses), causal inference (e.g., randomized controlled trials), and various lines of evidence (e.g., statistical, biochemical, clinical, etc.) [26,29-32]. NHC decisions might necessarily be informed by single studies (e.g., exploratory surveys), although stronger evidence is preferable.

The Transition From Non-health-consequential to Health-consequential Decisions

Clinical reality often presents blended situations. Let us consider the development of a drug. The process starts with an investigation of the underlying biochemical mechanisms. Next, we proceed to animal testing. Finally, human trials ensue. This phased process represents the gradual transition from NHC to HC decisions. Given the ethical and logistical necessity to possess increasingly robust evidence as the research progressively involves major costs and risks, it is crucial to establish at which “phase” the investigation has to be conducted. Terminal decisions must be weighed against practical considerations such as “Does the beneficial effect justify the therapy’s invasiveness and costs?”

The Null-Optimal-Risky-Detrimental hypotheses (NORD-h) Protocol

We propose a procedure to inform—not make—HC, NHC, and blended decisions: the Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol (Figure 1). Its novelty stems from (1) testing all relevant hypotheses to counteract nullism, (2) integrating prior knowledge or reasoned thoughts on the risk-benefit ratio to counteract the magnitude fallacy, and (3) adopting multiple indicative thresholds to counteract dichotomania. This also involves publishing a pre-print, marked with a digital object identifier, in which various ranges of point hypotheses are defined: Practically null (Hn), optimal (Ho), risky (Hr), and detrimental (Hd). The scope is set to limit the impact of biases and conflicts of interest. Adopting multiple hypotheses and evaluation criteria based on costs, risks, and benefits prevents statistical testing from being a set of decisions arbitrarily driven by numbers, as it compels the actor—through the choice of Hn, Ho, Hr, and Hd—to constantly relate to the real scenario and the practical consequences of their actions. Furthermore, this makes it possible to consistently identify the hypotheses that are most and least compatible with the observed data. In this regard, we emphasize that the NORD-h protocol is not designed to be universal, but rather to be tailored to each specific study.

Figure. 1.

Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol: structure and scope.

A Practical Example

Let us assume we want to evaluate the effectiveness of an LDL cholesterol-lowering drug. We adopt the NORD-h protocol: Hn contains all practically null hypotheses (e.g., consistent with small clinical relevance), Ho contains all optimal hypotheses (e.g., consistent with a favorable risk-benefit ratio), Hr contains all risky hypotheses (i.e., consistent with a too-drastic decrease in the considered time frame), and Hd contains all detrimental hypotheses (i.e., consistent with an increase in cholesterol). Let us consider 3 essential roles for the success of health sciences (Table 2): the clinician (who primarily focuses on patients), the scientist (who primarily focuses on phenomena), and the financier (who primarily focuses on sustainability).

Different types of NORD-h protocols for a hypothetical new treatment for low-density lipoprotein-cholesterol (mg/dL)

Since the desired degree of incompatibility depends on hypotheses and goals, different non-dichotomous surprise thresholds Sn, So, Sr, and Sd for the respective Hn, Ho, Hr, and Hd can be selected. The clinician may hope that the incompatibility of the null hypothesis with the data is at least Sn=4 (consecutive heads) and the incompatibility of the optimal hypothesis is less than So=3 as they prioritize patients. Hn=[-4, -1] assumes that, given the adverse events, the target effect should be a reduction of at least 5 units, and the minimum non-detrimental effect should be a reduction of 1 unit (in mg/dL). Hr=[-∞, -16] considers that a reduction exceeding 15 units becomes dangerous. The scientific protocol sets hypotheses coherent with causal phenomena. Hn=[-3, 3] assumes that oscillations between -3 and 3 are compatible with random events. Hr=[-∞, -21] considers that reductions exceeding 21 units become dangerous for the main stakeholders (the patients), although it remains of scientific interest. Accordingly, the scientist may be more permissive with the thresholds. The financial protocol sets hypotheses based on economic sustainability. Hn=[-5, -3] assumes that the minimum useful reduction is 6 units. Hr=[-∞, -21] establishes results that remain worthy of investment, although these call for more caution or changes (e.g., a reduction in the administered dose). The thresholds could be similar to clinical ones. The ideal situation is the one in which Ho is strongly compatible with our experimental data, while Hn, Hr, and Hd are strongly incompatible. Guided examples for calculating p-values and s-values for various hypotheses are provided in the literature [11,23,33] (Supplemental Material 1).

Application of the NORD-h protocol

Table 3 shows invented pre-treatment versus post-treatment levels in 10 patients. Calculation details are reported in Supplemental Material 1. Since we are concerned about possible detrimental effects, we employ a 1-sided 1-sample t-test (the design is sub-optimal for educational purposes). The test results are shown in Figure 2. We reiterate that “incompatibility thresholds” provide general guidelines to minimize post-hoc interpretations. Nevertheless, lower (respectively higher) s-values for the optimal (respectively non-optimal) hypothesis are better. Similarly, the interval hypotheses are not equivalence intervals (e.g., h=0 is much less impactful than h=10 even if both are classified as detrimental). Therefore, such thresholds are not meant to replace scientific thinking; rather, they should channel it in the right direction.

Example of low-density lipoprotein-cholesterol levels in the treated group (unit: mg/dL)

Figure. 2.

Null-Optimal-Risky-Detrimental hypotheses clinical protocol for a cholesterol treatment according to the 1-sample t-test. The conditional ideal scenario is as follows: the green bars (Ho) should be below the green line, the blue bars (Hn) should be above the blue line, and the orange and red bars (Hr and Hd, respectively) should be above the burgundy line. The experimental numerical result (the one where the bar is absent), should be within the optimal range (green region). The black crosses represent patients who exhibited cholesterol changes corresponding to the hypothesis shown below (e.g., the 2 black crosses above hypothesis h=-9 indicate that 2 patients in the dataset recorded a real decrease in cholesterol of 9 mg/dL). Hd, detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses.

Traditional hard approach: The average decrease of 7 units is non-significant since (null) p≥0.05 [wrong conclusion].

Traditional soft approach

The average decrease of 7 units, 95% CI=(0, 14), is statistically non-significant since (null) p≥0.05 but may have some clinical importance [post-hoc interpretation highly exposed to biases and conflicts of interest].

NORD-h protocol, conditional approach: Under the clinician protocol, considering the background hypotheses as sufficiently met, the scenario is not exactly ideal since both Ho=[-15, -5] and Hn=[-4, -1] do not markedly disagree with the data (s-value<3 in many cases). However, it is good that the average decrease r=-7 falls within Ho and that only the non-dangerous hypotheses (Hn and Ho) are generally highly compatible with the data (s<4). Therefore, the outcome aligns better with beneficial or non-hazardous effects [too dependent on unverified assumptions].

NORD-h protocol, least conditional approach

Under the clinician protocol, as assessed by the chosen test—with testable statistical assumptions that could be “reasonably met” according to some criteria (e.g., Q-Q plot)—these findings are consistent with a possible non-damaging effect. However, other hypotheses are compatible with the above situation. These include confounding (due to the absence of a control group), too-small sample size (insufficient to properly evaluate statistical assumptions and well represent the target population), and unidentified covariates (some patients experienced marked benefits while others did not). Therefore, an illustration of a least conditional conclusion is as follows: These findings align with the existence of a clinical effect and the violation of some fundamental assumptions. Further studies with major control over the sources of uncertainty are needed. The slight increase in cholesterol levels in some patients must be taken into consideration.

How to make the associated blended decision

The above situation is far from proving evidence of effectiveness but might justify further research under careful clinical supervision. However, this blended decision heavily depends on the overall context regarding the drug. For example, a solid biological/chemical background and the absence of major adverse events could support studies on larger groups. Conversely, excessive invasiveness of the therapy would necessitate reverting to the previous stage of development or abandoning the research regardless of the statistical outcome (which could still be useful in attempting to outline the reasons for the failure).

Considerations and Remarks

Preselecting interval hypotheses could be signaled as questionable by some as it could still be influenced by personal beliefs, biases, and conflicts of interest. The authors of this manuscript defend this practice based on 2 main considerations: (1) the cognitive issues affecting statistical testing, and (2) the limited scope of the descriptive reading. This framework forces the abandonment of nullism and boosts a practical view of testing, linked to costs, risks, and benefits. Moreover, it compels researchers to inform and establish pre-study protocols. Even admitting possible misuses, the declared non-inferential objective precludes overstatements that could have negative consequences for all stakeholders and science credibility. Finally, statistical surprise also prevents ritualistic interpretations of significance.

A Cautionary Note: Do Not Make Important Decisions Based on Single Studies

This approach is based on moderation due to the underlying uncertainty that characterizes medical science [20]. In this regard, we emphasize that the research limitations must always be presented alongside the hypothesis of interest, assessing their compatibility with the observed experimental scenario. It is necessary to address such inconvenient hypotheses (e.g., plausible confounding) with the same rigor as one investigates the hypothesis of interest (e.g., therapy effectiveness). Unless dealing with solid evidence like well-conducted meta-analyses of randomized trials, it is not a matter of generalizing results but rather providing a descriptive overview that is as impartial as possible. HC decisions can be reasonable if and only if they stem from several consonant high-quality studies of varying nature. Indeed, quoting Dr. Amrhein, “[...] being modest about our conclusions is one of the most important scientific virtues” [34].

CONCLUSION

The NORD-h protocol is a powerful tool for evaluating the consistency of experimental data with all hypotheses of interest at the scientific and clinical levels, avoiding the flawed practice of considering only the null hypothesis of exactly zero effect and, at most, a single alternative. Furthermore, the NORDh protocol incorporates cost-benefit analysis into the selection of interval hypotheses, which contrasts the simplistic classification of outcomes as “significant” or “not significant” based on a mere numerical, uncontextualized criterion. While it does not replace decision-making, this approach is well-suited for interpreting the results of meta-analyses. Adopting s-values instead of p-values allows for a clearer quantification of the information provided by the data according to the chosen statistical model. In cases where prior knowledge about the investigated phenomenon is limited, these methods still provide a systematic way to describe relationships between data and hypotheses.

Ethics Statement

This study does not need approval from the institutional research ethics committee because it does not involve human participants, animal subjects, or other elements requiring ethical clearance.

Supplemental Materials

Supplemental material is available at https://doi.org/10.3961/jpmph.24.250.

Supplemental Material 1.

Figure S1. One sample t-test.

Figure S2. One sample t-test table (from: https://doi.org/10.1016/j.gloepi.2024.100151).

jpmph-24-250-Supplementary-Material-1.docx

Notes

Conflict of Interest

The authors have no conflicts of interest associated with the material presented in this paper.

Funding

None.

Author Contributions

Both authors contributed equally to conceiving the study, analyzing the data, and writing this paper.

Acknowledgements

We thank Sander Greenland for helpful comments.

References

1. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 2016;31(4):337–350. https://doi.org/10.1007/s10654-016-0149-3.
2. Rovetta A. Statistical significance misuse in public health research: an investigation of the current situation and possible solutions. J Health Policy Outcomes Res 2014;https://doi.org/10.7365/JHPOR.2024.1.7.
3. Wasserstein RL, Lazar NA. The ASA statement on p -values: context, process, and purpose. Am Stat 2016;70(2):129–133. https://doi.org/10.1080/00031305.2016.1154108.
4. Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p <0.05”. Am Stat 2019;73(sup1):1–19. https://doi.org/10.1080/00031305.2019.1583913.
5. Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature 2019;567(7748):305–307. https://doi.org/10.1038/d41586-019-00857-9.
6. McShane BB, Gal D, Gelman A, Robert C, Tackett JL. Abandon statistical significance. Am Stat 2019;73(sup1):235–245. https://doi.org/10.1080/00031305.2018.1527253.
7. Gelman A. The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Pers Soc Psychol Bull 2018;44(1):16–23. https://doi.org/10.1177/0146167217729162.
8. McShane BB, Bradlow ET, Lynch JG, Meyer RJ. “Statistical significance” and statistical reporting: moving beyond binary. J Mark 2023;88(3):1–19. https://doi.org/10.1177/00222429231216910.
9. Greenland S. Divergence versus decision p -values: a distinction worth making in theory and keeping in practice: or, how divergence p -values measure evidence even when decision p -values do not. Scand J Stat 2023;50(1):54–88. https://doi.org/10.1111/sjos.12625.
10. Greenland S. Valid p -values behave exactly as they should: some misleading criticisms of p -values and their resolution with s-values. Am Stat 2019;73(sup1):106–114. https://doi.org/10.1080/00031305.2018.1529625.
11. Rafi Z, Greenland S. Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Med Res Methodol 2020;20(1):244. https://doi.org/10.1186/s12874-020-01105-9.
12. Mansournia MA, Nazemipour M, Etminan M. P-value, compatibility, and s-value. Glob Epidemiol 2022;4:100085. https://doi.org/10.1016/j.gloepi.2022.100085.
13. Rovetta A. S-values and surprisal intervals to replace p -values and confidence intervals: accepted - January 2024. Revstat Stat J [cited 2024 Feb 4]. Available from: https://revstat.ine.pt/index.php/REVSTAT/article/view/669.
14. American Statistical Association. American Statistical Association releases statement on statistical significance and p -values; 2016 [cited 2024 Feb 4]. Available from: https://www.amstat.org/asa/files/pdfs/P-valuestatement.pdf.
15. Rovetta A, Mansournia MA, Vitale A, Accurso V. How to read Pvalues and confidence intervals in public health studies. Saudi J Anaesth 2024;18(3):459–460. https://doi.org/10.4103/sja.sja_128_24.
16. Amrhein V, Trafimow D, Greenland S. Inferential statistics as descriptive statistics: there is no replication crisis if we don’t expect replication. Am Stat 2019;73(sup1):262–270. https://doi.org/10.1080/00031305.2018.1543137.
17. Amrhein V, Greenland S. Discuss practical importance of results based on interval estimates and p -value functions, not only on point estimates and null p -values. J Inf Technol 2022;37(3):316–320. https://doi.org/10.1177/02683962221105904.
18. Cole SR, Edwards JK, Greenland S. Surprise! Am J Epidemiol 2021;190(2):191–193. https://doi.org/10.1093/aje/kwaa136.
19. Greenland S, Mansournia MA, Joffe M. To curb research misreporting, replace significance and confidence by compatibility: a Preventive Medicine Golden Jubilee article. Prev Med 2022;164:107127. https://doi.org/10.1016/j.ypmed.2022.107127.
20. Ting C, Greenland S. Forcing a deterministic frame on probabilistic phenomena: a communication blind spot in media coverage of the “replication crisis”. Sci Commu 2024;46(5):672–684. https://doi.org/10.1177/10755470241239947.
21. Rubin M. “Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher. Eur J Philo Sci 2020;10:42. https://doi.org/10.1007/s13194-020-00309-6.
22. Rovetta A. Multiple confidence intervals and surprisal intervals to avoid significance fallacy. Cureus 2024;16(1)e51964. https://doi.org/10.7759/cureus.51964.
23. Rovetta A, Mansournia MA, Vitale A. For a proper use of frequentist inferential statistics in public health. Glob Epidemiol 2024;8:100151. https://doi.org/10.1016/j.gloepi.2024.100151.
24. Mansournia MA, Collins GS, Nielsen RO, Nazemipour M, Jewell NP, Altman DG, et al. A checklist for statistical assessment of medical papers (the CHAMP statement): explanation and elaboration. Br J Sports Med 2021;55(18):1009–1017. https://doi.org/10.1136/bjsports-2020-103652.
25. Greenland S, Rafi Z, Matthews R, Higgs M. To aid scientific inference, emphasize unconditional compatibility descriptions of statistics. ArXiv [Preprint] 2022;[cited 2024 Feb 4]. Available from: https://doi.org/10.48550/arXiv.1909.08583.
26. Greenland S, Hofman A. Multiple comparisons controversies are about context and costs, not frequentism versus Bayesianism. Eur J Epidemiol 2019;34(9):801–808. https://doi.org/10.1007/s10654-019-00552-z.
27. National Academy of Sciences (US), National Academy of Engineering (US) and Institute of Medicine (US) Panel on Scientific Responsibility and the Conduct of Research. Responsible science: ensuring the integrity of the research process: volume I. Washington, D.C.: National Academies Press (US); 1992. https://doi.org/10.17226/1864.
28. Good IJ. Rational decisions. J R Stat Soc Ser B Stat Methodol 1952;14(1):107–114. https://doi.org/10.1111/j.2517-6161.1952.tb00104.x.
29. Bann D, Courtin E, Davies NM, Wright L. Dialling back ‘impact’ claims: researchers should not be compelled to make policy claims based on single studies. Int J Epidemiol 2024;53(1):dyad181. https://doi.org/10.1093/ije/dyad181.
30. Greenland S. Analysis goals, error-cost sensitivity, and analysis hacking: essential considerations in hypothesis testing and multiple comparisons. Paediatr Perinat Epidemiol 2021;35(1):8–23. https://doi.org/10.1111/ppe.12711.
31. Lash TL, Fox MP, MacLehose RF, Maldonado G, McCandless LC, Greenland S. Good practices for quantitative bias analysis. Int J Epidemiol 2014;43(6):1969–1985. https://doi.org/10.1093/ije/dyu149.
32. Mansournia MA, Nazemipour M. Recommendations for accurate reporting in medical research statistics. Lancet 2024;403(10427):611–612. https://doi.org/10.1016/S0140-6736(24)00139-9.
33. Poole C. Beyond the confidence interval. Am J Public Health 1987;77(2):195–199. https://doi.org/10.2105/ajph.77.2.195.
34. Amrhein V. Statistics is for statisticians; 2020 [cited 2024 Feb 4]. Available from: https://isi-web.org/article/statistics-statisticians.

Article information Continued

Figure. 1.

Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol: structure and scope.

Figure. 2.

Null-Optimal-Risky-Detrimental hypotheses clinical protocol for a cholesterol treatment according to the 1-sample t-test. The conditional ideal scenario is as follows: the green bars (Ho) should be below the green line, the blue bars (Hn) should be above the blue line, and the orange and red bars (Hr and Hd, respectively) should be above the burgundy line. The experimental numerical result (the one where the bar is absent), should be within the optimal range (green region). The black crosses represent patients who exhibited cholesterol changes corresponding to the hypothesis shown below (e.g., the 2 black crosses above hypothesis h=-9 indicate that 2 patients in the dataset recorded a real decrease in cholesterol of 9 mg/dL). Hd, detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses.

Table 1.

Major decision types in public health

Decision type Description Decision rule Evidence type
Health-consequential Has an immediate or short-term impact on the population’s health Always requires many consistent studies Multiple types are necessary
Non-health-consequential Has no immediate or short-term impact on the population’s health Can be informed by single studies Multiple types are preferable

Table 2.

Different types of NORD-h protocols for a hypothetical new treatment for low-density lipoprotein-cholesterol (mg/dL)

Protocol Hn Ho Hr Hd Ideal
Sn So Sr, Sd
Clinician [-4, -1] [-15, -5] [-∞, -16] [0, +∞] ≥4 ≤3 ≥5
Scientist [-3, 3] [-20, -4] [-∞, -21] [4, +∞] ≥4 ≤4 ≥5
Financier [-5, -3] [-20, -6] [-∞, -21] [-2, +∞] ≥4 ≤3 ≥5

NORD-h, Null-Optimal-Risky-Detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses; Hd, detrimental hypotheses; Sn, indicative threshold for the null hypotheses; So, indicative threshold for the optimal hypotheses; Sr, indicative threshold for the risky hypotheses; Sd, indicative threshold for the detrimental hypotheses.

Table 3.

Example of low-density lipoprotein-cholesterol levels in the treated group (unit: mg/dL)

Patient Pre-treatment Post-treatment Difference
1 150 133 -17
2 160 139 -21
3 155 161 6
4 148 151 3
5 162 147 -15
6 158 149 -9
7 153 157 4
8 149 150 1
9 157 144 -13
10 151 142 -9