Skip Navigation
Skip to contents

JPMPH : Journal of Preventive Medicine and Public Health

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Prev Med Public Health > Volume 57(6); 2024 > Article
Special Article
P>0.05 Is Good: The NORD-h Protocol for Several Hypothesis Analysis Based on Known Risks, Costs, and Benefits
Alessandro Rovetta1corresp_iconorcid, Mohammad Ali Mansournia1,2orcid
Journal of Preventive Medicine and Public Health 2024;57(6):511-520.
DOI: https://doi.org/10.3961/jpmph.24.250
Published online: September 20, 2024
  • 1,900 Views
  • 194 Download
  • 1 Web of Science
  • 1 Crossref
  • 2 Scopus

1International Committee Against the Misuse of Statistical Significance, Bovezzo, Italy

2Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, Tehran, Iran

Corresponding author: Alessandro Rovetta, International Committee Against the Misuse of Statistical Significance, Via Brede Traversa II, Bovezzo 25073, Italy E-mail: alessandrorovetta@redeev.com
• Received: May 18, 2024   • Revised: July 15, 2024   • Accepted: August 14, 2024

Copyright © 2024 The Korean Society for Preventive Medicine

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

next
  • Statistical testing in medicine is a controversial and commonly misunderstood topic. Despite decades of efforts by renowned associations and international experts, fallacies such as nullism, the magnitude fallacy, and dichotomania are still widespread within clinical and epidemiological research. This can lead to serious health errors (e.g., misidentification of adverse reactions). In this regard, our work sheds light on another common interpretive and cognitive error: the fallacy of high significance, understood as the mistaken tendency to prioritize findings that lead to low p-values. Indeed, there are target hypotheses (e.g., a hazard ratio of 0.10) for which a high p-value is an optimal and desirable outcome. Accordingly, we propose a novel method that goes beyond mere null hypothesis testing by assessing the statistical surprise of the experimental result compared to the prediction of several target assumptions. Additionally, we formalize the concept of interval hypotheses based on prior information about costs, risks, and benefits for the stakeholders (NORD-h protocol). The incompatibility graph (or surprisal graph) is adopted in this context. Finally, we discuss the epistemic necessity for a descriptive, (quasi) unconditional approach in statistics, which is essential to draw valid conclusions about the consistency of data with all relevant possibilities, including study limitations. Given these considerations, this new protocol has the potential to significantly impact the production of reliable evidence in public health.
Context
Statistical significance misuse is a pervasive issue in medicine [1-6]. Despite being recognized problems for decades, nullism (exclusive analysis of the null hypothesis of zero effect), the magnitude fallacy (misinterpretation of statistical significance as practical significance), and dichotomania (p-value <0.05 is significant, p-value≥0.05 is non-significant) persist and consistently lead to serious health consequences, including misidentification of adverse events, illusory replication failures, and mistrust in science [7-20]. Therefore, we propose and discuss alternatives to statistical significance.
A Proper Definition of the P-value
Assuming all background hypotheses hold true (e.g., normality, linearity, absence of biases and confounding, honesty), the p-value is an index of incompatibility between data and the target hypothesis (e.g., the null hypothesis) as assessed by the selected test [14]. P-values nearing 1 (respectively 0) signify low (respectively high) incompatibility. However, a more fitting definition would be to view the p-value as a measure of compatibility, as increasing p-values equate to increasing compatibility [2,9,11-13,15]. The relevance of the term (in)compatibility is rooted in its expression of a degree of (dis)agreement rather than endorsement. For example, discovering a person at the crime scene is consistent with their guilt but it does not serve as evidence of guilt (the situation is also consistent with an attempt to assist or a mere coincidence). Likewise, declaring “these results are compatible with the drug’s effectiveness” conveys a necessary but insufficient condition to substantiate the drug’s effectiveness [9-16]. A second critical concept is the notion of interval estimates in contrast to point estimates, such as assessing the compatibility of data with a range of hypotheses rather than with a single hypothesis [17].
Incompatibility and Surprisal
P-values pose interpretative challenges. For example, the statistical information contained in the difference between p=0.95 and p=0.90 differs from that between p=0.10 and p=0.05, even though Δp is 0.05 in both instances (the information in a probability is translation-invariant only on the logarithmic scale). Also, the p-value is often mistakenly associated with the significance of real phenomena, despite being calculated under the assumption of a world of pure chance. To solve these problems, the s-value (surprisal) has been introduced [10-13,17-19]. Assuming all background hypotheses are true (we call this ideal situation the “utopian scenario”), the surprisal is the number “s” of consecutive heads we should obtain, tossing a fair coin “s” times, to match the unexpectedness of our statistical result compared to the prediction of the target hypothesis. Thus, the researcher is aware that the s-value is merely a benchmark and does not measure the scientific importance of the outcome: It simply “tells us” how surprised we should feel about the observed result compared to the prediction of the adopted model. The relationship between p-values and s-values is then expressed by p=0.5S, i.e., s=-log2p. For instance, according to a well-specified model, an s-value of 6.3 indicates that the statistical result is as surprising as getting approximately 6 consecutive heads in 6 fair tosses compared to the target hypothesis prediction. This framework allows us to see that the difference in surprise between the above 2 pairs of p-values is substantial, since Δs1=-log2(0.90)+log2(0.95)=0.15-0.07=0.08, whereas Δs2=-log2(0.05)+log2(0.10)=4.3-3.3=1.
Surprisal and Loss Acceptability
There is no universal scale for assessing surprise, as the latter should be evaluated based on the acceptability of situational costs and risks [11]. To illustrate, getting 6 heads in 6 coin tosses is generally highly surprising, making it reasonable to bet on the hypothesis that the coin is rigged in most everyday contexts. Nonetheless, the key question is: How does the decision change if the stakes increase? For example, if the cost of an error were much higher than the benefit of winning the bet, one might raise the surprise requirement to 8 heads in 8 coin tosses—in other words, it is necessary to establish a loss function. Similarly, concerning a certain drug, an s-value of 6 might be considered sufficient if referring to data incompatibility with hypotheses of frequent mild adverse events (e.g., headache) but still insufficient if referring to data incompatibility with hypotheses of frequent serious adverse events (e.g., anaphylaxis). These considerations all depend on the risk-benefit ratio. Ergo, the present manuscript adopts the s-value to accomplish the following objectives: (1) incorporate costs, risks, and benefits in the hypothesis selection, (2) utilize interval hypotheses as opposed to point hypotheses (e.g., the interval [-3, 3] instead of the null hypothesis of zero effect), and (3) evaluate the incompatibility of the outcome with various interval hypotheses of interest.
Compatibility Intervals Instead of Confidence Intervals
Confidence intervals (CIs) are frequently (mis)used to gauge confidence in results. This unwarranted optimism is conclusively refuted by the expectation that all experimental replications would be equivalent, which is impossible to guarantee in scientific practice due to sources of uncertainty (e.g., human errors) and variability (e.g., living organisms) [16,20,21]. Nevertheless, a CI offers valuable information when employed as a compatibility interval [12,19,22]. A 100(1-α)% compatibility interval (e.g., 95% CI) contains all target assumptions whose p-value, according to the chosen test, is greater than α (e.g., 0.05). Therefore, a compatibility interval contains all target assumptions that are more compatible with data than the interval limits. A 90% CI=(0, 20) “tells us” that the null hypothesis h=0 is as compatible with our experimental data, according to the chosen test, as the non-null hypothesis h=20 (p-value=0.10). All contained hypotheses (e.g., h=1, 2, or 18, 19) have p-values>0.10, meaning they are more compatible with the data than h=0 and h=20. It should be noted that all values inside a CI are not equally compatible [23]: In a 2-sided test, the point estimate is the most compatible (p-value=1), and values near it are more compatible than those near the interval limits. Similarly, not all values outside are equally (in) compatible: The values just outside the limits are practically just as compatible as those just inside the limits, but those far from the limits are much less compatible.
Compatibility to Avoid Nullism
The null hypothesis of zero effect is generally deemed the sole focus of interest, whereas there can be different hypotheses equally pertinent to research objectives. Suppose we have a hazard ratio (HR) of 4.5, 95% CI=(0.6, 34), null p-value=0.15 (we call “null p-value” the p-value referring to the null hypothesis). Can we conclude non-significance? No! The null p-value= 0.15 only “tells us” that, according to the chosen test, the null hypothesis HR0=1 is quite compatible with our statistical result HR=4.5. However, there are many extremely non-null hypotheses with p-values>0.15, like HR1=2 or HR2=3 (as they are “closer” than HR0 to the point estimate HR=4.5). Since these hypotheses are clinically relevant (in most contexts), our findings are more consistent with a non-null effect than a null effect. Nevertheless, the 95% CI of (0.6, 34) displays a vast range of hypotheses with p>0.05, signaling high statistical uncertainty.
Surprisal Intervals as Improvements of Compatibility Intervals
Compatibility intervals still present challenges. For instance, despite 99% CI, 95% CI, and 91% CI seem symmetric around the central interval, the associated compatibility requirements are different since Δs1=-log2(0.01)+log2(0.05)=6.6-4.3=2.3 while Δs2=-log2(0.05)+log2(0.09)=4.3-3.5=0.8. For this reason, the surprisal interval, or s-I, (e.g., 4-I=s-interval for s=4) has been introduced [13,22]: This encompasses all target hypotheses that are less surprising than “s” consecutive heads in “s” fair tosses compared to the test result. Considering “r” as the observed experimental result (e.g., the average effect or, in general, the point estimate), if we obtain r=10 and a 4-I=(1, 19) for a 1-sample t-test we know that, in the utopian scenario, r=10 is as surprising as 4 consecutive heads in 4 fair tosses compared to the hypotheses h=1 and h=19, and less surprising than 4 consecutive heads in 4 fair tosses compared to the hypotheses h=2 or h=18 (since these are contained within the 4-interval). Thus, a straightforward connection can be made between surprisal and compatibility intervals. An s-value of 4 corresponds to a p-value of (0.5)4=0.06. The associated compatibility interval is therefore 100(1-0.06)%=94% CI. In general, an s-interval corresponds to a 100(1-0.5s)% compatibility interval [23]. A novel convention for presenting multiple intervals enables the rapid extraction of valuable information. For example, 4-I=(a, b) and 5-I=(c, d) can be contracted in 4|5-I=(a, b|c, d), which shows the set of hypotheses between “a” and “b” that yield an s-value<4 and the set of hypotheses between “c” and “d” that yield an s-value<5 (according to the adopted test). A situation such as 4|5|6-I=(4, 10|2, 12|0, 14) displays the rate at which statistical surprise changes in relation to various hypotheses. Very different degrees of surprise near small effects can indicate large statistical uncertainty, which translates into an inability to draw useful conclusions about the main goal (even in the utopian scenario): e.g., in the context of blood pressure (mmHg), if the null hypothesis h=0 leads to s=6 and the non-null hypothesis h=0.5 leads to s=2, our result could be simultaneously considered very surprising (like 6 heads in 6 fair coin tosses) and not very surprising (like 2 heads in 2 fair coin tosses) compared to hypotheses of little practical relevance (h=0.0 and 0.5, respectively). Consequently, we cannot determine a global degree of incompatibility with hypotheses of small effects. However, summary papers (e.g., meta-analyses) still require more comprehensive frameworks.
Surprisal as a Measure of Incompatibility: Low S-values Can Be Very Good
Surprisal is tied to incompatibility: A result deemed highly surprising is essentially an unexpected outcome, meaning it is highly incompatible with the hypothesis under consideration (through the chosen statistical model). In the utopian scenario, the s-value is an intuitive measure of the statistical incompatibility of a result with a specific hypothesis through a certain test. A higher (respectively lower) s-value indicates greater (lesser) incompatibility. One of the most misunderstood aspects of statistical testing is that highly compatible hypotheses (e.g., p>0.10, i.e., s<3) can be very favorable for scientists. The obsession with seeking “statistical significance” as erroneous proof of practical significance has diverted attention from a trivial point—namely, the hypotheses most compatible with a statistical result are highly relevant as they represent those hypotheses that align better with what occurred in the experiment (as assessed by the chosen test). Indeed, a scientist might establish an ideal scenario where certain hypotheses should be highly compatible with the data (e.g., a substantial reduction in low-density lipoprotein [LDL] cholesterol due to therapy) while others should be highly incompatible (e.g., a null decrease in LDL cholesterol due to therapy).
(In)Compatibility as a Descriptive Non-inferential Measure
For frequentist-inferential statistics to be unambiguously informative about a research objective, it is necessary that all background assumptions hold. However, as discussed in the literature, researchers do not have the means to manage all sources of uncertainty [16,20]. In accordance with Amrhein et al. [16], the present manuscript adopts statistical incompatibility as a non-inferential measure of the discrepancy between the data and the chosen statistical model (which consists of the target and all other background hypotheses). The scope is not to infer the population parameters but to adopt s-values as descriptive statistics of the relationship between observed data and statistical hypotheses. While minimizing uncertainty remains an indispensable task [24], the advantage of the descriptive approach lies in being “unconditional” regarding background assumptions [16,20,25]. The goal is to present all scientific eventualities that are coherent with what happened in the experiment (including limitations). Indeed, even though we are merely interested in the target hypotheses and despite our best efforts to validate the model, other plausible explanations may be equally or more consistent with the data. The task of a good researcher is to present the most complete, unconditional scenario possible. In this regard, we suggest that the “the least conditional approach” is most suitable, since this still relies on the authors’ overall ability to formulate conclusions. Inference exists only in the degree of consistency of various studies, and it is essential to understand that the p-value (or s-value) for the null or any other hypothesis does not play a particular role in it [8]. For example, if in 10 different studies, conducted to the best of our capacities, we obtain an HR of approximately 4 and a null p≈0.10 (>0.05) (i.e., null s≈3) each time, the most plausible hypothesis is still HR=4 and not the null hypothesis HR0=1.
Decision Types: Health-consequential Versus Non-health Consequential
The scientific process is based on decisions [26]. The very choice to conduct a study implies a decision that entails direct consequences (e.g., allocation of funding) and indirect consequences (e.g., cutting resources for other sectors). Many of these consequences are unknown at the time the decision is made. Initial decisions (e.g., starting a project) and especially intermediate ones (e.g., continuing the project) may be driven by personal beliefs or motivations (e.g., authors’ or financiers’ obstinacy). Science cannot eliminate this uncertainty but can minimize it through pre-study protocols, independent experiments, vigilance, and promoting competence and integrity as values [27]. We define 2 types of decisions based on their implications (Table 1): health-consequential (HC) versus non-health-consequential (NHC).
A similar distinction was introduced by Good [28]: HC decisions could be defined as “terminal” (conclusive), while NHC decisions could be defined as “sequential” (they require further experiments). However, our modification stresses the impact on the stakeholders. HC decisions directly affect people’s health either directly (e.g., approving a treatment) or indirectly (e.g., rejecting it). These are the most sensitive as they involve patients, the most vulnerable stakeholders. Conversely, NHC decisions involve scientists, funders, institutions, and so forth, which may better withstand a negative impact. HC decisions necessitate multiple consistent studies (e.g., systematic reviews with meta-analyses), causal inference (e.g., randomized controlled trials), and various lines of evidence (e.g., statistical, biochemical, clinical, etc.) [26,29-32]. NHC decisions might necessarily be informed by single studies (e.g., exploratory surveys), although stronger evidence is preferable.
The Transition From Non-health-consequential to Health-consequential Decisions
Clinical reality often presents blended situations. Let us consider the development of a drug. The process starts with an investigation of the underlying biochemical mechanisms. Next, we proceed to animal testing. Finally, human trials ensue. This phased process represents the gradual transition from NHC to HC decisions. Given the ethical and logistical necessity to possess increasingly robust evidence as the research progressively involves major costs and risks, it is crucial to establish at which “phase” the investigation has to be conducted. Terminal decisions must be weighed against practical considerations such as “Does the beneficial effect justify the therapy’s invasiveness and costs?”
The Null-Optimal-Risky-Detrimental hypotheses (NORD-h) Protocol
We propose a procedure to inform—not make—HC, NHC, and blended decisions: the Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol (Figure 1). Its novelty stems from (1) testing all relevant hypotheses to counteract nullism, (2) integrating prior knowledge or reasoned thoughts on the risk-benefit ratio to counteract the magnitude fallacy, and (3) adopting multiple indicative thresholds to counteract dichotomania. This also involves publishing a pre-print, marked with a digital object identifier, in which various ranges of point hypotheses are defined: Practically null (Hn), optimal (Ho), risky (Hr), and detrimental (Hd). The scope is set to limit the impact of biases and conflicts of interest. Adopting multiple hypotheses and evaluation criteria based on costs, risks, and benefits prevents statistical testing from being a set of decisions arbitrarily driven by numbers, as it compels the actor—through the choice of Hn, Ho, Hr, and Hd—to constantly relate to the real scenario and the practical consequences of their actions. Furthermore, this makes it possible to consistently identify the hypotheses that are most and least compatible with the observed data. In this regard, we emphasize that the NORD-h protocol is not designed to be universal, but rather to be tailored to each specific study.
A Practical Example
Let us assume we want to evaluate the effectiveness of an LDL cholesterol-lowering drug. We adopt the NORD-h protocol: Hn contains all practically null hypotheses (e.g., consistent with small clinical relevance), Ho contains all optimal hypotheses (e.g., consistent with a favorable risk-benefit ratio), Hr contains all risky hypotheses (i.e., consistent with a too-drastic decrease in the considered time frame), and Hd contains all detrimental hypotheses (i.e., consistent with an increase in cholesterol). Let us consider 3 essential roles for the success of health sciences (Table 2): the clinician (who primarily focuses on patients), the scientist (who primarily focuses on phenomena), and the financier (who primarily focuses on sustainability).
Since the desired degree of incompatibility depends on hypotheses and goals, different non-dichotomous surprise thresholds Sn, So, Sr, and Sd for the respective Hn, Ho, Hr, and Hd can be selected. The clinician may hope that the incompatibility of the null hypothesis with the data is at least Sn=4 (consecutive heads) and the incompatibility of the optimal hypothesis is less than So=3 as they prioritize patients. Hn=[-4, -1] assumes that, given the adverse events, the target effect should be a reduction of at least 5 units, and the minimum non-detrimental effect should be a reduction of 1 unit (in mg/dL). Hr=[-∞, -16] considers that a reduction exceeding 15 units becomes dangerous. The scientific protocol sets hypotheses coherent with causal phenomena. Hn=[-3, 3] assumes that oscillations between -3 and 3 are compatible with random events. Hr=[-∞, -21] considers that reductions exceeding 21 units become dangerous for the main stakeholders (the patients), although it remains of scientific interest. Accordingly, the scientist may be more permissive with the thresholds. The financial protocol sets hypotheses based on economic sustainability. Hn=[-5, -3] assumes that the minimum useful reduction is 6 units. Hr=[-∞, -21] establishes results that remain worthy of investment, although these call for more caution or changes (e.g., a reduction in the administered dose). The thresholds could be similar to clinical ones. The ideal situation is the one in which Ho is strongly compatible with our experimental data, while Hn, Hr, and Hd are strongly incompatible. Guided examples for calculating p-values and s-values for various hypotheses are provided in the literature [11,23,33] (Supplemental Material 1).

Application of the NORD-h protocol

Table 3 shows invented pre-treatment versus post-treatment levels in 10 patients. Calculation details are reported in Supplemental Material 1. Since we are concerned about possible detrimental effects, we employ a 1-sided 1-sample t-test (the design is sub-optimal for educational purposes). The test results are shown in Figure 2. We reiterate that “incompatibility thresholds” provide general guidelines to minimize post-hoc interpretations. Nevertheless, lower (respectively higher) s-values for the optimal (respectively non-optimal) hypothesis are better. Similarly, the interval hypotheses are not equivalence intervals (e.g., h=0 is much less impactful than h=10 even if both are classified as detrimental). Therefore, such thresholds are not meant to replace scientific thinking; rather, they should channel it in the right direction.
Traditional hard approach: The average decrease of 7 units is non-significant since (null) p≥0.05 [wrong conclusion].

Traditional soft approach

The average decrease of 7 units, 95% CI=(0, 14), is statistically non-significant since (null) p≥0.05 but may have some clinical importance [post-hoc interpretation highly exposed to biases and conflicts of interest].
NORD-h protocol, conditional approach: Under the clinician protocol, considering the background hypotheses as sufficiently met, the scenario is not exactly ideal since both Ho=[-15, -5] and Hn=[-4, -1] do not markedly disagree with the data (s-value<3 in many cases). However, it is good that the average decrease r=-7 falls within Ho and that only the non-dangerous hypotheses (Hn and Ho) are generally highly compatible with the data (s<4). Therefore, the outcome aligns better with beneficial or non-hazardous effects [too dependent on unverified assumptions].

NORD-h protocol, least conditional approach

Under the clinician protocol, as assessed by the chosen test—with testable statistical assumptions that could be “reasonably met” according to some criteria (e.g., Q-Q plot)—these findings are consistent with a possible non-damaging effect. However, other hypotheses are compatible with the above situation. These include confounding (due to the absence of a control group), too-small sample size (insufficient to properly evaluate statistical assumptions and well represent the target population), and unidentified covariates (some patients experienced marked benefits while others did not). Therefore, an illustration of a least conditional conclusion is as follows: These findings align with the existence of a clinical effect and the violation of some fundamental assumptions. Further studies with major control over the sources of uncertainty are needed. The slight increase in cholesterol levels in some patients must be taken into consideration.

How to make the associated blended decision

The above situation is far from proving evidence of effectiveness but might justify further research under careful clinical supervision. However, this blended decision heavily depends on the overall context regarding the drug. For example, a solid biological/chemical background and the absence of major adverse events could support studies on larger groups. Conversely, excessive invasiveness of the therapy would necessitate reverting to the previous stage of development or abandoning the research regardless of the statistical outcome (which could still be useful in attempting to outline the reasons for the failure).
Considerations and Remarks
Preselecting interval hypotheses could be signaled as questionable by some as it could still be influenced by personal beliefs, biases, and conflicts of interest. The authors of this manuscript defend this practice based on 2 main considerations: (1) the cognitive issues affecting statistical testing, and (2) the limited scope of the descriptive reading. This framework forces the abandonment of nullism and boosts a practical view of testing, linked to costs, risks, and benefits. Moreover, it compels researchers to inform and establish pre-study protocols. Even admitting possible misuses, the declared non-inferential objective precludes overstatements that could have negative consequences for all stakeholders and science credibility. Finally, statistical surprise also prevents ritualistic interpretations of significance.
A Cautionary Note: Do Not Make Important Decisions Based on Single Studies
This approach is based on moderation due to the underlying uncertainty that characterizes medical science [20]. In this regard, we emphasize that the research limitations must always be presented alongside the hypothesis of interest, assessing their compatibility with the observed experimental scenario. It is necessary to address such inconvenient hypotheses (e.g., plausible confounding) with the same rigor as one investigates the hypothesis of interest (e.g., therapy effectiveness). Unless dealing with solid evidence like well-conducted meta-analyses of randomized trials, it is not a matter of generalizing results but rather providing a descriptive overview that is as impartial as possible. HC decisions can be reasonable if and only if they stem from several consonant high-quality studies of varying nature. Indeed, quoting Dr. Amrhein, “[...] being modest about our conclusions is one of the most important scientific virtues” [34].
The NORD-h protocol is a powerful tool for evaluating the consistency of experimental data with all hypotheses of interest at the scientific and clinical levels, avoiding the flawed practice of considering only the null hypothesis of exactly zero effect and, at most, a single alternative. Furthermore, the NORDh protocol incorporates cost-benefit analysis into the selection of interval hypotheses, which contrasts the simplistic classification of outcomes as “significant” or “not significant” based on a mere numerical, uncontextualized criterion. While it does not replace decision-making, this approach is well-suited for interpreting the results of meta-analyses. Adopting s-values instead of p-values allows for a clearer quantification of the information provided by the data according to the chosen statistical model. In cases where prior knowledge about the investigated phenomenon is limited, these methods still provide a systematic way to describe relationships between data and hypotheses.
Ethics Statement
This study does not need approval from the institutional research ethics committee because it does not involve human participants, animal subjects, or other elements requiring ethical clearance.
Supplemental material is available at https://doi.org/10.3961/jpmph.24.250.
Supplemental Material 1.
Figure S1. One sample t-test.
Figure S2. One sample t-test table (from: https://doi.org/10.1016/j.gloepi.2024.100151).
jpmph-24-250-Supplementary-Material-1.docx

Conflict of Interest

The authors have no conflicts of interest associated with the material presented in this paper.

Funding

None.

Author Contributions

Both authors contributed equally to conceiving the study, analyzing the data, and writing this paper.

We thank Sander Greenland for helpful comments.
Figure. 1.
Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol: structure and scope.
jpmph-24-250f1.jpg
Figure. 2.
Null-Optimal-Risky-Detrimental hypotheses clinical protocol for a cholesterol treatment according to the 1-sample t-test. The conditional ideal scenario is as follows: the green bars (Ho) should be below the green line, the blue bars (Hn) should be above the blue line, and the orange and red bars (Hr and Hd, respectively) should be above the burgundy line. The experimental numerical result (the one where the bar is absent), should be within the optimal range (green region). The black crosses represent patients who exhibited cholesterol changes corresponding to the hypothesis shown below (e.g., the 2 black crosses above hypothesis h=-9 indicate that 2 patients in the dataset recorded a real decrease in cholesterol of 9 mg/dL). Hd, detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses.
jpmph-24-250f2.jpg
jpmph-24-250f3.jpg
Table 1.
Major decision types in public health
Decision type Description Decision rule Evidence type
Health-consequential Has an immediate or short-term impact on the population’s health Always requires many consistent studies Multiple types are necessary
Non-health-consequential Has no immediate or short-term impact on the population’s health Can be informed by single studies Multiple types are preferable
Table 2.
Different types of NORD-h protocols for a hypothetical new treatment for low-density lipoprotein-cholesterol (mg/dL)
Protocol Hn Ho Hr Hd Ideal
Sn So Sr, Sd
Clinician [-4, -1] [-15, -5] [-∞, -16] [0, +∞] ≥4 ≤3 ≥5
Scientist [-3, 3] [-20, -4] [-∞, -21] [4, +∞] ≥4 ≤4 ≥5
Financier [-5, -3] [-20, -6] [-∞, -21] [-2, +∞] ≥4 ≤3 ≥5

NORD-h, Null-Optimal-Risky-Detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses; Hd, detrimental hypotheses; Sn, indicative threshold for the null hypotheses; So, indicative threshold for the optimal hypotheses; Sr, indicative threshold for the risky hypotheses; Sd, indicative threshold for the detrimental hypotheses.

Table 3.
Example of low-density lipoprotein-cholesterol levels in the treated group (unit: mg/dL)
Patient Pre-treatment Post-treatment Difference
1 150 133 -17
2 160 139 -21
3 155 161 6
4 148 151 3
5 162 147 -15
6 158 149 -9
7 153 157 4
8 149 150 1
9 157 144 -13
10 151 142 -9

Figure & Data

References

    Citations

    Citations to this article as recorded by  
    • Interaction between opium use and cigarette smoking on bladder cancer: An inverse probability weighting approach based on a multicenter case-control study in Iran
      Rahim Akrami, Maryam Hadji, Hamideh Rashidian, Maryam Nazemipour, Ahmad Naghibzadeh-Tahami, Alireza Ansari-Moghaddam, Kazem Zendehdel, Mohammad Ali Mansournia
      Global Epidemiology.2025; 9: 100182.     CrossRef

    Figure
    • 0
    • 1
    • 2
    P>0.05 Is Good: The NORD-h Protocol for Several Hypothesis Analysis Based on Known Risks, Costs, and Benefits
    Image Image Image
    Figure. 1. Null-Optimal-Risky-Detrimental hypotheses (NORD-h) protocol: structure and scope.
    Figure. 2. Null-Optimal-Risky-Detrimental hypotheses clinical protocol for a cholesterol treatment according to the 1-sample t-test. The conditional ideal scenario is as follows: the green bars (Ho) should be below the green line, the blue bars (Hn) should be above the blue line, and the orange and red bars (Hr and Hd, respectively) should be above the burgundy line. The experimental numerical result (the one where the bar is absent), should be within the optimal range (green region). The black crosses represent patients who exhibited cholesterol changes corresponding to the hypothesis shown below (e.g., the 2 black crosses above hypothesis h=-9 indicate that 2 patients in the dataset recorded a real decrease in cholesterol of 9 mg/dL). Hd, detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses.
    Graphical abstract
    P>0.05 Is Good: The NORD-h Protocol for Several Hypothesis Analysis Based on Known Risks, Costs, and Benefits
    Decision type Description Decision rule Evidence type
    Health-consequential Has an immediate or short-term impact on the population’s health Always requires many consistent studies Multiple types are necessary
    Non-health-consequential Has no immediate or short-term impact on the population’s health Can be informed by single studies Multiple types are preferable
    Protocol Hn Ho Hr Hd Ideal
    Sn So Sr, Sd
    Clinician [-4, -1] [-15, -5] [-∞, -16] [0, +∞] ≥4 ≤3 ≥5
    Scientist [-3, 3] [-20, -4] [-∞, -21] [4, +∞] ≥4 ≤4 ≥5
    Financier [-5, -3] [-20, -6] [-∞, -21] [-2, +∞] ≥4 ≤3 ≥5
    Patient Pre-treatment Post-treatment Difference
    1 150 133 -17
    2 160 139 -21
    3 155 161 6
    4 148 151 3
    5 162 147 -15
    6 158 149 -9
    7 153 157 4
    8 149 150 1
    9 157 144 -13
    10 151 142 -9
    Table 1. Major decision types in public health

    Table 2. Different types of NORD-h protocols for a hypothetical new treatment for low-density lipoprotein-cholesterol (mg/dL)

    NORD-h, Null-Optimal-Risky-Detrimental hypotheses; Hn, null hypotheses; Ho, optimal hypotheses; Hr, risky hypotheses; Hd, detrimental hypotheses; Sn, indicative threshold for the null hypotheses; So, indicative threshold for the optimal hypotheses; Sr, indicative threshold for the risky hypotheses; Sd, indicative threshold for the detrimental hypotheses.

    Table 3. Example of low-density lipoprotein-cholesterol levels in the treated group (unit: mg/dL)


    JPMPH : Journal of Preventive Medicine and Public Health
    TOP