Hydroxychloroquine as a Case Study
Misinterpretation of Statistical Nonsignificance as a Sign of Potential Bias: Hydroxychloroquine as a Case Study © Kurtis Hagen 2022
Abstract
The term “statistical significance,” ubiquitous in the medical literature, is often misinterpreted, as is the “p-value” from which it stems. This article explores the implications of results that are numerically positive (e.g., those in the treatment arm do better on average) but not statistically significant. This lack of statistical significance is sometimes interpreted as strong, even decisive, evidence against an effect without due consideration of other factors. Three influential articles on hydroxychloroquine (HCQ) as a treatment for COVID-19 are illustrative. They all involve numerically positive results that were not statistically significant that were misinterpreted as strong evidence against HCQ’s efficacy. These and related considerations raise concerns regarding the reliability of academic/medical reasoning around COVID-19 treatments, as well as more generally, and regarding the potential for bias stemming from conflicts of interest. Keywords Statistical significance, p-value, hydroxychloroquine, COVID-19, bias 1. Introduction The stakes are high in scientific-medical controversies, which can lead to research bias. This is especially true during a global pandemic, such as COVID-19. Lives, health, jobs, reputations, dollars, and political ideologies may be at stake. Researchers positioned to conduct large randomized controlled trials are important participants in the effort to resolve related controversies. However, being so positioned typically involves being embedded in an institutional structure with incentives that may bias the research and the interpretation of that research. It has been shown, for example, that industry-sponsored trials of pharmaceutical products tend to give more favorable results than do independent trials (see Goldacre 2012, 1-5). The so-called “funding effect,” though not definitive, does constitute “prima facie evidence that bias may exist” (Krimsky 2012, 18). A variety of factors, besides the obvious financial and career-based considerations, including ideological commitments, may influence how scientists interpret (or misinterpret) data.
This article focuses on a particular kind of misinterpretation that seems to have taken place, involving numerically positive but statistically nonsignificant results. For example, a higher percentage of people in the control arm of a study may require hospitalization, compared to the treatment arm, without this difference being sufficient to pass a test of statistical significance. Such a result cannot justify confidence that this difference is not a product of chance. However, in the cases in focus here, such a result is interpreted too negatively, as though it constitutes strong evidence against the effect. And this interpretation 2 seems to align with the interests of the pharmaceutical inddustry, given that the treatment might compete with potentially much more lucrative alternatives. People working in mainstream media also operate within an incentive structure. Pharmaceutical advertising, for example, may influence reporting on related topics.
Improper interpretations of studies that can be accurately attributed to credible scientists involved in prominently published research, especially if these interpretations happen to align with institutional interests or favored ideologies, may be easily amplified by the media, bolstering narratives that may actually be less well-supported than people are led to believe. For these reasons, the idea that bias of some kind may play an important role in shaping the evaluation of COVID-19 treatments is worth considering. The research on the potential efficacy of HCQ in treating COVID-19 is revealing in this regard. Specifically, prominent studies on this topic do seem to be misinterpreted by their authors, in a relatively consistent direction, and these misinterpretations have been substantially amplified by the mainstream media. The discussion below reveals that the concept of statistical (non)significance plays an important role in the misinterpretation. But it also suggests an underlying problem that goes beyond the current practices involving the interpretation of statistical significance. Although most of the bias-types and concerns discussed below are well known—such as publication bias and p-hacking—this paper considers them from a particular angle. Namely, regarding inexpensive treatments for COVID-19, such as HCQ, the predominant direction of bias may well be the opposite of what is typical, that is, errors may occur systematically in the direction of finding treatments not to be effective. First I briefly consider the potential role of HCQ in the context of the COVID-19 pandemic. I then discuss, over several sections, general issues involving statistical significance and the p-value associated with it.
Then I turn to three studies on HCQ as a treatment for COVID-19 which exemplify the problems described. I argue that the conclusions stated in these studies are misleading; they exaggerate the degree to which the statistical non-significance of their results undermines the hypothesis that HCQ is an effective treatment for COVID-19. 2. Background Regarding COVID-19 Responses to the COVID-19 crisis have been numerous and varied. They include mandated mask-wearing, restrictions on gatherings, and the forced closure of businesses. And there has been much emphasis on the development of vaccines. Comparatively little attention has been paid to treatments. However, the expensive drug remdesivir, despite slim evidence of only limited effectiveness (Wang 2020; Lerner 2020), and the existence of various safety concerns (Fan et al. 2020), has received favorable coverage in the media and is recommended by the National Institute of Health (NIH) for certain types of patients (NIH 2020a). In contrast, promising inexpensive (off-patent) treatments such as those involving hydroxychloroquine and ivermectin have received less mainstream attention, and most of that attention has been negative.1 Although it is not clear that the evidence for their efficacy compares unfavorably to that for remdesivir, the NIH’s COVID-19 Treatment Guidelines Panel recommended against the use of either hydroxychloroquine or ivermectin2 except in a clinical trial (NIH 2020b, 2020c).
There have even been efforts to discourage doctors in the U.S. from prescribing hydroxychloroquine to COVID-19 patients,3 as well as efforts to prevent patients from acquiring the drug once prescribed (Erman 2020). There has also been relatively little attention paid to the potential role of zinc and vitamins C and D in combating the virus, despite considerable evidence of efficacy, especially regarding vitamin D (Walsh 2020). Regarding HCQ in particular: it is a reasonably safe drug with a long history, and many doctors have reported favorable results using it to treat COVID-19 (often stressing the importance of proper timing, dosage, and other considerations, such as zinc supplementation). Such reports are often dismissed as “anecdotal,” which is how Anthony Fauci characterized them. Resistance to the use of HCQ comes from (ostensibly) authoritative sources. These sources, Fauci included, tend to focus on a few randomized placebo-controlled trials with discouraging results. They often imply that these studies essentially establish HCQ’s ineffectiveness, even though those particular trials appear seriously flawed (see the discussion of the RECOVERY and SOLIDARITY trials below). This is a serious matter, as many lives are at stake. Yale epidemiologist Harvey Risch has gone so far as to argue that “tens of thousands of patients with COVID-19 are dying unnecessarily,” stating that “the situation can be reversed easily and quickly” with the use of HCQ (Risch 2020). My intention here is to point to specific problems that raise concerns about how the evaluation of HCQ has unfolded. I emphasize that I am not arguing that HCQ ought to be used to treat COVID-19; I am highlighting the potential impact of a certain type of misinterpretation of study results, regardless of what the final determination ought to be about HCQ in particular.
P-Values and Statistical Significance The concept of “statistical significance” and the use of the “p-value” associated with it (both of which are described below) have faced strong criticism for some time.4 And recent calls to stop using the term “statistical significance” have been published in prominent venues. In a 2019 article published in Nature, more than 800 scientists endorsed a call to “retire statistical significance” (Amrhein et al. 2019). They encouraged more detailed and nuanced presentation of data, hoping that this might “help to halt overconfident claims, unwarranted declarations of ‘no difference’.” At about the same time, Ronald Wasserstein, Allen Schirm and Nicole Lazar recommended against the use of the concept “statistical significance,” writing: The [2016] ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. … Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. (2019, 2) The 2016 “APA Statement on Statistical Significance and P-Values” that they refer to explains, “While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted” (Wasserstein 2016, 131). The p-value tells us the probability of the data in a study, given that the null hypothesis is true, that is, assuming that the hypothesized effect does not exist.
This is different from the probability that the null hypothesis is true given the data—which is closer to what we would like to know. Although these ideas are often confused with each other, their values can be very different. Importantly, the probability that there is a real effect depends not only on the data in a particular study, but also on the probability of an effect independent of that data (as well as the bias involved in the production of the data). While the probability that a treatment has a real effect independent of a particular dataset may be hard to quantify, considering extreme cases can be revealing. Suppose there are two scientific theories. Theory A has clear biological plausibility supported by studies that show that the purported biological mechanisms actually do what would be necessary to produce the effect, and there are at least some exploratory studies that more directly suggest that the theory may well be true. Before conducting the study, Theory A may have a probability that is at least modest, or even better. We can even stipulate for our example that Theory A is highly likely to be true.
Suppose that Theory B, in contrast, seems implausible, so that its prior probability is regarded as extremely low. Now suppose that both these scientific theories are independently studied, and both studies involve experiments yielding positive results with p-values of 0.05. In both cases, the probability of getting data at least as extreme, assuming that the statistical hypothesis in question is false, is 5%— that’s what the p-value tells us. We’d like to know the probabilities that the related scientific theories are false. Is it, at least roughly, about 5% in both cases? No. The probability that the implausible theory is false remains high. The positive result is likely a matter of chance. After all, the p-value tells us that we would get a result at least this extreme 5% of the time even if there is no real effect, that is, if the relevant null hypothesis is true. Given the extremely low prior probability of the theory, it is likely that chance explains the result. The data from this study push the probability that the theory is true up from something very close to zero to something still very close to zero. Now consider the probability of Theory A, the biologically plausible theory with evidence already existing in its favor. Its probability was already high (by stipulation) and is further increased by this study.
Whatever the precise numbers, the probability that these theories are true is not the same, and indeed may diverge greatly. The probability of one is close to zero and the probability of the other could be pretty much anything, depending on the particulars, possibly even higher than 95% if the prior probability was very high (Nuzzo 2014, p. 151). This point is not at all a new one. But it seems that some researchers may not adequately incorporate it into the considerations of their results. The takeaway, for our purposes, is that the probability that we should assign to a theory may often have more to do with its biological plausibility, the results of other studies, and other such considerations (such as the reasonableness of the study design) than with the p-value of a particular study in isolation. Problems with Statistical Significance Here I want to highlight, and criticize, two general problematic tendencies in scientific discourse involving the concept of “statistical significance.” One is a misleading abbreviated way of referring to this, the other is an actual reasoning error. I will then show how these problems have manifested in the research on the efficacy of HCQ, and how that has led to inappropriate strongly negative conclusions.
Problem 1: In ordinary language, when someone says there is no “significant difference” between two things, that is commonly understood to mean, roughly, that there is no difference that makes a difference. That is, the magnitude of the difference, if there is any at all, is not large enough to matter very much. But that is often not what the phrase means when scientists use it, especially in the medical research literature. Often researchers use the word “significance” as shorthand for “statistical significance,” which indicates a level of uncertainty. The measure for this level of uncertainty is the “p-value.” Assuming the null hypothesis to be true (often meaning that there is no real effect), the p-value indicates the probability of nevertheless obtaining a result at least as extreme. So, if p is 0.05 that means that there is a 5% probability that at least as much of a difference as that actually found would have occurred simply by chance, assuming there really is no difference between the groups being compared. Five percent is generally taken as the (arbitrary) cutoff. If there is any more uncertainty than that, then the difference observed is labeled “not statistically significant,” and often described simply as “not significant,” regardless of the magnitude of the observed difference. (Larger magnitude differences are more likely to show statistical significance, but it depends also on sample sizes.) The above paragraph merely describes something that is well known, at least to scientists. Nevertheless, the term “statistical significance” may be seriously misleading, especially when presented to lay persons in abbreviated form. This should be fairly obvious, and yet the practice is fairly ubiquitous.
Problem 2: A related problem is perhaps not quite so obvious, although it is hardly unknown. Andrew Anderson describes it this way: In a case where the effect size is in the “meaningful/interesting” range but is not statistically significant, “failure to reject the null is sometimes mistakenly viewed as an acceptance of the null” (Anderson 2019, 119). This phenomenon was in fact a substantial part of the reason 800 scientists called for the retirement of the concept of “statistical significance” in the above-mentioned letter published in Nature. They write, “Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05” (Amrhein et al. 2019, 305). And yet this problem is pervasive. They report, “An analysis of 791 articles across 5 journals found that around half mistakenly assume non-significance means no effect” (307).6 Here is one way the problem manifests: Suppose someone does an experiment and finds that there are more people in the treatment group, as compared to the control group, that manifest a certain pathology (potentially a side effect of the treatment). Suppose it is a clinically significant effect, that is, if we could extrapolate from the results of the study, there are a meaningful number of them. But also suppose that since it was a small study the difference is not statistically significant. The researchers, requiring stronger evidence before they can claim they have found a real phenomenon, modestly report the observed difference as “not statistically significant.” They might say, in some formulations, that they “found no significant difference” between the treatment and control groups. That much is okay. The problem manifests when this inclination to be conservative is taken too far, and thus no longer is conservative. Rather than the modest claim that no effect can be positively asserted, it becomes the immodest assertion that no effect exists. It is also noteworthy that combining statistically insignificant findings can sometimes lead to a statistically significant one.
For example, making precisely this case as it applies to the use of HCQ in outpatient trials, Joseph Ladapo and Harvey Risch write, “[T]he statistically insignificant results from each of the randomized COVID-19 trials of outpatient hydroxychloroquine translates into a statistically significant 24% risk reduction” (Ladapo and Risch 2020; see also Lapado et al. 2020). P-Values and the Direction of Bias Errors are not uncommon in scientific research, though efforts should be taken to minimize them. When errors appear systematic, tending to skew in a particular direction, this can be considered a form of research bias (Resnik 2000, 178), regardless of whether it is intentional (and thus dishonest) or unintentional (and thus, in some sense, “innocent,” if still potentially harmful). It is worth noting that the intentions of actors are often mysterious, even to the actors themselves. And intentions are generally tied to a complex set of motivations, and thus they are not always easily dichotomized into “intentional” or “unintentional,” even if one has some insight into them. In the cases in question here, while intent will not be adjudicated, the interests at play are sufficiently large to justify worry that they might have some influence on researchers that they are, or should be, cognizant of in some degree. If this is in fact the case, mistaken conclusions would not be wholly innocent. Part of what motivates the common (and usually wise) impulse to demand rather strict standards of evidence may be an implicit (and probably mostly correct) assumption regarding the net direction of the potential biases. Ordinarily, bias would tend to favor finding a publishable result, for that is what is good for the researchers and often their funders. Often, especially when studying a treatment, a statistically significant finding is more publishable. It is thus reasonable to be concerned that, if several research groups performed similar experiments but only one of them ended up with a statistically significant finding, which was then published, while the others simply went unreported, then the p-value for the one that did get reported may provide a misleading sense of the likelihood that the result is not a mere product of chance.
This phenomenon is known as “publication bias.” Such issues are common knowledge among scholars who use statistical analysis in their work and are among the reasons that there exist norms against activities that could be described as “p-hacking.” They also account, in part, for the relatively conservative approach to p-values, the requirement of a very low p-value before declaring a positive result. (Nevertheless, if the effect size is clinically meaningful and the sample size is relatively small, that should arguably, absent other serious problems, motivate further study.) However, the situation is different when the research bias points the opposite direction, for example, when the researchers’ interests are best served by finding no effect, and especially if the design of the study might arguably be in some way stacked against finding an effect. In such a case, the ordinarily appropriate demand for a very low p-value risks becoming less appropriate. For, if strong biases are arrayed against an effect, and yet a signal of an effect can be discerned (e.g., the people in the treatment arm do better on average, though with less statistical strength than required to justify confidence) one should be less inclined than one might otherwise be to regard this result as suggesting that there is no effect. It would certainly not be strong evidence against the efficacy of the treatment. The data, in this case, should be viewed as analogous to a witness’s “statements against interest,” which have more credibility than ordinary (self-serving) statements. In such cases, an otherwise unimpressive result might reasonably be viewed as positive on account of beating expectations (where these are unusual expectations given unusual circumstances). In the case of studies involving HCQ, there are powerful organizations with strong financial interests in finding that such inexpensive, off patent drugs are not effective. And, the focus seems to be on whether statistical significance has been obtained, with the mistaken assumption that failure to achieve statistical significance implies that the hypothesis that the drug is effective is positively improbable.
Exemplary Models Before considering the HCQ studies, let’s consider a couple exemplary cases. Ronald L. Wasserstein, Executive Director of the American Statistical Association, presents an example involving another treatment for COVID-19, in an online presentation that was part of a conference on statistics and data science. This will give us a sense of what the conclusions of the studies examined below arguably should have looked like. Wasserstein discusses a study of two antivirals in combination, lopinavir and ritonavir, for the treatment of severe COVID-19.
The results included a difference of 5.8% in mortality favoring the treatment group, though the results were not “statistically significant,” given the size of the trial. Wasserstein comments, “What always happens with these things is that zero is in there [meaning the 95% confidence interval includes zero], so we get a conclusion that looks like this: …” (Wasserstein 2021). He then describes the study’s conclusion, which states, “In hospitalized adult patients with severe Covid-19, no benefit was observed with lopinavir– ritonavir treatment beyond standard care” (Cao et al. 2020, 1787). Wasserstein comments, “That’s not a conclusion that we should necessarily rush to because a 5.8 percentage point reduction in mortality is a big deal.” He then suggests that a better conclusion would look like the following: Our estimate of the mortality difference at 28 days was -5.8 percentage points (= 19.2% - 25.0%); thus, adding lopinavir-ritonavir to standard care could result in a clinically large decrease in mortality. However, possible mortality differences that are highly compatible with our data, given our model, ranged from -17.3 (a very large decrease in mortality) to 5.7 (a large increase in mortality). … Further study of this potentially effective treatment is needed. It should be noted that, in contrast with the studies on HCQ discussed below, the authors of this study did recommend further studies. They also specifically encouraged the testing of these drugs in combination with others (Cao et al. 2020, 1797).
A similar point is made by Roger Kirk, in an article that advocates shifting the emphasis in research from “statistical significance” to “practical significance.” Kirk gives a hypothetical example of a medication that may improve the performance of Alzheimer patients on IQ tests.
The mean for the group taking the medication is 13 IQ points better than the control, but the pvalue is 0.14—not statistically significant. And yet, Kirk explains: This information should make any rational researcher think that the data provides some support for the scientific hypothesis. In fact, the best guess that can be made is that the population mean difference is 13 IQ points. A 95% confidence interval for the population mean difference indicates that it is likely to be between -6.3 and 32.3 IQ points. The nonsignificant t test does not mean that there is no difference between the IQs; all it means is that the researcher cannot rule out chance or sampling variability as an explanation for the observed difference. (Kirk 1996, 755) He later asks: “Will the results replicate? Are they real? There is only one way to find out: Do a replication. Does the medication appear to have promise with Alzheimer patients? I think so” (756). In the following examples involving HCQ, the reasoning process contrasts starkly. 7. Problematic Conclusions in Three HCQ Studies In an open letter, a large group of qualified professionals—statisticians, MDs, and quantitative researchers—have identified the above-described problem (problem 2) in three significant studies involving HCQ (see Watanabe 2020). They call for a revision of the conclusion of three papers, namely, Boulware et al. 2020, Skipper et al. 2020, and Mitjà et al. 2020. They explain: “[A]ll three hydroxychloroquine (HC) studies showed positive but inconclusive results.” What they mean by “inconclusive” presumably is this: since the results did not meet the (arbitrary) standard required to declare statistical significance, confidence that the observed “positive” difference is indicative of a real effect is not warranted, as it may well be a matter of chance. The point is that, contrary to what these studies claim, they do not provide strong evidence that there is no effect. Watanabe et al conclude that these studies “might be underpowered,” that is, involve sample sizes that are inadequate to produce a statistically significant result even if the effect size is noteworthy. While this possibility is always available to explain away a high p-value, the authors motivate its potential applicability here by comparing these three studies with a “celebrated” study of a different drug, dexamethasone, that found similar absolute and relative effect sizes as did the three HCQ studies but enjoyed a sample size more than eight times that of the largest of the three HCQ studies. Before I summarize how the problem manifests in each of the three HCQ studies, it is worth noting that the authors of an influential study of the more expensive drug remdesivir characterize their findings as weakly positive: “Although not statistically significant, patients receiving remdesivir had a numerically faster time to clinical improvement than those receiving placebo among patients with symptom duration of 10 days or less” (Wang 2020, 1569). Now let’s consider the HCQ studies.
First, in Boulware et al. 2020, the authors twice state that “hydroxychloroquine did not prevent illness compatible with Covid-19 or confirmed infection” under the conditions tested (2020, 517, 522-523). But, as the authors of the above mentioned letter point out, that is not the proper conclusion. The data is consistent with HCQ preventing COVID-19 in some people. It would be more accurate to say that the study was unable to establish, with a high degree of confidence, that the benefit observed was not a mere product of chance. That could be either because the apparent benefit was not real or because the sample sizes were too small. It could also be because HCQ is beneficial only in certain types of cases or circumstances, or in combination with other treatments. It is always true that a numerically positive finding with a high p-value might possibly be rescued by more data. To determine whether a call for more data is warranted, one may consider how strongly the study was powered. In this case, the sample size was based on a 90% power to detect a 50% relative effect size. In that context, the results suggest that (assuming the protocol was a sound one) HCQ probably does not have an effect of that magnitude, not that it has no effect.8 Oblivious to such nuances, the mainstream media trumpeted the invalid conclusion of the study.
Consider a USA Today article entitled, “Fact Check: Hydroxychloroquine Has Not Worked in Treating COVID-19, Studies Show” (Fauzia 2020), which is fairly representative of the mainstream treatment. Under the heading “Recent Studies Show Drug Doesn’t Help,” the article cites Boulware et al. 2020. Though the study is described pretty well, the article does nothing to suggest that this case does not actually exemplify the claim made in the above-quoted titles of the article and of the section. Yet it is wrong to suggest that this study shows that HCQ does not work. It is not clear that the study even supports the idea that HCQ doesn’t work. What may well be a problem with the study itself, namely, that it is underpowered, is interpreted as a decisive problem with the treatment. Admittedly, in this case, the difference in question was not large (11.8% vs. 14.3%) and was far from statistical significance (p = 0.35). But the point here still applies. Second, in the case of Mitjà et al. (2020), the authors write, “The results of this randomized controlled trial convincingly rule out any meaningful virological or clinical benefit of HCQ in outpatients with mild Covid-19.” That is pretty strong language. And yet, while the study does undermine the case for virological benefit, the data does not “convincingly rule out” clinical benefit. After all, 7.1% of those in the control arm required hospitalization, compared to 5.9% in the intervention arm (no one in the study died or required mechanical ventilation).
There were only 293 people in the study, so this difference was not statistically significant. However, it is quite possible that the sample size was simply too small for the positive effect to meet the standard of statistical significance. And yet, according to Mitjà et al., this data rules out any meaningful clinical benefit of HCQ in the conditions under consideration. In this case, scientific modesty fails as an explanation for the authors’ conclusion. The conclusion as stated by these researchers would still be too strong even if their results genuinely suggested that HCQ was ineffective, which it is not clear that they do. It seems to function, inappropriately, as an epistemic conversation stopper.
Third, Skipper et al. (2020) concluded, “Hydroxychloroquine did not substantially reduce symptom severity in outpatients with early, mild COVID-19” (2020). It seems odd that they chose the word “substantially.” The absolute effect size is 6% and the relative effect size is nearly 20%, which seem intuitively “substantial.” What they presumably mean is that the results were not statistically significant. Next to a graph which appears to show modestly reduced symptoms in the HCQ group as compared to the control group, Skipper et al. write, “The percentage of participants reporting symptoms over time did not statistically differ by use of hydroxychloroquine or placebo” (Skipper et al. 2020, 6). The phrase, “did not statistically differ,” is odd and misleading. The results did differ. Presumably, to say they did not do so “statistically” is just to say that the value for p was greater than 0.05. The very next sentence in the article tells a fuller story (which the graph that accompanies it visually corroborates): “By day 14, the proportion of hydroxychloroquine participants with symptoms was 6 percentage points less than that of placebo participants (24% vs. 30%; P = 0.21).”10 While the difference observed could well be a product of chance, the study does not give us a strong reason to think that HCQ does not reduce the likelihood of having symptoms. In sum, Skipper et al.’s conclusion, that HCQ “did not substantially reduce symptom severity,” is misleading. The data is compatible with the view that it does reduce symptom severity. While a careful reader of the study can easily find the relevant numbers and make their own judgements, the authors’ summary statements taken on their own (as they often will be when such a study is mentioned in the media) are misleading. Note that Skipper et al. 2020 has twenty-four co-authors, as does Boulware et al. 2020. And Mitjà et al. 2020 has forty-six. That’s a lot of authors to sign off on misleading representations of what their studies show. To make matters worse, Skipper et al. write, “This builds on other randomized trial data on hydroxychloroquine, which have not shown any benefit for postexposure prophylaxis.” They offer three citations. One citation is to the above-described study by Boulware et al. (2020). They also cite the so-called “RECOVERY” trial (Horby 2020), which has been criticized for its high doses and its very late-stage usage. Their third citation is to a WHO webpage that is mostly about the WHO-led SOLIDARITY trial, which has issues similar to the RECOVERY trial. At that time, neither the RECOVERY nor the SOLIDARITY trial had been published in a peer-reviewed format, as both trials were halted before completion. Taken individually, it might be argued that many of the problems described in this section constitute mere quibbles over language, and that these authors have done nothing worse than fail to be as nuanced as they might better have been; nobody’s perfect. But taken together, especially amplified by the media, as they have been, this may constitute a serious problem.
The fact that the researchers in these cases drew (at best) overly strong negative conclusions suggests that the researchers may have been biased against finding efficacy. And this means that scrutiny of this research (and perhaps similar research) should focus on the possibility that, in other ways, it may have unfairly favored a negative finding—even though this is the opposite of the more common worry about studies in general. Finally, in a context in which incentives favor a negative finding, and yet positive but statistically insignificant correlations repeatedly arise in studies that are arguably underpowered, the most appropriate response is to call for further study, not declare the matter settled. And further, regarding HCQ in particular, the fact that the opposite was done suggests that, in these further studies, particular attention should be paid to potential biases. 8. Further Considerations It is not clear whether the misleading conclusions of these studies are merely innocent errors, or whether they may have been influenced by the interests that are at stake.
Nevertheless, the similarity between them might be thought to suggest a pattern implying the latter interpretation. In this section, I briefly discuss other cases that seem to corroborate this pattern, namely, that studies of HCQ and other promising COVID treatments tend to offer evaluations that are more negative than they should be, sometimes with significant implications. It is worth remembering the situation when the RECOVERY trial was discontinued. On May 22, The Lancet published a study that purported to show that HCQ treatment is harmful. On June 4 it announced that the study would be retracted (see Shamoo 2020, 325-326). The editor in chief of The Lancet, Richard Horton, ultimately described the study as “a monumental fraud” (Rabin 2020), which the independent critics who first brought attention to the implausibility of the study’s data more quickly surmised. The announcement that the RECOVERY trial would be wrapping up11 was made on June 5, one day after the Lancet retraction had been announced. Although announcement of the Lancet retraction had already been made by the time RECOVERY made their announcement, the process of coming to that decision probably took longer than one day, and thus the apparently fraudulent Lancet paper may well have influenced that decision. The above only scratches the surface of the problems plaguing the research on HCQ. Dig deeper almost anywhere and one will find something questionable.
For example, the “VA Study” (Magagnoli et al. 2020a), which has been cited in criticism of HCQ as a treatment for COVID19, has serious problems and limitations. For example, it does not report the HCQ dosing levels or timing. In addition, HCQ “was more likely to be prescribed to patients with more severe disease.” That’s quite a confounder! However, it is noteworthy that this admission occurs only in an early preprint version of the paper, and not in later versions—though presumably this truth didn’t go away. In that early version, the authors write, “[H]ydroxychloroquine, with or without azithromycin, was more likely to be prescribed to patients with more severe disease, as assessed by baseline ventilatory status and metabolic and hematologic parameters. Thus, as expected, increased mortality was observed in patients treated with hydroxychloroquine, both with and without azithromycin” (Magagnoli et al. 2020b, 12). The issue seems to call for further investigation, as it seems potentially serious enough to render the study essentially worthless. By removing this admission from later versions of the paper, the authors, in effect, conceal a crippling flaw in their study. Then, apparently oblivious to the study’s problems, the media cite it in decidedly negative portrayals of HCQ.12 While I have focused on studies on HCQ, and on statistical significance particularly, the real problem is broader than the concept of statistical significance and the interpretation of pvalues. Simply moving away from those is not going to solve it.
For example, the TOGETHER trial on ivermectin uses a Bayesian statistical approach rather than a frequentist one, and thus avoids the concept of “statistical significance” altogether (Reis et al. 2022). Nevertheless, a critique analogous to that given above could be made. For that study seems, at least arguably, weakly supportive of a modestly positive effect,13 namely, a relative risk reduction of about 10% in hospitalization (or a proxy for it).14 Nevertheless, that study is interpreted by its authors, and by the mainstream media following them, as establishing ivermectin’s uselessness against COVID-19. For example, the study’s authors write, “We did not find a significantly or clinically meaningful lower risk of medical admission to a hospital or prolonged emergency department observation” (Reis et al.2022, 7). And Edward Mills, one of the study’s lead researchers, reportedly stated, “There was no indication that ivermectin is clinically useful” (Ellis 2022). Another author, David Boulware, reportedly claimed, “There’s really no sign of any benefit,” and further remarked, “hopefully that will steer the majority of doctors away from ivermectin towards other therapies” (Zimmer 2022). (Note that this is the same David Boulware who led the HCQ trial described above.) The media amplified this message. According to a New York Times report on the study by Carl Zimmer: Ivermectin “showed no sign of alleviating” COVID-19. “The results were clear: Taking ivermectin did not reduce a Covid patient’s risk of ending up in the hospital.” Apparently, the study’s authors regard their conclusion as decisive—it “effectively ruled out the drug as a treatment for Covid, the study’s authors said” (Zimmer 2022). Just as in the cases discussed above involving HCQ, we here have a case in which people in the treatment arm did better on average than those in the control arm, and yet that result is interpreted as establishing the ineffectiveness of the (inexpensive) drug, though in this case without appeal to the concept of “statistical significance.”
The problem, it seems, is not a matter of statistical method, and so a change of practice at that level is unlikely to fix the problem. Deborah Mayo and David Hand go as far as to argue that “attempts to fix statistical practice by abandoning or replacing statistical significance are actually jeopardizing reliability and integrity” (2022, p. 26). I don’t take a side in that debate. But I suggest the focus should be on the probable cause underlying the frequent misinterpretation. As Mayo and Hand note, with respect to the replication crisis, “It is generally agreed that a large part of the blame for lack of replication in many fields may be traced to biases encouraged by the reward structure” (2022, p. 28). In a pandemic the stakes are raised, and in unusual circumstances ordinary biases may reverse, and thus extra caution and focus would be prudent. 9. Conclusion Findings that are not statistically significant are often wrongly interpreted as showing the hypothesis to be false. Both medical researchers and journalists make this error. This article highlights three studies in which this error was used to support the notion that HCQ is not an effective treatment for COVID-19. This is an issue that could impact many thousands of lives, perhaps tens of thousands, as Dr. Risch suggests, or even more. In these and analogous cases, the misinterpretation seems one-sided.15 This raises the question of whether bias is influencing both researchers and journalists on this and related issues. There are, after all, potential conflicting interests involved. And the prior probability of at least some degree of research bias is high, given previous research on industry-funded studies.
Science-based medicine is important. But vigilance is required to combat reasoning errors in medical research and to check the biases and other influences that may impact the directionality of such errors. (If the errors tend to cancel each other out, that is one thing; if they line up in the same direction, that is another.) The misuse of statistical significance in the evaluation of HCQ provides an example of the danger. In the cases discussed above, it may be that the misinterpretations are “innocent” in the sense that they are mistakes that could have equally been favorable or unfavorable to the researchers’ interests—after all Amrhein and colleagues suggest that this type of mistaken conclusion is common (2019, 307). Alternatively, it may be that, for at least some cases, conflicts of interest influence researchers to reach “favorable” conclusions that seem reasonable enough to pass some level scrutiny (perhaps partly because a bias is shared among relevant parties), rather than reaching a more rationally defensible conclusion that is in some relevant sense “unfavorable.” For any particular case, it is hard to determine which explanation is best. But when a pattern emerges, one may begin to suspect the latter explanation may play some role. The cases discussed above might be thought of as suggesting a pattern which is deserving of further scrutiny. Acknowledgements I thank Brian Martin and Lee Basham for feedback on earlier versions of this paper. I also thank Stuart Hurlbert and two anonymous referees for their detailed and helpful comments, suggestions, and constructive criticisms.
Disclosure statement: No potential conflict of interest was reported by the author(s). Funding The author(s) reported there is no funding associated with the work featured in this article.
© Kurtis Hagen 2022
Original at: https://www.tandfonline.com/doi/10.1080/08989621.2022.2155517