Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Diagnostic Accuracy Studies in Radiology: How to Recognize and Address Potential Sources of Bias

Diagnostic Accuracy Studies in Radiology: How to Recognize and Address Potential Sources of Bias Hindawi Radiology Research and Practice Volume 2021, Article ID 5801662, 10 pages https://doi.org/10.1155/2021/5801662 Review Article Diagnostic Accuracy Studies in Radiology: How to Recognize and Address Potential Sources of Bias 1,2 3 3 Athanasios Pavlou , Robert M. Kurtz , and Jae W. Song St. Vincent’s Medical Center, Bridgeport, CT, USA Frank H. Netter MD School of Medicine, North Haven, CT, USA Hospital of the University of Pennsylvania, Philadelphia, PA, USA Correspondence should be addressed to Jae W. Song; jae.song@pennmedicine.upenn.edu Received 2 July 2021; Revised 17 August 2021; Accepted 18 August 2021; Published 7 September 2021 Academic Editor: Andre´ Luiz Ferreira Costa Copyright © 2021 Athanasios Pavlou et al. +is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Accuracy is an important parameter of a diagnostic test. Studies that attempt to determine a test’s accuracy can suffer from various forms of bias. As radiology is a diagnostic specialty, many radiologists may design a diagnostic accuracy study or review one to understand how it may apply to their practice. Radiologists also frequently serve as consultants to other physicians regarding the selection of the most appropriate diagnostic exams. In these roles, understanding how to critically appraise the literature is important for all radiologists. +e purpose of this review is to provide a framework for evaluating potential sources of study design biases that are found in diagnostic accuracy studies and to explain their impact on sensitivity and specificity estimates. To help the reader understand these biases, we also present examples from the radiology literature. As practitioners of a diagnostic specialty, it is important 1.Introduction for radiologists to understand how to appraise diagnostic +e accuracy of a diagnostic test refers to how well a test can accuracy studies. Radiologists are frequently consulted by correctly identify a specific disease. +erefore, it is a crucial other physicians on which imaging test to order for specific parameter to consider when making a decision to perform indications and serve to educate and inform others about that test in a clinical setting. Inaccurate diagnostic tests can current standards of care for the diagnostic work-up of many lead to over- or undertreatment, inflated healthcare costs, patients. In the era of evidence-based medicine, radiologists and potentially patient harm [1]. Diagnostic accuracy studies are encouraged to keep up with the literature as well as know attempt to evaluate a test’s performance by comparing it to a how to appraise the quality of a diagnostic accuracy study. gold standard. +ese studies can suffer from biases (e.g., Moreover, it is equally important to know how applicable spectrum bias and verification bias) that are different from the results of a particular diagnostic accuracy study are to the radiologist’s own clinical practice [5]. those affecting studies designed to test the efficacy of therapeutic interventions. Awareness of these biases and Guidelines and checklists often serve as useful tools to how they can impact diagnostic accuracy measures is im- help one be comprehensive and achieve consistency. As portant. Several studies have quantitatively shown that such, the Cochrane Collaboration and Agency for Health- specific biases can lead to an overestimation or underesti- care Research and Quality has recommended the use of mation of accuracy measures [2, 3]. Given that diagnostic checklists such as the Quality Assessment of Diagnostic accuracy studies help experts and policymakers to create Accuracy Studies 2 (QUADAS-2) tool [6]. +is tool helps to guidelines and establish standard-of-care measures [4], it is assess the risk of bias in diagnostic studies and is organized imperative that readers be aware of these biases and how into 4 key domains. +ese domains include evaluating as- they can be addressed. pects of study design related to (1) patient selection, (2) the 2 Radiology Research and Practice as the Standards for the Reporting of Diagnostic Accuracy index test, (3) reference standard, and the (4) flow and timing of subjects in a study [7]. Within each domain are studies (STARD) [13]. Using the PIC framework will help the reader assess the specific types of study design biases that should be considered. external validity as well as gain insight into the applicability In this paper, we use the QUADAS-2 framework to of the study. For assessment of internal validity, critically review the study design biases within each domain (see appraising the study design using a four-domain framework Table 1). We will also present examples from the radiology is suggested [11]. We now review specific sources of study literature. design biases using the QUADAS-2 framework. 2.Basic Concepts 3. Domain 1: Patient Selection +e framework for developing a research question in ev- +e goal of sampling is to ensure that the sample group is idence-based medicine follows the PICO model. In diag- representative of the population of interest. +e results of the nostic accuracy studies, PICO stands for P (population), I study are contingent on the studied sample. +us, sampling (index test), C (comparator or reference standard), and O methods are a critical part of a study design. Participants (outcomes). A diagnostic accuracy study compares the should ideally be recruited from a population in a process index test (the test under investigation) with an established that ensures no over- or underrepresentation of certain reference test on a specific population and provides out- subpopulations [14]. comes for comparison [8]. +e degree to which the out- comes of the study represent true findings among similar individuals outside the study is determined by the validity. 3.1. Sampling Definition and Methods. Sampling is the +ere are two main types of validity: internal and external process of selecting a group of study subjects from the target (see Figure 1) [9]. population. +ere are two main categories of sampling methods: probability and nonprobability sampling. In probability sampling methods, all eligible subjects in 2.1. Internal Validity. +e extent to which the observed the target population have equal chances to be selected (e.g., results are not due to methodological errors is defined as random sampling). +e challenge with this type of sampling internal validity. +e internal validity of a study can be method is that it requires the presence of a comprehensive threatened by bias and imprecision (see Figure 2). Bias is list or registry of all eligible patients in the target population, considered to be any systematic deviation of an estimate from which the subjects are randomly chosen using, for from the true value. If a diagnostic accuracy study suffers instance, a random number generator [14]. As such regis- from bias, its sensitivity and/or specificity will be consis- tries are rarely available in practice, clinical studies more tently under- or overestimated compared to the true value. frequently use nonprobability sampling [15]. +is means that the error introduced by bias will not balance In nonprobability sampling methods, the sample is se- out upon repetition. Imprecision is the random error that lected in a process that does not guarantee equal chances to occurs with multiple estimates of a parameter and refers to be selected for each eligible subject in the target population. how far these estimates are from each other, not how far they An example of nonprobability sampling is convenience are from the true value. Because of the random deviation of sampling, where patients are selected only based on ac- the estimates towards opposite directions, repetition will cessibility and availability. +e selection process for con- eventually balance out this error [11]. venience sampling can lead to over- or underrepresentation of certain population attributes and therefore decreases the 2.2. External Validity. External validity examines whether generalizability of the study results (sampling bias). A special the findings of a study can be generalized to the population form of convenience sampling, commonly used in clinical level. If the study’s sample is representative of the target research, is consecutive sampling. In this sampling method, population, the results of the study can be generalized to the for a specified period of time, every subject who meets the population from which the sample was drawn and even predefined inclusion and exclusion criteria is recruited for beyond that to other similar populations. +is is especially the study. +is sampling method prevents the researchers important as it determines whether the results of the study from “picking and choosing” subjects [15]. Analysis of 31 can be applied in daily clinical practice [12]. published meta-analyses showed that nonconsecutive Applicability is also an important consideration when sampling tended to overestimate the diagnostic accuracy of evaluating diagnostic accuray studies. Careful evaluation of the test by 50% compared to consecutive sampling in di- the PIC (Population-Index-Reference) parameters of a study agnostic accuracy studies [16]. will help determine the extent of applicability of a study to a +e effect of consecutive over nonconsecutive sampling reader’s clinical practice. +e patient demographics, selec- can be seen in a study evaluating deep venous thrombosis tion and use of the index test, and test interpretation should (DVT) of the lower extremities. Kline et al. recruited subjects be compared between the study and the reader’s practice. To using a consecutive method to compare the diagnostic ac- allow for this comparison, it is vital that diagnostic accuracy curacy of emergency clinician-performed compression ul- studies report their methods with completeness and trasonography for DVT of the lower extremities against transparency, preferably using standardized checklists such whole-leg venous ultrasonography and reported a sensitivity Radiology Research and Practice 3 Table 1: Types of bias in diagnostic accuracy studies and how to address them. Bias type How to address Spectrum bias Perform random or consecutive sampling; avoid excluding subjects with ambiguous results Implement blinding of the researchers to the results of the reference test when interpreting the index test; Information bias predetermine the thresholds when designing a study Predict the direction and degree of deviation for the diagnostic accuracy in sensitivity analysis and adjust Misclassification bias accordingly; create a composite reference standard Diagnostic review Implement blinding of the researchers to the results of the index test when interpreting the reference test bias Address in the limitations section the possibility of overestimation of accuracy estimates and if possible, adjust Incorporation bias accordingly Use the same reference standard for all subjects and if not possible, acknowledge and measure the potential Verification bias accuracy estimate error Study the characteristics of subjects lost and how they differ from those that remain; perform sensitivity analysis to Attrition bias calculate the range of diagnostic accuracy estimates as if all withdrawals tested positive or negative Lack of Bias Internal External Validity Validity Study level Population level Precision Application in clinical setting Patient level Figure 1: Internal and external validity. Precision and lack of bias dictate the internal validity of the study. External validity refers to the process of applying the study results from the study level to the population level. Radiologists can use these results in their own clinical practice for management of individual patients. High precision High precision High bias Low bias Low precision Low precision High bias Low bias Decreasing bias Figure 2: Precision and bias. Increasing precision reduces the random error and decreasing bias is equivalent to decreasing systematic error. +e higher the precision and the lower the bias, the higher the internal validity of the study. Adapted from ELife, 7, e35718, Brandmaier, A. M. et al., Assessing reliability in neuroimaging research through intraclass effect decomposition (ICED) (2018) (modified) [10]. Increasing precision 4 Radiology Research and Practice for the clinical validation of AI algorithms forces a binary of 70% and specificity of 89% [17]. By contrast, other studies on the same topic reported almost perfect diagnostic ac- distinction of outcomes that does not accurately represent real-world situations, where disease-simulating conditions curacy (sensitivity: 100% and specificity: 91.8–100%) using a nonconsecutive sample. +ese higher accuracy measures and comorbidities may be present. As a result, the diagnostic could be due to excluding complex cases, excluding patients performance of an AI algorithm may be inflated, and who may be difficult to perform ultrasound, or excluding consequently, the generalization of study results to real- ambiguous results [18, 19]. world practice may be problematic. Nevertheless, case- control studies are still typically used as initial validation methods for deep learning algorithms, as they are more 3.2. Spectrum Bias. Spectrum bias is commonly used to convenient to perform and allow for establishment of a describe the variation in test performance across patient reference standard [23, 24]. subgroups. Studies that utilize a limited portion of the pa- Another limitation of the case-control design is that the tient spectrum can be affected by this type of bias. For positive predictive value (PPV) (probability that subjects example, a study that includes only high-risk patients may with a positive test truly have the disease) and negative provide different diagnostic accuracy estimates compared to predictive value (NPV) (probability that subjects with a a study that includes only low-risk patients, as the test negative test truly do not have the disease) cannot be directly performance varies in different populations [20, 21]. measured, as the ratio of cases to control is set by the in- An obvious source of spectrum bias is a patient selection vestigator and disease prevalence is not reflected in the data method that leads to a sample that is not representative of (see Figure 4) [22]. the target population. Local referral practices can also remove cases from the initial distribution, narrow the spectrum of patients, and lead to bias [11]. Understanding 4. Domain 2: Index Test spectrum bias is important as it can prohibit the general- 4.1. Information Bias. An important source of bias when ization of the results from the studied sample to a wider evaluating the index test is the lack of blinding of the in- population, especially when studying heterogeneous pop- vestigators to the results of the reference standard for each ulations. It has been suggested that “spectrum effect” is a subject. Knowledge of the reference standard results may more appropriate term, as the estimate from a narrow influence the interpretation of the index test results. +is is also spectrum of patients is valid for this specific subgroup [21]. known as information bias. +is type of bias can lead to larger An example of how diagnostic accuracy measurements deviations when the index test is not an objective measure- can be influenced by the patient spectrum is seen in a meta- ment and depends on a rater’s subjective assessment [25]. analysis that studied the accuracy of magnetic resonance Aside from blinding to avoid information bias, it is imaging (MRI) to detect silicone breast implant rupture. +e important for diagnostic accuracy studies to prespecify the authors found that the diagnostic accuracy of MRI in studies threshold used for the index test interpretation. A posteriori that included patients with symptoms of implant rupture determination of a threshold in a data-driven way can lead to was 14 times higher compared to studies that included only overestimation of test performance, especially in studies asymptomatic patients and two times higher compared to with a small number of subjects. +is is because an optimal studies that used both symptomatic and asymptomatic cutoff may be chosen based on the available results to favor patients (screening sample) [2]. overly optimistic measures of diagnostic accuracy [26]. For example, Kivrak et al. performed a study comparing 3.3. Case-Control and Cross-Sectional Study Design. In di- computed tomography (CT) virtual cystoscopy with con- agnostic accuracy studies, based on the way subjects are ventional cystoscopy for the diagnosis of bladder tumors, recruited, the study design is usually a case-control, cross- which they designed in a rigorous way to avoid introducing sectional, or cohort study design. In case-control designs, information bias. +e authors report that the two experi- patients are sampled separately from controls, which in- enced radiologists, who independently interpreted the vir- troduces spectrum bias. +is is because patients tend to be tual cystoscopy (the index test), were blinded to the findings “the sickest of the sick,” which leads to sensitivity overes- of conventional cystoscopy (the reference standard). Ad- timation, and controls tend to be the “healthiest of the ditionally, the virtual cystoscopy was performed and healthy,” which leads to specificity overestimation (see interpreted prior to the conventional cystoscopy, thereby Figure 3). In cross-sectional and cohort designs, patients and ensuring that the investigators were blinded to the results of controls are sampled together from a population based on the reference test [27]. the presence of a characteristic regardless of the presence of disease [3, 22]. In a study by Lijmer et al., which reviewed 184 diagnostic accuracy studies for design-related bias, case- 4.2. Indeterminate Index Test Results. Patients with inde- control designs tended to overestimate the diagnostic per- terminate or ambiguous results should not be excluded formance of the test by threefold compared to studies with from the study, as this could limit the results to an un- cohort design [3]. representative spectrum of extremes and potentially in- An area in radiology where the difference between case- troduce spectrum bias. In this case, it is preferable to control and cohort has been studied is Artificial Intelligence transform the 2 × 2 table to a 3 × 2 table and report positive, (AI). As noted by Park [23], utilizing a case-control design indeterminate, and negative results separately. To ensure Radiology Research and Practice 5 Sample subjects Cross sectional or cohort design: D+ and D- distributions D- D+ are closer as the sample includes the full spectrum of severity. Disease severity Sample subjects Case control design: D+ and D- distributions are further apart as mild cases are likely to be D- D+ ignored. Disease severity Figure 3: Cross-sectional study design minimizes spectrum bias as cases and controls are not sampled separately from the target population. D refers to disease status with D+ meaning disease is present and D-patients meaning disease is absent. Target population with disease D+ prevalence of 10% Cross sectional or cohort study design Case control study design Sampling scheme The sample is chosen from the target The D+ and D- patients are sampled population without regard to the presence separately from the target population of disease Resulting sample Study sample has a ratio Study sample has a ratio of disease D+ to D- of disease D+ to D- 1:10 chosen by the equal to the target researchers (e.g., 40%) population Statistics that can Sensitivity Sensitivity be measured Specificity Specificity Positive predictive value Negative predictive value Figure 4: Cross-sectional and cohort designs allow for the calculation of a negative predictive value (NPV) and a positive predictive value (PPV), as they incorporate meaningful prevalence data. D refers to disease status with D+ meaning disease is present and D-patients meaning disease is absent. that diagnostic accuracy estimates are not overestimated, a A meta-analysis by Schuetz et al. [28] pooled coronary conservative “intention to diagnose” approach should be CT angiography studies to compare how the handling of followed; indeterminate cases that test positive with the nonevaluable results affects diagnostic accuracy estimates. reference test are classified as false negative for the index. As CT angiography interpretation can involve nonevaluable Indeterminate cases that test negative with the reference test results especially in areas with vessel calcifications [29], test are classified as false positive for the index (see Table 2). the authors can consider nonevaluable vessel segments as In the scenario when the reference test also yields inde- positive or negative, exclude them from analysis, or even terminate results, the table may be extended to a 3 × 3 table exclude patients with nonevaluable segments altogether. +e to ensure transparent reporting [28]. authors in this study found that handling the test results with 6 Radiology Research and Practice Table 2: Various approaches for indeterminate index test results and their effect on sensitivity and specificity. Indeterminate results Sensitivity Specificity Excluded from analysis Increased Increased Indeterminate results considered positive Increased Decreased Indeterminate results considered negative Decreased Increased “Intention to diagnose” approach Decreased Decreased an “intention to diagnose” approach using a 3 × 2 table 5.2. Diagnostic Review Bias. Another important consider- ation when evaluating the reference standard is whether it is yielded lower diagnostic accuracy measures (Area Under Curve 0.93) compared to the other approaches (Area Under interpreted without the knowledge of the index test results. A positive index test may drive raters to search the reference Curve 0.96–0.99) [28]. study more carefully for evidence of disease. +is is known as diagnostic review bias [25]. As pointed out by Ransohoff 5.Domain 3: Reference Standard et al. [20], an example of this bias can be found in a study by +e reference test represents the gold standard to which the Meadway et al. [33] which evaluated the diagnostic per- index test is being compared. +e assumption is that the formance of Doppler ultrasound compared to venography. reference standard is 100% accurate, so any disagreement No indication was provided that the venograms were ex- with the results of the index test is attributed to the limited amined independently of the Doppler studies and thus it is sensitivity or specificity of the latter. However, reference possible that knowledge of the Doppler results affected the standards that perfectly differentiate between patients with venogram diagnoses. and without the target condition are rare and, thus, some patients will inevitably be misclassified [30]. 5.3. Incorporation Bias. On some occasions, the index test may be part of the reference standard. +e resulting bias is 5.1. Misclassification Bias. Misclassification bias, which is called incorporation bias and leads to overestimation of the also called imperfect gold standard bias, occurs due to errors sensitivity and specificity. Incorporation bias often occurs in the reference test. +e reference test may be susceptible to when the reference standard relies on clinical judgment as errors either due to its interpretation or due to technical the clinician often uses the index test to arrive at a diagnosis. limitations. For example, an imaging exam can give erro- +is bias will result in an overestimation of diagnostic ac- neous results because of inexperienced readers or due to curacy [34]. An example of this bias in the radiology lit- limited resolution. If pathology is used as a reference erature can be found in a study by Mater et al. which standard, sampling error is an additional factor which could evaluated the diagnostic accuracy of shunt series radio- lead to false-negative results. +e effect of this bias on the graphs and CT to assess for cerebrospinal fluid shunt diagnostic accuracy estimates can vary depending on malfunction. +e clinical decision to proceed to shunt re- whether the reference and index tests tend to err in the same vision, which was used as the reference standard, was made direction on the same patients or the reference and index test by the neurosurgeons after reviewing the radiograph and CT errors are independent of each other. As a result, sensitivity imaging. Despite the introduction of incorporation bias, this and specificity can be over- or underestimated by this type of decision was reasonable in this study due to the lack of an bias [22]. independent gold standard. +e authors also acknowledged An example of misclassification bias can be found in a this concern in the limitations section by stating possible study by Ai et al. which determined the diagnostic accuracy overestimation of the sensitivity [35]. of chest CT for the diagnosis of Coronavirus Disease 2019 (COVID-19). +e reference standard was a Reverse Tran- 6. Domain 4: Patient Flow and Timing scription Polymerase Chain Reaction (RT-PCR) test, which can give false-negative results in the early stages of the Diagnostic accuracy studies should be designed taking into disease. +e authors calculated the sensitivity of chest CT for account time-dependent changes of the disease on the the diagnosis of COVID-19 to be 97% and the specificity studied population and follow—as much as possible—a 25% but acknowledged in the limitations section that, due to homogeneous approach for all subjects. Intervals between misclassification bias, the sensitivity may have been over- the index and reference test and disturbances in the flow of estimated and the specificity may have been underestimated the study, such as changes in the reference test or with- by solely relying on the results of a single RT-PCR test [31]. drawals, are important sources of bias [7]. Various methods have been proposed to correct for misclassification bias. One suggestion is adjusting the ac- curacy estimates based on external evidence about the degree 6.1.TimingoftheIndexandReferenceTest. +e time interval and direction of the reference standard misclassification. between the conduction of the index and the reference tests Other ways to minimize this bias are to combine multiple should ideally be as short as possible. A long period between tests to a composite reference standard or validate the index the two could lead to misclassification bias, as the disease test usefulness by correlating directly with future clinical might improve or deteriorate during the interval time. An events or other clinical characteristics [32]. interval of a few days could be reasonable for chronic Radiology Research and Practice 7 Table 3: Direction of diagnostic accuracy estimates by type of bias. RDOR from RDOR from Type of bias Sensitivity [3, 16, 22, 26] Specificity [3, 16, 22, 26] Rutjes et al. Lijmer et al. [16] [3] Sampling bias 1.5, 95% CI 0.9, 95% CI Increases if complex cases are excluded Increases if complex cases are excluded (consecutive over (1.0–2.1) (0.7–1.1) nonconsecutive Decreases if clear-cut cases are Decreases if clear-cut cases are sampling) excluded excluded Increases when severe cases are Increases when healthy controls are 4.9, 95% CI 3.0, 95% CI Spectrum bias overrepresented in the patient sample overrepresented in the patient sample (0.6–37.3) (2.0–4.5) (“the sickest of the sick”) (“the healthiest of the healthy”) Information bias: lack of 1.1, 95% CI 1.3, 95% CI Variable Variable blinding (0.8–1.6) (1.0–1.9) Information bias: post 1.3 95% CI Increases Increases Not studied hoc definition of cutoff (0.8–1.9) Increases if errors in index and Increases if errors in index and Misclassification bias reference test are correlated reference test are correlated Not studied Not studied (imperfect gold standard) Decreases if errors in index and Decreases if errors in index and reference test are independent reference test are independent 1.4, 95% CI Not studied Incorporation bias Increases Increases (0.7–2.8) Increases if the gold standard is used Increases if the gold standard is used Verification bias: for positive index results and a different for positive index results and a different 1.6, 95% CI 2.2, 95% CI differential (i.e., different reference test (e.g., noninvasive and reference test (e.g., noninvasive and (0.9–2.9) (1.5–3.3) reference standards) less expensive) is used for negative less expensive) is used for negative index results index results 1.1, 95% CI 1.0, 95% CI Verification bias: partial Increases Decreases (0.7–1.7) (0.8–1.3) RDOR: Relative Diagnostic Odds Ratio. CI: confidence interval. diseases but would be problematic for acute diseases. For An example can be found in the Prospective Investi- reference tests that require follow-up to determine whether gation of Pulmonary Embolism Diagnosis (PIOPED) study, the disease is present, an appropriate minimum follow-up which evaluated the diagnostic accuracy of Ventilation- time should be set for all patients [6]. For example, a sys- Perfusion (V-Q) scan using conventional angiography as a reference standard. From the 131 patients with near normal/ tematic review investigated the diagnostic accuracy of MRI in the diagnosis of early multiple sclerosis using clinical normal results on the V-Q scan, only 57 received angiog- follow-up as reference standard. +e average follow-up raphy (gold standard). For the remaining 74, an alternative period in the included studies ranged from 7 months to 14 reference standard was used: no evidence of pulmonary years and the authors found that studies with shorter follow- embolism during one-year follow-up. +e authors calculated up tended to overestimate the sensitivity and underestimate that if those 74 patients were included in the analysis, the specificity [36]. NPV for near normal/normal scan would have been 96% and if not, the NPV would have been 91%. So, they con- cluded that the true NPV value is somewhere between those two numbers but possibly closer to the first [39]. 6.2. Verification Bias. Verification bias is a form of bias in- Another area in radiology where partial verification bias troduced when not all patients receive the gold standard has been described is Single Photon Emission Computed (partial) or when some patients receive a different reference test Tomography (SPECT) for the diagnosis of coronary artery than the rest (differential) [3]. In partial verification bias, if the disease [40–43]. +e decision to perform coronary angi- decision is made to perform the gold standard only for positive ography, which is the gold standard for the diagnosis of index test cases, the sensitivity will be overestimated (fewer false coronary artery disease, may be affected by the result of a negatives) and specificity will be underestimated (more false preceding SPECT which introduces verification bias (also positives) [37]. +e effect of differential verification depends on called posttest referral bias). Authors have utilized mathe- the quality of the different reference tests that are being used. matical formulas (e.g., Begg and Greenes [44] and Diamond Using a superior reference test for the positive test results and a [45]) to adjust for this bias leading to significant changes in different reference test for the negative results will overestimate calculated diagnostic accuracy parameters. Miller et al. [42] both sensitivity and specificity [3]. Notably, using the same gold reported an unadjusted sensitivity of 98% and specificity of standard for all patients may not be clinically or ethically 13% for SPECT in coronary artery disease. After correction appropriate. If verification bias cannot be eliminated by with the Begg and Greenes formula, the sensitivity dropped choosing a proper study design, it should be at least ac- to 65% and the specificity increased to 67% which indicates knowledged or statistically corrected by the authors [38]. 8 Radiology Research and Practice that verification bias can have an effect on accuracy accuracy studies and critically assess the studies before estimation. applying the conclusions to their own clinical practice. Conflicts of Interest 6.3. Attrition Bias. An important consideration is whether all patients were included in the analysis. Withdrawals lead +e authors declare that they have no conflicts of interest. to over- or underestimation of accuracy estimates (attrition bias) if the patients lost to follow-up differ in some way from References those who remain. It is important for studies to report [1] A. S. Saber Tehrani, H. Lee, S. C. Mathews et al., “25-year withdrawals and evaluate their effect on accuracy estimates summary of US malpractice claims for diagnostic errors [46]. An example of the effect of withdrawals on diagnostic 1986–2010: an analysis from the National Practitioner Data accuracy estimates can be found in a study by Kearl et al. Bank,” BMJ Quality and Safety, vol. 22, pp. 672–680, 2013. which investigated the accuracy of MRI and ultrasound in [2] J. W. Song, H. M. Kim, L. T. Bellfi, and K. C. Chung, “+e the diagnosis of appendicitis. Of the 589 patients included, effect of study design biases on the diagnostic accuracy of the reference standards, which were pathology reports, magnetic resonance imaging for detecting silicone breast surgical diagnosis, or clinical decision for medical treatment implant ruptures: a meta-analysis,” Plastic and Reconstructive for appendicitis, were not available for 63 patients (10.7%) Surgery, vol. 127, pp. 1029–1044, 2011. due to loss to follow-up. +e authors acknowledged this [3] J. G. Lijmer, B. W. Mol, S. Heisterkamp et al., “Empirical limitation and analyzed the effect on diagnostic accuracy. A evidence of design-related bias in studies of diagnostic tests,” Journal of the American Medical Association, vol. 282, sensitivity analysis was performed, and the diagnostic ac- pp. 1061–1066, 1999. curacy was calculated separately as if all withdrawals were [4] R. K. Owen, N. J. Cooper, T. J. Quinn, R. Lees, and positive for appendicitis as well as if all withdrawals were A. J. Sutton, “Network meta-analysis of diagnostic test ac- negative for appendicitis with the reference standard [47]. curacy studies identifies and ranks the optimal diagnostic tests and thresholds for health care policy and decision-making,” 7.Direction of Accuracy Measures due to Bias Journal of Clinical Epidemiology, vol. 99, pp. 64–74, 2018. [5] Evidence-Based Radiology Working Group, “Evidence-based Knowing the direction of diagnostic accuracy measures is a radiology: a new approach to the practice of radiology,” first step in countering the effect of bias in our interpretation Radiology, vol. 220, pp. 566–575, 2001. of study results. +e general direction towards which the [6] J. B. Reitsma, A. W. S. Rutjes, P. Whiting, V. V. Vlassov, M. M. G. Leeflang, and J. J. Deeks, “Chapter 9: assessing diagnostic accuracy estimates may deviate can be predicted methodological quality,” in Cochrane Handbook for System- and depends on the specific type of bias. Rutjes et al. [16] and atic Reviews of Diagnostic Test Accuracy Version 1.0.0, Lijmer et al. [3] quantified the effect of several study design J. J. Deeks, P. M. Bossuyt, and C. Gatsonis, Eds., +e Cochrane biases on diagnostic accuracy measures (see Table 3). +ey Collaboration, London, UK, 2009. used the Relative Diagnostic Odds Ratio (RDOR) as a pa- [7] P. F. Whiting, A. W. S. Rutjes, M. E. Westwood et al., rameter to compare studies with a specific methodological “QUADAS-2: a revised tool for the quality assessment of shortcoming to those without. An RDOR greater than one diagnostic accuracy studies,” Annals of Internal Medicine, indicates that diagnostic accuracy parameters are over- vol. 155, pp. 529–536, 2011. estimated in the study, while an RDOR less than one in- [8] S. Aslam and P. Emmanuel, “Formulating a researchable dicates that diagnostic accuracy parameters are question: a critical step for facilitating good clinical research,” underestimated in the study. +e limitation with using Indian Journal of Sexually Transmitted Diseases and AIDS, vol. 31, pp. 47–50, 2010. RDOR is that important biases that have opposing effects on [9] C. M. Patino and J. C. Ferreira, “Internal and external validity: sensitivity and specificity may not cause significant direc- can you apply research study results to your patients?” Jornal tional changes in RDOR, which will remain close to one. Brasileiro de Pneumologia, vol. 44, p. 183, 2018. +is may be the explanation why both of these studies failed [10] A. M. Brandmaier, E. Wenger, N. C. Bodammer, S. Kuhn, ¨ to detect statistically significant changes in the RDOR for N. Raz, and U. Lindenberger, “Assessing reliability in neu- some forms of bias [3] (see Table 3). roimaging research through intra-class effect decomposition (ICED),” eLife, vol. 7, Article ID e35718, 2018. [11] R. L. Schmidt and R. E. Factor, “Understanding sources of bias 8.Conclusion in diagnostic accuracy studies,” Archives of Pathology & Laboratory Medicine, vol. 137, pp. 558–565, 2013. Diagnostic accuracy studies can suffer from many forms of [12] C. Andrade, “Internal, external, and ecological validity in bias. QUADAS-2 provides a useful framework for thinking research design, conduct, and evaluation,” Indian Journal of about study design biases. Patient selection, index test, Psychological Medicine, vol. 40, pp. 498-499, 2018. reference test, and patient flow/timing are the four main [13] J. F. Cohen, D. A. Korevaar, D. G. Altman et al., “STARD 2015 domains to be evaluated in each study, as they cover the guidelines for reporting diagnostic accuracy studies: expla- primary sources of systematic error in diagnostic accuracy nation and elaboration,” BMJ Open, vol. 6, Article ID e012799, studies. Potential sources of bias should be acknowledged by the authors and their effect on test performance should be [14] M. Elfil and A. Negida, “Sampling methods in clinical re- estimated and reported. We are encouraged to become fa- search; an educational review,” Emergency, vol. 5, no. 1, p. e52, miliar with the biases that can be found in diagnostic 2017. Radiology Research and Practice 9 [15] K. Mathieson, “Making sense of biostatistics: probability [30] C. Biesheuvel, L. Irwig, and P. Bossuyt, “Observed differences versus nonprobability sampling,” Journal of Clinical Research in diagnostic test accuracy between patient subgroups: is it Best Practices, vol. 10, no. 8, pp. 1-2, 2014. real or due to reference standard misclassification?” Clinical [16] A. W. S. Rutjes, J. B. Reitsma, M. Di Nisio, N. Smidt, Chemistry, vol. 53, pp. 1725–1729, 2007. J. C. van Rijn, and P. M. M. Bossuyt, “Evidence of bias and [31] T. Ai, Z. Yang, H. Hou, C. Zhan, C. Chen, and W. Lv, variation in diagnostic accuracy studies,” Canadian Medical “Correlation of chest CT and RT-PCR testing for Coronavirus Association Journal, vol. 174, pp. 469–476, 2006. disease 2019 (COVID-19) in China: a report of 1014 cases,” [17] J. A. Kline, P. M. O’Malley, V. S. Tayal, G. R. Snead, and Radiology, vol. 296, pp. E32–E40, 2020. A. M. Mitchell, “Emergency clinician-performed compression [32] J. B. Reitsma, A. W. S. Rutjes, K. S. Khan, A. Coomarasamy, ultrasonography for deep venous thrombosis of the lower and P. M. Bossuyt, “A review of solutions for diagnostic extremity,”AnnalsofEmergencyMedicine, vol. 52, pp. 437–445, accuracy studies with an imperfect or missing reference standard,” Journal of Clinical Epidemiology, vol. 62, [18] S. Farahmand, M. Farnia, S. Shahriaran, and P. Khashayar, pp. 797–806, 2009. “+e accuracy of limited B-mode compression technique in [33] J. Meadway, A. N. Nicolaides, C. J. Walker, and diagnosing deep venous thrombosis in lower extremities,” 9e J. D. O’Connell, “Value of Doppler ultrasound in diagnosis of American journal of emergency medicine, vol. 29, pp. 687–690, clinically suspected deep vein thrombosis,” British Medical Journal, vol. 4, pp. 552–554, 1975. [19] T. Jang, M. Docherty, C. Aubin, and G. Polites, “Resident- [34] A. Worster and C. Carpenter, “Incorporation bias in studies of performed compression ultrasonography for the detection of diagnostic tests: how to avoid being biased about bias,” CJEM, proximal deep vein thrombosis: fast and accurate,” Academic vol. 10, pp. 174-175, 2008. Emergency Medicine, vol. 11, pp. 319–322, 2004. [35] A. Mater, M. Shroff, S. Al-Farsi, J. Drake, and R. D. Goldman, [20] D. F. Ransohoff and A. R. Feinstein, “Problems of spectrum “Test characteristics of neuroimaging in the emergency de- and bias in evaluating the efficacy of diagnostic tests,” New partment evaluation of children for cerebrospinal fluid shunt England Journal of Medicine, vol. 299, pp. 926–930, 1978. malfunction,” Canadian Journal of Emergency Medicine, [21] S. A. Mulherin and W. C. Miller, “Spectrum bias or spectrum vol. 10, pp. 131–135, 2008. effect? subgroup variation in diagnostic test evaluation,” [36] P. Whiting, R. Harbord, C. Main, J. J. Deeks, G. Filippini, and Annals of Internal Medicine, vol. 137, pp. 598–602, 2002. M. Egger, “Accuracy of magnetic resonance imaging for the [22] M. A. Kohn, C. R. Carpenter, and T. B. Newman, “Under- diagnosis of multiple sclerosis: systematic review,” BMJ, standing the direction of bias in studies of diagnostic test vol. 332, pp. 875–884, 2006. accuracy,” Academic Emergency Medicine, vol. 20, pp. 1194– [37] A. S. Kosinski and H. X. Barnhart, “Accounting for non- 1206, 2013. ignorable verification bias in assessment of diagnostic tests,” [23] S. H. Park, “Diagnostic case-control versus diagnostic cohort Biometrics, vol. 59, pp. 163–171, 2003. studies for clinical validation of artificial intelligence algo- [38] J. M. Petscavage, M. L. Richardson, and R. B. Carr, “Verifi- rithm performance,” Radiology, vol. 290, no. 1, pp. 272-273, cation bias an underrecognized source of error in assessing the efficacy of medical imaging,” Academic Radiology, vol. 18, [24] J. G. Nam, S. Park, E. J. Hwang et al., “Development and pp. 343–346, 2011. validation of deep learning–based automatic detection algo- [39] PIOPED Investigators, “Value of the ventilation/perfusion rithm for malignant pulmonary nodules on chest radio- scan in acute pulmonary embolism. Results of the prospective graphs,” Radiology, vol. 290, no. 1, pp. 218–228, 2019. investigation of pulmonary embolism diagnosis (PIOPED),” [25] P. Whiting, A. W. S. Rutjes, J. B. Reitsma, A. S. Glas, Journal of the American Medical Association, vol. 263, P. M. M. Bossuyt, and J. Kleijnen, “Sources of variation and pp. 2753–2759, 1990. bias in studies of diagnostic accuracy,” Annals of Internal [40] M. P. Cecil, A. S. Kosinski, M. T. Jones et al., “+e importance Medicine, vol. 140, pp. 189–202, 2004. of work-up (verification) bias correction in assessing the [26] M. M. G. Leeflang, K. G. M. Moons, J. B. Reitsma, and accuracy of SPECT thallium-201 testing for the diagnosis of A. H. Zwinderman, “Bias in sensitivity and specificity caused coronary artery disease,” Journal of Clinical Epidemiology, by data-driven selection of optimal cutoff values: mechanisms, vol. 49, no. 7, pp. 735–742, 1996. magnitude, and solutions,” Clinical Chemistry, vol. 54, [41] C. Santana-Boado, J. Candell-Riera, J. Castell-Conesa et al., pp. 729–737, 2008. “Diagnostic accuracy of technetium-99m-MIBI myocardial [27] A. S. Kivrak, D. Kiresi, D. Emlik, K. Odev, and M. Kilinc, SPECT in women and men,” Journal of Nuclear Medicine: “Comparison of CT virtual cystoscopy of the contrast ma- Official Publication, Society of Nuclear Medicine, vol. 39, no. 5, terial-filled bladder with conventional cystoscopy in the di- pp. 751–755, 1998. agnosis of bladder tumours,” Clinical Radiology, vol. 64, [42] T. D. Miller, D. O. Hodge, T. F. Christian, J. J. Milavetz, pp. 30–37, 2009. K. R. Bailey, and R. J. Gibbons, “Effects of adjustment for [28] G. M. Schuetz, P. Schlattmann, and M. Dewey, “Use of 3 × 2 referral bias on the sensitivity and specificity of single photon tables with an intention to diagnose approach to assess clinical emission computed tomography for the diagnosis of coronary performance of diagnostic tests: meta-analytical evaluation of artery disease,” 9e American Journal of Medicine, vol. 112, coronary CT angiography studies,” BMJ, vol. 345, Article ID e6717, 2012. no. 4, pp. 290–297, 2002. [43] J. A. Ladapo, S. Blecker, M. R. Elashoff et al., “Clinical im- [29] C. Strong, A. Ferreira, and R. C. Teles, “Diagnostic accuracy of plications of referral bias in the diagnostic performance of computed tomography angiography for the exclusion of exercise testing for coronary artery disease,” Journal of the coronary artery disease in candidates for transcatheter aortic valve implantation,” Scientific Reports, vol. 9, Article ID American Heart Association, vol. 2, no. 6, Article ID e000505, 19942, 2019. 2013. 10 Radiology Research and Practice [44] C. B. Begg and R. A. Greenes, “Assessment of diagnostic tests when disease verification is subject to selection bias,” Bio- metrics, vol. 39, no. 1, pp. 207–215, 1983. [45] G. A. Diamond, “Reverend Bayes’ silent majority. An alter- native factor affecting sensitivity and specificity of exercise electrocardiography,” 9e American Journal of Cardiology, vol. 57, no. 13, pp. 1175–1180, 1986. [46] O. Aliu and K. C. Chung, “Assessing strength of evidence in diagnostic tests,” Plastic and Reconstructive Surgery, vol. 129, pp. 989e–998e, 2012. [47] Y. L. Kearl, I. Claudius, S. Behar et al., “Accuracy of magnetic resonance imaging and ultrasound for appendicitis in diag- nostic and nondiagnostic studies,” Academic Emergency Medicine, vol. 23, pp. 179–185, 2016. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Radiology Research and Practice Hindawi Publishing Corporation

Diagnostic Accuracy Studies in Radiology: How to Recognize and Address Potential Sources of Bias

Loading next page...
 
/lp/hindawi-publishing-corporation/diagnostic-accuracy-studies-in-radiology-how-to-recognize-and-address-Fuj8HBepCr
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2021 Athanasios Pavlou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
2090-1941
eISSN
2090-195X
DOI
10.1155/2021/5801662
Publisher site
See Article on Publisher Site

Abstract

Hindawi Radiology Research and Practice Volume 2021, Article ID 5801662, 10 pages https://doi.org/10.1155/2021/5801662 Review Article Diagnostic Accuracy Studies in Radiology: How to Recognize and Address Potential Sources of Bias 1,2 3 3 Athanasios Pavlou , Robert M. Kurtz , and Jae W. Song St. Vincent’s Medical Center, Bridgeport, CT, USA Frank H. Netter MD School of Medicine, North Haven, CT, USA Hospital of the University of Pennsylvania, Philadelphia, PA, USA Correspondence should be addressed to Jae W. Song; jae.song@pennmedicine.upenn.edu Received 2 July 2021; Revised 17 August 2021; Accepted 18 August 2021; Published 7 September 2021 Academic Editor: Andre´ Luiz Ferreira Costa Copyright © 2021 Athanasios Pavlou et al. +is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Accuracy is an important parameter of a diagnostic test. Studies that attempt to determine a test’s accuracy can suffer from various forms of bias. As radiology is a diagnostic specialty, many radiologists may design a diagnostic accuracy study or review one to understand how it may apply to their practice. Radiologists also frequently serve as consultants to other physicians regarding the selection of the most appropriate diagnostic exams. In these roles, understanding how to critically appraise the literature is important for all radiologists. +e purpose of this review is to provide a framework for evaluating potential sources of study design biases that are found in diagnostic accuracy studies and to explain their impact on sensitivity and specificity estimates. To help the reader understand these biases, we also present examples from the radiology literature. As practitioners of a diagnostic specialty, it is important 1.Introduction for radiologists to understand how to appraise diagnostic +e accuracy of a diagnostic test refers to how well a test can accuracy studies. Radiologists are frequently consulted by correctly identify a specific disease. +erefore, it is a crucial other physicians on which imaging test to order for specific parameter to consider when making a decision to perform indications and serve to educate and inform others about that test in a clinical setting. Inaccurate diagnostic tests can current standards of care for the diagnostic work-up of many lead to over- or undertreatment, inflated healthcare costs, patients. In the era of evidence-based medicine, radiologists and potentially patient harm [1]. Diagnostic accuracy studies are encouraged to keep up with the literature as well as know attempt to evaluate a test’s performance by comparing it to a how to appraise the quality of a diagnostic accuracy study. gold standard. +ese studies can suffer from biases (e.g., Moreover, it is equally important to know how applicable spectrum bias and verification bias) that are different from the results of a particular diagnostic accuracy study are to the radiologist’s own clinical practice [5]. those affecting studies designed to test the efficacy of therapeutic interventions. Awareness of these biases and Guidelines and checklists often serve as useful tools to how they can impact diagnostic accuracy measures is im- help one be comprehensive and achieve consistency. As portant. Several studies have quantitatively shown that such, the Cochrane Collaboration and Agency for Health- specific biases can lead to an overestimation or underesti- care Research and Quality has recommended the use of mation of accuracy measures [2, 3]. Given that diagnostic checklists such as the Quality Assessment of Diagnostic accuracy studies help experts and policymakers to create Accuracy Studies 2 (QUADAS-2) tool [6]. +is tool helps to guidelines and establish standard-of-care measures [4], it is assess the risk of bias in diagnostic studies and is organized imperative that readers be aware of these biases and how into 4 key domains. +ese domains include evaluating as- they can be addressed. pects of study design related to (1) patient selection, (2) the 2 Radiology Research and Practice as the Standards for the Reporting of Diagnostic Accuracy index test, (3) reference standard, and the (4) flow and timing of subjects in a study [7]. Within each domain are studies (STARD) [13]. Using the PIC framework will help the reader assess the specific types of study design biases that should be considered. external validity as well as gain insight into the applicability In this paper, we use the QUADAS-2 framework to of the study. For assessment of internal validity, critically review the study design biases within each domain (see appraising the study design using a four-domain framework Table 1). We will also present examples from the radiology is suggested [11]. We now review specific sources of study literature. design biases using the QUADAS-2 framework. 2.Basic Concepts 3. Domain 1: Patient Selection +e framework for developing a research question in ev- +e goal of sampling is to ensure that the sample group is idence-based medicine follows the PICO model. In diag- representative of the population of interest. +e results of the nostic accuracy studies, PICO stands for P (population), I study are contingent on the studied sample. +us, sampling (index test), C (comparator or reference standard), and O methods are a critical part of a study design. Participants (outcomes). A diagnostic accuracy study compares the should ideally be recruited from a population in a process index test (the test under investigation) with an established that ensures no over- or underrepresentation of certain reference test on a specific population and provides out- subpopulations [14]. comes for comparison [8]. +e degree to which the out- comes of the study represent true findings among similar individuals outside the study is determined by the validity. 3.1. Sampling Definition and Methods. Sampling is the +ere are two main types of validity: internal and external process of selecting a group of study subjects from the target (see Figure 1) [9]. population. +ere are two main categories of sampling methods: probability and nonprobability sampling. In probability sampling methods, all eligible subjects in 2.1. Internal Validity. +e extent to which the observed the target population have equal chances to be selected (e.g., results are not due to methodological errors is defined as random sampling). +e challenge with this type of sampling internal validity. +e internal validity of a study can be method is that it requires the presence of a comprehensive threatened by bias and imprecision (see Figure 2). Bias is list or registry of all eligible patients in the target population, considered to be any systematic deviation of an estimate from which the subjects are randomly chosen using, for from the true value. If a diagnostic accuracy study suffers instance, a random number generator [14]. As such regis- from bias, its sensitivity and/or specificity will be consis- tries are rarely available in practice, clinical studies more tently under- or overestimated compared to the true value. frequently use nonprobability sampling [15]. +is means that the error introduced by bias will not balance In nonprobability sampling methods, the sample is se- out upon repetition. Imprecision is the random error that lected in a process that does not guarantee equal chances to occurs with multiple estimates of a parameter and refers to be selected for each eligible subject in the target population. how far these estimates are from each other, not how far they An example of nonprobability sampling is convenience are from the true value. Because of the random deviation of sampling, where patients are selected only based on ac- the estimates towards opposite directions, repetition will cessibility and availability. +e selection process for con- eventually balance out this error [11]. venience sampling can lead to over- or underrepresentation of certain population attributes and therefore decreases the 2.2. External Validity. External validity examines whether generalizability of the study results (sampling bias). A special the findings of a study can be generalized to the population form of convenience sampling, commonly used in clinical level. If the study’s sample is representative of the target research, is consecutive sampling. In this sampling method, population, the results of the study can be generalized to the for a specified period of time, every subject who meets the population from which the sample was drawn and even predefined inclusion and exclusion criteria is recruited for beyond that to other similar populations. +is is especially the study. +is sampling method prevents the researchers important as it determines whether the results of the study from “picking and choosing” subjects [15]. Analysis of 31 can be applied in daily clinical practice [12]. published meta-analyses showed that nonconsecutive Applicability is also an important consideration when sampling tended to overestimate the diagnostic accuracy of evaluating diagnostic accuray studies. Careful evaluation of the test by 50% compared to consecutive sampling in di- the PIC (Population-Index-Reference) parameters of a study agnostic accuracy studies [16]. will help determine the extent of applicability of a study to a +e effect of consecutive over nonconsecutive sampling reader’s clinical practice. +e patient demographics, selec- can be seen in a study evaluating deep venous thrombosis tion and use of the index test, and test interpretation should (DVT) of the lower extremities. Kline et al. recruited subjects be compared between the study and the reader’s practice. To using a consecutive method to compare the diagnostic ac- allow for this comparison, it is vital that diagnostic accuracy curacy of emergency clinician-performed compression ul- studies report their methods with completeness and trasonography for DVT of the lower extremities against transparency, preferably using standardized checklists such whole-leg venous ultrasonography and reported a sensitivity Radiology Research and Practice 3 Table 1: Types of bias in diagnostic accuracy studies and how to address them. Bias type How to address Spectrum bias Perform random or consecutive sampling; avoid excluding subjects with ambiguous results Implement blinding of the researchers to the results of the reference test when interpreting the index test; Information bias predetermine the thresholds when designing a study Predict the direction and degree of deviation for the diagnostic accuracy in sensitivity analysis and adjust Misclassification bias accordingly; create a composite reference standard Diagnostic review Implement blinding of the researchers to the results of the index test when interpreting the reference test bias Address in the limitations section the possibility of overestimation of accuracy estimates and if possible, adjust Incorporation bias accordingly Use the same reference standard for all subjects and if not possible, acknowledge and measure the potential Verification bias accuracy estimate error Study the characteristics of subjects lost and how they differ from those that remain; perform sensitivity analysis to Attrition bias calculate the range of diagnostic accuracy estimates as if all withdrawals tested positive or negative Lack of Bias Internal External Validity Validity Study level Population level Precision Application in clinical setting Patient level Figure 1: Internal and external validity. Precision and lack of bias dictate the internal validity of the study. External validity refers to the process of applying the study results from the study level to the population level. Radiologists can use these results in their own clinical practice for management of individual patients. High precision High precision High bias Low bias Low precision Low precision High bias Low bias Decreasing bias Figure 2: Precision and bias. Increasing precision reduces the random error and decreasing bias is equivalent to decreasing systematic error. +e higher the precision and the lower the bias, the higher the internal validity of the study. Adapted from ELife, 7, e35718, Brandmaier, A. M. et al., Assessing reliability in neuroimaging research through intraclass effect decomposition (ICED) (2018) (modified) [10]. Increasing precision 4 Radiology Research and Practice for the clinical validation of AI algorithms forces a binary of 70% and specificity of 89% [17]. By contrast, other studies on the same topic reported almost perfect diagnostic ac- distinction of outcomes that does not accurately represent real-world situations, where disease-simulating conditions curacy (sensitivity: 100% and specificity: 91.8–100%) using a nonconsecutive sample. +ese higher accuracy measures and comorbidities may be present. As a result, the diagnostic could be due to excluding complex cases, excluding patients performance of an AI algorithm may be inflated, and who may be difficult to perform ultrasound, or excluding consequently, the generalization of study results to real- ambiguous results [18, 19]. world practice may be problematic. Nevertheless, case- control studies are still typically used as initial validation methods for deep learning algorithms, as they are more 3.2. Spectrum Bias. Spectrum bias is commonly used to convenient to perform and allow for establishment of a describe the variation in test performance across patient reference standard [23, 24]. subgroups. Studies that utilize a limited portion of the pa- Another limitation of the case-control design is that the tient spectrum can be affected by this type of bias. For positive predictive value (PPV) (probability that subjects example, a study that includes only high-risk patients may with a positive test truly have the disease) and negative provide different diagnostic accuracy estimates compared to predictive value (NPV) (probability that subjects with a a study that includes only low-risk patients, as the test negative test truly do not have the disease) cannot be directly performance varies in different populations [20, 21]. measured, as the ratio of cases to control is set by the in- An obvious source of spectrum bias is a patient selection vestigator and disease prevalence is not reflected in the data method that leads to a sample that is not representative of (see Figure 4) [22]. the target population. Local referral practices can also remove cases from the initial distribution, narrow the spectrum of patients, and lead to bias [11]. Understanding 4. Domain 2: Index Test spectrum bias is important as it can prohibit the general- 4.1. Information Bias. An important source of bias when ization of the results from the studied sample to a wider evaluating the index test is the lack of blinding of the in- population, especially when studying heterogeneous pop- vestigators to the results of the reference standard for each ulations. It has been suggested that “spectrum effect” is a subject. Knowledge of the reference standard results may more appropriate term, as the estimate from a narrow influence the interpretation of the index test results. +is is also spectrum of patients is valid for this specific subgroup [21]. known as information bias. +is type of bias can lead to larger An example of how diagnostic accuracy measurements deviations when the index test is not an objective measure- can be influenced by the patient spectrum is seen in a meta- ment and depends on a rater’s subjective assessment [25]. analysis that studied the accuracy of magnetic resonance Aside from blinding to avoid information bias, it is imaging (MRI) to detect silicone breast implant rupture. +e important for diagnostic accuracy studies to prespecify the authors found that the diagnostic accuracy of MRI in studies threshold used for the index test interpretation. A posteriori that included patients with symptoms of implant rupture determination of a threshold in a data-driven way can lead to was 14 times higher compared to studies that included only overestimation of test performance, especially in studies asymptomatic patients and two times higher compared to with a small number of subjects. +is is because an optimal studies that used both symptomatic and asymptomatic cutoff may be chosen based on the available results to favor patients (screening sample) [2]. overly optimistic measures of diagnostic accuracy [26]. For example, Kivrak et al. performed a study comparing 3.3. Case-Control and Cross-Sectional Study Design. In di- computed tomography (CT) virtual cystoscopy with con- agnostic accuracy studies, based on the way subjects are ventional cystoscopy for the diagnosis of bladder tumors, recruited, the study design is usually a case-control, cross- which they designed in a rigorous way to avoid introducing sectional, or cohort study design. In case-control designs, information bias. +e authors report that the two experi- patients are sampled separately from controls, which in- enced radiologists, who independently interpreted the vir- troduces spectrum bias. +is is because patients tend to be tual cystoscopy (the index test), were blinded to the findings “the sickest of the sick,” which leads to sensitivity overes- of conventional cystoscopy (the reference standard). Ad- timation, and controls tend to be the “healthiest of the ditionally, the virtual cystoscopy was performed and healthy,” which leads to specificity overestimation (see interpreted prior to the conventional cystoscopy, thereby Figure 3). In cross-sectional and cohort designs, patients and ensuring that the investigators were blinded to the results of controls are sampled together from a population based on the reference test [27]. the presence of a characteristic regardless of the presence of disease [3, 22]. In a study by Lijmer et al., which reviewed 184 diagnostic accuracy studies for design-related bias, case- 4.2. Indeterminate Index Test Results. Patients with inde- control designs tended to overestimate the diagnostic per- terminate or ambiguous results should not be excluded formance of the test by threefold compared to studies with from the study, as this could limit the results to an un- cohort design [3]. representative spectrum of extremes and potentially in- An area in radiology where the difference between case- troduce spectrum bias. In this case, it is preferable to control and cohort has been studied is Artificial Intelligence transform the 2 × 2 table to a 3 × 2 table and report positive, (AI). As noted by Park [23], utilizing a case-control design indeterminate, and negative results separately. To ensure Radiology Research and Practice 5 Sample subjects Cross sectional or cohort design: D+ and D- distributions D- D+ are closer as the sample includes the full spectrum of severity. Disease severity Sample subjects Case control design: D+ and D- distributions are further apart as mild cases are likely to be D- D+ ignored. Disease severity Figure 3: Cross-sectional study design minimizes spectrum bias as cases and controls are not sampled separately from the target population. D refers to disease status with D+ meaning disease is present and D-patients meaning disease is absent. Target population with disease D+ prevalence of 10% Cross sectional or cohort study design Case control study design Sampling scheme The sample is chosen from the target The D+ and D- patients are sampled population without regard to the presence separately from the target population of disease Resulting sample Study sample has a ratio Study sample has a ratio of disease D+ to D- of disease D+ to D- 1:10 chosen by the equal to the target researchers (e.g., 40%) population Statistics that can Sensitivity Sensitivity be measured Specificity Specificity Positive predictive value Negative predictive value Figure 4: Cross-sectional and cohort designs allow for the calculation of a negative predictive value (NPV) and a positive predictive value (PPV), as they incorporate meaningful prevalence data. D refers to disease status with D+ meaning disease is present and D-patients meaning disease is absent. that diagnostic accuracy estimates are not overestimated, a A meta-analysis by Schuetz et al. [28] pooled coronary conservative “intention to diagnose” approach should be CT angiography studies to compare how the handling of followed; indeterminate cases that test positive with the nonevaluable results affects diagnostic accuracy estimates. reference test are classified as false negative for the index. As CT angiography interpretation can involve nonevaluable Indeterminate cases that test negative with the reference test results especially in areas with vessel calcifications [29], test are classified as false positive for the index (see Table 2). the authors can consider nonevaluable vessel segments as In the scenario when the reference test also yields inde- positive or negative, exclude them from analysis, or even terminate results, the table may be extended to a 3 × 3 table exclude patients with nonevaluable segments altogether. +e to ensure transparent reporting [28]. authors in this study found that handling the test results with 6 Radiology Research and Practice Table 2: Various approaches for indeterminate index test results and their effect on sensitivity and specificity. Indeterminate results Sensitivity Specificity Excluded from analysis Increased Increased Indeterminate results considered positive Increased Decreased Indeterminate results considered negative Decreased Increased “Intention to diagnose” approach Decreased Decreased an “intention to diagnose” approach using a 3 × 2 table 5.2. Diagnostic Review Bias. Another important consider- ation when evaluating the reference standard is whether it is yielded lower diagnostic accuracy measures (Area Under Curve 0.93) compared to the other approaches (Area Under interpreted without the knowledge of the index test results. A positive index test may drive raters to search the reference Curve 0.96–0.99) [28]. study more carefully for evidence of disease. +is is known as diagnostic review bias [25]. As pointed out by Ransohoff 5.Domain 3: Reference Standard et al. [20], an example of this bias can be found in a study by +e reference test represents the gold standard to which the Meadway et al. [33] which evaluated the diagnostic per- index test is being compared. +e assumption is that the formance of Doppler ultrasound compared to venography. reference standard is 100% accurate, so any disagreement No indication was provided that the venograms were ex- with the results of the index test is attributed to the limited amined independently of the Doppler studies and thus it is sensitivity or specificity of the latter. However, reference possible that knowledge of the Doppler results affected the standards that perfectly differentiate between patients with venogram diagnoses. and without the target condition are rare and, thus, some patients will inevitably be misclassified [30]. 5.3. Incorporation Bias. On some occasions, the index test may be part of the reference standard. +e resulting bias is 5.1. Misclassification Bias. Misclassification bias, which is called incorporation bias and leads to overestimation of the also called imperfect gold standard bias, occurs due to errors sensitivity and specificity. Incorporation bias often occurs in the reference test. +e reference test may be susceptible to when the reference standard relies on clinical judgment as errors either due to its interpretation or due to technical the clinician often uses the index test to arrive at a diagnosis. limitations. For example, an imaging exam can give erro- +is bias will result in an overestimation of diagnostic ac- neous results because of inexperienced readers or due to curacy [34]. An example of this bias in the radiology lit- limited resolution. If pathology is used as a reference erature can be found in a study by Mater et al. which standard, sampling error is an additional factor which could evaluated the diagnostic accuracy of shunt series radio- lead to false-negative results. +e effect of this bias on the graphs and CT to assess for cerebrospinal fluid shunt diagnostic accuracy estimates can vary depending on malfunction. +e clinical decision to proceed to shunt re- whether the reference and index tests tend to err in the same vision, which was used as the reference standard, was made direction on the same patients or the reference and index test by the neurosurgeons after reviewing the radiograph and CT errors are independent of each other. As a result, sensitivity imaging. Despite the introduction of incorporation bias, this and specificity can be over- or underestimated by this type of decision was reasonable in this study due to the lack of an bias [22]. independent gold standard. +e authors also acknowledged An example of misclassification bias can be found in a this concern in the limitations section by stating possible study by Ai et al. which determined the diagnostic accuracy overestimation of the sensitivity [35]. of chest CT for the diagnosis of Coronavirus Disease 2019 (COVID-19). +e reference standard was a Reverse Tran- 6. Domain 4: Patient Flow and Timing scription Polymerase Chain Reaction (RT-PCR) test, which can give false-negative results in the early stages of the Diagnostic accuracy studies should be designed taking into disease. +e authors calculated the sensitivity of chest CT for account time-dependent changes of the disease on the the diagnosis of COVID-19 to be 97% and the specificity studied population and follow—as much as possible—a 25% but acknowledged in the limitations section that, due to homogeneous approach for all subjects. Intervals between misclassification bias, the sensitivity may have been over- the index and reference test and disturbances in the flow of estimated and the specificity may have been underestimated the study, such as changes in the reference test or with- by solely relying on the results of a single RT-PCR test [31]. drawals, are important sources of bias [7]. Various methods have been proposed to correct for misclassification bias. One suggestion is adjusting the ac- curacy estimates based on external evidence about the degree 6.1.TimingoftheIndexandReferenceTest. +e time interval and direction of the reference standard misclassification. between the conduction of the index and the reference tests Other ways to minimize this bias are to combine multiple should ideally be as short as possible. A long period between tests to a composite reference standard or validate the index the two could lead to misclassification bias, as the disease test usefulness by correlating directly with future clinical might improve or deteriorate during the interval time. An events or other clinical characteristics [32]. interval of a few days could be reasonable for chronic Radiology Research and Practice 7 Table 3: Direction of diagnostic accuracy estimates by type of bias. RDOR from RDOR from Type of bias Sensitivity [3, 16, 22, 26] Specificity [3, 16, 22, 26] Rutjes et al. Lijmer et al. [16] [3] Sampling bias 1.5, 95% CI 0.9, 95% CI Increases if complex cases are excluded Increases if complex cases are excluded (consecutive over (1.0–2.1) (0.7–1.1) nonconsecutive Decreases if clear-cut cases are Decreases if clear-cut cases are sampling) excluded excluded Increases when severe cases are Increases when healthy controls are 4.9, 95% CI 3.0, 95% CI Spectrum bias overrepresented in the patient sample overrepresented in the patient sample (0.6–37.3) (2.0–4.5) (“the sickest of the sick”) (“the healthiest of the healthy”) Information bias: lack of 1.1, 95% CI 1.3, 95% CI Variable Variable blinding (0.8–1.6) (1.0–1.9) Information bias: post 1.3 95% CI Increases Increases Not studied hoc definition of cutoff (0.8–1.9) Increases if errors in index and Increases if errors in index and Misclassification bias reference test are correlated reference test are correlated Not studied Not studied (imperfect gold standard) Decreases if errors in index and Decreases if errors in index and reference test are independent reference test are independent 1.4, 95% CI Not studied Incorporation bias Increases Increases (0.7–2.8) Increases if the gold standard is used Increases if the gold standard is used Verification bias: for positive index results and a different for positive index results and a different 1.6, 95% CI 2.2, 95% CI differential (i.e., different reference test (e.g., noninvasive and reference test (e.g., noninvasive and (0.9–2.9) (1.5–3.3) reference standards) less expensive) is used for negative less expensive) is used for negative index results index results 1.1, 95% CI 1.0, 95% CI Verification bias: partial Increases Decreases (0.7–1.7) (0.8–1.3) RDOR: Relative Diagnostic Odds Ratio. CI: confidence interval. diseases but would be problematic for acute diseases. For An example can be found in the Prospective Investi- reference tests that require follow-up to determine whether gation of Pulmonary Embolism Diagnosis (PIOPED) study, the disease is present, an appropriate minimum follow-up which evaluated the diagnostic accuracy of Ventilation- time should be set for all patients [6]. For example, a sys- Perfusion (V-Q) scan using conventional angiography as a reference standard. From the 131 patients with near normal/ tematic review investigated the diagnostic accuracy of MRI in the diagnosis of early multiple sclerosis using clinical normal results on the V-Q scan, only 57 received angiog- follow-up as reference standard. +e average follow-up raphy (gold standard). For the remaining 74, an alternative period in the included studies ranged from 7 months to 14 reference standard was used: no evidence of pulmonary years and the authors found that studies with shorter follow- embolism during one-year follow-up. +e authors calculated up tended to overestimate the sensitivity and underestimate that if those 74 patients were included in the analysis, the specificity [36]. NPV for near normal/normal scan would have been 96% and if not, the NPV would have been 91%. So, they con- cluded that the true NPV value is somewhere between those two numbers but possibly closer to the first [39]. 6.2. Verification Bias. Verification bias is a form of bias in- Another area in radiology where partial verification bias troduced when not all patients receive the gold standard has been described is Single Photon Emission Computed (partial) or when some patients receive a different reference test Tomography (SPECT) for the diagnosis of coronary artery than the rest (differential) [3]. In partial verification bias, if the disease [40–43]. +e decision to perform coronary angi- decision is made to perform the gold standard only for positive ography, which is the gold standard for the diagnosis of index test cases, the sensitivity will be overestimated (fewer false coronary artery disease, may be affected by the result of a negatives) and specificity will be underestimated (more false preceding SPECT which introduces verification bias (also positives) [37]. +e effect of differential verification depends on called posttest referral bias). Authors have utilized mathe- the quality of the different reference tests that are being used. matical formulas (e.g., Begg and Greenes [44] and Diamond Using a superior reference test for the positive test results and a [45]) to adjust for this bias leading to significant changes in different reference test for the negative results will overestimate calculated diagnostic accuracy parameters. Miller et al. [42] both sensitivity and specificity [3]. Notably, using the same gold reported an unadjusted sensitivity of 98% and specificity of standard for all patients may not be clinically or ethically 13% for SPECT in coronary artery disease. After correction appropriate. If verification bias cannot be eliminated by with the Begg and Greenes formula, the sensitivity dropped choosing a proper study design, it should be at least ac- to 65% and the specificity increased to 67% which indicates knowledged or statistically corrected by the authors [38]. 8 Radiology Research and Practice that verification bias can have an effect on accuracy accuracy studies and critically assess the studies before estimation. applying the conclusions to their own clinical practice. Conflicts of Interest 6.3. Attrition Bias. An important consideration is whether all patients were included in the analysis. Withdrawals lead +e authors declare that they have no conflicts of interest. to over- or underestimation of accuracy estimates (attrition bias) if the patients lost to follow-up differ in some way from References those who remain. It is important for studies to report [1] A. S. Saber Tehrani, H. Lee, S. C. Mathews et al., “25-year withdrawals and evaluate their effect on accuracy estimates summary of US malpractice claims for diagnostic errors [46]. An example of the effect of withdrawals on diagnostic 1986–2010: an analysis from the National Practitioner Data accuracy estimates can be found in a study by Kearl et al. Bank,” BMJ Quality and Safety, vol. 22, pp. 672–680, 2013. which investigated the accuracy of MRI and ultrasound in [2] J. W. Song, H. M. Kim, L. T. Bellfi, and K. C. Chung, “+e the diagnosis of appendicitis. Of the 589 patients included, effect of study design biases on the diagnostic accuracy of the reference standards, which were pathology reports, magnetic resonance imaging for detecting silicone breast surgical diagnosis, or clinical decision for medical treatment implant ruptures: a meta-analysis,” Plastic and Reconstructive for appendicitis, were not available for 63 patients (10.7%) Surgery, vol. 127, pp. 1029–1044, 2011. due to loss to follow-up. +e authors acknowledged this [3] J. G. Lijmer, B. W. Mol, S. Heisterkamp et al., “Empirical limitation and analyzed the effect on diagnostic accuracy. A evidence of design-related bias in studies of diagnostic tests,” Journal of the American Medical Association, vol. 282, sensitivity analysis was performed, and the diagnostic ac- pp. 1061–1066, 1999. curacy was calculated separately as if all withdrawals were [4] R. K. Owen, N. J. Cooper, T. J. Quinn, R. Lees, and positive for appendicitis as well as if all withdrawals were A. J. Sutton, “Network meta-analysis of diagnostic test ac- negative for appendicitis with the reference standard [47]. curacy studies identifies and ranks the optimal diagnostic tests and thresholds for health care policy and decision-making,” 7.Direction of Accuracy Measures due to Bias Journal of Clinical Epidemiology, vol. 99, pp. 64–74, 2018. [5] Evidence-Based Radiology Working Group, “Evidence-based Knowing the direction of diagnostic accuracy measures is a radiology: a new approach to the practice of radiology,” first step in countering the effect of bias in our interpretation Radiology, vol. 220, pp. 566–575, 2001. of study results. +e general direction towards which the [6] J. B. Reitsma, A. W. S. Rutjes, P. Whiting, V. V. Vlassov, M. M. G. Leeflang, and J. J. Deeks, “Chapter 9: assessing diagnostic accuracy estimates may deviate can be predicted methodological quality,” in Cochrane Handbook for System- and depends on the specific type of bias. Rutjes et al. [16] and atic Reviews of Diagnostic Test Accuracy Version 1.0.0, Lijmer et al. [3] quantified the effect of several study design J. J. Deeks, P. M. Bossuyt, and C. Gatsonis, Eds., +e Cochrane biases on diagnostic accuracy measures (see Table 3). +ey Collaboration, London, UK, 2009. used the Relative Diagnostic Odds Ratio (RDOR) as a pa- [7] P. F. Whiting, A. W. S. Rutjes, M. E. Westwood et al., rameter to compare studies with a specific methodological “QUADAS-2: a revised tool for the quality assessment of shortcoming to those without. An RDOR greater than one diagnostic accuracy studies,” Annals of Internal Medicine, indicates that diagnostic accuracy parameters are over- vol. 155, pp. 529–536, 2011. estimated in the study, while an RDOR less than one in- [8] S. Aslam and P. Emmanuel, “Formulating a researchable dicates that diagnostic accuracy parameters are question: a critical step for facilitating good clinical research,” underestimated in the study. +e limitation with using Indian Journal of Sexually Transmitted Diseases and AIDS, vol. 31, pp. 47–50, 2010. RDOR is that important biases that have opposing effects on [9] C. M. Patino and J. C. Ferreira, “Internal and external validity: sensitivity and specificity may not cause significant direc- can you apply research study results to your patients?” Jornal tional changes in RDOR, which will remain close to one. Brasileiro de Pneumologia, vol. 44, p. 183, 2018. +is may be the explanation why both of these studies failed [10] A. M. Brandmaier, E. Wenger, N. C. Bodammer, S. Kuhn, ¨ to detect statistically significant changes in the RDOR for N. Raz, and U. Lindenberger, “Assessing reliability in neu- some forms of bias [3] (see Table 3). roimaging research through intra-class effect decomposition (ICED),” eLife, vol. 7, Article ID e35718, 2018. [11] R. L. Schmidt and R. E. Factor, “Understanding sources of bias 8.Conclusion in diagnostic accuracy studies,” Archives of Pathology & Laboratory Medicine, vol. 137, pp. 558–565, 2013. Diagnostic accuracy studies can suffer from many forms of [12] C. Andrade, “Internal, external, and ecological validity in bias. QUADAS-2 provides a useful framework for thinking research design, conduct, and evaluation,” Indian Journal of about study design biases. Patient selection, index test, Psychological Medicine, vol. 40, pp. 498-499, 2018. reference test, and patient flow/timing are the four main [13] J. F. Cohen, D. A. Korevaar, D. G. Altman et al., “STARD 2015 domains to be evaluated in each study, as they cover the guidelines for reporting diagnostic accuracy studies: expla- primary sources of systematic error in diagnostic accuracy nation and elaboration,” BMJ Open, vol. 6, Article ID e012799, studies. Potential sources of bias should be acknowledged by the authors and their effect on test performance should be [14] M. Elfil and A. Negida, “Sampling methods in clinical re- estimated and reported. We are encouraged to become fa- search; an educational review,” Emergency, vol. 5, no. 1, p. e52, miliar with the biases that can be found in diagnostic 2017. Radiology Research and Practice 9 [15] K. Mathieson, “Making sense of biostatistics: probability [30] C. Biesheuvel, L. Irwig, and P. Bossuyt, “Observed differences versus nonprobability sampling,” Journal of Clinical Research in diagnostic test accuracy between patient subgroups: is it Best Practices, vol. 10, no. 8, pp. 1-2, 2014. real or due to reference standard misclassification?” Clinical [16] A. W. S. Rutjes, J. B. Reitsma, M. Di Nisio, N. Smidt, Chemistry, vol. 53, pp. 1725–1729, 2007. J. C. van Rijn, and P. M. M. Bossuyt, “Evidence of bias and [31] T. Ai, Z. Yang, H. Hou, C. Zhan, C. Chen, and W. Lv, variation in diagnostic accuracy studies,” Canadian Medical “Correlation of chest CT and RT-PCR testing for Coronavirus Association Journal, vol. 174, pp. 469–476, 2006. disease 2019 (COVID-19) in China: a report of 1014 cases,” [17] J. A. Kline, P. M. O’Malley, V. S. Tayal, G. R. Snead, and Radiology, vol. 296, pp. E32–E40, 2020. A. M. Mitchell, “Emergency clinician-performed compression [32] J. B. Reitsma, A. W. S. Rutjes, K. S. Khan, A. Coomarasamy, ultrasonography for deep venous thrombosis of the lower and P. M. Bossuyt, “A review of solutions for diagnostic extremity,”AnnalsofEmergencyMedicine, vol. 52, pp. 437–445, accuracy studies with an imperfect or missing reference standard,” Journal of Clinical Epidemiology, vol. 62, [18] S. Farahmand, M. Farnia, S. Shahriaran, and P. Khashayar, pp. 797–806, 2009. “+e accuracy of limited B-mode compression technique in [33] J. Meadway, A. N. Nicolaides, C. J. Walker, and diagnosing deep venous thrombosis in lower extremities,” 9e J. D. O’Connell, “Value of Doppler ultrasound in diagnosis of American journal of emergency medicine, vol. 29, pp. 687–690, clinically suspected deep vein thrombosis,” British Medical Journal, vol. 4, pp. 552–554, 1975. [19] T. Jang, M. Docherty, C. Aubin, and G. Polites, “Resident- [34] A. Worster and C. Carpenter, “Incorporation bias in studies of performed compression ultrasonography for the detection of diagnostic tests: how to avoid being biased about bias,” CJEM, proximal deep vein thrombosis: fast and accurate,” Academic vol. 10, pp. 174-175, 2008. Emergency Medicine, vol. 11, pp. 319–322, 2004. [35] A. Mater, M. Shroff, S. Al-Farsi, J. Drake, and R. D. Goldman, [20] D. F. Ransohoff and A. R. Feinstein, “Problems of spectrum “Test characteristics of neuroimaging in the emergency de- and bias in evaluating the efficacy of diagnostic tests,” New partment evaluation of children for cerebrospinal fluid shunt England Journal of Medicine, vol. 299, pp. 926–930, 1978. malfunction,” Canadian Journal of Emergency Medicine, [21] S. A. Mulherin and W. C. Miller, “Spectrum bias or spectrum vol. 10, pp. 131–135, 2008. effect? subgroup variation in diagnostic test evaluation,” [36] P. Whiting, R. Harbord, C. Main, J. J. Deeks, G. Filippini, and Annals of Internal Medicine, vol. 137, pp. 598–602, 2002. M. Egger, “Accuracy of magnetic resonance imaging for the [22] M. A. Kohn, C. R. Carpenter, and T. B. Newman, “Under- diagnosis of multiple sclerosis: systematic review,” BMJ, standing the direction of bias in studies of diagnostic test vol. 332, pp. 875–884, 2006. accuracy,” Academic Emergency Medicine, vol. 20, pp. 1194– [37] A. S. Kosinski and H. X. Barnhart, “Accounting for non- 1206, 2013. ignorable verification bias in assessment of diagnostic tests,” [23] S. H. Park, “Diagnostic case-control versus diagnostic cohort Biometrics, vol. 59, pp. 163–171, 2003. studies for clinical validation of artificial intelligence algo- [38] J. M. Petscavage, M. L. Richardson, and R. B. Carr, “Verifi- rithm performance,” Radiology, vol. 290, no. 1, pp. 272-273, cation bias an underrecognized source of error in assessing the efficacy of medical imaging,” Academic Radiology, vol. 18, [24] J. G. Nam, S. Park, E. J. Hwang et al., “Development and pp. 343–346, 2011. validation of deep learning–based automatic detection algo- [39] PIOPED Investigators, “Value of the ventilation/perfusion rithm for malignant pulmonary nodules on chest radio- scan in acute pulmonary embolism. Results of the prospective graphs,” Radiology, vol. 290, no. 1, pp. 218–228, 2019. investigation of pulmonary embolism diagnosis (PIOPED),” [25] P. Whiting, A. W. S. Rutjes, J. B. Reitsma, A. S. Glas, Journal of the American Medical Association, vol. 263, P. M. M. Bossuyt, and J. Kleijnen, “Sources of variation and pp. 2753–2759, 1990. bias in studies of diagnostic accuracy,” Annals of Internal [40] M. P. Cecil, A. S. Kosinski, M. T. Jones et al., “+e importance Medicine, vol. 140, pp. 189–202, 2004. of work-up (verification) bias correction in assessing the [26] M. M. G. Leeflang, K. G. M. Moons, J. B. Reitsma, and accuracy of SPECT thallium-201 testing for the diagnosis of A. H. Zwinderman, “Bias in sensitivity and specificity caused coronary artery disease,” Journal of Clinical Epidemiology, by data-driven selection of optimal cutoff values: mechanisms, vol. 49, no. 7, pp. 735–742, 1996. magnitude, and solutions,” Clinical Chemistry, vol. 54, [41] C. Santana-Boado, J. Candell-Riera, J. Castell-Conesa et al., pp. 729–737, 2008. “Diagnostic accuracy of technetium-99m-MIBI myocardial [27] A. S. Kivrak, D. Kiresi, D. Emlik, K. Odev, and M. Kilinc, SPECT in women and men,” Journal of Nuclear Medicine: “Comparison of CT virtual cystoscopy of the contrast ma- Official Publication, Society of Nuclear Medicine, vol. 39, no. 5, terial-filled bladder with conventional cystoscopy in the di- pp. 751–755, 1998. agnosis of bladder tumours,” Clinical Radiology, vol. 64, [42] T. D. Miller, D. O. Hodge, T. F. Christian, J. J. Milavetz, pp. 30–37, 2009. K. R. Bailey, and R. J. Gibbons, “Effects of adjustment for [28] G. M. Schuetz, P. Schlattmann, and M. Dewey, “Use of 3 × 2 referral bias on the sensitivity and specificity of single photon tables with an intention to diagnose approach to assess clinical emission computed tomography for the diagnosis of coronary performance of diagnostic tests: meta-analytical evaluation of artery disease,” 9e American Journal of Medicine, vol. 112, coronary CT angiography studies,” BMJ, vol. 345, Article ID e6717, 2012. no. 4, pp. 290–297, 2002. [43] J. A. Ladapo, S. Blecker, M. R. Elashoff et al., “Clinical im- [29] C. Strong, A. Ferreira, and R. C. Teles, “Diagnostic accuracy of plications of referral bias in the diagnostic performance of computed tomography angiography for the exclusion of exercise testing for coronary artery disease,” Journal of the coronary artery disease in candidates for transcatheter aortic valve implantation,” Scientific Reports, vol. 9, Article ID American Heart Association, vol. 2, no. 6, Article ID e000505, 19942, 2019. 2013. 10 Radiology Research and Practice [44] C. B. Begg and R. A. Greenes, “Assessment of diagnostic tests when disease verification is subject to selection bias,” Bio- metrics, vol. 39, no. 1, pp. 207–215, 1983. [45] G. A. Diamond, “Reverend Bayes’ silent majority. An alter- native factor affecting sensitivity and specificity of exercise electrocardiography,” 9e American Journal of Cardiology, vol. 57, no. 13, pp. 1175–1180, 1986. [46] O. Aliu and K. C. Chung, “Assessing strength of evidence in diagnostic tests,” Plastic and Reconstructive Surgery, vol. 129, pp. 989e–998e, 2012. [47] Y. L. Kearl, I. Claudius, S. Behar et al., “Accuracy of magnetic resonance imaging and ultrasound for appendicitis in diag- nostic and nondiagnostic studies,” Academic Emergency Medicine, vol. 23, pp. 179–185, 2016.

Journal

Radiology Research and PracticeHindawi Publishing Corporation

Published: Sep 7, 2021

References