Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator

Physician understanding, explainability, and trust in a hypothetical machine learning risk... Abstract Objective Implementation of machine learning (ML) may be limited by patients’ right to “meaningful information about the logic involved” when ML influences healthcare decisions. Given the complexity of healthcare decisions, it is likely that ML outputs will need to be understood and trusted by physicians, and then explained to patients. We therefore investigated the association between physician understanding of ML outputs, their ability to explain these to patients, and their willingness to trust the ML outputs, using various ML explainability methods. Materials and Methods We designed a survey for physicians with a diagnostic dilemma that could be resolved by an ML risk calculator. Physicians were asked to rate their understanding, explainability, and trust in response to 3 different ML outputs. One ML output had no explanation of its logic (the control) and 2 ML outputs used different model-agnostic explainability methods. The relationships among understanding, explainability, and trust were assessed using Cochran-Mantel-Haenszel tests of association. Results The survey was sent to 1315 physicians, and 170 (13%) provided completed surveys. There were significant associations between physician understanding and explainability (P < .001), between physician understanding and trust (P < .001), and between explainability and trust (P < .001). ML outputs that used model-agnostic explainability methods were preferred by 88% of physicians when compared with the control condition; however, no particular ML explainability method had a greater influence on intended physician behavior. Conclusions Physician understanding, explainability, and trust in ML risk calculators are related. Physicians preferred ML outputs accompanied by model-agnostic explanations but the explainability method did not alter intended physician behavior. artificial intelligence, explainability, interpretability, decision support, medicine INTRODUCTION BACKGROUND AND SIGNIFICANCE In clinical practice, traditional risk calculators, such as the Wells’ Criteria and PERC Rule scores for pulmonary embolism, are commonly used in clinical practice to support decision making in cases of diagnostic uncertainty.1–3 Machine learning (ML) may improve the accuracy of risk assessment4; however, implementation of ML raises complex clinical, ethical, and legal questions because of the lack of understanding about how ML models generate outputs—commonly referred to as the black box problem.5–7 This has become a particular issue since the European Union Global Data Protection Regulation has been introduced, which requires that patients have a right to “meaningful information about the logic involved” when ML is used.8 Therefore, unless the black box problem can be remedied, the patient right to explanation cannot be fulfilled and the healthcare industry cannot fully benefit from these powerful technologies. Fortunately, ML explainability methods have been developed to address the black box problem.9–13 For example, model-agnostic explainability methods can derive post hoc explanations from black box models.14 Broadly, this involves training an interpretable model based on the predictions of a black box model and perturbing its inputs to see how the black box model reacts.15–17 Examples of these methods include permutation variable importance (VI), individual conditional expectation (ICE) plots, local interpretable model-agnostic explanations (LIME), and Shapley values (SVs).17–20 While these methods have been explored in a number of contexts, there is currently limited evidence as to whether they adequately explain ML-derived risk estimates in a clinical context.21 Importantly, clinical decisions represent a complex synthesis of basic sciences, clinical evidence, and patient preferences.22,23 If these highly individualized decisions were to be automated with ML, physicians would likely need to explain the decision-making process to patients, at least in the near term.22,24,25 Indeed, many published examples of ML applications in healthcare still require physician interpretation.26–30 Therefore, in order to satisfy patients’ rights to a meaningful explanation, ML risk calculators will first need to “explain” their output in such a way that physicians can understand. It is not yet known if currently available model-agnostic explanations can provide physicians with an adequate understanding of ML outputs, nor if physician understanding of ML outputs leads to the ability to explain them to patients. Finally, few studies have explored whether improving physician understanding of ML outputs influences physician behavior. OBJECTIVES In the context of a clinical ML risk calculator, we aimed to investigate the association among (1) physician understanding of an ML output (termed physician understanding); (2) physician ability to explain an ML output to patients (termed explainability), and (3) and physician intended behavior (termed trust). We also aimed to explore whether model-agnostic explainability methods influenced the intended behavior of physicians, and whether a particular model-agnostic explainability method was preferred by physicians. MATERIALS AND METHODS Hypothetical ML risk calculator for pulmonary embolism To assess the relationships among physician understanding, explainability, and trust in ML risk calculators, we designed a hypothetical ML risk calculator based on the Wells’ Criteria and PERC Rule to use in a survey-based clinical scenario for physicians.1,2 Based on the risk estimated by the ML risk calculator, a clinical recommendation of either (1) “reassurance and discharge recommended” or (2) “computed tomography pulmonary angiogram recommended” was made if the patient was low risk of pulmonary embolism or not low risk of pulmonary embolism, respectively. We combined a risk estimate with a clinical recommendation as we believed that it would allow us to assess intended physician behavior and, therefore, trust in the ML output. In addition, the combination of a risk estimate and a clinical recommendation is commonly used with non-ML risk calculators, including online versions of the Wells’ Criteria and PERC Rule.31,32 Survey instrument Based on the described hypothetical ML risk calculator, we designed an online survey (Supplementary Appendix) that provided participants with the following hypothetical clinical scenario: You are a GP [general practitioner] who has reviewed a 50-year-old woman presenting with shortness of breath. After a history, examination, laboratory tests, ECG [electrocardiogram], and a chest x-ray, you are comfortable you have excluded the most concerning diagnoses. However, you are still considering pulmonary embolism. The practice has installed a piece of software that uses artificial intelligence to assist with ruling out pulmonary embolism. It can stratify patients as either; (1) low risk of pulmonary embolism: reassurance and discharge recommended; or (2) not low risk of pulmonary embolism: computed tomography pulmonary angiogram recommended. The software automatically analyses the electronic record, including your documented history, examination, and laboratory tests, and provides its recommendation. The rationale for this clinical scenario was that pulmonary embolism is a disease encountered in community, medical, and surgical settings and is potentially life-threatening, so decision making around diagnosis and management is critical.33 We provided limited clinical information in the scenario to prevent participants from agreeing or disagreeing with the ML output for reasons other than the explainability method we used. Each participant was shown 3 consecutive ML outputs, which included an ML risk estimate combined with a clinical recommendation. The 3 consecutive ML outputs were identical except for the method used to visualize and explain the “logic” of the risk estimate. After each ML output, the participant was asked the following questions with select-choice answers: To what degree does the software’s decision make sense to you? [Not at all/Very little/Somewhat/To a great extent] Would you be able to explain the software’s decision to the patient? [Yes/No] Would you follow the software’s recommendation? [Yes/No] The first ML output for all participants was as follows: Your patient has a low risk (<1% chance) of pulmonary embolism. They should be reassured and followed up in the community as you deem appropriate. This recommendation is based on a cohort of 10 000 patients who were investigated for pulmonary embolism, of whom 1000 had a similar risk profile. The software has been externally validated. This ML output was regarded as the control condition because only the estimated risk of pulmonary embolism (ie, <1% chance) and the population that the ML model was derived from were shown. No explanation of the logic behind the ML model’s risk calculation was shown. In our scenario, the accuracy of the ML model and size of the derivation cohort were exaggerated in order to ensure that physicians’ potential lack of trust was not simply due to poor model performance. For the second and third ML outputs, the control output was shown to the participant along with a model-agnostic explanation. This model-agnostic explanation included (1) a graphical visualization and (2) a brief textual explanation of what the visualization was demonstrating. The model-agnostic explanations were hypothetical. They were developed by W.K.D. and N.B., who are practicing physicians, and Q.T., who is a data scientist, and were tested on a small number of practicing physicians in order to ensure they were clinically valid. For the second ML output, participants were randomized to 1 of 2 global model-agnostic explanations, either VI (Figure 1A) or an ICE plot (Figure 1B). For the third ML output, participants were randomized to 1 of 2 local model-agnostic explanations, either LIME (Figure 1C) or SVs (Figure 1D). Broadly, global explanations demonstrate feature importance across a population, while local explanations explain individual predictions.34 Finally, participants were asked to select which ML output they preferred and why. Demographics and medical subspecialty were also assessed. Figure 1. Open in new tabDownload slide Graphical visualizations shown to participants with the following complementary explanation. (A) For variable importance (VI), the visualization shows the relative importance of each clinical factor used by the model. This is general information about the software’s logic. (B) For individual conditional expectation plots (ICE), the visualizations show the average and standard deviation of the software’s predictions for different values of the most influential clinical factors used by the model. This is general information about the software’s logic. (C) For local interpretable model-agnostic explanations (LIME), the visualization shows the positive or negative relative impact of the most influential clinical factors used by the software to estimate your patient’s risk of pulmonary embolism (PE). This profile is specific to your patient. (D) For Shapely values (SVs), the visualization shows the positive or negative contribution of each clinical factor to the risk estimated by the software. The sum of the bars is equal to your patient’s PE risk. This profile is specific to your patient. Figure 1. Open in new tabDownload slide Graphical visualizations shown to participants with the following complementary explanation. (A) For variable importance (VI), the visualization shows the relative importance of each clinical factor used by the model. This is general information about the software’s logic. (B) For individual conditional expectation plots (ICE), the visualizations show the average and standard deviation of the software’s predictions for different values of the most influential clinical factors used by the model. This is general information about the software’s logic. (C) For local interpretable model-agnostic explanations (LIME), the visualization shows the positive or negative relative impact of the most influential clinical factors used by the software to estimate your patient’s risk of pulmonary embolism (PE). This profile is specific to your patient. (D) For Shapely values (SVs), the visualization shows the positive or negative contribution of each clinical factor to the risk estimated by the software. The sum of the bars is equal to your patient’s PE risk. This profile is specific to your patient. Participants The relevant institutional review boards approved this study. We surveyed all physicians employed in Northland District Health Board (approximately 104 Resident Medical Officers and 150 Senior Medical Officers), Waitematā District Health Board (approximately 415 Resident Medical Officers and 502 Senior Medical Officers), and Te Tai Tokerau and Manaia Primary Health Organisations (approximately 151 general practitioners combined) in New Zealand between March and May of 2019. These physicians serve a population of approximately 800 000 people. The survey was emailed to all physicians via a SurveyGizmo link.35 Descriptive statistics and analysis of comments Statistical analysis was performed using data analysis (Pandas and StatsModels) and visualizations packages (Seaborn and WordCloud) in the Python ecosystem, in addition to stats and samplesizeCMH in the R ecosystem. The associations between physician understanding, explainability, and trust were assessed using Cochran-Mantel-Haenszel tests. Because repeated measures were collected from each participant (ie, 1 for each of the 3 ML output types), the response data were stratified according to the ML output type (ie, control condition, global explanation, and local explanation). Thereby, 2 × 4 × 3 contingency tables were generated for physician understanding vs explainability and for physician understanding vs trust, and 2 × 2 × 3 contingency tables were generated for explainability vs trust. For 2×J × K (J > 2) tables, the generalized Cochran-Mantel-Haenszel test was applied.36 To assess the reliability of each test, the statistical power of each test were calculated with a conventional significance level of .05. Binomial exact tests were used to assess whether physicians preferred specific explainability methods. To assess the influence of explainability method on intended physician behavior (trust), McNemar’s test was applied to paired responses from the same participants under 2 different scenarios (eg, control vs global explanation). If the number of changes in responses was <25, a binomial distribution was used to obtain an exact P value, otherwise a normal chi-square distribution was used to achieve an approximated P value. A word cloud was used to explore the reasons why participants preferred certain ML outputs. Natural language processing package was used to process the responses. The plain text was tokenized, de-capitalized, and stemmed. After removing all punctuation, digits, stop words, and short words with <3 letters, the frequency of each token and bigram was counted and used to set the font size in the word cloud proportionally. We kept only tokens or bigrams that appear at least 5 times in the responses collected. RESULTS Participants The survey was emailed to 1315 participants (Figure 2). There were 249 (18.9%) responses, among which 79 (31.7%) were incomplete and 170 (68.3%) were complete. A total of 100 were male participants (58.8%). The largest age group was the 25-34 years of age range (32.4%) followed by the 45-54 years of age range (27.6%). The most common medical subspecialties that responded were “Medicine” (20%) followed by “Other” (17.6%) and “Anaesthetics” and “Psychiatry” (both 11.8%) (Table 1). Of the 170 physicians that completed the survey, all (100%) were provided with 3 consecutive ML outputs, of which the first was always the control output (no model-agnostic explanation), the second was always a global model-agnostic explanation (VI or ICE), and the third was always a local model-agnostic explanation (LIME or SVs). A total of 37 (21.8%) were randomized to VI and LIME; 33 (19.4%) were randomized to VI and SVs; 54 (31.8%) were randomized to ICE and LIME; and 46 (27.0%) were randomized to ICE and SVs. A total of 510 responses (3 ML outputs per physician who completed the survey) were used to assess the relationships among the concepts of physician understanding, explainability, and trust. Figure 2. Open in new tabDownload slide Flow chart of survey response and randomization. ICE: individual conditional expectation plots; LIME: local interpretable model-agnostic explanations; SV: Shapely values; VI: variable importance Figure 2. Open in new tabDownload slide Flow chart of survey response and randomization. ICE: individual conditional expectation plots; LIME: local interpretable model-agnostic explanations; SV: Shapely values; VI: variable importance Table 1. Demographic characteristics of the participants Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Open in new tab Table 1. Demographic characteristics of the participants Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Open in new tab Physician understanding and explainability Physicians who reported higher levels of understanding in response to the question “To what degree does the software’s decision make sense to you?” were more likely to answer “Yes” to the question “Would you be able to explain the software’s decision to the patient?” (Figure 3). The relationship between physician understanding and explainability was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ23 = 156.3, P < .001, power = 0.99). Figure 3. Open in new tabDownload slide Relationship between physician understanding and explainability. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 3. Open in new tabDownload slide Relationship between physician understanding and explainability. (A) Control condition. (B) Global explanation. (C) Local explanation. Physician understanding and trust Physicians who reported higher levels of understanding in response to the question “To what degree does the software’s decision make sense to you?” were more likely to answer “Yes” to the question “Would you follow the software recommendation?” (Figure 4). The relationship between physician understanding and trust was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ23 = 128.7, P < .001, power = 0.99). Figure 4. Open in new tabDownload slide Relationship between physician understanding and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 4. Open in new tabDownload slide Relationship between physician understanding and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Explainability and trust Physicians who reported “Yes” to the question “Would you be able to explain the software’s decision to the patient” were more likely to respond “Yes” to the question “Would you follow the software recommendation?” (Figure 5). The relationship between explainability and trust was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ21 = 61.2, P < .001, power = 0.99). No particular ML explainability method had a greater influence on intended physician behavior (Table 2). Figure 5. Open in new tabDownload slide Relationship between explainability and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 5. Open in new tabDownload slide Relationship between explainability and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Table 2. Influence of explainability methods on intended physician behavior Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 ICE: individual conditional expectation; LIME: local interpretable model-agnostic explanations; SV: Shapley value; VI: variable importance. Open in new tab Table 2. Influence of explainability methods on intended physician behavior Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 ICE: individual conditional expectation; LIME: local interpretable model-agnostic explanations; SV: Shapley value; VI: variable importance. Open in new tab Preferred explainability method There were 156 (91.8%) physicians that rated their preferred explainability method. Of these, 87.8% of physicians preferred an ML output which contained a model-agnostic explanation, compared with 12.2% who preferred the control output (no model-agnostic explanation). Binomial exact testing showed that the preference for any model explainability method was significantly higher than two-thirds (P < .001). Local explanations (LIME [32.1%] and SVs [29.9%]) were preferred over global explanations (VI [18.2%] and ICE [19.7%]), and binomial exact testing showed that the preference for local explanations (62.0%) was significantly higher than 0.5 among all those preferred model-agnostic explainability method (P = .003). Even when no model-agnostic explanation was provided (ie, before being shown a model-agnostic explanation), physicians commonly reported that ML outputs were understandable (93% rating “somewhat” or “to a great extent”), explainable (85%), and trustworthy (76%). Figure 6 shows a word cloud generated from the responses to the question “Why do you prefer this software recommendation?” We noticed trends toward (1) the simplicity of the visualization (easy, simple, easy to understand), (2) the specificity of the explanation (specific, specific patient), and (3) the confirmation of clinical knowledge (explain, understand, information). Figure 6. Open in new tabDownload slide Word cloud generated from the responses to the question, “Why do you prefer this software recommendation?” Figure 6. Open in new tabDownload slide Word cloud generated from the responses to the question, “Why do you prefer this software recommendation?” DISCUSSION In the context of an ML risk calculator, this study found that physicians’ perceived understanding of ML outputs (physician understanding), their ability to explain ML outputs to a patient (explainability), and their intended behavior (trust) are significantly related. While physicians did prefer to have an explanation of the ML model’s logic (88% of physicians), no particular ML explainability method had a greater influence on intended physician behavior. These findings suggest that ML explainability is important to physicians, but that factors other than explainability, such as ML accuracy, might be more important to physicians. This hypothesis needs to be tested in future studies. Physician understanding and intended physician behavior (trust) Several other studies have explored the link between physician understanding and trust in ML. A survey of 191 Chinese physicians found that participants’ ability to understand ML decision support tools correlated with their attitudes of usefulness of ML but not with whether they would actually use ML tools.37 In studies of non-ML decision support systems, increased physician skepticism about the quality of the underlying evidence led to reduced motivation to use them.38,39 Finally, a qualitative study and several opinion articles from physicians outlined their desire to understand an ML’s logic before they would follow its recommendations, especially if its recommendations were to differ from their own opinion.24,40,41 Our study also demonstrated this trend, with no physician reporting that they would follow the ML output if they rated their understanding as “not at all.” Whilst this evidence base is small and heterogeneous, it suggests an association between physician understanding of ML outputs and intended physician behavior. Explainability Despite the legal requirements outlined in the Global Data Protection Regulation and the opinions described previously, many experts debate the necessity of ML explainability. Those in favor of explainability reason that even highly performant ML models are prone to error, such as confounding in their derivation datasets or incorrect application of datasets to new populations.24,41,42 Therefore, providing more information about the logic underlying an ML output for the physician to verify may reduce the incidence of these errors and, incidentally, may also address concerns of physician deskilling from automation complacency.40,42 However, others point out that the enormous volume and complexity of pathophysiological and medical evidence involved in clinical decision making is far beyond the capacity of any physician to recall, even if they are able to review relevant information at the bedside using decision aids.23 Furthermore, evidence used to support clinical decision making is often empirically derived from randomized clinical trials, without a pathophysiological understanding of why a treatment works.23 This means that the current standard of care involves physicians trusting the results of trials in which they do not completely understand each component that influenced the final result. This limits their ability to ensure that patients are fully truly informed about healthcare decisions.23,43 Therefore, if ML models were to be incorporated into clinical practice at this standard, they may not necessarily need a comprehensive explanation of the logic underlying their output.23 Seventy-six percent of physicians in our survey reported that the ML output was trustworthy when no model-agnostic explanation was provided. This suggests that the “empirical” evidence reported by the ML risk calculator—that the patient had a low risk (<1%) of pulmonary embolism based on a cohort of 10 000 patients who were investigated for pulmonary embolism—was often adequate to make a clinical decision. However, this approach is unlikely to satisfy the patient’s right to an explanation about the “meaningful logic” behind an ML model’s output.8,44 There may be a mismatch between what physicians perceive as “sufficient” information to make clinical decisions and what is required by current legal frameworks. Preferred explainability methods No particular ML explainability method had a greater influence on intended physician behavior. This is in contrast to previous research showing that the method used for framing and displaying data could influence physician decision making.45 However, among the 88% physicians who preferred a model-agnostic explanation, 62% preferred a local explanation over a global explanation. There may be several reasons for this. First, as our qualitative data suggested, having a simple explanation that is specific to the patient may be preferred. With global explanations, the physician must interpret where their patient lies on a continuum of risk for each clinical feature. This requires an additional cognitive step compared with local explanations, where this step is done for the physician. Second, local explanations may be more intuitive to physicians as they align more closely with physicians’ current decision-making processes. The widely accepted hypothetico-deductive model involves generating several hypotheses (ie, a differential diagnosis) for the patient’s presenting complaint based on the physician’s previous experience, and then testing each hypothesis against the patient’s clinical features to find which fits best.46,47 Local explanations fit neatly into this model, whereby a physician can easily test a particular hypothesis for the patient in question. Indeed, phrases such as “confirmation” were used in participants’ description of why local explanations were preferred. Limitations of our study The key limitations of this study relate to the generalizability of the results and the design of the clinical scenario. The results may not generalize well to other settings for several reasons. The sample size was small, which was compounded by the response and completion rates of 18.8% and 12.9%, respectively. This could have resulted in a sampling bias which we were unable to quantify as we did not have access to the demographic data of the nonresponders. Our analysis also aggregated data from a wide range of specialties and demographics, potentially missing important signals from subgroups. Finally, biases in healthcare processes, such as how clinical data are recorded and by whom, means that the trends we observed for this specific ML risk calculator and clinical situation may not generalize well to other settings.48–51 The other main limitation is that the design of the hypothetical clinical scenario may have influenced participants’ responses in several ways. The first factor that may have influenced responses is that ML outputs were always presented in the same order—“no explanation” (control), “global explanation,” and “local explanation.” It is well described that the order of survey questions can lead to survey response effects.52 The impact that this had on our results is unknown, but perhaps could have been mitigated if the order of the ML outputs had been randomized. The second factor is the performance of the ML risk calculator. In our scenario, we described an ML risk calculator that was superior to existing PE risk calculators to avoid participant skepticism of the underlying evidence of the model.1–3 However, several studies suggest that decision support tool performance significantly affects physician use, so the high levels of trust seen in our study may be at least partly attributable to the high performance of our ML risk calculator.37,53,54 The third factor is how a physician might respond to an uncertain or unexpected risk estimate, or clinical recommendation. For simplicity, our scenario combined a precise risk estimate with an expected clinical recommendation. However, given that risk calculators are intended for use in situations where relevant information is conflicting or absent,55,56 it is unlikely that ML models will always provide such a high degree of precision, which at times, could lead to unexpected clinical recommendations. Further research is required to understand the impact of different ML explainability methods in such a scenario. The fourth factor that we did not explore in the survey is how such an ML risk calculator might be incorporated into the clinical workflow. Actual ML risk calculators will face barriers regarding how the relevant patient data would be extracted from the clinical record, and how easily the risk calculator could be accessed by physicians.55,57 Evidence from the decision support literature suggests that such feasibility issues have a significant impact on physician behavior.38,53,55,57,58 These factors highlight the many complexities involved in the implementation of ML risk calculators into clinical practice. Although important, the purpose of our study was not to comprehensively quantify these factors, as this would have made our survey impractical. Rather, the emphasis of our study was exploring physicians’ responses to contemporary model-agnostic explanations. CONCLUSIONS This study has found that in the healthcare setting, physician understanding, explainability, and trust in ML are related. Physicians preferred model-agnostic explanations rather than having no explanation of ML logic. However, many physicians were willing to trust a highly performant ML in the absence of an explanation of its logic. This demonstrates a discrepancy between what physicians and the law may deem to be “sufficient understanding” of ML logic when applied in healthcare settings. Further research is needed to validate these findings. FUNDING The study was funded by a grant from Precision Driven Health, which is a public-private partnership between the Auckland Regional Health Boards, the University of Auckland and Orion Health. AUTHOR CONTRIBUTIONS WKD and NB designed the study (along with QT and RR), oversaw data collection and analysis, contributed to data interpretation, and drafted the manuscript. NH performed data analysis, contributed to data interpretation, and revised the manuscript for important intellectual content. QT designed the study (along with WKD, NB, and RR), performed data analysis, contributed to data interpretation, and revised the manuscript for important intellectual content. GS contributed to data interpretation and revised the manuscript for important intellectual content. RR is the senior investigator of the study, designed the study (along with WKD, NB and QT), contributed to data interpretation, and revised the manuscript for important intellectual content. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. ACKNOWLEDGMENTS We would like to thank the physicians who participated in this study from Northland District Health Board, Waitemata District Health Board, and Te Tai Tokerau and Manaia Primary Health Organisations. CONFLICT OF INTEREST STATEMENT WKD and NB were contracted by Precision Driven Health to conduct the study. NH, QT, and RR were employed by Orion Health. Precision Driven Health had the right to review the publication before submission, however, the researchers had full access to the data at all times and no changes were made to the researchers’ version of the manuscript. REFERENCES 1 Wells PS , Anderson DR , Rodger M , et al. . Excluding pulmonary embolism at the bedside without diagnostic imaging: management of patients with suspected pulmonary embolism presenting to the emergency department by using a simple clinical model and d-dimer . Ann Intern Med 2001 ; 135 ( 2 ): 98 – 107 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Kline JA , Mitchell AM , Kabrhel C , et al. . Clinical criteria to prevent unnecessary diagnostic testing in emergency department patients with suspected pulmonary embolism . J Thromb Haemost 2004 ; 2 ( 8 ): 1247 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Cooper J. Improving the diagnosis of pulmonary embolism in the emergency department . BMJ Qual Improv Rep 2015 ; 4 ( 1 ): 1 – 6 . Google Scholar Crossref Search ADS WorldCat 4 Banerjee I , Sofela M , Yang J , et al. . Development and performance of the pulmonary embolism result forecast model (PERFORM) for computed tomography clinical decision support . JAMA Netw Open 2019 ; 2 ( 8 ): e198719 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Sullivan HR , Schweikart SJ. Are current tort liability doctrines adequate for addressing injury caused by AI? AMA J Ethics 2019 ; 21 ( 2 ): E160 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Anderson M , Anderson SL. How should AI be developed, validated, and implemented in patient care? AMA J Ethics 2019 ; 21 ( 2 ): 125 – 30 . Google Scholar Crossref Search ADS WorldCat 7 Price WN. Medical malpractice and black-box medicine In: Cohen GI , Lynch HF , Vayena E , Gasser U , eds. Big Data, Health Law, and Bioethics . Cambridge, United Kingdom : Cambridge University Press ; 2018 : 295 – 306 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 8 Selbst AD , Powles J. Meaningful information and the right to explanation . Int Data Priv Law 2017 ; 7 ( 4 ): 233 – 42 . Google Scholar Crossref Search ADS WorldCat 9 Thurier Q , Hua N , Boyle L , et al. . Inspecting a machine learning based clinical risk calculator: a practical perspective. In: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). IEEE; 2019 : 325 – 30 ; Cordoba, Spain. 10 Lundberg SM , Nair B , Vavilala MS , et al. . Explainable machine-learning predictions for the prevention of hypoxaemia during surgery . Nat Biomed Eng 2018 ; 2 ( 10 ): 749 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Albers DJ , Levine ME , Stuart A , et al. . Mechanistic machine learning: how data assimilation leverages physiologic knowledge using bayesian inference to forecast the future, infer the present, and phenotype . J Am Med Inform Assoc 2018 ; 25 ( 10 ): 1392 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Albers DJ , Levine M , Gluckman B , et al. . Personalized glucose forecasting for type 2 diabetes using data assimilation . PLoS Comput Biol 2017 ; 13 ( 4 ): e1005232 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Zeevi D , Korem T , Zmora N , et al. . Personalized nutrition by prediction of glycemic responses . Cell 2015 ; 163 ( 5 ): 1079 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Ribeiro MT , Singh S , Guestrin C. Model-agnostic interpretability of machine learning. arXiv [stat.ML]. 2016 . http://arxiv.org/abs/1606.05386. 15 Baehrens D , Schroeter T , Harmeling S , et al. . How to explain individual classification decisions . J Mach Learn Res 2010 ; 11 : 1803 – 31 . WorldCat 16 Krause J , Perer A , Ng K. Interacting with predictions: visual inspection of black-box machine learning model In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. New York, NY: ACM; 2016 : 5686 – 97 . 17 Ribeiro MT , Singh S , Guestrin C. Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM; 2016 : 1135 – 44 . 18 Breiman L. Random forests . Mach Learn 2001 ; 45 ( 1 ): 5 – 32 . Google Scholar Crossref Search ADS WorldCat 19 Goldstein A , Kapelner A , Bleich J , et al. . Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation . J Comput Graph Stat 2015 ; 24 ( 1 ): 44 – 65 . Google Scholar Crossref Search ADS WorldCat 20 Lundberg SM , Lee S-I. A unified approach to interpreting model predictions In: Guyon I , Luxburg UV , Bengio S , et al. ., eds. Advances in Neural Information Processing Systems 30 . New York, NY : Curran Associates ; 2017 : 4765 – 74 . Google Preview WorldCat COPAC 21 Wang D , Yang Q , Abdul A , et al. . Designing theory-driven user-centric explainable AI. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing System. New York, NY: ACM; 2019 : 1 – 15 . 22 Plsek PE , Greenhalgh T. The challenge of complexity in health care . BMJ 2001 ; 323 ( 7313 ): 625 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 23 London AJ. Artificial intelligence and black-box medical decisions: accuracy versus explainability . Hastings Cent Rep 2019 ; 49 ( 1 ): 15 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Verghese A , Shah NH , Harrington RA. What this computer needs is a physician: humanism and artificial intelligence . JAMA 2018 ; 319 ( 1 ): 19 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Diprose W , Buist N. Artificial intelligence in medicine: humans need not apply? N Z Med J 2016 ; 129 ( 1434 ): 73 – 6 . Google Scholar PubMed WorldCat 26 Nemati S , Holder A , Razmi F , et al. . An interpretable machine learning model for accurate prediction of sepsis in the ICU . Crit Care Med 2018 ; 46 ( 4 ): 547 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Rucco M , Sousa-Rodrigues D , Merelli E , et al. . Neural hypernetwork approach for pulmonary embolism diagnosis . BMC Res Notes 2015 ; 8 ( 1 ): 617 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Esteva A , Kuprel B , Novoa RA , et al. . Dermatologist-level classification of skin cancer with deep neural networks . Nature 2017 ; 542 ( 7639 ): 115 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Larson DB , Chen MC , Lungren MP , et al. . Performance of a deep-learning neural network model in assessing skeletal maturity on pediatric hand radiographs . Radiology 2018 ; 287 ( 1 ): 313 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Swaminathan S , Qirko K , Smith T , et al. . A machine learning approach to triaging patients with chronic obstructive pulmonary disease . PLoS One 2017 ; 12 ( 11 ): e0188532 . Google Scholar Crossref Search ADS PubMed WorldCat 31 MDCalc. Wells’ Criteria for Pulmonary Embolism. 2019 . https://www.mdcalc.com/wells-criteria-pulmonary-embolism Accessed September 1 , 2019 . 32 MDCalc. PERC Rule for Pulmonary Embolism. 2019 . https://www.mdcalc.com/perc-rule-pulmonary-embolism Accessed September 1 , 2019 . 33 Carson JL , Kelley MA , Duff A , et al. . The clinical course of pulmonary embolism . N Engl J Med 1992 ; 326 ( 19 ): 1240 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Hall P , Gill N , Kurka M , Phan W. Machine Learning Interpretability with H2O Driverless AI. Mountain View, CA: H2O.ai; 2019 . 35 SurveyGizmo. Online Survey Software & Tools. 2019 . https://www.surveygizmo.com/Accessed June 15 , 2019 . 36 Lovric M. International Encyclopedia of Statistical Science . Berlin, Germany : Springer ; 2011 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 37 Fan W , Liu J , Zhu S , et al. . Investigating the impacting factors for the healthcare professionals to adopt artificial intelligence-based medical diagnosis support system (AIMDSS) . Ann Oper Res 2018 . doi: 10.1007/s10479-018-2818-y. WorldCat 38 Lugtenberg M , Weenink J-W , van der Weijden T , et al. . Implementation of multiple-domain covering computerized decision support systems in primary care: a focus group study on perceived barriers . BMC Med Inform Decis Mak 2015 ; 15 ( 1 ): 82 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Voruganti T , O’Brien MA , Straus SE , et al. . Primary care physicians’ perspectives on computer-based health risk assessment tools for chronic diseases: a mixed methods study . J Innov Health Inform 2015 ; 22 ( 3 ): 333 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Xie Y , Gao G , Chen XA. Outlining the design space of explainable intelligent systems for medical diagnosis. arXiv [cs.HC]. 2019 . http://arxiv.org/abs/1902.06019 41 Cabitza F , Rasoini R , Gensini GF. Unintended consequences of machine learning in medicine . JAMA 2017 ; 318 ( 6 ): 517 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Challen R , Denny J , Pitt M , Gompels L , Edwards T , Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety . BMJ Qual Saf 2019 ; 28 ( 3 ): 231 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Diprose W , Verster F. The preventive-pill paradox: how shared decision making could increase cardiovascular morbidity and mortality . Circulation 2016 ; 134 ( 21 ): 1599 – 600 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Voigt P , Von Dem Bussche A. The EU General Data Protection Regulation (GDPR). A Practical Guide . 1st ed. Cham, Switzerland : Springer International ; 2017 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 45 Ancker JS , Senathirajah Y , Kukafka R , et al. . Design features of graphs in health risk communication: a systematic review . J Am Med Inform Assoc 2006 ; 13 ( 6 ): 608 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Barrows HS , Robyn M , Tamblyn B. Problem-Based Learning: An Approach to Medical Education . Berlin, Germany : Springer ; 1980 . Google Preview WorldCat COPAC 47 Elstein AS , Schwartz A. Clinical problem solving and diagnostic decision making: selective review of the cognitive literature . BMJ 2002 ; 324 ( 7339 ): 729 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Hripcsak G , Albers DJ. Next-generation phenotyping of electronic health records . J Am Med Inform Assoc 2013 ; 20 ( 1 ): 117 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Agniel D , Kohane IS , Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study . BMJ 2018 ; 361 : k1479 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Hripcsak G , Albers DJ. Correlating electronic health record concepts with healthcare process events . J Am Med Inform Assoc 2013 ; 20 ( e2 ): e311 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Collins SA , Cato K , Albers D , et al. . Relationship between nursing documentation and patients’ mortality . Am J Crit Care 2013 ; 22 ( 4 ): 306 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Tourangeau R , Rips LJ , Rasinski K. The Psychology of Survey Response . Cambridge, United Kingdom : Cambridge University Press ; 2000 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 53 Esmaeilzadeh P , Sambasivan M , Kumar N , et al. . Adoption of clinical decision support systems in a developing country: antecedents and outcomes of physician’s threat to perceived professional autonomy . Int J Med Inform 2015 ; 84 ( 8 ): 548 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Chang I-C , Hwang H-G , Hung W-F , et al. . Physicians’ acceptance of pharmacokinetics-based clinical decision support systems . Expert Syst Appl 2007 ; 33 ( 2 ): 296 – 303 . Google Scholar Crossref Search ADS WorldCat 55 Press A , Press A , McCullagh L , et al. . Usability testing of a complex clinical decision support tool in the emergency department: lessons learned . JMIR Human Factors 2015 ; 2 ( 2 ): e14 . Google Scholar Crossref Search ADS PubMed WorldCat 56 West AF. Clinical decision-making: coping with uncertainty . Postgrad Med J 2002 ; 78 ( 920 ): 319 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Arts DL , Medlock SK , van Weert H , et al. . Acceptance and barriers pertaining to a general practice decision support system for multiple clinical conditions: a mixed methods evaluation . PLoS One 2018 ; 13 ( 4 ): e0193187 . Google Scholar Crossref Search ADS PubMed WorldCat 58 O'Sullivan DM , Doyle JS , Michalowski WJ , et al. . Assessing the motivation of MDs to use computer-based support at the point-of-care in the emergency department . AMIA Annu Symp Proc 2011 ; 2011 : 1045 – 54 . Google Scholar PubMed WorldCat Author notes These authors contributed equally. © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator

Loading next page...
 
/lp/oxford-university-press/physician-understanding-explainability-and-trust-in-a-hypothetical-jQ8ayca0YB
Publisher
Oxford University Press
Copyright
© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocz229
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective Implementation of machine learning (ML) may be limited by patients’ right to “meaningful information about the logic involved” when ML influences healthcare decisions. Given the complexity of healthcare decisions, it is likely that ML outputs will need to be understood and trusted by physicians, and then explained to patients. We therefore investigated the association between physician understanding of ML outputs, their ability to explain these to patients, and their willingness to trust the ML outputs, using various ML explainability methods. Materials and Methods We designed a survey for physicians with a diagnostic dilemma that could be resolved by an ML risk calculator. Physicians were asked to rate their understanding, explainability, and trust in response to 3 different ML outputs. One ML output had no explanation of its logic (the control) and 2 ML outputs used different model-agnostic explainability methods. The relationships among understanding, explainability, and trust were assessed using Cochran-Mantel-Haenszel tests of association. Results The survey was sent to 1315 physicians, and 170 (13%) provided completed surveys. There were significant associations between physician understanding and explainability (P < .001), between physician understanding and trust (P < .001), and between explainability and trust (P < .001). ML outputs that used model-agnostic explainability methods were preferred by 88% of physicians when compared with the control condition; however, no particular ML explainability method had a greater influence on intended physician behavior. Conclusions Physician understanding, explainability, and trust in ML risk calculators are related. Physicians preferred ML outputs accompanied by model-agnostic explanations but the explainability method did not alter intended physician behavior. artificial intelligence, explainability, interpretability, decision support, medicine INTRODUCTION BACKGROUND AND SIGNIFICANCE In clinical practice, traditional risk calculators, such as the Wells’ Criteria and PERC Rule scores for pulmonary embolism, are commonly used in clinical practice to support decision making in cases of diagnostic uncertainty.1–3 Machine learning (ML) may improve the accuracy of risk assessment4; however, implementation of ML raises complex clinical, ethical, and legal questions because of the lack of understanding about how ML models generate outputs—commonly referred to as the black box problem.5–7 This has become a particular issue since the European Union Global Data Protection Regulation has been introduced, which requires that patients have a right to “meaningful information about the logic involved” when ML is used.8 Therefore, unless the black box problem can be remedied, the patient right to explanation cannot be fulfilled and the healthcare industry cannot fully benefit from these powerful technologies. Fortunately, ML explainability methods have been developed to address the black box problem.9–13 For example, model-agnostic explainability methods can derive post hoc explanations from black box models.14 Broadly, this involves training an interpretable model based on the predictions of a black box model and perturbing its inputs to see how the black box model reacts.15–17 Examples of these methods include permutation variable importance (VI), individual conditional expectation (ICE) plots, local interpretable model-agnostic explanations (LIME), and Shapley values (SVs).17–20 While these methods have been explored in a number of contexts, there is currently limited evidence as to whether they adequately explain ML-derived risk estimates in a clinical context.21 Importantly, clinical decisions represent a complex synthesis of basic sciences, clinical evidence, and patient preferences.22,23 If these highly individualized decisions were to be automated with ML, physicians would likely need to explain the decision-making process to patients, at least in the near term.22,24,25 Indeed, many published examples of ML applications in healthcare still require physician interpretation.26–30 Therefore, in order to satisfy patients’ rights to a meaningful explanation, ML risk calculators will first need to “explain” their output in such a way that physicians can understand. It is not yet known if currently available model-agnostic explanations can provide physicians with an adequate understanding of ML outputs, nor if physician understanding of ML outputs leads to the ability to explain them to patients. Finally, few studies have explored whether improving physician understanding of ML outputs influences physician behavior. OBJECTIVES In the context of a clinical ML risk calculator, we aimed to investigate the association among (1) physician understanding of an ML output (termed physician understanding); (2) physician ability to explain an ML output to patients (termed explainability), and (3) and physician intended behavior (termed trust). We also aimed to explore whether model-agnostic explainability methods influenced the intended behavior of physicians, and whether a particular model-agnostic explainability method was preferred by physicians. MATERIALS AND METHODS Hypothetical ML risk calculator for pulmonary embolism To assess the relationships among physician understanding, explainability, and trust in ML risk calculators, we designed a hypothetical ML risk calculator based on the Wells’ Criteria and PERC Rule to use in a survey-based clinical scenario for physicians.1,2 Based on the risk estimated by the ML risk calculator, a clinical recommendation of either (1) “reassurance and discharge recommended” or (2) “computed tomography pulmonary angiogram recommended” was made if the patient was low risk of pulmonary embolism or not low risk of pulmonary embolism, respectively. We combined a risk estimate with a clinical recommendation as we believed that it would allow us to assess intended physician behavior and, therefore, trust in the ML output. In addition, the combination of a risk estimate and a clinical recommendation is commonly used with non-ML risk calculators, including online versions of the Wells’ Criteria and PERC Rule.31,32 Survey instrument Based on the described hypothetical ML risk calculator, we designed an online survey (Supplementary Appendix) that provided participants with the following hypothetical clinical scenario: You are a GP [general practitioner] who has reviewed a 50-year-old woman presenting with shortness of breath. After a history, examination, laboratory tests, ECG [electrocardiogram], and a chest x-ray, you are comfortable you have excluded the most concerning diagnoses. However, you are still considering pulmonary embolism. The practice has installed a piece of software that uses artificial intelligence to assist with ruling out pulmonary embolism. It can stratify patients as either; (1) low risk of pulmonary embolism: reassurance and discharge recommended; or (2) not low risk of pulmonary embolism: computed tomography pulmonary angiogram recommended. The software automatically analyses the electronic record, including your documented history, examination, and laboratory tests, and provides its recommendation. The rationale for this clinical scenario was that pulmonary embolism is a disease encountered in community, medical, and surgical settings and is potentially life-threatening, so decision making around diagnosis and management is critical.33 We provided limited clinical information in the scenario to prevent participants from agreeing or disagreeing with the ML output for reasons other than the explainability method we used. Each participant was shown 3 consecutive ML outputs, which included an ML risk estimate combined with a clinical recommendation. The 3 consecutive ML outputs were identical except for the method used to visualize and explain the “logic” of the risk estimate. After each ML output, the participant was asked the following questions with select-choice answers: To what degree does the software’s decision make sense to you? [Not at all/Very little/Somewhat/To a great extent] Would you be able to explain the software’s decision to the patient? [Yes/No] Would you follow the software’s recommendation? [Yes/No] The first ML output for all participants was as follows: Your patient has a low risk (<1% chance) of pulmonary embolism. They should be reassured and followed up in the community as you deem appropriate. This recommendation is based on a cohort of 10 000 patients who were investigated for pulmonary embolism, of whom 1000 had a similar risk profile. The software has been externally validated. This ML output was regarded as the control condition because only the estimated risk of pulmonary embolism (ie, <1% chance) and the population that the ML model was derived from were shown. No explanation of the logic behind the ML model’s risk calculation was shown. In our scenario, the accuracy of the ML model and size of the derivation cohort were exaggerated in order to ensure that physicians’ potential lack of trust was not simply due to poor model performance. For the second and third ML outputs, the control output was shown to the participant along with a model-agnostic explanation. This model-agnostic explanation included (1) a graphical visualization and (2) a brief textual explanation of what the visualization was demonstrating. The model-agnostic explanations were hypothetical. They were developed by W.K.D. and N.B., who are practicing physicians, and Q.T., who is a data scientist, and were tested on a small number of practicing physicians in order to ensure they were clinically valid. For the second ML output, participants were randomized to 1 of 2 global model-agnostic explanations, either VI (Figure 1A) or an ICE plot (Figure 1B). For the third ML output, participants were randomized to 1 of 2 local model-agnostic explanations, either LIME (Figure 1C) or SVs (Figure 1D). Broadly, global explanations demonstrate feature importance across a population, while local explanations explain individual predictions.34 Finally, participants were asked to select which ML output they preferred and why. Demographics and medical subspecialty were also assessed. Figure 1. Open in new tabDownload slide Graphical visualizations shown to participants with the following complementary explanation. (A) For variable importance (VI), the visualization shows the relative importance of each clinical factor used by the model. This is general information about the software’s logic. (B) For individual conditional expectation plots (ICE), the visualizations show the average and standard deviation of the software’s predictions for different values of the most influential clinical factors used by the model. This is general information about the software’s logic. (C) For local interpretable model-agnostic explanations (LIME), the visualization shows the positive or negative relative impact of the most influential clinical factors used by the software to estimate your patient’s risk of pulmonary embolism (PE). This profile is specific to your patient. (D) For Shapely values (SVs), the visualization shows the positive or negative contribution of each clinical factor to the risk estimated by the software. The sum of the bars is equal to your patient’s PE risk. This profile is specific to your patient. Figure 1. Open in new tabDownload slide Graphical visualizations shown to participants with the following complementary explanation. (A) For variable importance (VI), the visualization shows the relative importance of each clinical factor used by the model. This is general information about the software’s logic. (B) For individual conditional expectation plots (ICE), the visualizations show the average and standard deviation of the software’s predictions for different values of the most influential clinical factors used by the model. This is general information about the software’s logic. (C) For local interpretable model-agnostic explanations (LIME), the visualization shows the positive or negative relative impact of the most influential clinical factors used by the software to estimate your patient’s risk of pulmonary embolism (PE). This profile is specific to your patient. (D) For Shapely values (SVs), the visualization shows the positive or negative contribution of each clinical factor to the risk estimated by the software. The sum of the bars is equal to your patient’s PE risk. This profile is specific to your patient. Participants The relevant institutional review boards approved this study. We surveyed all physicians employed in Northland District Health Board (approximately 104 Resident Medical Officers and 150 Senior Medical Officers), Waitematā District Health Board (approximately 415 Resident Medical Officers and 502 Senior Medical Officers), and Te Tai Tokerau and Manaia Primary Health Organisations (approximately 151 general practitioners combined) in New Zealand between March and May of 2019. These physicians serve a population of approximately 800 000 people. The survey was emailed to all physicians via a SurveyGizmo link.35 Descriptive statistics and analysis of comments Statistical analysis was performed using data analysis (Pandas and StatsModels) and visualizations packages (Seaborn and WordCloud) in the Python ecosystem, in addition to stats and samplesizeCMH in the R ecosystem. The associations between physician understanding, explainability, and trust were assessed using Cochran-Mantel-Haenszel tests. Because repeated measures were collected from each participant (ie, 1 for each of the 3 ML output types), the response data were stratified according to the ML output type (ie, control condition, global explanation, and local explanation). Thereby, 2 × 4 × 3 contingency tables were generated for physician understanding vs explainability and for physician understanding vs trust, and 2 × 2 × 3 contingency tables were generated for explainability vs trust. For 2×J × K (J > 2) tables, the generalized Cochran-Mantel-Haenszel test was applied.36 To assess the reliability of each test, the statistical power of each test were calculated with a conventional significance level of .05. Binomial exact tests were used to assess whether physicians preferred specific explainability methods. To assess the influence of explainability method on intended physician behavior (trust), McNemar’s test was applied to paired responses from the same participants under 2 different scenarios (eg, control vs global explanation). If the number of changes in responses was <25, a binomial distribution was used to obtain an exact P value, otherwise a normal chi-square distribution was used to achieve an approximated P value. A word cloud was used to explore the reasons why participants preferred certain ML outputs. Natural language processing package was used to process the responses. The plain text was tokenized, de-capitalized, and stemmed. After removing all punctuation, digits, stop words, and short words with <3 letters, the frequency of each token and bigram was counted and used to set the font size in the word cloud proportionally. We kept only tokens or bigrams that appear at least 5 times in the responses collected. RESULTS Participants The survey was emailed to 1315 participants (Figure 2). There were 249 (18.9%) responses, among which 79 (31.7%) were incomplete and 170 (68.3%) were complete. A total of 100 were male participants (58.8%). The largest age group was the 25-34 years of age range (32.4%) followed by the 45-54 years of age range (27.6%). The most common medical subspecialties that responded were “Medicine” (20%) followed by “Other” (17.6%) and “Anaesthetics” and “Psychiatry” (both 11.8%) (Table 1). Of the 170 physicians that completed the survey, all (100%) were provided with 3 consecutive ML outputs, of which the first was always the control output (no model-agnostic explanation), the second was always a global model-agnostic explanation (VI or ICE), and the third was always a local model-agnostic explanation (LIME or SVs). A total of 37 (21.8%) were randomized to VI and LIME; 33 (19.4%) were randomized to VI and SVs; 54 (31.8%) were randomized to ICE and LIME; and 46 (27.0%) were randomized to ICE and SVs. A total of 510 responses (3 ML outputs per physician who completed the survey) were used to assess the relationships among the concepts of physician understanding, explainability, and trust. Figure 2. Open in new tabDownload slide Flow chart of survey response and randomization. ICE: individual conditional expectation plots; LIME: local interpretable model-agnostic explanations; SV: Shapely values; VI: variable importance Figure 2. Open in new tabDownload slide Flow chart of survey response and randomization. ICE: individual conditional expectation plots; LIME: local interpretable model-agnostic explanations; SV: Shapely values; VI: variable importance Table 1. Demographic characteristics of the participants Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Open in new tab Table 1. Demographic characteristics of the participants Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Characteristics n (%) Age  18-24 y 3 (1.8)  25-34 y 55 (32.4)  35-44 y 34 (20.0)  45-54 y 47 (27.6)  55-64 y 22 (12.9)  65-74 y 5 (2.9)  75 y or older 3 (1.8)  Missing 1 (0.6) Sex/gender  Male 100 (58.8)  Female 64 (37.6)  Nonbinary 1 (0.6)  Missing 5 (2.9) Subspecialty  Medicine 34 (20)  Other 30 (17.6)  Anesthetics 20 (11.8)  Psychiatry 20 (11.8)  Surgery 16 (9.4)  Emergency medicine 15 (8.8)  Pediatrics 11 (6.5)  Radiology 9 (5.3)  General practice 6 (3.5)  Obstetrics and gynecology 5 (2.9)  Pathology 3 (1.8)  Ophthalmology 1 (0.6) Open in new tab Physician understanding and explainability Physicians who reported higher levels of understanding in response to the question “To what degree does the software’s decision make sense to you?” were more likely to answer “Yes” to the question “Would you be able to explain the software’s decision to the patient?” (Figure 3). The relationship between physician understanding and explainability was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ23 = 156.3, P < .001, power = 0.99). Figure 3. Open in new tabDownload slide Relationship between physician understanding and explainability. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 3. Open in new tabDownload slide Relationship between physician understanding and explainability. (A) Control condition. (B) Global explanation. (C) Local explanation. Physician understanding and trust Physicians who reported higher levels of understanding in response to the question “To what degree does the software’s decision make sense to you?” were more likely to answer “Yes” to the question “Would you follow the software recommendation?” (Figure 4). The relationship between physician understanding and trust was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ23 = 128.7, P < .001, power = 0.99). Figure 4. Open in new tabDownload slide Relationship between physician understanding and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 4. Open in new tabDownload slide Relationship between physician understanding and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Explainability and trust Physicians who reported “Yes” to the question “Would you be able to explain the software’s decision to the patient” were more likely to respond “Yes” to the question “Would you follow the software recommendation?” (Figure 5). The relationship between explainability and trust was statistically significant, controlling for all ML output types (Cochran-Mantel-Haenszel χ21 = 61.2, P < .001, power = 0.99). No particular ML explainability method had a greater influence on intended physician behavior (Table 2). Figure 5. Open in new tabDownload slide Relationship between explainability and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Figure 5. Open in new tabDownload slide Relationship between explainability and trust. (A) Control condition. (B) Global explanation. (C) Local explanation. Table 2. Influence of explainability methods on intended physician behavior Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 ICE: individual conditional expectation; LIME: local interpretable model-agnostic explanations; SV: Shapley value; VI: variable importance. Open in new tab Table 2. Influence of explainability methods on intended physician behavior Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 Concept Compared scenarios Observations P value Trust Control Global explanation 170 .851 Control Local explanation 170 .856 Control VI 70 .774 Control ICE 100 1.0 Control LIME 91 .454 Control SVs 79 .180 VI LIME 37 .109 VI SVs 33 .625 ICE LIME 54 .754 ICE SVs 46 .070 ICE: individual conditional expectation; LIME: local interpretable model-agnostic explanations; SV: Shapley value; VI: variable importance. Open in new tab Preferred explainability method There were 156 (91.8%) physicians that rated their preferred explainability method. Of these, 87.8% of physicians preferred an ML output which contained a model-agnostic explanation, compared with 12.2% who preferred the control output (no model-agnostic explanation). Binomial exact testing showed that the preference for any model explainability method was significantly higher than two-thirds (P < .001). Local explanations (LIME [32.1%] and SVs [29.9%]) were preferred over global explanations (VI [18.2%] and ICE [19.7%]), and binomial exact testing showed that the preference for local explanations (62.0%) was significantly higher than 0.5 among all those preferred model-agnostic explainability method (P = .003). Even when no model-agnostic explanation was provided (ie, before being shown a model-agnostic explanation), physicians commonly reported that ML outputs were understandable (93% rating “somewhat” or “to a great extent”), explainable (85%), and trustworthy (76%). Figure 6 shows a word cloud generated from the responses to the question “Why do you prefer this software recommendation?” We noticed trends toward (1) the simplicity of the visualization (easy, simple, easy to understand), (2) the specificity of the explanation (specific, specific patient), and (3) the confirmation of clinical knowledge (explain, understand, information). Figure 6. Open in new tabDownload slide Word cloud generated from the responses to the question, “Why do you prefer this software recommendation?” Figure 6. Open in new tabDownload slide Word cloud generated from the responses to the question, “Why do you prefer this software recommendation?” DISCUSSION In the context of an ML risk calculator, this study found that physicians’ perceived understanding of ML outputs (physician understanding), their ability to explain ML outputs to a patient (explainability), and their intended behavior (trust) are significantly related. While physicians did prefer to have an explanation of the ML model’s logic (88% of physicians), no particular ML explainability method had a greater influence on intended physician behavior. These findings suggest that ML explainability is important to physicians, but that factors other than explainability, such as ML accuracy, might be more important to physicians. This hypothesis needs to be tested in future studies. Physician understanding and intended physician behavior (trust) Several other studies have explored the link between physician understanding and trust in ML. A survey of 191 Chinese physicians found that participants’ ability to understand ML decision support tools correlated with their attitudes of usefulness of ML but not with whether they would actually use ML tools.37 In studies of non-ML decision support systems, increased physician skepticism about the quality of the underlying evidence led to reduced motivation to use them.38,39 Finally, a qualitative study and several opinion articles from physicians outlined their desire to understand an ML’s logic before they would follow its recommendations, especially if its recommendations were to differ from their own opinion.24,40,41 Our study also demonstrated this trend, with no physician reporting that they would follow the ML output if they rated their understanding as “not at all.” Whilst this evidence base is small and heterogeneous, it suggests an association between physician understanding of ML outputs and intended physician behavior. Explainability Despite the legal requirements outlined in the Global Data Protection Regulation and the opinions described previously, many experts debate the necessity of ML explainability. Those in favor of explainability reason that even highly performant ML models are prone to error, such as confounding in their derivation datasets or incorrect application of datasets to new populations.24,41,42 Therefore, providing more information about the logic underlying an ML output for the physician to verify may reduce the incidence of these errors and, incidentally, may also address concerns of physician deskilling from automation complacency.40,42 However, others point out that the enormous volume and complexity of pathophysiological and medical evidence involved in clinical decision making is far beyond the capacity of any physician to recall, even if they are able to review relevant information at the bedside using decision aids.23 Furthermore, evidence used to support clinical decision making is often empirically derived from randomized clinical trials, without a pathophysiological understanding of why a treatment works.23 This means that the current standard of care involves physicians trusting the results of trials in which they do not completely understand each component that influenced the final result. This limits their ability to ensure that patients are fully truly informed about healthcare decisions.23,43 Therefore, if ML models were to be incorporated into clinical practice at this standard, they may not necessarily need a comprehensive explanation of the logic underlying their output.23 Seventy-six percent of physicians in our survey reported that the ML output was trustworthy when no model-agnostic explanation was provided. This suggests that the “empirical” evidence reported by the ML risk calculator—that the patient had a low risk (<1%) of pulmonary embolism based on a cohort of 10 000 patients who were investigated for pulmonary embolism—was often adequate to make a clinical decision. However, this approach is unlikely to satisfy the patient’s right to an explanation about the “meaningful logic” behind an ML model’s output.8,44 There may be a mismatch between what physicians perceive as “sufficient” information to make clinical decisions and what is required by current legal frameworks. Preferred explainability methods No particular ML explainability method had a greater influence on intended physician behavior. This is in contrast to previous research showing that the method used for framing and displaying data could influence physician decision making.45 However, among the 88% physicians who preferred a model-agnostic explanation, 62% preferred a local explanation over a global explanation. There may be several reasons for this. First, as our qualitative data suggested, having a simple explanation that is specific to the patient may be preferred. With global explanations, the physician must interpret where their patient lies on a continuum of risk for each clinical feature. This requires an additional cognitive step compared with local explanations, where this step is done for the physician. Second, local explanations may be more intuitive to physicians as they align more closely with physicians’ current decision-making processes. The widely accepted hypothetico-deductive model involves generating several hypotheses (ie, a differential diagnosis) for the patient’s presenting complaint based on the physician’s previous experience, and then testing each hypothesis against the patient’s clinical features to find which fits best.46,47 Local explanations fit neatly into this model, whereby a physician can easily test a particular hypothesis for the patient in question. Indeed, phrases such as “confirmation” were used in participants’ description of why local explanations were preferred. Limitations of our study The key limitations of this study relate to the generalizability of the results and the design of the clinical scenario. The results may not generalize well to other settings for several reasons. The sample size was small, which was compounded by the response and completion rates of 18.8% and 12.9%, respectively. This could have resulted in a sampling bias which we were unable to quantify as we did not have access to the demographic data of the nonresponders. Our analysis also aggregated data from a wide range of specialties and demographics, potentially missing important signals from subgroups. Finally, biases in healthcare processes, such as how clinical data are recorded and by whom, means that the trends we observed for this specific ML risk calculator and clinical situation may not generalize well to other settings.48–51 The other main limitation is that the design of the hypothetical clinical scenario may have influenced participants’ responses in several ways. The first factor that may have influenced responses is that ML outputs were always presented in the same order—“no explanation” (control), “global explanation,” and “local explanation.” It is well described that the order of survey questions can lead to survey response effects.52 The impact that this had on our results is unknown, but perhaps could have been mitigated if the order of the ML outputs had been randomized. The second factor is the performance of the ML risk calculator. In our scenario, we described an ML risk calculator that was superior to existing PE risk calculators to avoid participant skepticism of the underlying evidence of the model.1–3 However, several studies suggest that decision support tool performance significantly affects physician use, so the high levels of trust seen in our study may be at least partly attributable to the high performance of our ML risk calculator.37,53,54 The third factor is how a physician might respond to an uncertain or unexpected risk estimate, or clinical recommendation. For simplicity, our scenario combined a precise risk estimate with an expected clinical recommendation. However, given that risk calculators are intended for use in situations where relevant information is conflicting or absent,55,56 it is unlikely that ML models will always provide such a high degree of precision, which at times, could lead to unexpected clinical recommendations. Further research is required to understand the impact of different ML explainability methods in such a scenario. The fourth factor that we did not explore in the survey is how such an ML risk calculator might be incorporated into the clinical workflow. Actual ML risk calculators will face barriers regarding how the relevant patient data would be extracted from the clinical record, and how easily the risk calculator could be accessed by physicians.55,57 Evidence from the decision support literature suggests that such feasibility issues have a significant impact on physician behavior.38,53,55,57,58 These factors highlight the many complexities involved in the implementation of ML risk calculators into clinical practice. Although important, the purpose of our study was not to comprehensively quantify these factors, as this would have made our survey impractical. Rather, the emphasis of our study was exploring physicians’ responses to contemporary model-agnostic explanations. CONCLUSIONS This study has found that in the healthcare setting, physician understanding, explainability, and trust in ML are related. Physicians preferred model-agnostic explanations rather than having no explanation of ML logic. However, many physicians were willing to trust a highly performant ML in the absence of an explanation of its logic. This demonstrates a discrepancy between what physicians and the law may deem to be “sufficient understanding” of ML logic when applied in healthcare settings. Further research is needed to validate these findings. FUNDING The study was funded by a grant from Precision Driven Health, which is a public-private partnership between the Auckland Regional Health Boards, the University of Auckland and Orion Health. AUTHOR CONTRIBUTIONS WKD and NB designed the study (along with QT and RR), oversaw data collection and analysis, contributed to data interpretation, and drafted the manuscript. NH performed data analysis, contributed to data interpretation, and revised the manuscript for important intellectual content. QT designed the study (along with WKD, NB, and RR), performed data analysis, contributed to data interpretation, and revised the manuscript for important intellectual content. GS contributed to data interpretation and revised the manuscript for important intellectual content. RR is the senior investigator of the study, designed the study (along with WKD, NB and QT), contributed to data interpretation, and revised the manuscript for important intellectual content. SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online. ACKNOWLEDGMENTS We would like to thank the physicians who participated in this study from Northland District Health Board, Waitemata District Health Board, and Te Tai Tokerau and Manaia Primary Health Organisations. CONFLICT OF INTEREST STATEMENT WKD and NB were contracted by Precision Driven Health to conduct the study. NH, QT, and RR were employed by Orion Health. Precision Driven Health had the right to review the publication before submission, however, the researchers had full access to the data at all times and no changes were made to the researchers’ version of the manuscript. REFERENCES 1 Wells PS , Anderson DR , Rodger M , et al. . Excluding pulmonary embolism at the bedside without diagnostic imaging: management of patients with suspected pulmonary embolism presenting to the emergency department by using a simple clinical model and d-dimer . Ann Intern Med 2001 ; 135 ( 2 ): 98 – 107 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Kline JA , Mitchell AM , Kabrhel C , et al. . Clinical criteria to prevent unnecessary diagnostic testing in emergency department patients with suspected pulmonary embolism . J Thromb Haemost 2004 ; 2 ( 8 ): 1247 – 55 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Cooper J. Improving the diagnosis of pulmonary embolism in the emergency department . BMJ Qual Improv Rep 2015 ; 4 ( 1 ): 1 – 6 . Google Scholar Crossref Search ADS WorldCat 4 Banerjee I , Sofela M , Yang J , et al. . Development and performance of the pulmonary embolism result forecast model (PERFORM) for computed tomography clinical decision support . JAMA Netw Open 2019 ; 2 ( 8 ): e198719 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Sullivan HR , Schweikart SJ. Are current tort liability doctrines adequate for addressing injury caused by AI? AMA J Ethics 2019 ; 21 ( 2 ): E160 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Anderson M , Anderson SL. How should AI be developed, validated, and implemented in patient care? AMA J Ethics 2019 ; 21 ( 2 ): 125 – 30 . Google Scholar Crossref Search ADS WorldCat 7 Price WN. Medical malpractice and black-box medicine In: Cohen GI , Lynch HF , Vayena E , Gasser U , eds. Big Data, Health Law, and Bioethics . Cambridge, United Kingdom : Cambridge University Press ; 2018 : 295 – 306 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 8 Selbst AD , Powles J. Meaningful information and the right to explanation . Int Data Priv Law 2017 ; 7 ( 4 ): 233 – 42 . Google Scholar Crossref Search ADS WorldCat 9 Thurier Q , Hua N , Boyle L , et al. . Inspecting a machine learning based clinical risk calculator: a practical perspective. In: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). IEEE; 2019 : 325 – 30 ; Cordoba, Spain. 10 Lundberg SM , Nair B , Vavilala MS , et al. . Explainable machine-learning predictions for the prevention of hypoxaemia during surgery . Nat Biomed Eng 2018 ; 2 ( 10 ): 749 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Albers DJ , Levine ME , Stuart A , et al. . Mechanistic machine learning: how data assimilation leverages physiologic knowledge using bayesian inference to forecast the future, infer the present, and phenotype . J Am Med Inform Assoc 2018 ; 25 ( 10 ): 1392 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat 12 Albers DJ , Levine M , Gluckman B , et al. . Personalized glucose forecasting for type 2 diabetes using data assimilation . PLoS Comput Biol 2017 ; 13 ( 4 ): e1005232 . Google Scholar Crossref Search ADS PubMed WorldCat 13 Zeevi D , Korem T , Zmora N , et al. . Personalized nutrition by prediction of glycemic responses . Cell 2015 ; 163 ( 5 ): 1079 – 94 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Ribeiro MT , Singh S , Guestrin C. Model-agnostic interpretability of machine learning. arXiv [stat.ML]. 2016 . http://arxiv.org/abs/1606.05386. 15 Baehrens D , Schroeter T , Harmeling S , et al. . How to explain individual classification decisions . J Mach Learn Res 2010 ; 11 : 1803 – 31 . WorldCat 16 Krause J , Perer A , Ng K. Interacting with predictions: visual inspection of black-box machine learning model In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. New York, NY: ACM; 2016 : 5686 – 97 . 17 Ribeiro MT , Singh S , Guestrin C. Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: ACM; 2016 : 1135 – 44 . 18 Breiman L. Random forests . Mach Learn 2001 ; 45 ( 1 ): 5 – 32 . Google Scholar Crossref Search ADS WorldCat 19 Goldstein A , Kapelner A , Bleich J , et al. . Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation . J Comput Graph Stat 2015 ; 24 ( 1 ): 44 – 65 . Google Scholar Crossref Search ADS WorldCat 20 Lundberg SM , Lee S-I. A unified approach to interpreting model predictions In: Guyon I , Luxburg UV , Bengio S , et al. ., eds. Advances in Neural Information Processing Systems 30 . New York, NY : Curran Associates ; 2017 : 4765 – 74 . Google Preview WorldCat COPAC 21 Wang D , Yang Q , Abdul A , et al. . Designing theory-driven user-centric explainable AI. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing System. New York, NY: ACM; 2019 : 1 – 15 . 22 Plsek PE , Greenhalgh T. The challenge of complexity in health care . BMJ 2001 ; 323 ( 7313 ): 625 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 23 London AJ. Artificial intelligence and black-box medical decisions: accuracy versus explainability . Hastings Cent Rep 2019 ; 49 ( 1 ): 15 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Verghese A , Shah NH , Harrington RA. What this computer needs is a physician: humanism and artificial intelligence . JAMA 2018 ; 319 ( 1 ): 19 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Diprose W , Buist N. Artificial intelligence in medicine: humans need not apply? N Z Med J 2016 ; 129 ( 1434 ): 73 – 6 . Google Scholar PubMed WorldCat 26 Nemati S , Holder A , Razmi F , et al. . An interpretable machine learning model for accurate prediction of sepsis in the ICU . Crit Care Med 2018 ; 46 ( 4 ): 547 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Rucco M , Sousa-Rodrigues D , Merelli E , et al. . Neural hypernetwork approach for pulmonary embolism diagnosis . BMC Res Notes 2015 ; 8 ( 1 ): 617 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Esteva A , Kuprel B , Novoa RA , et al. . Dermatologist-level classification of skin cancer with deep neural networks . Nature 2017 ; 542 ( 7639 ): 115 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Larson DB , Chen MC , Lungren MP , et al. . Performance of a deep-learning neural network model in assessing skeletal maturity on pediatric hand radiographs . Radiology 2018 ; 287 ( 1 ): 313 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Swaminathan S , Qirko K , Smith T , et al. . A machine learning approach to triaging patients with chronic obstructive pulmonary disease . PLoS One 2017 ; 12 ( 11 ): e0188532 . Google Scholar Crossref Search ADS PubMed WorldCat 31 MDCalc. Wells’ Criteria for Pulmonary Embolism. 2019 . https://www.mdcalc.com/wells-criteria-pulmonary-embolism Accessed September 1 , 2019 . 32 MDCalc. PERC Rule for Pulmonary Embolism. 2019 . https://www.mdcalc.com/perc-rule-pulmonary-embolism Accessed September 1 , 2019 . 33 Carson JL , Kelley MA , Duff A , et al. . The clinical course of pulmonary embolism . N Engl J Med 1992 ; 326 ( 19 ): 1240 – 5 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Hall P , Gill N , Kurka M , Phan W. Machine Learning Interpretability with H2O Driverless AI. Mountain View, CA: H2O.ai; 2019 . 35 SurveyGizmo. Online Survey Software & Tools. 2019 . https://www.surveygizmo.com/Accessed June 15 , 2019 . 36 Lovric M. International Encyclopedia of Statistical Science . Berlin, Germany : Springer ; 2011 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 37 Fan W , Liu J , Zhu S , et al. . Investigating the impacting factors for the healthcare professionals to adopt artificial intelligence-based medical diagnosis support system (AIMDSS) . Ann Oper Res 2018 . doi: 10.1007/s10479-018-2818-y. WorldCat 38 Lugtenberg M , Weenink J-W , van der Weijden T , et al. . Implementation of multiple-domain covering computerized decision support systems in primary care: a focus group study on perceived barriers . BMC Med Inform Decis Mak 2015 ; 15 ( 1 ): 82 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Voruganti T , O’Brien MA , Straus SE , et al. . Primary care physicians’ perspectives on computer-based health risk assessment tools for chronic diseases: a mixed methods study . J Innov Health Inform 2015 ; 22 ( 3 ): 333 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Xie Y , Gao G , Chen XA. Outlining the design space of explainable intelligent systems for medical diagnosis. arXiv [cs.HC]. 2019 . http://arxiv.org/abs/1902.06019 41 Cabitza F , Rasoini R , Gensini GF. Unintended consequences of machine learning in medicine . JAMA 2017 ; 318 ( 6 ): 517 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Challen R , Denny J , Pitt M , Gompels L , Edwards T , Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety . BMJ Qual Saf 2019 ; 28 ( 3 ): 231 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Diprose W , Verster F. The preventive-pill paradox: how shared decision making could increase cardiovascular morbidity and mortality . Circulation 2016 ; 134 ( 21 ): 1599 – 600 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Voigt P , Von Dem Bussche A. The EU General Data Protection Regulation (GDPR). A Practical Guide . 1st ed. Cham, Switzerland : Springer International ; 2017 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 45 Ancker JS , Senathirajah Y , Kukafka R , et al. . Design features of graphs in health risk communication: a systematic review . J Am Med Inform Assoc 2006 ; 13 ( 6 ): 608 – 18 . Google Scholar Crossref Search ADS PubMed WorldCat 46 Barrows HS , Robyn M , Tamblyn B. Problem-Based Learning: An Approach to Medical Education . Berlin, Germany : Springer ; 1980 . Google Preview WorldCat COPAC 47 Elstein AS , Schwartz A. Clinical problem solving and diagnostic decision making: selective review of the cognitive literature . BMJ 2002 ; 324 ( 7339 ): 729 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 48 Hripcsak G , Albers DJ. Next-generation phenotyping of electronic health records . J Am Med Inform Assoc 2013 ; 20 ( 1 ): 117 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 49 Agniel D , Kohane IS , Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study . BMJ 2018 ; 361 : k1479 . Google Scholar Crossref Search ADS PubMed WorldCat 50 Hripcsak G , Albers DJ. Correlating electronic health record concepts with healthcare process events . J Am Med Inform Assoc 2013 ; 20 ( e2 ): e311 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Collins SA , Cato K , Albers D , et al. . Relationship between nursing documentation and patients’ mortality . Am J Crit Care 2013 ; 22 ( 4 ): 306 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 52 Tourangeau R , Rips LJ , Rasinski K. The Psychology of Survey Response . Cambridge, United Kingdom : Cambridge University Press ; 2000 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC 53 Esmaeilzadeh P , Sambasivan M , Kumar N , et al. . Adoption of clinical decision support systems in a developing country: antecedents and outcomes of physician’s threat to perceived professional autonomy . Int J Med Inform 2015 ; 84 ( 8 ): 548 – 60 . Google Scholar Crossref Search ADS PubMed WorldCat 54 Chang I-C , Hwang H-G , Hung W-F , et al. . Physicians’ acceptance of pharmacokinetics-based clinical decision support systems . Expert Syst Appl 2007 ; 33 ( 2 ): 296 – 303 . Google Scholar Crossref Search ADS WorldCat 55 Press A , Press A , McCullagh L , et al. . Usability testing of a complex clinical decision support tool in the emergency department: lessons learned . JMIR Human Factors 2015 ; 2 ( 2 ): e14 . Google Scholar Crossref Search ADS PubMed WorldCat 56 West AF. Clinical decision-making: coping with uncertainty . Postgrad Med J 2002 ; 78 ( 920 ): 319 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Arts DL , Medlock SK , van Weert H , et al. . Acceptance and barriers pertaining to a general practice decision support system for multiple clinical conditions: a mixed methods evaluation . PLoS One 2018 ; 13 ( 4 ): e0193187 . Google Scholar Crossref Search ADS PubMed WorldCat 58 O'Sullivan DM , Doyle JS , Michalowski WJ , et al. . Assessing the motivation of MDs to use computer-based support at the point-of-care in the emergency department . AMIA Annu Symp Proc 2011 ; 2011 : 1045 – 54 . Google Scholar PubMed WorldCat Author notes These authors contributed equally. © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Apr 1, 2020

References