A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

Goodwin, Travis, R; Demner-Fushman,, Dina

doi:10.1093/jamia/ocaa004

A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

Goodwin, Travis, R;Demner-Fushman,, Dina 2020-04-01 00:00:00 Abstract Objective Reliable longitudinal risk prediction for hospitalized patients is needed to provide quality care. Our goal is to develop a generalizable model capable of leveraging clinical notes to predict healthcare-associated diseases 24–96 hours in advance. Methods We developed a reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP) to predict the risk of hospital acquired (occurring ≥ 48 hours after admission) acute kidney injury, pressure injury, or anemia ≥ 24 hours before it is implicated by the patient’s chart, labs, or notes. We rely on the MIMIC III critical care database and extract distinct positive and negative cohorts for each disease. We retrospectively determine the date-of-event using structured and unstructured criteria and use it as a form of indirect supervision to train and evaluate CANTRIP to predict disease risk using clinical notes. Results Our experiments indicate that CANTRIP, operating on text alone, obtains 74%–87% area under the curve and 77%–85% Specificity. Baseline shallow models showed lower performance on all metrics, while bidirectional long short-term memory obtained the highest Sensitivity at the cost of significantly lower Specificity and Precision. Discussion Proper model architecture allows clinical text to be successfully harnessed to predict nosocomial disease, outperforming shallow models and obtaining similar performance to disease-specific models reported in the literature. Conclusion Clinical text on its own can provide a competitive alternative to traditional structured features (eg, lab values, vital signs). CANTRIP is able to generalize across nosocomial diseases without disease-specific feature extraction and is available at https://github.com/h4ste/cantrip. deep learning, machine learning, artificial intelligence, natural language processing, medical informatics, decision support systems, clinical OBJECTIVE The Centers for Disease Control (CDC) estimates that 1 in every 25 acute care hospitalizations results in a healthcare-associated infection (HAI) and that at least 50% of HAIs are preventable.1,2 Not only are HAIs estimated to cost over $9.8 billion USD annually,2 but they are used to measure quality of care by the Centers for Medicare and Medicaid Services (CMS), with failure to prevent HAIs potentially resulting in financial penalties to the offending hospital. In addition to infections, other types of preventable hospital acquired or associated disease have been reported with high prevalence.3–5 Predicting such nosocomial (ie, hospital acquired) diseases has the potential to reduce costs and improve outcomes. Predictive modeling is an active area of medical informatics research with over 107 studies published between 2011 and 2017.6 However, the majority of risk prediction frameworks in use today were developed before the adoption of the electronic health record (EHR) and typically rely on a small number of risk factors (eg, signs, social factors, basic measurements) easily assessable by the physician.7 Likewise, automatic approaches typically rely on extracting hand-chosen disease-specific features easily extracted from the structured portions of the EHR (eg, laboratory results, vital signs, and chart information). By contrast, we were interested in discovering whether the information documented in unstructured clinical narratives could supplement or exceed the traditional information contained in structured (ie, tabular) parts of the EHR by enabling more robust prognostication of disease. Clinical notes typically document or summarize the most important positive and negative observations, potential diagnoses, findings, and treatments about the patient. Moreover, they often provide interpretation of the low-level information present in the patient’s chart or labs. More importantly, unlike structured data, unstructured data can provide important nuanced and contextual information not available in a tabular format including degrees of belief (eg, suggesting possible diagnoses or conditional treatments), relationships (eg, indicating which aspects of the clinical picture are being addressed with specific interventions), and interpretations (eg, indicating that a typically abnormal lab value is effectively normal given patient’s history). This type of rich data is notoriously difficult to incorporate in traditional data-science systems,8 but it is ideally poised for deep learning which can automatically discover and extract significant and meaningful features from raw data.9 To overcome the limitations of structured data and capture the longitudinal information in clinical notes, we present and evaluate a deep learning model harnessing clinical text for temporal risk prediction: reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). We show how CANTRIP can be trained without direct ground-truth risk labels to predict 3 nosocomial diseases 24–96 hours in advance: hospital acquired acute kidney injury (HAAKI), hospital acquired pressure injury (HAPI), and hospital acquired anemia (HAA). Note: in this study, we are interested only in predicting when and if the patient will develop nosocomial disease—we are not determining causality. BACKGROUND AND SIGNIFICANCE Risk prediction from EHR data has received considerable attention over the last decade,6 with the majority of approaches predicting a specific outcome or single disease. In a review of 107 risk prediction studies, Goldstein et al (2017) found that (a) most studies did not fully utilize the depth of information available about patients in the EHR, instead relying on a small predefined list of variables; and (b) most models neglected to consider longitudinal measures.6 Few studies considered clinical text. The use of clinical text was previously explored by Goodwin and Harabagiu to predict congestive heart failure for diabetic patients.10–12 Their methods, too, relied on a small number of predefined features. By contrast, CANTRIP does not rely on any prespecified set of features opting instead to consider all observations documented in each clinical note. This allows CANTRIP to be trained to potentially predict a large variety of diseases without requiring disease-specific feature engineering. To ensure the generalizability of CANTRIP, we apply the model to 3 common nosocomial diseases: hospital acquired acute kidney injury (HAAKI), hospital acquired pressure injury (HAPI), and hospital acquired anemia (HAA)—each with their own cohorts and experimental results. Hospital acquired acute kidney injury (HAAKI) Acute kidney injury (AKI) affects as many as 20% of all hospitalizations resulting in an estimated cost of $10 billion annually.5,13 AKI is associated with increased mortality, end-stage renal disease, and chronic kidney disease.13 It has been shown that even small increases in serum creatinine are associated with long-term damage and increased mortality.13 Current criteria for AKI, however, rely on markers of established kidney damage or impaired function, necessitating new approaches for earlier prediction of AKI before significant kidney damage is established.14 Prior work on AKI prediction has largely focused on limited patient populations and a small number of standard features.15–18 Mohamadlou et al (2018) used a Gradient Boosting Machine19 to predict severe AKI using the English National Health Service criteria as their gold standard and relying on vital signs and creatinine values as features.20 Tomašev et al (2019) present a deep learning approach using the KDIGO21 criteria as a gold standard and relying on historical aggregates of 29 numeric structured data elements.22 By contrast, our approach is the first to our knowledge to predict AKI or HAAKI using clinical notes. Moreover, ours is the first approach to predict AKI without relying on extracting AKI-specific features. Hospital acquired anemia (HAA) A substantial number of hospital patients with normal HgB on admission become anemic during the course of their hospitalization resulting in increased average length of stay by 10%–88%, hospital charges by 6%–80%, and risk of in-hospital mortality by 51%–228%, depending on HAA severity.3 HAA can result from a large number of factors, such as blood loss (including phlebotomy), erythropoietin deficiencies, nutritional deficiencies, hemolysis, and coagulation abnormalities.23 Thavendiranathan et al (2005) found that phlebotomy is highly associated with changes in HgB and hematocrit noting a mean decrease of 7.0 g/L HgB and 1.9% hematocrit with every 100 mL of blood drawn, while McEvoy et al (2013) indicate that critical care patients average 40–70 mL of blood drawn daily and that every 50 mL of blood drawn increases their risk of moderate to severe HAA by 18%.24,25 Consequently, the ability to automatically predict HAA would enable physicians to switch to small volume phlebotomy tubes, minimizing blood loss from in-dwelling catheters, and reducing blood tests.26 Indeed, Chant et al (2006) found that even small decreases in phlebotomy volume were associated with significantly reduced transfusion requirements in patients with prolonged stays.27 Although there has been some work on predicting anemias such as classifying iron deficiency anemia using artificial neural networks,28 or predicting moderate to severe anemia for patients with ulcerative colitis using logistic regression (LR),29 we were unable to find any prior work on developing automatic methods for predicting hospital acquired anemia whether using structured or unstructured data. Hospital acquired pressure injury (HAPI) The development of pressure injuries (ie, pressure ulcers or bed sores) can lead to several complications, including sepsis, cellulitis, osteomyelitis, pain, and depression.30 The mortality rate has been noted to be as high as 60% within 1 year of hospital discharge for older patients who develop a pressure ulcer during their stay.31 The Braden scale is the most widely used risk assessment scale for pressure ulcers.32 However, in an external evaluation, Hyun et al (2013) found that the Braden scale shows “insufficient predictive validity and poor accuracy in discriminating intensive care patients at risk of pressure ulcers developing.”33 Keller et al (2002) reported that, “there are no conclusive studies on the identification of pressure ulcer risk factors. None of the existing risk-assessment scales were developed especially for use in intensive care unit (ICU) patients.”4 Automatic prediction of pressure injuries was explored by Schoonhoven et al (2006),34 wherein LR was applied to a small number of structured features. By contrast, we show that our data-driven deep learning approach can reliably detect pressure ulcer for ICU patients without physician interaction or pre-specified feature extraction, allowing for potentially improved patient outcomes. METHODS We first present our cohort selection and data preprocessing approaches and then our proposed model and the evaluation against several baselines. Cohort selection We selected our retrospective cohort, illustrated in Figure 1, from the MIMIC-III critical care database.35,36 MIMIC, developed by the Massachusetts Institute of Technology (MIT) Lab for Computation Physiology to support research in intelligent patient monitoring, is a freely available database containing deidentified health data associated with 46 520 patients. After excluding admissions with fewer than 2 days of notes, our final cohorts consisted of 34 878 hospital admissions for HAAKI (34.2% prevalence); 35 218 for HAPI (11.1% prevalence); and 38 385 for HAA (39.8% prevalence). Figure 1. Open in new tabDownload slide Cohort selection diagram for each experimental cohort where (a), (b), and (c) describe the criteria used to distinguish between the positive and negative admissions for each disease; (d), (e), and (f) present the proportions of positive hospital admissions that met each type of criteria; discharge ICD-9 criteria were based on the Clinical Classifications Software (CCS) diagnosis and procedure categorization scheme provided as part of the Healthcare Cost and Utilization Project (HCUP); and DOE refers to the Date-of-Event for each disease. Admissions that met only ICD-9 criteria were omitted from this study as we were unable to determine their DOE. Figure 1. Open in new tabDownload slide Cohort selection diagram for each experimental cohort where (a), (b), and (c) describe the criteria used to distinguish between the positive and negative admissions for each disease; (d), (e), and (f) present the proportions of positive hospital admissions that met each type of criteria; discharge ICD-9 criteria were based on the Clinical Classifications Software (CCS) diagnosis and procedure categorization scheme provided as part of the Healthcare Cost and Utilization Project (HCUP); and DOE refers to the Date-of-Event for each disease. Admissions that met only ICD-9 criteria were omitted from this study as we were unable to determine their DOE. Data preparation and preprocessing To account for irregular gaps in the patient’s hospital visit, we adopt an abstract representation of the patient’s hospital visit which we call their clinical chronology. We represent the chronology C as a discrete, discontiguous sequence of Lsnapshots, s1, s2, ⋯, sL ⁠, where each snapshot encodes the clinical observations documented in any clinical notes produced on the same (calendar) day, and a sequence of elapsed times, δ1, δ2, ⋯, δL such that δi encodes the number of hours between si and si-1 and δ0 encodes the number of hours between hospital admission and the first clinical note. Natural language preprocessing In this work, to evaluate the impact of clinical notes for predicting disease risk, we only considered the clinical observations documented in clinical notes. We extracted the set of observations from each clinical note in 4 steps, illustrated in Figure 2. An initial set of medical concepts corresponding to Unified Medical Language System (UMLS)37 entities was detected using MetaMap Lite.38 In order to account for the physician’s beliefs about each concept, we used FastContext,39 a high-performance reimplementation of ConText,40 to detect the following semantic attributes: Figure 2. Open in new tabDownload slide Natural language preprocessing used to extract observations from clinical notes. Figure 2. Open in new tabDownload slide Natural language preprocessing used to extract observations from clinical notes. Negation indicating whether the observation was affirmed or negated; Certainty indicating whether the author was certain or uncertain; Temporality indicating whether the observation occurred in the present, the past, or is hypothetical; and Experiencer indicating whether the observation was associated with the patient or someone else (eg, family). Sections were recognized and normalized using a large number of hand-crafted regular expression rules previously created for InfoBot.41 We then filtered out all observations that (1) were not affirmed, certain, present, and associated with the patient; (2) occurred in a section corresponding to consults, family history, past medical history, or social history; (3) had a UMLS semantic type not corresponding to a medical problem, intervention, drug, or anatomic region; or (4) belonged to InfoBot’s medical stop word list. Semantic types, rules, and stop words are provided in online Supplementary Appendix A. Determining the Date-of-Event We determined the Date-of-Event (DOE) as the first date in which the disease is documented in a clinical note, or evidenced by the patient’s labs or chart. Specifically, for each disease, we defined 1 or more (a) seed concepts in the UMLS hierarchy, (b) lexical patterns, and (c) structured criteria using the laboratory, chart, and/or vital sign information in MIMIC. We determined the DOE as the first date in which (1) any observation extracted from a clinical note associated with that date descends from any of the UMLS seed concepts; (2) any observation or any text in the note contains any of the lexical patterns not immediately followed by a colon (to rule out structural matches, eg, “bed sore: none”); or (3) the structured criteria is met. Table 1 provides the seed concepts, lexical patterns, and structured criteria associated with each disease. Table 1. Criteria used to detect HAAKI, HAPI, and HAA from clinical notes or structured (eg, chart, laboratory) data, where CUI refers to a concept unique identifier in UMLS Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Abbreviations: AKI, acute kidney injury; CUI, concept unique identifier; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; KDIGO, Kidney Disease Improving Global Outcomes; NPUAP, National Pressure Uncler Advisory Panel; UMLS, Unified Medical Language System; WHO, World Health Organization. *represents a regular expression wildcard. Open in new tab Table 1. Criteria used to detect HAAKI, HAPI, and HAA from clinical notes or structured (eg, chart, laboratory) data, where CUI refers to a concept unique identifier in UMLS Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Abbreviations: AKI, acute kidney injury; CUI, concept unique identifier; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; KDIGO, Kidney Disease Improving Global Outcomes; NPUAP, National Pressure Uncler Advisory Panel; UMLS, Unified Medical Language System; WHO, World Health Organization. *represents a regular expression wildcard. Open in new tab Encoding elapsed times We encoded elapsed times using the sinusoidal representation proposed in Vaswani et al (2017),42 wherein the number of hours elapsed since the previous note, that is, δ is represented as a K-dimensional vector consisting of pairs of sinusoidal projections with different frequencies: δi2j=sin⁡hi/100002j32 δi2j+1=cos⁡hi/100002j32 where j∈0, K-1 is the index into the vector δi ⁠, and hi is the number of hours between si and si+1 ⁠. This representation was chosen because, for any offset k ⁠, δij+k reduces to a linear function of δij ⁠. We also experimented with other encoding schemes and found the sinusoidal version to be the most effective across all evaluated systems. Creating positive and negative examples To train and evaluate our model without manually quantifying the risk of disease for each snapshot in each patient’s chronology, we used the DOE as a form of indirect supervision to produce positive and negative examples, as illustrated in Figure 3. Specifically, for each positive admission (ie, admissions with chronologies in which the patient eventually develops the disease) we created a labeled example by: Truncating each chronology to end at the last snapshot occurring 24–96 hours before the DOE; Defining the prediction window Δ as the elapsed time (in hours) between the final snapshot (after truncation) and the DOE; and Assigning the label y=1 ⁠. Figure 3. Open in new tabDownload slide How the prediction windows Δ and chronologies are determined for positive and negative examples. Figure 3. Open in new tabDownload slide How the prediction windows Δ and chronologies are determined for positive and negative examples. To create negative examples, we first grouped positive admissions into buckets based on demographic and admission information including the patient’s age, sex, and race as well as their admitting ICU, source of admission (ie, clinic, physician, transfer, or other), type of admission (ie, elective, emergency, urgent), Oxford Acute Severity of Illness Score,43 and type of insurance (ie, government, private, Medicaid, Medicare, or self pay). For each bucket b ⁠, we assumed the Time-to-Event (TTE, ie, the number of hours elapsed from hospital admission to DOE) followed a Gamma prior distribution (ie, TTE∼ Γkb, θb ⁠) and determined kb and θb using maximum likelihood estimates over each positive example in the bucket. This allowed us to create labels for our negative examples by: Determining which bucket b each negative example belonged to; Sampling TTE′ ~ Γkb,θb ⁠; Defining the DOE as either (a) the date obtained by projecting TTE′ from the date of hospital admission or (b) the discharge date, whichever occurred first; Truncating the chronology to end at the snapshot 24–96 hours before the DOE; and Defining Δ as the hours elapsed between the final snapshot (after truncation) and the DOE. Note: negative examples assigned to a bucket without any positive examples were excluded (filtered) from the experiments for that disease (corresponding to “Demographic Filtering” in Figure 1). Computational approaches We evaluated 4 computational approaches for predicting nosocomial disease, namely: (1) our proposed CANTRIP, (2) a bidirectional LSTM44 network, and 2 shallow learning systems (3) LR and (4) a support vector machine45 (SVM). Recurrent additive network for temporal risk prediction Given a clinical chronology C=s1, ⋯, sL;δ1, ⋯δL and a prediction window Δ ⁠, CANTRIP predicts the probability (ie, risk) that the patient will develop the disease during that window. We factorize this probability into 3 components, each corresponding to 1 of the modules illustrated in Figure 4: Py|C, Δ=Py|r︷Prediction⋅Pr|e1, δ1, e2, δ2, ⋯, eL, Δ︷Progression⋅∏i=1LPei|si∈C︷Encoding where y indicates whether the patient develops the disease (such that y=1 if the patient develops the disease, and y=0 otherwise), e1, ⋯, eL denotes the internal representations of the snapshots s1, ⋯, sL learned by Encoding module, r is the internal representation clinical picture of the patient produced by the Progression module, and δ2, ⋯δL represents the elapsed time between successive snapshots. Details on each module are provided below. Figure 4. Open in new tabDownload slide The reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). Figure 4. Open in new tabDownload slide The reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). Encoding clinical snapshots The goal of the Encoding module in CANTRIP is to learn an optimal encoding of individual clinical snapshots. Formally, we define the vocabulary V as the set of all unique clinical observations documented in the positive examples. This allows us to represent each clinical snapshot s as a V-length binary vector such that the v-th element in s indicates whether the v-th observation in V was observed in that snapshot. CANTRIP incorporates a Residual Highway Network46 (RHN) to learn an embedding ei for each clinical snapshot si ⁠. Highway networks allow information to flow “around” or across multiple layers, enabling networks with hundreds of layers to be trained efficiently. We used 10 dense layers with batch normalization, L1 regularization, and single-depth residual connections to produce the encoding of the clinical snapshot ei ⁠. All dense layers used Gaussian error linear unit47 activations as in Devlin et al (2019).48 Modeling disease progression To account for the fact that clinical snapshots provide only an incomplete view of the clinical picture of the patient (ie, an EEG report is unlikely to describe a pressure injury or indicate anemia), we must infer the patient’s clinical picture by combining and accumulating information from each embedded clinical snapshot to model the progression of their disease. We accomplished this by (1) casting the inferred clinical picture of the patient as the memory of a Recurrent Neural Network (RNN) and (2) training the RNN to accumulate information about the progression of the patient’s disease by processing each snapshot sequentially. Formally, for each encoded snapshot ei∈ e1, e2, ⋯, eL} ⁠, the RNN is trained to predict the progression of the patient’s disease after elapsed time δi+1∈δ2, δ3,…, Δ} ⁠, such that when considering the final (ie, most recent) encoded snapshot eL and prediction window Δ ⁠, the final output of the RNN, r ⁠, encodes sufficient information to estimate the probability that the patient will develop the disease within Δ days. We used a recurrent additive network49 (RAN) as our RNN implementation in CANTRIP. RANs are a substantial simplification of LSTM44 units and gated recurrent units50 with nearly half the number of learnable parameters and have been shown to yield a number of advantages including avoiding the vanishing gradient problem, improving model performance, and significantly reducing model complexity. We believe that these properties make them ideally suited for deep learning with limited datasets. Predicting disease risk As all of the heavy lifting is accomplished by the Progression and Encoding modules, the disease risk is calculated by estimating the probability that the patient will develop the disease within Δ days using a logistic sigmoid projection: Py|r=σ⁡wpr+bp=11+ e-wpr+bp where y denotes whether the patient develops the disease (ie, y=1 if the patient develops the disease, and y=0 otherwise), r is the encoding of the inferred clinical picture produced by the Progression module, and wp and bp denote the learned weight vector and bias value. Bidirectional LSTM We also explored the use of a bidirectional LSTM44 network, using a single embedding layer for observations and the final state of the LSTM to predict disease risk as in CANTRIP. The bidirectional LSTM closely resembles CANTRIP if the RHN in the Encoding module were replaced by a single dense layer, and the RAN in the Progression module were replaced with a bidirectional LSTM. Shallow learning approaches We evaluated 2 shallow learning approaches: LR and SVMs.45 Both approaches used the set of observations in the final snapshot before the prediction window as their input features. Evaluation For each cohort we created training, development, and testing datasets using a stratified 8:1:1 random split based on the demographic and admission criteria illustrated in Figure 3 and, for positive examples, the type of label(s) associated with that chronology as illustrated in Figure 1d–f.51–53 For each system, we report the performance on the test set using the hyperparameters that provided the highest MCC (defined below) on the development set. Hyperparameter optimization is described in online Supplementary Appendix B. We incorporated temperature scaling for probability calibration using the development set.54 Metrics We report 7 metrics to evaluate the performance of each system for each cohort: Accuracy, Sensitivity (the true positive rate, ie, Recall), Specificity (the true negative rate), Precision (the positive predictive value), the F1 measure (the harmonic mean of Precision and Recall), the area under the Receiver Operating Characteristic (ROC) curve (AUC), and the Mathews correlation coefficient55 (MCC, a balanced measure useful for comparing systems on imbalanced data56). Due to the data imbalance (ie, the low prevalence of HAA, HAPI, and HAAKI) in our dataset, we primarily relied on MCC to compare systems. Additional details on metrics are provided in online Supplementary Appendix C. RESULTS Table 2 presents the performance obtained using the best configuration of each system for each disease: HAA, HAPI, and HAAKI. Across all 3 diseases, CANTRIP obtains the highest MCC, F1 ⁠, Accuracy, and AUC. Interestingly, bi-LSTM obtains the highest Sensitivity for HAA and HAAKI, at the cost of having the lowest Specificity across the board. By contrast, the SVM exhibited the weakest performance for all 3 diseases. Interestingly, the SVM retained ≥ 90% of the positive examples as support vectors when training on all 3 diseases, indicating that nosocomial disease prediction from clinical notes cannot be accomplished focusing only on the most representative examples. Table 2. Performance of each evaluated system when predicting HAA, HAPI, and HAAKI 24–96 hours before documented in the clinical notes or directly evidenced by laboratory or chart data Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Abbreviations: AUC, area under the curve; CANTRIP, reCurrent Additive Network for Temporal RIsk Prediction; biLSTM, bidirectional Long Short-Term Memory network; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; LR, logistic regression; MCC, Matthews correlation coefficient; SVM, support vector machine. Open in new tab Table 2. Performance of each evaluated system when predicting HAA, HAPI, and HAAKI 24–96 hours before documented in the clinical notes or directly evidenced by laboratory or chart data Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Abbreviations: AUC, area under the curve; CANTRIP, reCurrent Additive Network for Temporal RIsk Prediction; biLSTM, bidirectional Long Short-Term Memory network; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; LR, logistic regression; MCC, Matthews correlation coefficient; SVM, support vector machine. Open in new tab Figure 5 provides insights on system performance as illustrated by receiver operating characteristic and precision-recall curves. Figure 5. Open in new tabDownload slide System performance as illustrated by receiver operating characteristic (ROC) and precision-recall (PR) curves with different decision thresholds indicated. Selecting the decision threshold can be accomplished by (1) determining a clinically actionable level of Precision for the target disease and (2) comparing potential thresholds on ROC and PR curves, with optimal thresholds between 33% and 50% for CANTRIP. In our experiments, we used a threshold of 50%. Figure 5. Open in new tabDownload slide System performance as illustrated by receiver operating characteristic (ROC) and precision-recall (PR) curves with different decision thresholds indicated. Selecting the decision threshold can be accomplished by (1) determining a clinically actionable level of Precision for the target disease and (2) comparing potential thresholds on ROC and PR curves, with optimal thresholds between 33% and 50% for CANTRIP. In our experiments, we used a threshold of 50%. DISCUSSION As shown by Table 2, the highest performance for all systems was obtained when predicting HAPI, followed by HAAKI, and finally by HAA. This is unsurprising given that pressure injury risk factors such as mobility, color, texture, and wound care are often documented in free text. By contrast, anemia is typically defined in terms of HgB and hematocrit, which are only sporadically and inconsistently documented in clinical narratives. In terms of HAA, Khan et al (2017)29 report an AUC of 69% when predicting moderate to severe anemia based on structured data (including diagnosis of mild anemia) using LR for 789 patients with newly diagnosed ulcerative colitis. CANTRIP obtains similar performance (75% AUC) despite also detecting mild anemias and without considering structured data such as albumin, HgB, or hematocrit. When predicting HAPI, Schoonhoven et al (2006)34 report an AUC of 70% when using LR and a rule-based classifier on hand-chosen features. A retrospective analysis of the Braden scale reports an AUC of 62%, with 18% Precision and 29% F1 ⁠.57 In another retrospective study of the Braden scale, Hyun et al (2013)33 report an AUC of 67%, with 14% Precision and 24% F1 ⁠. While these results all use different criteria for pressure injury classification as well as different data sets and study designs, we can see that despite requiring no manual labeling, feature extraction, or physician interaction, CANTRIP obtains similar AUC (87%) and significantly higher Precision (42%) and F1 (53%) compared to both manual and rule-based prediction systems designed specifically for predicting pressure injuries. This suggests that deep learning is able to extract meaningful signals for predicting pressure injury from clinical texts. Finally, for HAAKI, Mohamadlou et al (2018)20 report an AUC of 67% with 80% Accuracy when predicting moderate to severe AKI up to 72 hours in advance based on structured data. CANTRIP obtains higher AUC (79%) when predicting any type of AKI (including mild) up to 96 hours in advance, based only on clinical notes. DeepMind reports an ROC AUC of 93% with 33% Precision and a Precision-Recall AUC of 29.7% when predicting AKI up to 48 hours in advance using data collected for 703 782 adult patients from all available sites in the US Department of Veterans Affairs and including 620 000 features.22 By contrast, with significantly less data, a more diverse population, and no AKI-specific feature extraction, CANTRIP obtains lower AUC (79%) but more than double Precision (67%) and double Precision-Recall AUC (74%) when predicting up to 96 hours in advance, indicating that clinical text alone provides important cues for predicting HAAKI. In terms of probability calibration, we measured an Estimated Calibration Error54 of 2.69%, 1.66%, and 2.36%, for HAA, HAPI, and HAAKI, respectively, indicating that CANTRIP under-estimated the empirical probability of each disease and may benefit from site-specific calibration. Likewise, selection of a decision threshold depends on both the disease and the intended use of the model as guided by Figure 5. Reliability plots are provided in online Supplementary Appendix D. Overall, the relative performance of CANTRIP, when compared to baseline and disease-specific models reported in the literature, suggests that not only can CANTRIP generalize across nosocomial diseases, but that clinical notes provide meaningful information for prognostication of disease. Limitations The primary limitation of this study is the fact that all systems relied only on features extracted from clinical notes. This was a deliberate design decision: while our ultimate goal is to combine textual and structured features, we were interested in first examining the power of text alone for predicting nosocomial disease. An additional limitation is that features only indicated the presence or absence of observations, signs, interventions, etc, meaning that values reported in the text, such as “HgB: 7.5,” are not available to the model. Rather than parsing and extracting this information, in future work, we aim to combine both clinical notes and structured data as features. For computational reasons, we chose to exclude hypothetical and negated mentions of observations; we expect that providing explicitly negated information to the model may further improve Specificity and is something we are exploring in future work. In future work, we plan to explore and further validate CANTRIP using patient data from other clinical sites. CONCLUSION We presented and evaluated a deep learning model harnessing clinical text to predict nosocomial disease from clinical notes for critical care patients. We showed how CANTRIP can be trained without direct ground-truth risk labels to predict 3 nosocomial diseases 24–96 hours in advance: HAAKI, HAPI, and HAA. Our experimental results indicate that not only does CANTRIP outperform traditional (shallow) learning approaches and a competitive deep learning baseline, but that despite considering only non-disease-specific features extracted from clinical notes, CANTRIP obtains competitive performance to disease-specific systems relying on hand-chosen structured features or hand-crafted rules. FUNDING This work was supported by the intramural research program at the US National Library of Medicine, National Institutes of Health. AUTHOR CONTRIBUTIONS TG and DDF conceptualized the study. DDF oversaw study design and reviewed and helped analyze the findings. TG designed and implemented the systems, collected and processed the data, performed data analysis, and conducted the experiments and evaluations. TG and DDF contributed to writing and editing of the manuscript. ACKNOWLEDGMENTS This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We would like to thank Dr. Clement McDonald for his advice on criteria coding in MIMIC. Figure 2 was designed using assets from MacroVector at Freepik.com. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Magill SS , Edwards JR , Bamberg W , et al. . Multistate point-prevalence survey of health care–associated infections . N Engl J Med 2014 ; 370 ( 13 ): 1198 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Schmier JK , Hulme-Lowe CK , Semenova S , et al. . Estimated hospital costs associated with preventable health care-associated infections if health care antiseptic products were unavailable . Clincoecon Outcomes Res 2016 ; 8 : 197 – 205 . Google Scholar Crossref Search ADS WorldCat 3 Henderson JM , Blackstone EH , Hixson ED , et al. . Hospital-acquired anemia: prevalence, outcomes, and healthcare implications . J Hosp Med 2013 ; 8 : 506–12. OpenURL Placeholder Text WorldCat 4 Keller B , Wille J , van Ramshorst B , et al. . Pressure ulcers in intensive care patients: a review of risks and prevention . Intensive Care Med 2002 ; 28 ( 10 ): 1379 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Silver SA , Long J , Zheng Y , et al. . Cost of acute kidney injury in hospitalized patients . J Hosp Med 2017 ; 12 ( 2 ): 70 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Goldstein BA , Navar AM , Pencina MJ , et al. . Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review . J Am Med Inform Assoc 2017 ; 24 ( 1 ): 198 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Chase HS , Mitrani LR , Lu GG , et al. . Early recognition of multiple sclerosis using natural language processing of the electronic health record . BMC Med Inform Decis Mak 2017 ; 17 ( 1 ): 24 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Demner-Fushman D , Chapman WW , McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform 2009 ; 42 ( 5 ): 760 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Collobert R , Weston J , Bottou L , et al. . Natural language processing (almost) from scratch . J Mach Learn Res 2011 ; 12 : 2493 – 537 . OpenURL Placeholder Text WorldCat 10 Goodwin T , Harabagiu SM. A probabilistic reasoning method for predicting the progression of clinical findings from electronic medical records. In: proceedings of the American Medical informatics Association (AMIA) Joint Summit on Clinical Research Informatics; March 25–27, 2015 ; San Francisco, California. 11 Goodwin T , Harabiu SM. A predictive chronological model of multiple clinical observations. In: International Conference on Healthcare Informatics (ICHI); October 21–23, 2015; Dallas, TX. 12 Goodwin T , Harabagiu SM. Inferring the interactions of risk factors from EHRs . AMIA Jt Summits Transl Sci Proc 2016 ; 2016 : 78 – 87 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 13 Chertow GM , Burdick E , Honour M , et al. . Acute kidney injury, mortality, length of stay, and costs in hospitalized patients . J Am Soc Nephrol 2005 ; 16 ( 11 ): 3365 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Mehta RL , Pascual MT , Soroko S , et al. . Spectrum of acute renal failure in the intensive care unit: the PICARD experience . Kidney Int 2004 ; 66 ( 4 ): 1613 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Koyner JL , Adhikari R , Edelson DP , et al. . Development of a multicenter ward-based AKI prediction model . Clin J Am Soc Nephrol 2016 ; 11 ( 11 ): 1935 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Flechet M , Güiza F , Schetz M , et al. . AKIpredictor, an online prognostic calculator for acute kidney injury in adult critically ill patients: development, validation and comparison to serum neutrophil gelatinase-associated lipocalin . Intensive Care Med 2017 ; 43 ( 6 ): 764 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Kate RJ , Perez RM , Mazumdar D , et al. . Prediction and detection models for acute kidney injury in hospitalized older adults . BMC Med Inform Decis Mak 2016 ; 16 ( 1 ): 39 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Porter CJ , Juurlink I , Bisset LH , et al. . A real-time electronic alert to improve detection of acute kidney injury in a large teaching hospital . Nephrol Dial Transplant 2014 ; 29 : 1888 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Friedman JH. Greedy function approximation: a gradient boosting machine . Ann Stat 2001 ; 29 ( 5 ): 1189 – 232 . Google Scholar Crossref Search ADS WorldCat 20 Mohamadlou H , Lynn-Palevsky A , Barton C , et al. . Prediction of acute kidney injury with a machine learning algorithm using electronic health record data . Can J Kidney Health Dis 2018 ; 5. doi: 10.1177/2054358118776326 . OpenURL Placeholder Text WorldCat 21 Khwaja A. KDIGO clinical practice guidelines for acute kidney injury . Nephron Clin Pract 2012 ; 120 : c179 – 84 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 22 Tomašev N , Glorot X , Rae JW , et al. . A clinically applicable approach to continuous prediction of future acute kidney injury . Nature 2019 ; 572 ( 7767 ): 116 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Rawal G , Kumar R , Yadav S , et al. . Anemia in intensive care: a review of current concepts . J Crit Care Med 2016 ; 2 ( 3 ): 109 – 14 . Google Scholar Crossref Search ADS WorldCat 24 Thavendiranathan P , Bagai A , Ebidia A , et al. . Do blood tests cause anemia in hospitalized patients? The effect of diagnostic phlebotomy on hemoglobin and hematocrit levels . J Gen Intern Med 2005 ; 20 ( 6 ): 520 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 25 McEvoy MT , Shander A. Anemia, bleeding, and blood transfusion in the intensive care unit: causes, risks, costs, and new strategies . Am J Crit Care 2013 ; 22 ( 6 ): eS1 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Harber CR , Sosnowski KJ , Hegde RM. Highly conservative phlebotomy in adult intensive care: a prospective randomized controlled trial . Anaesth Intensive Care 2006 ; 34 ( 4 ): 434 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Chant C , Wilson G , Friedrich JO. Anemia, transfusion, and phlebotomy practices in critically ill patients with prolonged ICU length of stay: a cohort study . Crit Care 2006 ; 10 ( 5 ): R140 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Azarkhish I , Raoufy MR , Gharibzadeh S. Artificial intelligence models for predicting iron deficiency anemia and iron serum level based on accessible laboratory data . J Med Syst 2012 ; 36 ( 3 ): 2057 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Khan N , Patel D , Shah Y , et al. . A novel model for predicting incident moderate to severe anemia and iron deficiency in patients with newly diagnosed ulcerative colitis . Dig Dis Sci 2017 ; 62 ( 5 ): 1295 – 304 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Brem H , Maggi J , Nierman D , et al. . High cost of stage IV pressure ulcers . Am J Surg 2010 ; 200 ( 4 ): 473 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Thomas DR , Goode PS , Tarquine PH , et al. . Hospital-acquired pressure ulcers and risk of death . J Am Geriatr Soc 1996 ; 44 ( 12 ): 1435 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Bergstrom N , Demuth PJ , Braden BJ. A clinical trial of the Braden Scale for Predicting Pressure Sore Risk . Nurs Clin North Am 1987 ; 22 ( 2 ): 417 – 28 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 33 Hyun S , Vermillion B , Newton C , et al. . Predictive validity of the Braden scale for patients in intensive care units . Am J Crit Care 2013 ; 22 : 514 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Schoonhoven L , Grobbee DE , Donders ART , et al. . Prediction of pressure ulcer development in hospitalized patients: a tool for risk assessment . Qual Saf Health Care 2006 ; 15 ( 1 ): 65 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Johnson AEW , Pollard TJ , Shen L , et al. . MIMIC-III, a freely accessible critical care database . Sci Data 2016 ; 3 ( 1 ): 160035 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Goldberger AL , Amaral LA , Glass L , et al. . PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals . Circulation 2000 ; 101 ( 23 ): E215 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Lindberg DA , Humphreys BL , McCray AT. The Unified Medical Language System . Methods Inf Med 1993 ; 32 : 281 – 291 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Demner-Fushman D , Rogers WJ , Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap . J Am Med Inform Assoc 2017 ; 24 ( 4 ): 841 – 4 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 39 Shi J , Hurdle JF. Trie-based rule processing for clinical NLP: a use-case study of n-trie, making the ConText algorithm more efficient and scalable . J Biomed Inform 2018 ; 85 : 106 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Chapman WW , Hillert D , Velupillai S , et al. . Extending the NegEx lexicon for multiple languages . Stud Health Technol Inform 2013 ; 192 : 677 – 81 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 41 Demner-Fushman D , Seckman C , Fisher C , et al. . A prototype system to support evidence-based practice . AMIA Annu Symp Proc 2008 ; 2008 : 151 – 5 . OpenURL Placeholder Text WorldCat 42 Vaswani A , Shazeer N , Parmar N , et al. . Attention is all you need. In: Guyon I , Luxburg UV , Bengio S , et al. , eds. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017; December 4–9, 2017 : 5998 – 6008 ; Long Beach, CA. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Accessed September 18, 2019. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 43 Johnson AEW , Kramer AA , Clifford GD. A new severity of illness scale using a subset of Acute Physiology and Chronic Health Evaluation data elements shows comparable predictive accuracy . Crit Care Med 2013 ; 41 ( 7 ): 1711 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Hochreiter S , Schmidhuber J. Long short-term memory . Neural Comput 1997 ; 9 ( 8 ): 1735 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Cortes C , Vapnik V. Support vector machine . Mach Learn 1995 ; 20 ( 3 ): 273 – 97 . OpenURL Placeholder Text WorldCat 46 Srivastava RK , Greff K , Schmidhuber J. Training very deep networks. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press; 2015 : 2377– 85 . 47 Hendrycks D , Gimpel K. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415. Published Online First: 27 June 2016. http://arxiv.org/abs/1606.08415 Accessed September 12, 2019. 48 Devlin J , Chang M-W , Lee K , et al. . BERT: pre-training of deep bidirectional transformers for language understanding. In: proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); June 2–7, 2019; Minneapolis, MN. 49 Lee K , Levy O , Zettlemoyer L. Recurrent Additive Networks. arXiv preprint arXiv:1705.07393. Published Online First: 21 May 2017. 50 Glorot X , Bordes A , Bengio Y. Deep sparse rectifier neural networks. In: Gordon G, Dunson D, and Dudík M, eds. proceedings of the 14th International Conference on Artificial Intelligence and Statistics; April 11–13, 2011 ; PMLR: Fort Lauderdale, FL. http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf Accessed September 21, 2016. 51 Sechidis K , Tsoumakas G , Vlahavas I. On the stratification of multi-label data. In: Gunopulos D , Hofmann T , Malerba D , et al. ., eds. Machine Learning and Knowledge Discovery in Databases . Heidelberg : Springer ; 2011 : 145 – 58 . Google Scholar Crossref Search ADS Google Scholar Google Preview WorldCat COPAC 52 Szymański P , Kajdanowicz T. A network perspective on stratification of multi-label data. In: Torgo L , Krawczyk B , Branco P , et al. ., eds. Proceedings of the First International Workshop on Learning with Imbalanced Domains: theory and Applications. ECML-PKDD . Skopje, Macedonia : PMLR ; 2017 : 22 – 35 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 53 Szymański P , Kajdanowicz T. A scikit-based Python environment for performing multi-label classification. arXiv preprint arXiv:1702.01460. Published Online First: 5 February 2017 . http://arxiv.org/abs/1702.01460 Accessed September 16, 2019. 54 Guo C , Pleiss G , Sun Y , et al. . On calibration of modern neural networks. In: proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org 2017. 1321–1330. http://dl.acm.org/citation.cfm? id=3305381.3305518 Accessed 27 November, 2019. 55 Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme . Biochim Biophys Acta BBA - Protein Struct 1975 ; 405 ( 2 ): 442 – 51 . Google Scholar Crossref Search ADS WorldCat 56 Boughorbel S , Jarray F , El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric . PLoS One 2017 ; 12 ( 6 ): e0177678 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Cho I , Noh M. Braden Scale: evaluation of clinical usefulness in an intensive care unit . J Adv Nurs 2010 ; 66 ( 2 ): 293 – 302 . Google Scholar Crossref Search ADS PubMed WorldCat Published by Oxford University Press on behalf of the American Medical Informatics Association 2020. This work is written by US Government employees and is in the public domain in the US. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/a-customizable-deep-learning-model-for-nosocomial-risk-prediction-from-wO0BdjEnUU

Loading next page...

References (63)

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Dina Demner-Fushman, Charlotte Seckman, C. Fisher, S. Hauser, J. Clayton, G. Thoma (2008)
A Prototype System to Support Evidence-based Practice
AMIA ... Annual Symposium proceedings. AMIA Symposium
B. Goldstein, A. Navar, M. Pencina, J. Ioannidis (2017)
Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review
Journal of the American Medical Informatics Association, 24
A. Khwaja (2012)
KDIGO Clinical Practice Guidelines for Acute Kidney Injury
Nephron Clinical Practice, 120
Alistair Johnson, A. Kramer, G. Clifford (2013)
A New Severity of Illness Scale Using a Subset of Acute Physiology and Chronic Health Evaluation Data Elements Shows Comparable Predictive Accuracy*
Critical Care Medicine, 41
H. Brem, J. Maggi, D. Nierman, L. Rolnitzky, D. Bell, Robert Rennert, M. Golinko, A. Yan, C. Lyder, Bruce Vladeck (2010)
High cost of stage IV pressure ulcers.
American journal of surgery, 200 4
A. Goldberger, L. Amaral, L. Glass, Jeffrey Hausdorff, P. Ivanov, R. Mark, J. Mietus, G. Moody, Chung-Kang Peng, H. Stanley (2000)
Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol
Jianlin Shi, John Hurdle (2018)
Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable
Journal of biomedical informatics, 85
Piotr Szymański, Tomasz Kajdanowicz (2017)
A scikit-based Python environment for performing multi-label classification
ArXiv, abs/1702.01460
M. Flechet, F. Güiza, M. Schetz, P. Wouters, I. Vanhorebeek, I. Derese, J. Gunst, I. Spriet, M. Casaer, Greet Berghe, G. Meyfroidt (2017)
AKIpredictor, an online prognostic calculator for acute kidney injury in adult critically ill patients: development, validation and comparison to serum neutrophil gelatinase-associated lipocalin
Intensive Care Medicine, 43
R. Srivastava, Klaus Greff, J. Schmidhuber (2015)
Training Very Deep Networks
Travis Goodwin, S. Harabagiu (2015)
A Predictive Chronological Model of Multiple Clinical Observations
2015 International Conference on Healthcare Informatics
Konstantinos Sechidis, Grigorios Tsoumakas, I. Vlahavas (2011)
On the Stratification of Multi-label Data
L. Schoonhoven, D. Grobbee, A. Donders, A. Algra, M. Grypdonck, M. Bousema, A. Schrijvers, E. Buskens (2006)
Prediction of pressure ulcer development in hospitalized patients: a tool for risk assessment
Quality and Safety in Health Care, 15
B. Matthews (1975)
Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
Biochimica et biophysica acta, 405 2
S. Magill, J. Edwards, Wendy Bamberg, Z. Beldavs, G. Dumyati, M. Kainer, R. Lynfield, Meghan Maloney, Laura McAllister-Hollod, J. Nadle, S. Ray, Deborah Thompson, L. Wilson, S. Fridkin (2014)
Multistate point-prevalence survey of health care-associated infections.
The New England journal of medicine, 370 13
C. Porter, I. Juurlink, L. Bisset, R. Bavakunji, R. Mehta, M. Devonald (2014)
A real-time electronic alert to improve detection of acute kidney injury in a large teaching hospital.
Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association, 29 10
Collobert (2011)
2493
J Mach Learn Res, 12
J. Koyner, Richa Adhikari, D. Edelson, M. Churpek (2016)
Development of a Multicenter Ward-Based AKI Prediction Model.
Clinical journal of the American Society of Nephrology : CJASN, 11 11
S. Boughorbel, Fethi Jarray, Mohammed El-Anbari (2017)
Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric
PLoS ONE, 12
Dina Demner-Fushman, Willie Rogers, A. Aronson (2017)
MetaMap Lite: an evaluation of a new Java implementation of MetaMap
Journal of the American Medical Informatics Association, 24
N. Khan, Dhruvan Patel, Y. Shah, Yu-Xiao Yang (2017)
A Novel Model for Predicting Incident Moderate to Severe Anemia and Iron Deficiency in Patients with Newly Diagnosed Ulcerative Colitis
Digestive Diseases and Sciences, 62
S. Hyun, B. Vermillion, Cheryl Newton, Monica Fall, Xiaobai Li, Pacharmon Kaewprag, S. Moffatt-Bruce, E. Lenz (2013)
Predictive validity of the Braden scale for patients in intensive care units.
American journal of critical care : an official publication, American Association of Critical-Care Nurses, 22 6
P. Thavendiranathan, A. Bagai, A. Ebidia, A. Detsky, N. Choudhry (2005)
Do blood tests cause anemia in hospitalized patients?
Journal of General Internal Medicine, 20
R. Mehta, M. Pascual, S. Soroko, Brandon Savage, J. Himmelfarb, T. Ikizler, Emil Paganini, Glenn Chertow (2004)
Spectrum of acute renal failure in the intensive care unit: the PICARD experience.
Kidney international, 66 4
D. Thomas, P. Goode, Paige Tarquine, R. Allman (1996)
Hospital‐Acquired Pressure Ulcers and Risk of Death
Journal of the American Geriatrics Society, 44
Hamid Mohamadlou, A. Lynn-Palevsky, C. Barton, Uli Chettipally, L. Shieh, J. Calvert, N. Saber, R. Das (2017)
Prediction of Acute Kidney Injury With a Machine Learning Algorithm Using Electronic Health Record Data
Canadian Journal of Kidney Health and Disease, 5
J. Friedman (2001)
Greedy function approximation: A gradient boosting machine.
Annals of Statistics, 29
S. Hochreiter, J. Schmidhuber (1997)
Long Short-Term Memory
Neural Computation, 9
Alistair Johnson, T. Pollard, Lu Shen, Li-wei Lehman, M. Feng, M. Ghassemi, Benjamin Moody, Peter Szolovits, L. Celi, R. Mark (2016)
MIMIC-III, a freely accessible critical care database
Scientific Data, 3
Travis Goodwin, S. Harabagiu (2015)
A Probabilistic Reasoning Method for Predicting the Progression of Clinical Findings from Electronic Medical Records
AMIA Summits on Translational Science Proceedings, 2015
S. Silver, J. Long, Yuanchao Zheng, G. Chertow (2017)
Cost of Acute Kidney Injury in Hospitalized Patients
Journal Of Hospital Medicine, 12
(2017)
Recurrent Additive Networks. arXiv preprint arXiv:1705.07393
J. Schmier, Carolyn Hulme-Lowe, S. Semenova, J. Klenk, P. DeLeo, R. Sedlák, Pete Carlson (2016)
Estimated hospital costs associated with preventable health care-associated infections if health care antiseptic products were unavailable
ClinicoEconomics and Outcomes Research: CEOR, 8
C. Chant, G. Wilson, J. Friedrich (2006)
Anemia, transfusion, and phlebotomy practices in critically ill patients with prolonged ICU length of stay: a cohort study
Critical Care, 10
Piotr Szymański, Tomasz Kajdanowicz (2017)
A Network Perspective on Stratification of Multi-Label Data
(2010)
Scale: evaluation of clinical usefulness in an intensive care unit
H. Chase, L. Mitrani, Gabriel Lu, Dominick Fulgieri (2017)
Early recognition of multiple sclerosis using natural language processing of the electronic health record
BMC Medical Informatics and Decision Making, 17
Iman Azarkhish, M. Raoufy, S. Gharibzadeh (2012)
Artificial Intelligence Models for Predicting Iron Deficiency Anemia and Iron Serum Level Based on Accessible Laboratory Data
Journal of Medical Systems, 36
Rohit Kate, R. Perez, Debesh Mazumdar, K. Pasupathy, V. Nilakantan (2016)
Prediction and detection models for acute kidney injury in hospitalized older adults
BMC Medical Informatics and Decision Making, 16
N. Bergstrom, B. Braden, A. Laguzza, V. Holman (1987)
The Braden Scale for Predicting Pressure Sore Risk
Nursing Research, 36
(2015)
Training very deep networks. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume
Paul Keller, J. Wille, B. Ramshorst, C. Werken (2002)
Pressure ulcers in intensive care patients: a review of risks and prevention
Intensive Care Medicine, 28
Xavier Glorot, Antoine Bordes, Yoshua Bengio (2011)
Deep Sparse Rectifier Neural Networks
Corinna Cortes, V. Vapnik (2004)
Support-vector networks
Machine Learning, 20
Insook Cho, M. Noh (2010)
Braden Scale: evaluation of clinical usefulness in an intensive care unit.
Journal of advanced nursing, 66 2
Tarek Abudawood, Peter Flach (2009)
Evaluation Measures for Multi-class Subgroup Discovery
Nenad Tomašev, Xavier Glorot, Jack Rae, Michal Zielinski, Harry Askham, Andre Saraiva, A. Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk, Alistair Connell, Cían Hughes, A. Karthikesalingam, Julien Cornebise, Hugh Montgomery, G. Rees, C. Laing, Clifton Baker, Kelly Peterson, R. Reeves, D. Hassabis, Dominic King, Mustafa Suleyman, T. Back, C. Nielson, J. Ledsam, S. Mohamed (2019)
A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury
Nature, 572
(2016)
Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415
Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Weinberger (2017)
On Calibration of Modern Neural Networks
Dan Hendrycks, Kevin Gimpel (2016)
Gaussian Error Linear Units (GELUs)
arXiv: Learning
C. Koch, Liang Li, Zhiyuan Sun, Eric Hixson, A. Tang, S. Phillips, E. Blackstone, J. Henderson (2013)
Hospital-acquired anemia: prevalence, outcomes, and healthcare implications.
Journal of hospital medicine, 8 9
S. Suthaharan (2016)
Support Vector Machine
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, Illia Polosukhin (2017)
Attention is All you Need
G. Chertow, E. Burdick, Melissa Honour, J. Bonventre, D. Bates (2005)
Acute kidney injury, mortality, length of stay, and costs in hospitalized patients.
Journal of the American Society of Nephrology : JASN, 16 11
Dina Demner-Fushman, W. Chapman, C. McDonald (2009)
What can natural language processing do for clinical decision support?
Journal of biomedical informatics, 42 5
W. Chapman, Dieter Hillert, S. Velupillai, Maria Kvist, Maria Skeppstedt, B. Chapman, Mike Conway, Melissa Tharp, D. Mowery, Louise Deléger (2013)
Extending the NegEx Lexicon for Multiple Languages
Studies in health technology and informatics, 192
Travis Goodwin, S. Harabagiu (2016)
Inferring the Interactions of Risk Factors from EHRs
AMIA Summits on Translational Science Proceedings, 2016
C. Harber, K. Sosnowski, R. Hegde (2006)
Highly Conservative Phlebotomy in Adult Intensive Care— A Prospective Randomized Controlled Trial
Anaesthesia and Intensive Care, 34
G. Rawal, Raj Kumar, S. Yadav, Amrita Singh (2016)
Anemia in Intensive Care: A review of Current Concepts
The Journal of Critical Care Medicine, 2
M. McEvoy, A. Shander (2013)
Anemia, bleeding, and blood transfusion in the intensive care unit: causes, risks, costs, and new strategies.
American journal of critical care : an official publication, American Association of Critical-Care Nurses, 22 6 Suppl
D. Lindberg, B. Humphreys, A. McCray (1993)
The Unified Medical Language System
Yearbook of Medical Informatics, 02
Ronan Collobert, J. Weston, L. Bottou, Michael Karlen, K. Kavukcuoglu, P. Kuksa (2011)
Natural Language Processing (Almost) from Scratch
ArXiv, abs/1103.0398

Publisher: Oxford University Press
Copyright: Published by Oxford University Press on behalf of the American Medical Informatics Association 2020. This work is written by US Government employees and is in the public domain in the US.
ISSN: 1067-5027
eISSN: 1527-974X
DOI: 10.1093/jamia/ocaa004
Publisher site: See Article on Publisher Site

Abstract

Abstract Objective Reliable longitudinal risk prediction for hospitalized patients is needed to provide quality care. Our goal is to develop a generalizable model capable of leveraging clinical notes to predict healthcare-associated diseases 24–96 hours in advance. Methods We developed a reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP) to predict the risk of hospital acquired (occurring ≥ 48 hours after admission) acute kidney injury, pressure injury, or anemia ≥ 24 hours before it is implicated by the patient’s chart, labs, or notes. We rely on the MIMIC III critical care database and extract distinct positive and negative cohorts for each disease. We retrospectively determine the date-of-event using structured and unstructured criteria and use it as a form of indirect supervision to train and evaluate CANTRIP to predict disease risk using clinical notes. Results Our experiments indicate that CANTRIP, operating on text alone, obtains 74%–87% area under the curve and 77%–85% Specificity. Baseline shallow models showed lower performance on all metrics, while bidirectional long short-term memory obtained the highest Sensitivity at the cost of significantly lower Specificity and Precision. Discussion Proper model architecture allows clinical text to be successfully harnessed to predict nosocomial disease, outperforming shallow models and obtaining similar performance to disease-specific models reported in the literature. Conclusion Clinical text on its own can provide a competitive alternative to traditional structured features (eg, lab values, vital signs). CANTRIP is able to generalize across nosocomial diseases without disease-specific feature extraction and is available at https://github.com/h4ste/cantrip. deep learning, machine learning, artificial intelligence, natural language processing, medical informatics, decision support systems, clinical OBJECTIVE The Centers for Disease Control (CDC) estimates that 1 in every 25 acute care hospitalizations results in a healthcare-associated infection (HAI) and that at least 50% of HAIs are preventable.1,2 Not only are HAIs estimated to cost over $9.8 billion USD annually,2 but they are used to measure quality of care by the Centers for Medicare and Medicaid Services (CMS), with failure to prevent HAIs potentially resulting in financial penalties to the offending hospital. In addition to infections, other types of preventable hospital acquired or associated disease have been reported with high prevalence.3–5 Predicting such nosocomial (ie, hospital acquired) diseases has the potential to reduce costs and improve outcomes. Predictive modeling is an active area of medical informatics research with over 107 studies published between 2011 and 2017.6 However, the majority of risk prediction frameworks in use today were developed before the adoption of the electronic health record (EHR) and typically rely on a small number of risk factors (eg, signs, social factors, basic measurements) easily assessable by the physician.7 Likewise, automatic approaches typically rely on extracting hand-chosen disease-specific features easily extracted from the structured portions of the EHR (eg, laboratory results, vital signs, and chart information). By contrast, we were interested in discovering whether the information documented in unstructured clinical narratives could supplement or exceed the traditional information contained in structured (ie, tabular) parts of the EHR by enabling more robust prognostication of disease. Clinical notes typically document or summarize the most important positive and negative observations, potential diagnoses, findings, and treatments about the patient. Moreover, they often provide interpretation of the low-level information present in the patient’s chart or labs. More importantly, unlike structured data, unstructured data can provide important nuanced and contextual information not available in a tabular format including degrees of belief (eg, suggesting possible diagnoses or conditional treatments), relationships (eg, indicating which aspects of the clinical picture are being addressed with specific interventions), and interpretations (eg, indicating that a typically abnormal lab value is effectively normal given patient’s history). This type of rich data is notoriously difficult to incorporate in traditional data-science systems,8 but it is ideally poised for deep learning which can automatically discover and extract significant and meaningful features from raw data.9 To overcome the limitations of structured data and capture the longitudinal information in clinical notes, we present and evaluate a deep learning model harnessing clinical text for temporal risk prediction: reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). We show how CANTRIP can be trained without direct ground-truth risk labels to predict 3 nosocomial diseases 24–96 hours in advance: hospital acquired acute kidney injury (HAAKI), hospital acquired pressure injury (HAPI), and hospital acquired anemia (HAA). Note: in this study, we are interested only in predicting when and if the patient will develop nosocomial disease—we are not determining causality. BACKGROUND AND SIGNIFICANCE Risk prediction from EHR data has received considerable attention over the last decade,6 with the majority of approaches predicting a specific outcome or single disease. In a review of 107 risk prediction studies, Goldstein et al (2017) found that (a) most studies did not fully utilize the depth of information available about patients in the EHR, instead relying on a small predefined list of variables; and (b) most models neglected to consider longitudinal measures.6 Few studies considered clinical text. The use of clinical text was previously explored by Goodwin and Harabagiu to predict congestive heart failure for diabetic patients.10–12 Their methods, too, relied on a small number of predefined features. By contrast, CANTRIP does not rely on any prespecified set of features opting instead to consider all observations documented in each clinical note. This allows CANTRIP to be trained to potentially predict a large variety of diseases without requiring disease-specific feature engineering. To ensure the generalizability of CANTRIP, we apply the model to 3 common nosocomial diseases: hospital acquired acute kidney injury (HAAKI), hospital acquired pressure injury (HAPI), and hospital acquired anemia (HAA)—each with their own cohorts and experimental results. Hospital acquired acute kidney injury (HAAKI) Acute kidney injury (AKI) affects as many as 20% of all hospitalizations resulting in an estimated cost of $10 billion annually.5,13 AKI is associated with increased mortality, end-stage renal disease, and chronic kidney disease.13 It has been shown that even small increases in serum creatinine are associated with long-term damage and increased mortality.13 Current criteria for AKI, however, rely on markers of established kidney damage or impaired function, necessitating new approaches for earlier prediction of AKI before significant kidney damage is established.14 Prior work on AKI prediction has largely focused on limited patient populations and a small number of standard features.15–18 Mohamadlou et al (2018) used a Gradient Boosting Machine19 to predict severe AKI using the English National Health Service criteria as their gold standard and relying on vital signs and creatinine values as features.20 Tomašev et al (2019) present a deep learning approach using the KDIGO21 criteria as a gold standard and relying on historical aggregates of 29 numeric structured data elements.22 By contrast, our approach is the first to our knowledge to predict AKI or HAAKI using clinical notes. Moreover, ours is the first approach to predict AKI without relying on extracting AKI-specific features. Hospital acquired anemia (HAA) A substantial number of hospital patients with normal HgB on admission become anemic during the course of their hospitalization resulting in increased average length of stay by 10%–88%, hospital charges by 6%–80%, and risk of in-hospital mortality by 51%–228%, depending on HAA severity.3 HAA can result from a large number of factors, such as blood loss (including phlebotomy), erythropoietin deficiencies, nutritional deficiencies, hemolysis, and coagulation abnormalities.23 Thavendiranathan et al (2005) found that phlebotomy is highly associated with changes in HgB and hematocrit noting a mean decrease of 7.0 g/L HgB and 1.9% hematocrit with every 100 mL of blood drawn, while McEvoy et al (2013) indicate that critical care patients average 40–70 mL of blood drawn daily and that every 50 mL of blood drawn increases their risk of moderate to severe HAA by 18%.24,25 Consequently, the ability to automatically predict HAA would enable physicians to switch to small volume phlebotomy tubes, minimizing blood loss from in-dwelling catheters, and reducing blood tests.26 Indeed, Chant et al (2006) found that even small decreases in phlebotomy volume were associated with significantly reduced transfusion requirements in patients with prolonged stays.27 Although there has been some work on predicting anemias such as classifying iron deficiency anemia using artificial neural networks,28 or predicting moderate to severe anemia for patients with ulcerative colitis using logistic regression (LR),29 we were unable to find any prior work on developing automatic methods for predicting hospital acquired anemia whether using structured or unstructured data. Hospital acquired pressure injury (HAPI) The development of pressure injuries (ie, pressure ulcers or bed sores) can lead to several complications, including sepsis, cellulitis, osteomyelitis, pain, and depression.30 The mortality rate has been noted to be as high as 60% within 1 year of hospital discharge for older patients who develop a pressure ulcer during their stay.31 The Braden scale is the most widely used risk assessment scale for pressure ulcers.32 However, in an external evaluation, Hyun et al (2013) found that the Braden scale shows “insufficient predictive validity and poor accuracy in discriminating intensive care patients at risk of pressure ulcers developing.”33 Keller et al (2002) reported that, “there are no conclusive studies on the identification of pressure ulcer risk factors. None of the existing risk-assessment scales were developed especially for use in intensive care unit (ICU) patients.”4 Automatic prediction of pressure injuries was explored by Schoonhoven et al (2006),34 wherein LR was applied to a small number of structured features. By contrast, we show that our data-driven deep learning approach can reliably detect pressure ulcer for ICU patients without physician interaction or pre-specified feature extraction, allowing for potentially improved patient outcomes. METHODS We first present our cohort selection and data preprocessing approaches and then our proposed model and the evaluation against several baselines. Cohort selection We selected our retrospective cohort, illustrated in Figure 1, from the MIMIC-III critical care database.35,36 MIMIC, developed by the Massachusetts Institute of Technology (MIT) Lab for Computation Physiology to support research in intelligent patient monitoring, is a freely available database containing deidentified health data associated with 46 520 patients. After excluding admissions with fewer than 2 days of notes, our final cohorts consisted of 34 878 hospital admissions for HAAKI (34.2% prevalence); 35 218 for HAPI (11.1% prevalence); and 38 385 for HAA (39.8% prevalence). Figure 1. Open in new tabDownload slide Cohort selection diagram for each experimental cohort where (a), (b), and (c) describe the criteria used to distinguish between the positive and negative admissions for each disease; (d), (e), and (f) present the proportions of positive hospital admissions that met each type of criteria; discharge ICD-9 criteria were based on the Clinical Classifications Software (CCS) diagnosis and procedure categorization scheme provided as part of the Healthcare Cost and Utilization Project (HCUP); and DOE refers to the Date-of-Event for each disease. Admissions that met only ICD-9 criteria were omitted from this study as we were unable to determine their DOE. Figure 1. Open in new tabDownload slide Cohort selection diagram for each experimental cohort where (a), (b), and (c) describe the criteria used to distinguish between the positive and negative admissions for each disease; (d), (e), and (f) present the proportions of positive hospital admissions that met each type of criteria; discharge ICD-9 criteria were based on the Clinical Classifications Software (CCS) diagnosis and procedure categorization scheme provided as part of the Healthcare Cost and Utilization Project (HCUP); and DOE refers to the Date-of-Event for each disease. Admissions that met only ICD-9 criteria were omitted from this study as we were unable to determine their DOE. Data preparation and preprocessing To account for irregular gaps in the patient’s hospital visit, we adopt an abstract representation of the patient’s hospital visit which we call their clinical chronology. We represent the chronology C as a discrete, discontiguous sequence of Lsnapshots, s1, s2, ⋯, sL ⁠, where each snapshot encodes the clinical observations documented in any clinical notes produced on the same (calendar) day, and a sequence of elapsed times, δ1, δ2, ⋯, δL such that δi encodes the number of hours between si and si-1 and δ0 encodes the number of hours between hospital admission and the first clinical note. Natural language preprocessing In this work, to evaluate the impact of clinical notes for predicting disease risk, we only considered the clinical observations documented in clinical notes. We extracted the set of observations from each clinical note in 4 steps, illustrated in Figure 2. An initial set of medical concepts corresponding to Unified Medical Language System (UMLS)37 entities was detected using MetaMap Lite.38 In order to account for the physician’s beliefs about each concept, we used FastContext,39 a high-performance reimplementation of ConText,40 to detect the following semantic attributes: Figure 2. Open in new tabDownload slide Natural language preprocessing used to extract observations from clinical notes. Figure 2. Open in new tabDownload slide Natural language preprocessing used to extract observations from clinical notes. Negation indicating whether the observation was affirmed or negated; Certainty indicating whether the author was certain or uncertain; Temporality indicating whether the observation occurred in the present, the past, or is hypothetical; and Experiencer indicating whether the observation was associated with the patient or someone else (eg, family). Sections were recognized and normalized using a large number of hand-crafted regular expression rules previously created for InfoBot.41 We then filtered out all observations that (1) were not affirmed, certain, present, and associated with the patient; (2) occurred in a section corresponding to consults, family history, past medical history, or social history; (3) had a UMLS semantic type not corresponding to a medical problem, intervention, drug, or anatomic region; or (4) belonged to InfoBot’s medical stop word list. Semantic types, rules, and stop words are provided in online Supplementary Appendix A. Determining the Date-of-Event We determined the Date-of-Event (DOE) as the first date in which the disease is documented in a clinical note, or evidenced by the patient’s labs or chart. Specifically, for each disease, we defined 1 or more (a) seed concepts in the UMLS hierarchy, (b) lexical patterns, and (c) structured criteria using the laboratory, chart, and/or vital sign information in MIMIC. We determined the DOE as the first date in which (1) any observation extracted from a clinical note associated with that date descends from any of the UMLS seed concepts; (2) any observation or any text in the note contains any of the lexical patterns not immediately followed by a colon (to rule out structural matches, eg, “bed sore: none”); or (3) the structured criteria is met. Table 1 provides the seed concepts, lexical patterns, and structured criteria associated with each disease. Table 1. Criteria used to detect HAAKI, HAPI, and HAA from clinical notes or structured (eg, chart, laboratory) data, where CUI refers to a concept unique identifier in UMLS Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Abbreviations: AKI, acute kidney injury; CUI, concept unique identifier; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; KDIGO, Kidney Disease Improving Global Outcomes; NPUAP, National Pressure Uncler Advisory Panel; UMLS, Unified Medical Language System; WHO, World Health Organization. *represents a regular expression wildcard. Open in new tab Table 1. Criteria used to detect HAAKI, HAPI, and HAA from clinical notes or structured (eg, chart, laboratory) data, where CUI refers to a concept unique identifier in UMLS Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Disease . UMLS Seed CUI . Lexical pattern(s) . Structured criteria . HAAKI C0022660 (Kidney Failure, Acute) kidney failure, renal failure, kidney injury, renal injury, AKI KDIGO11,12 HAPI C0011127 (Pressure Ulcer) bed sore, bed ulcer, pressure sore, pressure ulcer, decub* sore, decub* ulcer NPUAP13 HAA C0002871 (Anemia) anemia, anaemia, HAA WHO14 Abbreviations: AKI, acute kidney injury; CUI, concept unique identifier; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; KDIGO, Kidney Disease Improving Global Outcomes; NPUAP, National Pressure Uncler Advisory Panel; UMLS, Unified Medical Language System; WHO, World Health Organization. *represents a regular expression wildcard. Open in new tab Encoding elapsed times We encoded elapsed times using the sinusoidal representation proposed in Vaswani et al (2017),42 wherein the number of hours elapsed since the previous note, that is, δ is represented as a K-dimensional vector consisting of pairs of sinusoidal projections with different frequencies: δi2j=sin⁡hi/100002j32 δi2j+1=cos⁡hi/100002j32 where j∈0, K-1 is the index into the vector δi ⁠, and hi is the number of hours between si and si+1 ⁠. This representation was chosen because, for any offset k ⁠, δij+k reduces to a linear function of δij ⁠. We also experimented with other encoding schemes and found the sinusoidal version to be the most effective across all evaluated systems. Creating positive and negative examples To train and evaluate our model without manually quantifying the risk of disease for each snapshot in each patient’s chronology, we used the DOE as a form of indirect supervision to produce positive and negative examples, as illustrated in Figure 3. Specifically, for each positive admission (ie, admissions with chronologies in which the patient eventually develops the disease) we created a labeled example by: Truncating each chronology to end at the last snapshot occurring 24–96 hours before the DOE; Defining the prediction window Δ as the elapsed time (in hours) between the final snapshot (after truncation) and the DOE; and Assigning the label y=1 ⁠. Figure 3. Open in new tabDownload slide How the prediction windows Δ and chronologies are determined for positive and negative examples. Figure 3. Open in new tabDownload slide How the prediction windows Δ and chronologies are determined for positive and negative examples. To create negative examples, we first grouped positive admissions into buckets based on demographic and admission information including the patient’s age, sex, and race as well as their admitting ICU, source of admission (ie, clinic, physician, transfer, or other), type of admission (ie, elective, emergency, urgent), Oxford Acute Severity of Illness Score,43 and type of insurance (ie, government, private, Medicaid, Medicare, or self pay). For each bucket b ⁠, we assumed the Time-to-Event (TTE, ie, the number of hours elapsed from hospital admission to DOE) followed a Gamma prior distribution (ie, TTE∼ Γkb, θb ⁠) and determined kb and θb using maximum likelihood estimates over each positive example in the bucket. This allowed us to create labels for our negative examples by: Determining which bucket b each negative example belonged to; Sampling TTE′ ~ Γkb,θb ⁠; Defining the DOE as either (a) the date obtained by projecting TTE′ from the date of hospital admission or (b) the discharge date, whichever occurred first; Truncating the chronology to end at the snapshot 24–96 hours before the DOE; and Defining Δ as the hours elapsed between the final snapshot (after truncation) and the DOE. Note: negative examples assigned to a bucket without any positive examples were excluded (filtered) from the experiments for that disease (corresponding to “Demographic Filtering” in Figure 1). Computational approaches We evaluated 4 computational approaches for predicting nosocomial disease, namely: (1) our proposed CANTRIP, (2) a bidirectional LSTM44 network, and 2 shallow learning systems (3) LR and (4) a support vector machine45 (SVM). Recurrent additive network for temporal risk prediction Given a clinical chronology C=s1, ⋯, sL;δ1, ⋯δL and a prediction window Δ ⁠, CANTRIP predicts the probability (ie, risk) that the patient will develop the disease during that window. We factorize this probability into 3 components, each corresponding to 1 of the modules illustrated in Figure 4: Py|C, Δ=Py|r︷Prediction⋅Pr|e1, δ1, e2, δ2, ⋯, eL, Δ︷Progression⋅∏i=1LPei|si∈C︷Encoding where y indicates whether the patient develops the disease (such that y=1 if the patient develops the disease, and y=0 otherwise), e1, ⋯, eL denotes the internal representations of the snapshots s1, ⋯, sL learned by Encoding module, r is the internal representation clinical picture of the patient produced by the Progression module, and δ2, ⋯δL represents the elapsed time between successive snapshots. Details on each module are provided below. Figure 4. Open in new tabDownload slide The reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). Figure 4. Open in new tabDownload slide The reCurrent Additive Network for Temporal RIsk Prediction (CANTRIP). Encoding clinical snapshots The goal of the Encoding module in CANTRIP is to learn an optimal encoding of individual clinical snapshots. Formally, we define the vocabulary V as the set of all unique clinical observations documented in the positive examples. This allows us to represent each clinical snapshot s as a V-length binary vector such that the v-th element in s indicates whether the v-th observation in V was observed in that snapshot. CANTRIP incorporates a Residual Highway Network46 (RHN) to learn an embedding ei for each clinical snapshot si ⁠. Highway networks allow information to flow “around” or across multiple layers, enabling networks with hundreds of layers to be trained efficiently. We used 10 dense layers with batch normalization, L1 regularization, and single-depth residual connections to produce the encoding of the clinical snapshot ei ⁠. All dense layers used Gaussian error linear unit47 activations as in Devlin et al (2019).48 Modeling disease progression To account for the fact that clinical snapshots provide only an incomplete view of the clinical picture of the patient (ie, an EEG report is unlikely to describe a pressure injury or indicate anemia), we must infer the patient’s clinical picture by combining and accumulating information from each embedded clinical snapshot to model the progression of their disease. We accomplished this by (1) casting the inferred clinical picture of the patient as the memory of a Recurrent Neural Network (RNN) and (2) training the RNN to accumulate information about the progression of the patient’s disease by processing each snapshot sequentially. Formally, for each encoded snapshot ei∈ e1, e2, ⋯, eL} ⁠, the RNN is trained to predict the progression of the patient’s disease after elapsed time δi+1∈δ2, δ3,…, Δ} ⁠, such that when considering the final (ie, most recent) encoded snapshot eL and prediction window Δ ⁠, the final output of the RNN, r ⁠, encodes sufficient information to estimate the probability that the patient will develop the disease within Δ days. We used a recurrent additive network49 (RAN) as our RNN implementation in CANTRIP. RANs are a substantial simplification of LSTM44 units and gated recurrent units50 with nearly half the number of learnable parameters and have been shown to yield a number of advantages including avoiding the vanishing gradient problem, improving model performance, and significantly reducing model complexity. We believe that these properties make them ideally suited for deep learning with limited datasets. Predicting disease risk As all of the heavy lifting is accomplished by the Progression and Encoding modules, the disease risk is calculated by estimating the probability that the patient will develop the disease within Δ days using a logistic sigmoid projection: Py|r=σ⁡wpr+bp=11+ e-wpr+bp where y denotes whether the patient develops the disease (ie, y=1 if the patient develops the disease, and y=0 otherwise), r is the encoding of the inferred clinical picture produced by the Progression module, and wp and bp denote the learned weight vector and bias value. Bidirectional LSTM We also explored the use of a bidirectional LSTM44 network, using a single embedding layer for observations and the final state of the LSTM to predict disease risk as in CANTRIP. The bidirectional LSTM closely resembles CANTRIP if the RHN in the Encoding module were replaced by a single dense layer, and the RAN in the Progression module were replaced with a bidirectional LSTM. Shallow learning approaches We evaluated 2 shallow learning approaches: LR and SVMs.45 Both approaches used the set of observations in the final snapshot before the prediction window as their input features. Evaluation For each cohort we created training, development, and testing datasets using a stratified 8:1:1 random split based on the demographic and admission criteria illustrated in Figure 3 and, for positive examples, the type of label(s) associated with that chronology as illustrated in Figure 1d–f.51–53 For each system, we report the performance on the test set using the hyperparameters that provided the highest MCC (defined below) on the development set. Hyperparameter optimization is described in online Supplementary Appendix B. We incorporated temperature scaling for probability calibration using the development set.54 Metrics We report 7 metrics to evaluate the performance of each system for each cohort: Accuracy, Sensitivity (the true positive rate, ie, Recall), Specificity (the true negative rate), Precision (the positive predictive value), the F1 measure (the harmonic mean of Precision and Recall), the area under the Receiver Operating Characteristic (ROC) curve (AUC), and the Mathews correlation coefficient55 (MCC, a balanced measure useful for comparing systems on imbalanced data56). Due to the data imbalance (ie, the low prevalence of HAA, HAPI, and HAAKI) in our dataset, we primarily relied on MCC to compare systems. Additional details on metrics are provided in online Supplementary Appendix C. RESULTS Table 2 presents the performance obtained using the best configuration of each system for each disease: HAA, HAPI, and HAAKI. Across all 3 diseases, CANTRIP obtains the highest MCC, F1 ⁠, Accuracy, and AUC. Interestingly, bi-LSTM obtains the highest Sensitivity for HAA and HAAKI, at the cost of having the lowest Specificity across the board. By contrast, the SVM exhibited the weakest performance for all 3 diseases. Interestingly, the SVM retained ≥ 90% of the positive examples as support vectors when training on all 3 diseases, indicating that nosocomial disease prediction from clinical notes cannot be accomplished focusing only on the most representative examples. Table 2. Performance of each evaluated system when predicting HAA, HAPI, and HAAKI 24–96 hours before documented in the clinical notes or directly evidenced by laboratory or chart data Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Abbreviations: AUC, area under the curve; CANTRIP, reCurrent Additive Network for Temporal RIsk Prediction; biLSTM, bidirectional Long Short-Term Memory network; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; LR, logistic regression; MCC, Matthews correlation coefficient; SVM, support vector machine. Open in new tab Table 2. Performance of each evaluated system when predicting HAA, HAPI, and HAAKI 24–96 hours before documented in the clinical notes or directly evidenced by laboratory or chart data Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Disease . System . Accuracy . AUC . Sensitivity . Specificity . Precision . F1 . MCC . HAA LR 64.58% 65.55% 47.55% 74.51% 52.09% 49.72% 0.2252 SVM 57.03% 55.23% 37.46% 69.21% 43.09% 40.08% 0.0687 biLSTM 62.78% 70.01% 72.03% 57.34% 49.85% 58.92% 0.2844 CANTRIP 69.64% 74.61% 57.56% 76.75% 59.30% 58.42% 0.3453 HAPI LR 76.87% 74.34% 57.91% 79.64% 29.37% 38.98% 0.2887 SVM 78.32% 62.78% 24.05% 86.26% 20.38% 22.06% 0.0961 biLSTM 78.77% 80.84% 70.57% 79.99% 34.41% 46.27% 0.3844 CANTRIP 83.61% 87.05% 71.83% 85.36% 42.19% 53.16% 0.4632 HAAKI LR 64.84% 64.49% 44.99% 77.20% 55.12% 49.55% 0.2327 SVM 57.07% 55.24% 37.69% 69.13% 43.18% 40.25% 0.0703 biLSTM 67.50% 74.04% 71.97% 64.69% 56.16% 63.09% 0.3569 CANTRIP 73.68% 79.10% 61.72% 81.20% 67.35% 64.40% 0.4370 Abbreviations: AUC, area under the curve; CANTRIP, reCurrent Additive Network for Temporal RIsk Prediction; biLSTM, bidirectional Long Short-Term Memory network; HAA, hospital acquired anemia; HAAKI, hospital acquired acute kidney injury; HAPI, hospital acquired pressure injury; LR, logistic regression; MCC, Matthews correlation coefficient; SVM, support vector machine. Open in new tab Figure 5 provides insights on system performance as illustrated by receiver operating characteristic and precision-recall curves. Figure 5. Open in new tabDownload slide System performance as illustrated by receiver operating characteristic (ROC) and precision-recall (PR) curves with different decision thresholds indicated. Selecting the decision threshold can be accomplished by (1) determining a clinically actionable level of Precision for the target disease and (2) comparing potential thresholds on ROC and PR curves, with optimal thresholds between 33% and 50% for CANTRIP. In our experiments, we used a threshold of 50%. Figure 5. Open in new tabDownload slide System performance as illustrated by receiver operating characteristic (ROC) and precision-recall (PR) curves with different decision thresholds indicated. Selecting the decision threshold can be accomplished by (1) determining a clinically actionable level of Precision for the target disease and (2) comparing potential thresholds on ROC and PR curves, with optimal thresholds between 33% and 50% for CANTRIP. In our experiments, we used a threshold of 50%. DISCUSSION As shown by Table 2, the highest performance for all systems was obtained when predicting HAPI, followed by HAAKI, and finally by HAA. This is unsurprising given that pressure injury risk factors such as mobility, color, texture, and wound care are often documented in free text. By contrast, anemia is typically defined in terms of HgB and hematocrit, which are only sporadically and inconsistently documented in clinical narratives. In terms of HAA, Khan et al (2017)29 report an AUC of 69% when predicting moderate to severe anemia based on structured data (including diagnosis of mild anemia) using LR for 789 patients with newly diagnosed ulcerative colitis. CANTRIP obtains similar performance (75% AUC) despite also detecting mild anemias and without considering structured data such as albumin, HgB, or hematocrit. When predicting HAPI, Schoonhoven et al (2006)34 report an AUC of 70% when using LR and a rule-based classifier on hand-chosen features. A retrospective analysis of the Braden scale reports an AUC of 62%, with 18% Precision and 29% F1 ⁠.57 In another retrospective study of the Braden scale, Hyun et al (2013)33 report an AUC of 67%, with 14% Precision and 24% F1 ⁠. While these results all use different criteria for pressure injury classification as well as different data sets and study designs, we can see that despite requiring no manual labeling, feature extraction, or physician interaction, CANTRIP obtains similar AUC (87%) and significantly higher Precision (42%) and F1 (53%) compared to both manual and rule-based prediction systems designed specifically for predicting pressure injuries. This suggests that deep learning is able to extract meaningful signals for predicting pressure injury from clinical texts. Finally, for HAAKI, Mohamadlou et al (2018)20 report an AUC of 67% with 80% Accuracy when predicting moderate to severe AKI up to 72 hours in advance based on structured data. CANTRIP obtains higher AUC (79%) when predicting any type of AKI (including mild) up to 96 hours in advance, based only on clinical notes. DeepMind reports an ROC AUC of 93% with 33% Precision and a Precision-Recall AUC of 29.7% when predicting AKI up to 48 hours in advance using data collected for 703 782 adult patients from all available sites in the US Department of Veterans Affairs and including 620 000 features.22 By contrast, with significantly less data, a more diverse population, and no AKI-specific feature extraction, CANTRIP obtains lower AUC (79%) but more than double Precision (67%) and double Precision-Recall AUC (74%) when predicting up to 96 hours in advance, indicating that clinical text alone provides important cues for predicting HAAKI. In terms of probability calibration, we measured an Estimated Calibration Error54 of 2.69%, 1.66%, and 2.36%, for HAA, HAPI, and HAAKI, respectively, indicating that CANTRIP under-estimated the empirical probability of each disease and may benefit from site-specific calibration. Likewise, selection of a decision threshold depends on both the disease and the intended use of the model as guided by Figure 5. Reliability plots are provided in online Supplementary Appendix D. Overall, the relative performance of CANTRIP, when compared to baseline and disease-specific models reported in the literature, suggests that not only can CANTRIP generalize across nosocomial diseases, but that clinical notes provide meaningful information for prognostication of disease. Limitations The primary limitation of this study is the fact that all systems relied only on features extracted from clinical notes. This was a deliberate design decision: while our ultimate goal is to combine textual and structured features, we were interested in first examining the power of text alone for predicting nosocomial disease. An additional limitation is that features only indicated the presence or absence of observations, signs, interventions, etc, meaning that values reported in the text, such as “HgB: 7.5,” are not available to the model. Rather than parsing and extracting this information, in future work, we aim to combine both clinical notes and structured data as features. For computational reasons, we chose to exclude hypothetical and negated mentions of observations; we expect that providing explicitly negated information to the model may further improve Specificity and is something we are exploring in future work. In future work, we plan to explore and further validate CANTRIP using patient data from other clinical sites. CONCLUSION We presented and evaluated a deep learning model harnessing clinical text to predict nosocomial disease from clinical notes for critical care patients. We showed how CANTRIP can be trained without direct ground-truth risk labels to predict 3 nosocomial diseases 24–96 hours in advance: HAAKI, HAPI, and HAA. Our experimental results indicate that not only does CANTRIP outperform traditional (shallow) learning approaches and a competitive deep learning baseline, but that despite considering only non-disease-specific features extracted from clinical notes, CANTRIP obtains competitive performance to disease-specific systems relying on hand-chosen structured features or hand-crafted rules. FUNDING This work was supported by the intramural research program at the US National Library of Medicine, National Institutes of Health. AUTHOR CONTRIBUTIONS TG and DDF conceptualized the study. DDF oversaw study design and reviewed and helped analyze the findings. TG designed and implemented the systems, collected and processed the data, performed data analysis, and conducted the experiments and evaluations. TG and DDF contributed to writing and editing of the manuscript. ACKNOWLEDGMENTS This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We would like to thank Dr. Clement McDonald for his advice on criteria coding in MIMIC. Figure 2 was designed using assets from MacroVector at Freepik.com. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Magill SS , Edwards JR , Bamberg W , et al. . Multistate point-prevalence survey of health care–associated infections . N Engl J Med 2014 ; 370 ( 13 ): 1198 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Schmier JK , Hulme-Lowe CK , Semenova S , et al. . Estimated hospital costs associated with preventable health care-associated infections if health care antiseptic products were unavailable . Clincoecon Outcomes Res 2016 ; 8 : 197 – 205 . Google Scholar Crossref Search ADS WorldCat 3 Henderson JM , Blackstone EH , Hixson ED , et al. . Hospital-acquired anemia: prevalence, outcomes, and healthcare implications . J Hosp Med 2013 ; 8 : 506–12. OpenURL Placeholder Text WorldCat 4 Keller B , Wille J , van Ramshorst B , et al. . Pressure ulcers in intensive care patients: a review of risks and prevention . Intensive Care Med 2002 ; 28 ( 10 ): 1379 – 88 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Silver SA , Long J , Zheng Y , et al. . Cost of acute kidney injury in hospitalized patients . J Hosp Med 2017 ; 12 ( 2 ): 70 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Goldstein BA , Navar AM , Pencina MJ , et al. . Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review . J Am Med Inform Assoc 2017 ; 24 ( 1 ): 198 – 208 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Chase HS , Mitrani LR , Lu GG , et al. . Early recognition of multiple sclerosis using natural language processing of the electronic health record . BMC Med Inform Decis Mak 2017 ; 17 ( 1 ): 24 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Demner-Fushman D , Chapman WW , McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform 2009 ; 42 ( 5 ): 760 – 72 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Collobert R , Weston J , Bottou L , et al. . Natural language processing (almost) from scratch . J Mach Learn Res 2011 ; 12 : 2493 – 537 . OpenURL Placeholder Text WorldCat 10 Goodwin T , Harabagiu SM. A probabilistic reasoning method for predicting the progression of clinical findings from electronic medical records. In: proceedings of the American Medical informatics Association (AMIA) Joint Summit on Clinical Research Informatics; March 25–27, 2015 ; San Francisco, California. 11 Goodwin T , Harabiu SM. A predictive chronological model of multiple clinical observations. In: International Conference on Healthcare Informatics (ICHI); October 21–23, 2015; Dallas, TX. 12 Goodwin T , Harabagiu SM. Inferring the interactions of risk factors from EHRs . AMIA Jt Summits Transl Sci Proc 2016 ; 2016 : 78 – 87 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 13 Chertow GM , Burdick E , Honour M , et al. . Acute kidney injury, mortality, length of stay, and costs in hospitalized patients . J Am Soc Nephrol 2005 ; 16 ( 11 ): 3365 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Mehta RL , Pascual MT , Soroko S , et al. . Spectrum of acute renal failure in the intensive care unit: the PICARD experience . Kidney Int 2004 ; 66 ( 4 ): 1613 – 21 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Koyner JL , Adhikari R , Edelson DP , et al. . Development of a multicenter ward-based AKI prediction model . Clin J Am Soc Nephrol 2016 ; 11 ( 11 ): 1935 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Flechet M , Güiza F , Schetz M , et al. . AKIpredictor, an online prognostic calculator for acute kidney injury in adult critically ill patients: development, validation and comparison to serum neutrophil gelatinase-associated lipocalin . Intensive Care Med 2017 ; 43 ( 6 ): 764 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Kate RJ , Perez RM , Mazumdar D , et al. . Prediction and detection models for acute kidney injury in hospitalized older adults . BMC Med Inform Decis Mak 2016 ; 16 ( 1 ): 39 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Porter CJ , Juurlink I , Bisset LH , et al. . A real-time electronic alert to improve detection of acute kidney injury in a large teaching hospital . Nephrol Dial Transplant 2014 ; 29 : 1888 – 93 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Friedman JH. Greedy function approximation: a gradient boosting machine . Ann Stat 2001 ; 29 ( 5 ): 1189 – 232 . Google Scholar Crossref Search ADS WorldCat 20 Mohamadlou H , Lynn-Palevsky A , Barton C , et al. . Prediction of acute kidney injury with a machine learning algorithm using electronic health record data . Can J Kidney Health Dis 2018 ; 5. doi: 10.1177/2054358118776326 . OpenURL Placeholder Text WorldCat 21 Khwaja A. KDIGO clinical practice guidelines for acute kidney injury . Nephron Clin Pract 2012 ; 120 : c179 – 84 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 22 Tomašev N , Glorot X , Rae JW , et al. . A clinically applicable approach to continuous prediction of future acute kidney injury . Nature 2019 ; 572 ( 7767 ): 116 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Rawal G , Kumar R , Yadav S , et al. . Anemia in intensive care: a review of current concepts . J Crit Care Med 2016 ; 2 ( 3 ): 109 – 14 . Google Scholar Crossref Search ADS WorldCat 24 Thavendiranathan P , Bagai A , Ebidia A , et al. . Do blood tests cause anemia in hospitalized patients? The effect of diagnostic phlebotomy on hemoglobin and hematocrit levels . J Gen Intern Med 2005 ; 20 ( 6 ): 520 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 25 McEvoy MT , Shander A. Anemia, bleeding, and blood transfusion in the intensive care unit: causes, risks, costs, and new strategies . Am J Crit Care 2013 ; 22 ( 6 ): eS1 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Harber CR , Sosnowski KJ , Hegde RM. Highly conservative phlebotomy in adult intensive care: a prospective randomized controlled trial . Anaesth Intensive Care 2006 ; 34 ( 4 ): 434 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 27 Chant C , Wilson G , Friedrich JO. Anemia, transfusion, and phlebotomy practices in critically ill patients with prolonged ICU length of stay: a cohort study . Crit Care 2006 ; 10 ( 5 ): R140 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Azarkhish I , Raoufy MR , Gharibzadeh S. Artificial intelligence models for predicting iron deficiency anemia and iron serum level based on accessible laboratory data . J Med Syst 2012 ; 36 ( 3 ): 2057 – 61 . Google Scholar Crossref Search ADS PubMed WorldCat 29 Khan N , Patel D , Shah Y , et al. . A novel model for predicting incident moderate to severe anemia and iron deficiency in patients with newly diagnosed ulcerative colitis . Dig Dis Sci 2017 ; 62 ( 5 ): 1295 – 304 . Google Scholar Crossref Search ADS PubMed WorldCat 30 Brem H , Maggi J , Nierman D , et al. . High cost of stage IV pressure ulcers . Am J Surg 2010 ; 200 ( 4 ): 473 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 31 Thomas DR , Goode PS , Tarquine PH , et al. . Hospital-acquired pressure ulcers and risk of death . J Am Geriatr Soc 1996 ; 44 ( 12 ): 1435 – 40 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Bergstrom N , Demuth PJ , Braden BJ. A clinical trial of the Braden Scale for Predicting Pressure Sore Risk . Nurs Clin North Am 1987 ; 22 ( 2 ): 417 – 28 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 33 Hyun S , Vermillion B , Newton C , et al. . Predictive validity of the Braden scale for patients in intensive care units . Am J Crit Care 2013 ; 22 : 514 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 34 Schoonhoven L , Grobbee DE , Donders ART , et al. . Prediction of pressure ulcer development in hospitalized patients: a tool for risk assessment . Qual Saf Health Care 2006 ; 15 ( 1 ): 65 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Johnson AEW , Pollard TJ , Shen L , et al. . MIMIC-III, a freely accessible critical care database . Sci Data 2016 ; 3 ( 1 ): 160035 . Google Scholar Crossref Search ADS PubMed WorldCat 36 Goldberger AL , Amaral LA , Glass L , et al. . PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals . Circulation 2000 ; 101 ( 23 ): E215 – 20 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Lindberg DA , Humphreys BL , McCray AT. The Unified Medical Language System . Methods Inf Med 1993 ; 32 : 281 – 291 . Google Scholar Crossref Search ADS PubMed WorldCat 38 Demner-Fushman D , Rogers WJ , Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap . J Am Med Inform Assoc 2017 ; 24 ( 4 ): 841 – 4 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 39 Shi J , Hurdle JF. Trie-based rule processing for clinical NLP: a use-case study of n-trie, making the ConText algorithm more efficient and scalable . J Biomed Inform 2018 ; 85 : 106 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Chapman WW , Hillert D , Velupillai S , et al. . Extending the NegEx lexicon for multiple languages . Stud Health Technol Inform 2013 ; 192 : 677 – 81 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 41 Demner-Fushman D , Seckman C , Fisher C , et al. . A prototype system to support evidence-based practice . AMIA Annu Symp Proc 2008 ; 2008 : 151 – 5 . OpenURL Placeholder Text WorldCat 42 Vaswani A , Shazeer N , Parmar N , et al. . Attention is all you need. In: Guyon I , Luxburg UV , Bengio S , et al. , eds. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017; December 4–9, 2017 : 5998 – 6008 ; Long Beach, CA. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf Accessed September 18, 2019. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 43 Johnson AEW , Kramer AA , Clifford GD. A new severity of illness scale using a subset of Acute Physiology and Chronic Health Evaluation data elements shows comparable predictive accuracy . Crit Care Med 2013 ; 41 ( 7 ): 1711 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Hochreiter S , Schmidhuber J. Long short-term memory . Neural Comput 1997 ; 9 ( 8 ): 1735 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 45 Cortes C , Vapnik V. Support vector machine . Mach Learn 1995 ; 20 ( 3 ): 273 – 97 . OpenURL Placeholder Text WorldCat 46 Srivastava RK , Greff K , Schmidhuber J. Training very deep networks. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press; 2015 : 2377– 85 . 47 Hendrycks D , Gimpel K. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415. Published Online First: 27 June 2016. http://arxiv.org/abs/1606.08415 Accessed September 12, 2019. 48 Devlin J , Chang M-W , Lee K , et al. . BERT: pre-training of deep bidirectional transformers for language understanding. In: proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); June 2–7, 2019; Minneapolis, MN. 49 Lee K , Levy O , Zettlemoyer L. Recurrent Additive Networks. arXiv preprint arXiv:1705.07393. Published Online First: 21 May 2017. 50 Glorot X , Bordes A , Bengio Y. Deep sparse rectifier neural networks. In: Gordon G, Dunson D, and Dudík M, eds. proceedings of the 14th International Conference on Artificial Intelligence and Statistics; April 11–13, 2011 ; PMLR: Fort Lauderdale, FL. http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf Accessed September 21, 2016. 51 Sechidis K , Tsoumakas G , Vlahavas I. On the stratification of multi-label data. In: Gunopulos D , Hofmann T , Malerba D , et al. ., eds. Machine Learning and Knowledge Discovery in Databases . Heidelberg : Springer ; 2011 : 145 – 58 . Google Scholar Crossref Search ADS Google Scholar Google Preview WorldCat COPAC 52 Szymański P , Kajdanowicz T. A network perspective on stratification of multi-label data. In: Torgo L , Krawczyk B , Branco P , et al. ., eds. Proceedings of the First International Workshop on Learning with Imbalanced Domains: theory and Applications. ECML-PKDD . Skopje, Macedonia : PMLR ; 2017 : 22 – 35 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC 53 Szymański P , Kajdanowicz T. A scikit-based Python environment for performing multi-label classification. arXiv preprint arXiv:1702.01460. Published Online First: 5 February 2017 . http://arxiv.org/abs/1702.01460 Accessed September 16, 2019. 54 Guo C , Pleiss G , Sun Y , et al. . On calibration of modern neural networks. In: proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org 2017. 1321–1330. http://dl.acm.org/citation.cfm? id=3305381.3305518 Accessed 27 November, 2019. 55 Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme . Biochim Biophys Acta BBA - Protein Struct 1975 ; 405 ( 2 ): 442 – 51 . Google Scholar Crossref Search ADS WorldCat 56 Boughorbel S , Jarray F , El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric . PLoS One 2017 ; 12 ( 6 ): e0177678 . Google Scholar Crossref Search ADS PubMed WorldCat 57 Cho I , Noh M. Braden Scale: evaluation of clinical usefulness in an intensive care unit . J Adv Nurs 2010 ; 66 ( 2 ): 293 – 302 . Google Scholar Crossref Search ADS PubMed WorldCat Published by Oxford University Press on behalf of the American Medical Informatics Association 2020. This work is written by US Government employees and is in the public domain in the US. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics Association – Oxford University Press

Published: Apr 1, 2020

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

A customizable deep learning model for nosocomial risk prediction from critical care notes with indirect supervision

References (63)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies