Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for... Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 25(5), 2018, 530–537 doi: 10.1093/jamia/ocx160 Advance Access Publication Date: 19 January 2018 Research and Applications Research and Applications SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research* 1,2 3 3,4 1,5 1,5 Honghan Wu, Giulia Toti, Katherine I Morley, Zina M Ibrahim, Amos Folarin, 1 6 7 7 7 Richard Jackson, Ismail Kartoglu, Asha Agrawal, Clive Stringer, Darren Gale, 8 8 9 9,10 Genevieve Gorrell, Angus Roberts, Matthew Broadbent, Robert Stewart, and 1,5 Richard JB Dobson Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College Lon- don, London, UK, School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China, 3 4 National Addiction Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK, Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Australia, Farr In- 6 7 stitute of Health Informatics Research, University College London, London, UK, InterDigital Europe, London, UK, King’s College 8 9 Hospital NHS Foundation Trust, London, UK, Department of Computer Science, University of Sheffield, Sheffield, UK, South London and Maudsley NHS Foundation Trust, London, UK and Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK Corresponding Author: Dr Honghan Wu, Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, SE5 8AF, UK. E-mail: honghan.wu@kcl.ac.uk. Phone: þ442078480924. *The manuscript has been submitted to JAMIA Received 1 October 2017; Revised 28 November 2017; Editorial Decision 22 December 2017; Accepted 8 January 2018 ABSTRACT Objective: Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. Methods: SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. Results: SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foun- dation Trust, one of Europe’s largest providers of mental health services. In 2 Clinical Record Interactive Search–based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King’s College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomic- sengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Vali- dated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted V The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 530 Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 531 by SemEHR can achieve around 97% accuracy. Conclusion: Results from the multiple case studies demonstrate SemEHR’s efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR. Key words: secondary use of EHR, information extraction, NLP, semantic search, ontology, FHIR, patient recruitment BACKGROUND with extracted concepts and determine which are critical to un- derstand the clinical domain. To address these challenges, The opportunity for secondary use of the wealth of information con- SemEHR uses a production infrastructure that integrates our pre- tained within electronic health records (EHRs) has attracted vious work in the CogStack pipeline to harmonize and cleanse researchers interested in investigating approaches to provide more heterogeneous records, using them to identify contextualized tailored and timely care, improve efficiency of services, and derive mentions (negation, temporality, and experiencer) of a wide 1–4 new scientific and medical insights. In addition to structured data range of biomedical concepts, including Systematized Nomencla- contained within relational database tables (such as International ture of Medicine Clinical Terms (SNOMED CT) (http://www. Classification of Diseases, Tenth Revision [ICD-10] diagnoses snomed.org/snomed-ct), ICD-10 (http://apps.who.int/classifica- codes), EHR documents are filled with unstructured clinical notes, tions/icd10/browse/2010/en), Logical Observation Identifiers such as nursing records, radiology reports, and discharge summa- Names and Codes (LOINC) (https://loinc.org/), and Drug Ontol- 5–7 ries. These notes add a richness and depth to EHR-based studies, ogy (https://ontology.atlassian.net/wiki/spaces/DRON/over- providing data and insight beyond what is available within the thin view). In addition, SemEHR automatically associates semantic layer of data stored within structured fields. types of annotations and their clinical contexts (derived from Deriving actionable insights from the EHR, including the un- documents or sections) with dedicated extraction rules, which structured component, is challenging. It requires bringing together enables better IE capabilities, such as populating structured vital expertise in the clinical domain, the underlying health care informa- sign data from observation notes. tion systems, and text analytics techniques, eg, natural language It is well appreciated that a one-size-fits-all approach needs to be processing (NLP). For example, the Clinical Record Interactive adapted to work effectively in different scenarios. Therefore, to Search (CRIS) system, an anonymized replica of the EHR used in serve different use cases well, we require the capability to extend the South London and Maudsley (SLaM) National Health Service the terminology of the general-purpose IE system to cover unseen (NHS) Foundation Trust in the UK, was designed to support clinical concepts, deal with language specificities in a subcorpus, support and scientific studies. Since its launch in 2009, a large number of use case–specific extraction requirements, and enable perfor- 9–13 studies ( to name a few) have used the CRIS resource in conjunc- mance fine-tuning, eg, by incorporating specific knowledge or tion with NLP or text-mining techniques. Although these studies an- researchers’ expertise. SemEHR provides a study-based (use swered different clinical questions, the technical requirements for case–specific) learning engine that enables iterative learning and extracting, structuring, and making sense of the data largely over- feedback. It collects user feedback and uses rule-based and ma- lapped, and included (1) preprocessing and cleansing corpus-related chine learning techniques to tackle study-specific challenges and documents (eg, removing misleading form headings from scanned requirements in a continuous manner. documents); (2) compiling and recognizing common medical termi- A few hurdles prevent the effective consumption of extracted nology (eg, the antipsychotic medication identification problem is data from general-purpose IE systems in scientific research and 10,11 almost the same in ); and (3) deriving patient-level clinical sig- clinical studies. To fulfill requirements by various studies, devel- nals from document-level NLP annotations (eg, understanding that oping general-purpose IE systems is inevitable in order to adopt a medication prescribed at admission was removed from the large terminologies that users might not be familiar with. This patient’s discharge medication list). poses challenges in (1) mapping look-up concepts to terminology As unstructured EHR data are inevitably needed by many re- terms, (2) translating clinical relations to term associations, and search projects and clinical studies, more cost-effective and system- (3) exploiting terminology semantics to bring unexpected or atic solutions are needed to address the common challenges unperceived new insights. At the consumption level, SemEHR presented by different use cases, while also ensuring that study- implements an ontology-based semantic search component to specific requirements are not compromised by the unified approach. tackle such challenges. To address such challenges, we propose SemEHR, a semantic Last, and probably most important, EHRs represent a timeline of search and analytical system that generates a complete and process- multiple patient interactions with services. As such, the able view of patients from their clinical notes. document-level IE results should be integrated at the patient level to incorporate temporal and macrocontextual information To realize a general-purpose biomedical information extraction (which reports, which visits, etc., as opposed to the sentence- (IE) system for EHRs, there are at least 3 fundamental chal- based contextual information discussed above). Only after this lenges: (1) syntactic heterogeneity: how to effectively access mul- integration is the EHR IE task complete. However, this requires timodal/multisource EHR data that are almost certainly a thorough understanding of the EHR system. SemEHR provides heterogeneous in format, data model, and access interface; (2) a multiperspective view of each patient by assembling NLP anno- knowledge coverage: how to cover all possible biomedical con- tations at the patient level as longitudinal views and compiling cepts that are required by potential use cases; and (3) context structured medical profiles. Both the NLP results and the patient capturing: how to represent and capture the contexts associated Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 532 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Figure 1. (A) SemEHR data model: entities (patient, clinical note, concept, and concept mentions) and their associations. (B) SemEHR generates 2 longitudinal views for each patient: concept mentions grouped in typed and dated documents (upper part), and concept mentions grouped in structured (discharge) summa- ries (lower part). timeline are made available via an ontology-based search system, patient’s medical history/conditions in a period of time (eg, an inpa- which effectively turns common IE tasks into semantic search tient hospital stay). A summary is composed of groups of concept queries. The interface provides a multiperspective view of each mentions, where each group is about one particular aspect of the patient by assembling NLP annotations at the patient level as patient’s medical profile, eg, past medical history, medications, or longitudinal views and compiling structured medical profiles. physical exams. Preferably, such summaries are derived from discharge summaries. When discharge summaries are not available, an auto- mated summarization approach is appliedtogenerate the summaries METHOD based on the contextual information of the concept types and concept mentions. Automated summaries are differentiated from those gener- Data model and longitudinal patient views ated from discharge summaries. Supplementary Material 2 describes As depicted in Figure 1A, SemEHR is built upon 6 types of entities: the detailed process of automated medical profile generation. patient, clinical note, concept, concept mention, medical profile, and profile aspect. Each patient is associated with a list of dated and typed clinical notes. From these notes, SemEHR identifies mentions Architecture: generic and adaptive information of a wide range of biomedical concepts from the Unified Medical extraction and retrieval 15,16 Language System (UMLS), a compendium of many controlled As illustrated in Figure 2, SemEHR is composed of 3 subsystems: the vocabularies, including SNOMED CT, ICD-10, LOINC, Drug On- producing subsystem, the continuous learning subsystem, and the tology, and Gene Ontology. By analyzing the context of its appear- consuming subsystem. ance, each mention is associated with 3 pieces of dimensional contextual information: negation, temporality, and experiencer. Highlighted in green in Figure 1A, the associations between concepts The producing subsystem (eg, Steatohepatitis is a liver disease; Ribavirin is a drug for treating Essentially, the producing subsystem extracts free-text clinical notes hepatitis C) are made available to conduct semantically enriched from heterogeneous underlying EHR systems, populating the data computations by incorporating the various biomedical ontologies model described in the previous section. This task is performed in 3 and Linked Open Data (https://en.wikipedia.org/wiki/Linked_data) main steps: data retrieval, information extraction, and semantic 17 18 14 such as DBpedia and Wikidata. SemEHR derives periodical indexing. CogStack, a data harmonization and enterprise search medical profiles from a patient’s clinical notes, automatically gener- toolkit for EHRs, is adopted in the data retrieval step to provide a uni- ated medical summaries consisting of a set of profile aspects (sec- fied interface with unstructured EHR data, which is often very hetero- tions describing different aspects of a medical profile, eg, past geneous in format and distributed in storage. Each document that medical history, medications, etc.) for a defined period of time. Con- flows out from the data retrieval component is fed into the NLP pipe- cept mentions are assigned to these aspects according to their ap- line, which embeds Bio-YODIE (https://gate.ac.uk/applications/bio- pearance in the original clinical notes. As the rectangle boxes in yodie.html), an NLP pipeline dedicated to annotating UMLS concepts Figure 1A show, SemEHR entities are mapped to Fast Healthcare In- in clinical notes (“documents” hereafter). (Bio-YODIE was developed teroperability Resources (FHIR) (https://www.hl7.org/fhir/over- as part of the EU KConnect project, in which GG, AR, HW, RS, and view.html) entities whenever possible. RD are involved.) Emerging from the NLP pipeline are the documents Based on this data model, SemEHR populates 2 longitudinal and concept mentions extracted from them, which are then analyzed views (shown in Figure 1B) for each patient. As shown in the upper by the Semantic Index component before being indexed. The analysis part of Figure 1B, the first view is generated directly from the raw involves deriving document types (eg, Radiology, GP Letter, or Dis- data. Concept mentions are organized in a list of clinical notes that charge Summary), parsing document structure (eg, identifying headed are located on a timeline according to their date attributes (eg, the blocks from discharge summaries), and associating concept mentions created date/time of the clinical notes). Wherever possible, types of with document structures. The analysis results, document content, clinical notes (such as GP Letter, Radiology, or Discharge Summary) and NLP outputs are finally indexed by an Elasticsearch (https:// are presented. www.elastic.co/products/elasticsearch) cluster. Patient-level summa- The second view (lower part of Figure 1B) is designed to convey ries are generated as described in the previous section. These summa- structured summaries for a patient, each of which summarizes the ries are updated as new documents are added to the index. Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 533 Figure 2. The architecture of SemEHR is composed of 3 subsystems: (1) the producing subsystem (upper part of the figure), creation of SemEHR semantic index by harmonizing, natural language processing, and indexing EHR data; (2) the continuous learning subsystem, addressing study-specific requirements and sup- porting fine-tuning for separate studies; and (3) the consuming subsystem (lower part), supporting tailored care, patient recruitment, and clinical research by se- mantic searching and study-based continuous learning. SemEHR aims to produce annotations with accurate contextual Linked Open Data application programming interfaces) to support information. Three components work collectively to achieve this tasks such as patient characterization or trial recruitment. A con- goal: the Bio-YODIE pipeline captures the sentence/paragraph-level suming task is called a “study” in SemEHR. Each study will have its contexts (eg, negation, hypothetical mentions); the semantic index’s own storage within SemEHR’s Study Knowledge Graph (KG) (bot- analyzer brings in section/document-level context (eg, past medical tom of the Storage section in Figure 2), which stores its study param- history, laboratory results); and the continuous learning subsystem eters (eg, cohort definition and metadata), search settings (eg, query (described in the next subsection) learns the contexts from user- concepts), study results (eg, selected cohort and exported features), assessed annotations (see Supplementary Material 1 for details). and customized rules (eg, regular expressions to remove unwanted annotations). There is also a common KG (Common KG in Figure 2), where sharable knowledge or efforts (such as manually selected The continuous learning subsystem concepts of alcohol-related liver diseases or postprocessing rules for To accommodate the uniqueness of the IE requirements of different improving NLP results) are made available to other studies. studies, SemEHR is designed with a continuous learning subsystem Key functionalities of the consuming subsystem include the fol- to iteratively address study-specific issues. The system collects and lowing: analyzes user feedback from an annotation component embedded within the user interface. Based on the analyzed feedback, 2 compo- Translating search terms to query concepts. This translates the nents are used to improve the IE results. The first is a rule engine, user’s keyword searches (which are often ambiguous or incom- which generates and applies rules for filtering out unwanted results, plete) into semantically clear concepts (identified using UMLS eg, removing concept mentions based on their original string or sur- Controlled Unclassified Information). The correct translation is rounding text. The second component is a machine learning engine essential to ensure the soundness and completeness of search and (a bidirectional recurrent neural network model), which takes user analytics results. Unfortunately, in the clinical scenario, it is often feedback as training data, applies the trained model on the study’s not a trivial task to compile an accurate and complete concept corpus, and populates a confidence value for each concept mention. list even for a single clinical signal. For example, one SemEHR Confidence values are used as quantified indicators in analytic com- case study needs to look up patients with alcohol-related liver ponents for populating results. The user interface for collecting feed- disease. Given a general clinical term such as “liver disease,” it back and the continuous learning mechanisms are explained in would be time-consuming to compile a list of all subtypes of liver detail in Supplementary Material 1. disease that are also alcohol-related. As depicted in section A of Figure 3, SemEHR provides 2 functions for supporting concept The consuming subsystem translation: (1) matching search terms to concepts, which is en- This subsystem consists of a set of components that utilize IE results hanced with logical reasoning to automatically include semanti- and clinical knowledge (accessed from biomedical ontology and cally related concepts and EHR-based exclusion to remove Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 534 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Figure 3. Screenshots of key functionalities provided by the consuming subsystem. (A) Identifying query concepts (UMLS CUIs): facilities to ensure the correct and complete concepts are used in the query to derive accurate clinical findings. (A1) Concept search for matching a user search term to one or more ontology (UMLS) concepts; logical reasoning is implemented to enable the automated inclusion of semantically related concepts (eg, hepatocellular damage is liver dam- age). (A2) Concept validation component for checking and approving the automated inferred concepts based on the aim and criteria of the clinical study (eg, only retain alcohol-related liver conditions for addiction analytics). (B) Selecting and summarizing cohort (the full text in the screenshot has been deliberately rewritten to avoid leaking sensitive patient data). A summary table is generated for a user query where each row summarizes the numbers of total mentions and contextu- alized mentions for one patient. (C) Patient timeline: longitudinal document view (upper), structured medical profile view (based on FHIR discharge summary for- mat), and the view of latest vital signs and other measurements. concepts that do not exist in EHRs of the study cohort; and (2) The second view is the structured medical profile (lower part validating automatically populated lists, to allow manual assess- of Figure 3C), which is automatically derived from the ment by the researchers. patient’s clinical notes and structured using extended FHIR Selecting and summarizing a cohort. Each query submitted to discharge summary format (23 sections of the FHIR discharge SemEHR will result in a cohort, a list of patients who match the summary [http://hl7.org/fhir/us/ccda/2017Jan/StructureDefini- query. As shown in Figure 3B, a summary table is generated for tion-CCDA-on-FHIR-Discharge-Summary.html] are extended the matching cohort. Each row summarizes a patient, and the first with an additional 8 headings). This structured summary column shows the patient ID. The second one shows the total enhances SemEHR’s search and IE ability. For example, by number of mentions of the search concepts within this patient’s constraining the search field to “Family History,” one can get EHR, followed by numbers of 4 contextualized mentions: positive a cohort of patients with a family history of a certain disorder. mentions, history/hypothetical mentions, negated mentions, and In addition, knowing that a piece of text appeared in the mentions associated with other experiencers. Clicking on the “Hospital Discharge Physical,” sophisticated rules can be ap- numbers brings the user to the clinical notes, where corresponding plied to extract more structured data, such as vital signs. mentions are highlighted (lower part of Figure 3B). The third view is the view of vital signs and other measure- Generating patient views and structured medical profile. As a ge- ments (middle part of Figure 3C). This is automatically gener- neric IE and retrieval platform, SemEHR processes all EHR ated by applying IE rules on the latest structured summary of records for patients and tries to identify a wide range of biomedi- a patient. cal concepts from them. This enables it to produce a panorama for each patient. As shown in Figure 3C, 3 different views are Based on these key functionalities, SemEHR provides a set of generated for each patient: search interfaces to surface the clinical variables hidden in clinical • notes. A typical query, such as “return all patients with a family his- The first view is the longitudinal document view (upper part tory of hepatitis C,” previously might have required the end user to of Figure 3C), which lists all patient documents in chronologi- have NLP expertise, eg, be able to do named entity recognition for cal order, labels documents using their types, and ticks those “hepatitis C” that must be mentioned in the context of “family his- documents that match the query. This view delivers the abun- dance of a patient’s records, the prevalence of matched docu- tory.” Using SemEHR, the end user can put in a simple keyword ments, and their temporal distributions. search: “hepatitis C.” To fulfill this search, SemEHR will pull out Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 535 Table 1. Given a disease (identified by one or more UMLS concepts, ie, search concepts), SemEHR can generate a summary table for a co- hort of patients, which, for each patient, gives the number of positive mentions of the search concepts within all of his/her EHR documents. Using this number as the only feature, we classify whether a patient suffers from a disease or not. a b Precision Recall F-measure Class (200) Precision Recall F-measure Class (1000) 0.857 0.522 0.649 Hepatitis C positive (33) 0.985 0.855 0.915 HIV positive (76) 0.941 0.989 0.964 Hepatitis C unknown (177) 0.988 0.999 0.994 HIV unknown (924) Weighted avg. 0.931 0.935 0.928 Weighted avg. 0.988 0.988 0.988 Two hundred CRIS patients evaluated for hepatitis C; classification model: naive Bayes; test method: 10-fold cross-validation; search concepts: C0019196, C2148557, C0220847. This shows the results of a 200-patient cohort for hepatitis C infection. One thousand CRIS patients evaluated for HIV; classification model: decision table; test method: 10-fold cross-validation; search concepts: C0019699, C0920550. This shows the results of a 1000-patient cohort for HIV. Table 2. The performance of SemEHR laboratory measurement extraction on MIMIC-III data: 11 measurements are studied (first column); 100 patients were randomly selected for this study Laboratory measurements MIMIC-III label # Correct # Incorrect # Actually # Total Accuracy Accuracy (UMLS label) (structured (structured correct extracted (structured (manually data data (manually measurements data verified) comparison) comparison) verified) comparison) (%) (%) Hematocrit Hematocrit 38 5 4 43 88.37 97.67 Platelets Platelet count 1 1 1 2 50.00 100.00 Sodium Sodium 15 0 0 15 100.00 100.00 Mean corpuscular Mean corpuscular 35 1 0 36 97.22 97.22 hemoglobin hemoglobin concentration concentration Alanine aminotransferase Alanine aminotransferase 19 3 2 22 86.36 95.45 Red blood cell Red blood cell 35 1 0 36 97.22 97.22 distribution width distribution width Serum aspartate Aspartate 20 2 1 22 90.91 95.45 aminotransferase aminotransferase Chloride Chloride 15 0 0 15 100.00 100.00 Blood urea Urea nitrogen 3 0 0 3 100.00 100.00 Leukocytes White blood cells 34 5 4 39 87.18 97.44 Glucose Glucose 18 3 0 21 85.71 85.71 Average accuracy 89.36 96.93 The extracted results were assessed by 2 steps: (1) comparing with the structured data (querying lab events table in MIMIC-III; accuracy reported in the 7th column), and (2) manually checking not-matched items in the first step (accuracy reported in the last column). the cohort of relevant patients, populate patient-level summaries (ie, CRIS clinical notes, SemEHR identified 46 million mentions of con- cepts, the predominant ones being pharmacologic substances (16 numbers of contextualized concept mentions, such as patient has 16 million), mental or behavioral dysfunction (12 million), and sign or total mentions of the disease, 15 of them positive and 1 about a fam- symptom (3.8 million). In a CRIS-based liver disease study, ily member), and provide a link to each mention in the original SemEHR identified (in the context of an information retrieval task) source clinical note (similar to the UI illustrated in Figure 3B). 94 instances out of 100 hepatitis C–positive patients that were man- ually annotated (based on structured blood test data). In an HIV study, a random 1000-patient cohort was selected, and SemEHR RESULTS identified 21 out of 23 true positive (verifiable via structured blood This section reports the experiments and results from 3 EHR sys- test data) HIV patients using 2 search concepts, HIV Pos (UMLS tems focusing on evaluating SemEHR’s capacities in semantic code: C0019699) (20 true positives) and HIV diagnosis (UMLS search, analytics, and clinical decision-making support. The evalua- code: C0920550) (8 true positives). SemEHR integrates document- tion on its natural language processing (Bio-YODIE) performance is level NLP annotations at the patient level to generate an integral available in Supplementary Material 3. view of patients. Table 1 presents the results of 2 experiments designed to evaluate the effectiveness of such integration on 2 case Studies conducted on CRIS data of South London and studies, hepatitis C and HIV. The results show that the number of Maudsley Hospital positive mentions of diseases at the patient level is a good feature for SemEHR has been deployed on the anonymized psychiatric records supervised learning methods (naive Bayes or decision table) to clas- database CRIS, which contains a total of 18 million free-text docu- sify whether a patient suffers from a disease or not. (The results ments from South London and Maudsley Hospital, one of Europe’s reported in Table 1 are of a classification task, which is different largest mental health providers (serving 1.2 million residents). In the from the previous information retrieval task.) Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 536 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Table 3. The number of extracted semantic entities in 5 sections of SemEHR medical profiles of the 100 randomly selected MIMIC-III patients, which are usually not recorded in structured EHRs Admission medications Family history Social history History of past illness Hospital discharge instructions # Total annotations 1475 156 445 1575 1162 Top 5 semantic types by frequency Temporal concept 442 Finding 42 Finding 132 Disease or syndrome 337 Clinical attribute 359 Pharmacologic substance 393 Disease or syndrome 33 Temporal concept 86 Finding 189 Temporal concept 158 Finding 194 Neoplastic process 28 Pharmacologic substance 58 Temporal concept 182 Health care–related 133 organization Clinical drug 121 Pharmacologic substance 14 Clinical attribute 30 Therapeutic or preventive 180 Health care activity 126 procedure Health care–related 51 Clinical attribute 9 Individual behavior 28 Body part, organ, or organ 96 Finding 79 organization component Study conducted at King’s College Hospital, London cation are reported in the last column of Table 2. The average accu- At King’s College Hospital, SemEHR is being used to assess eligibil- racy was improved to 97%. ity and subsequently recruit patients into the 100 000 Genomes Proj- The manual verification revealed that extracting vital signs from ect (https://www.genomicsengland.co.uk/). Here, an open SPARQL clinical notes can complement structured data in MIMIC-III; there endpoint is integrated to map UMLS concepts to Human Phenotype are 6 cases where the measurements are extracted from free text but Ontology terms, inclusion criteria for recruitment, and the concepts missing in the structured data. In general, SemEHR can reveal various necessary to populate complex phenotype models. The preliminary types of structured data that usually are not or cannot be recorded in validation study suggests that the tool is able to validate previously structured EHRs, such as family history and social history. In Table submitted cases and is very fast at searching phenotypes (providing 3, for the above 100 randomly selected patients, we summarize the results within seconds), an operation that previously required man- number of semantic entities identified in 5 sections of SemEHR medi- ual assessment of patient records. For example, the time to check cal profiles which are usually not recorded in structured EHRs. the recruitment criteria for a patient is reduced significantly from days to minutes for dermatology disorders, for which the inclusion/ DISCUSSION exclusion criteria contain 120 phenotypes, on average. In addition, semantic reasoning (eg, expanding search concepts with more spe- SemEHR has been deployed or is in the process of being deployed in a cific concepts) has been found to be helpful for identifying 2 specific number of NHS Trust EHR systems, including South London and phenotypes, neutropenia and hypertension. Maudsley, King’s College Hospital, University College London Hos- pitals, and Guy’s Hospital. Results and feedback from the multiple Studies conducted on MIMIC-III data SemEHR use cases have shown its effectiveness in automating lengthy We deployed a SemEHR instance on MIMIC-III, an intensive care manual tasks without jeopardizing accuracy. Queries are returned at EHR dataset anonymized from 2 US-based hospitals and made pub- a rapid enough rate to enable iterative tailoring to achieve high specif- lic for research purposes. MIMIC-III contains about 2 million free- icity. Moreover, according to our case studies at SLaM, SemEHR has text clinical notes and comprises very good structured data, includ- achieved similar accuracy to bespoke NLP applications built upon ing high-resolution laboratory measurements for most patients. To TextHunter. With a system powered by ontological semantics, evaluate the performance of SemEHR’s structured medical profile, researchers can make use of semantically associated concepts to im- we randomly selected 100 patients and assessed the accuracy of au- prove results, eg, in the CRIS-based liver disease study, the inclusion tomatically extracted laboratory measurements in their SemEHR of 8 drugs used for treating liver disease helped to find more patients. medical profiles. The results are presented in Table 2. Eleven types Our case studies show that building a unified framework like of laboratory measurements were manually selected for this evalua- SemEHR realizes a more cost-effective approach to dealing with tion, which contains popular tests such as hematocrit and relatively common IE challenges and significantly lowers the barrier for rare ones such as blood urea. First, we compared the extracted mea- researchers, coders, and clinicians to access knowledge residing surement values with those stored in the MIMIC-III structured data. within unstructured clinical notes. SemEHR has great potential in A patient usually has multiple values of the same measurement that enabling the efficient and effective secondary use of EHRs to im- are tested at different times, and it should be noted that as long as prove health care services. Furthermore, SemEHR-like systems initi- the extracted value appears within the list of all values from the ate a collaborative learning platform, as advocated by Moseley and structured data, the extraction is deemed correct; otherwise, it is in- et al., enabling studies to be conducted in a cooperative way rather correct. The result of the first step is presented in the second to last than having resources remain in isolated silos. column. The average accuracy using structured data verification is SemEHR provides different patient views, with the aim of pre- 89%. For those incorrect extractions, we applied a second step of senting a more continuous representation of the patient’s treatment 21,22 manual assessment. This step identified some false negative results timeline. Such views may reveal data quality issues to research- from the first step caused by factors such as decimal rounding (3 ers or clinicians so that necessary actions can be taken before deriv- cases), different units (2 cases), and missing laboratory events in ing conclusions. For example, the longitudinal document view gives structured data (6 cases). The accuracies based on the manual verifi- a quick overview of how abundant or detailed a patient’s EHR is, Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 537 2. Mathias JS, Gossett D, Baker DW. Use of electronic health record data to which helps to identify patients who have incomplete records and evaluate overuse of cervical cancer screening. J Am Med Inform Assoc. might need to be removed from studies. However, data quality 2012;19:e96–101. issues such as data incompleteness, inconsistency, and inaccuracy 3. Pawloski PA, Thomas AJ, Kane S, et al. Predicting neutropenia risk in need to be addressed in a systematic way; making users aware of the patients with cancer using electronic data. J Am Med Inform Assoc. potential issues is only the first step. In our future work, we will in- 2017;24:e129–35. vestigate approaches to tackling challenges such as checking auto- 4. Bilal U, D ıez J, et al., the HHH Research Group. Population cardiovascu- mated patient-level consistency, bearing in mind that some of the lar health and urban environments: the Heart Healthy Hoods exploratory 23,24 challenges require wider-scope (eg, institution-level) attention. study in Madrid, Spain. BMC Med Res Methodol. 2016;16(1):104. 5. Hebbring SJ, Rastegar-Mojarad M, Ye Z, et al. Application of clinical text data for phenome-wide association studies (PheWASs). Bioinformatics. CONCLUSION 2015;31:1981–87. 6. Abhyankar S, Demner-Fushman D, Callaghan FM, et al. Combining In this paper, we presented SemEHR, a unified information extraction structured and unstructured data to identify a cohort of ICU patients who and semantic search system for obtaining clinical insight from unstruc- received dialysis. J Am Med Inform Assoc. 2014;21:801–07. tured clinical notes. With a dedicated architecture and the incorpora- 7. Scheurwegs E, Luyckx K, Luyten L, et al. Data integration of structured tion of semantic analytics, SemEHR effectively turns IE tasks into and unstructured sources for assigning clinical codes to patient stays. JAm (iterative) ontology-based searches, which significantly lowers the bar- Med Inform Assoc. 2016;23:e11–19. riers to secondary use of unstructured EHR data. The system has been 8. Stewart R, Soremekun M, Perera G, et al. The South London and Mauds- deployed in several NHS hospitals in the UK and a number of case ley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case studies have been initiated, including patient recruitment for the UK register: development and descriptive data. BMC Psychiatry. 2009;9:51. government’s 100 000 Genomes Project. Results and feedback demon- 9. Wu H, Ibrahim ZM, Iqbal E, et al. Encoding medication episodes for ad- strate that SemEHR can efficiently perform the task of cohort selec- verse drug event prediction. In: Research and Development in Intelligent Systems XXXIII. New York: Springer; 2016: 245–50. tion and patient characterization with high accuracy. SemEHR is open 10. Kadra G, Stewart R, Shetty H, et al. Extracting antipsychotic polyphar- source; all nonsensitive data relating to its verifications have been pub- macy data from electronic health records: developing and evaluating a lished in its online repository: https://github.com/CogStack/SemEHR. novel process. BMC Psychiatry. 2015;15:166. 11. Iqbal E, Mallah R, Jackson RG, et al. Identification of adverse drug events FUNDING STATEMENT from free text electronic patient records and information in a large mental health case register. PLoS One. 2015;10:e0134208. This work was supported by the Medical Research Council (grant 12. Jackson RG, Patel R, Jayatilleke N, et al. Natural language processing to number MC_PC_14089 and MR/L014815/1), the National Institute extract symptoms of severe mental illness from clinical text: the Clinical for Health Research Biomedical Research Centre at South London Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) and Maudsley NHS Foundation Trust and King’s College London, project. BMJ Open. 2017;7:e012012. the European Union’s Horizon 2020 (grant number 644753 KCon- 13. Jackson MSc RG, Ball M, Patel R, et al. TextHunter: a user friendly tool nect), the Wellcome Trust Seed Award in Science (grant number for extracting generic concepts from free text in clinical research. AMIA Annu Symp Proc. 2014;2014:729–38. 109823/Z/15/Z), the National Institute for Health Research Univer- 14. Jackson R, Kartoglu IE, Agrawal A, et al. CogStack: experiences of sity College London Hospital’s Biomedical Research Centre, Arthri- deploying integrated information retrieval and extraction services in a tis Research UK, the British Heart Foundation, Cancer Research large National Health Service Foundation Trust Hospital. bioRxiv. 2017. UK, the Chief Scientist Office, the Economic and Social Research 15. Bodenreider O. The Unified Medical Language System (UMLS): integrat- Council, the Engineering and Physical Sciences Research Council, ing biomedical terminology. Nucleic Acids Res. 2004;32:D267–70. the National Institute for Social Care and Health Research, and the 16. Lindberg DAB, Humphreys BL. The Unified Medical Language System Wellcome Trust (grant number MR/K006584/1). (UMLS) and computer-based patient records. In: Aspects of the Computer-based Patient Record. New York: Springer; 1992: 165–75. 17. Auer S, Bizer C, Kobilarov G, et al. DBpedia: a nucleus for a web of open COMPETING INTERESTS STATEMENT data. In: The Semantic Web: Lecture Notes in Computer Science. Berlin, We declare no competing interests. Heidelberg: Springer; 2007: 722–35. 18. Vrande ci c D, Kro ¨ tzsch M. Wikidata. Commun ACM. 2014;57:78–85. 19. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible CONTRIBUTORSHIP STATEMENT critical care database. Sci Data. 2016;3:160035. 20. Moseley ET, Hsu DJ, Stone DJ, et al. Beyond open big data: addressing HW, IK, AF, AR, GG, RJ, ZI, and AA were involved in development unreliable research. J Med Internet Res. 2014;16(11):e259. of SemEHR or components that are used by the system. GT and 21. Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality KIM led the liver disease and HIV studies on the SLaM EHR. RS led issues and informatics opportunities. Summit Translat Bioinform. an autoimmune study on the SLaM EHR; CS and DG led the 2010;2010:1. 100 000 Genomes Project study at King’s College Hospital. MB and 22. Nair S, Hsu D, Celi LA. Challenges and opportunities in secondary anal- RS provided the access and computational resources for accessing yses of electronic health record data. In: Secondary Analysis of Elec- the SLaM EHR. AR, RS, and RJBD secured funding for this re- tronic Health Records. Cham: Springer; 2016. https://rd.springer.com/ search. All authors contributed to the abstract. chapter/10.1007/978-3-319-43742-2_3/fulltext.html. Accessed June 14, 2017. 23. Cresswell KM, Bates DW, Sheikh A. Ten key considerations for the suc- REFERENCES cessful optimization of large-scale health information technology. JAm Med Inform Assoc. 2017;24:182–87. 1. Warner JL, Wang L, Pao W, et al. CUSTOM-SEQ: a prototype for oncol- 24. Cresswell K, Bates DW, Sheikh A. Six ways for governments to get value ogy rapid learning in a comprehensive EHR environment. J Am Med In- form Assoc. 2016;23:692–700. from health IT. Lancet. 2016;387:2074–75. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research

Loading next page...
 
/lp/oxford-university-press/semehr-a-general-purpose-semantic-search-system-to-surface-semantic-g2udDVLfy3

References (61)

Publisher
Oxford University Press
Copyright
Copyright © 2022 American Medical Informatics Association
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocx160
Publisher site
See Article on Publisher Site

Abstract

Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 25(5), 2018, 530–537 doi: 10.1093/jamia/ocx160 Advance Access Publication Date: 19 January 2018 Research and Applications Research and Applications SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research* 1,2 3 3,4 1,5 1,5 Honghan Wu, Giulia Toti, Katherine I Morley, Zina M Ibrahim, Amos Folarin, 1 6 7 7 7 Richard Jackson, Ismail Kartoglu, Asha Agrawal, Clive Stringer, Darren Gale, 8 8 9 9,10 Genevieve Gorrell, Angus Roberts, Matthew Broadbent, Robert Stewart, and 1,5 Richard JB Dobson Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College Lon- don, London, UK, School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China, 3 4 National Addiction Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK, Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Australia, Farr In- 6 7 stitute of Health Informatics Research, University College London, London, UK, InterDigital Europe, London, UK, King’s College 8 9 Hospital NHS Foundation Trust, London, UK, Department of Computer Science, University of Sheffield, Sheffield, UK, South London and Maudsley NHS Foundation Trust, London, UK and Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK Corresponding Author: Dr Honghan Wu, Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, SE5 8AF, UK. E-mail: honghan.wu@kcl.ac.uk. Phone: þ442078480924. *The manuscript has been submitted to JAMIA Received 1 October 2017; Revised 28 November 2017; Editorial Decision 22 December 2017; Accepted 8 January 2018 ABSTRACT Objective: Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. Methods: SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. Results: SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foun- dation Trust, one of Europe’s largest providers of mental health services. In 2 Clinical Record Interactive Search–based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King’s College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomic- sengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Vali- dated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted V The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 530 Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 531 by SemEHR can achieve around 97% accuracy. Conclusion: Results from the multiple case studies demonstrate SemEHR’s efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR. Key words: secondary use of EHR, information extraction, NLP, semantic search, ontology, FHIR, patient recruitment BACKGROUND with extracted concepts and determine which are critical to un- derstand the clinical domain. To address these challenges, The opportunity for secondary use of the wealth of information con- SemEHR uses a production infrastructure that integrates our pre- tained within electronic health records (EHRs) has attracted vious work in the CogStack pipeline to harmonize and cleanse researchers interested in investigating approaches to provide more heterogeneous records, using them to identify contextualized tailored and timely care, improve efficiency of services, and derive mentions (negation, temporality, and experiencer) of a wide 1–4 new scientific and medical insights. In addition to structured data range of biomedical concepts, including Systematized Nomencla- contained within relational database tables (such as International ture of Medicine Clinical Terms (SNOMED CT) (http://www. Classification of Diseases, Tenth Revision [ICD-10] diagnoses snomed.org/snomed-ct), ICD-10 (http://apps.who.int/classifica- codes), EHR documents are filled with unstructured clinical notes, tions/icd10/browse/2010/en), Logical Observation Identifiers such as nursing records, radiology reports, and discharge summa- Names and Codes (LOINC) (https://loinc.org/), and Drug Ontol- 5–7 ries. These notes add a richness and depth to EHR-based studies, ogy (https://ontology.atlassian.net/wiki/spaces/DRON/over- providing data and insight beyond what is available within the thin view). In addition, SemEHR automatically associates semantic layer of data stored within structured fields. types of annotations and their clinical contexts (derived from Deriving actionable insights from the EHR, including the un- documents or sections) with dedicated extraction rules, which structured component, is challenging. It requires bringing together enables better IE capabilities, such as populating structured vital expertise in the clinical domain, the underlying health care informa- sign data from observation notes. tion systems, and text analytics techniques, eg, natural language It is well appreciated that a one-size-fits-all approach needs to be processing (NLP). For example, the Clinical Record Interactive adapted to work effectively in different scenarios. Therefore, to Search (CRIS) system, an anonymized replica of the EHR used in serve different use cases well, we require the capability to extend the South London and Maudsley (SLaM) National Health Service the terminology of the general-purpose IE system to cover unseen (NHS) Foundation Trust in the UK, was designed to support clinical concepts, deal with language specificities in a subcorpus, support and scientific studies. Since its launch in 2009, a large number of use case–specific extraction requirements, and enable perfor- 9–13 studies ( to name a few) have used the CRIS resource in conjunc- mance fine-tuning, eg, by incorporating specific knowledge or tion with NLP or text-mining techniques. Although these studies an- researchers’ expertise. SemEHR provides a study-based (use swered different clinical questions, the technical requirements for case–specific) learning engine that enables iterative learning and extracting, structuring, and making sense of the data largely over- feedback. It collects user feedback and uses rule-based and ma- lapped, and included (1) preprocessing and cleansing corpus-related chine learning techniques to tackle study-specific challenges and documents (eg, removing misleading form headings from scanned requirements in a continuous manner. documents); (2) compiling and recognizing common medical termi- A few hurdles prevent the effective consumption of extracted nology (eg, the antipsychotic medication identification problem is data from general-purpose IE systems in scientific research and 10,11 almost the same in ); and (3) deriving patient-level clinical sig- clinical studies. To fulfill requirements by various studies, devel- nals from document-level NLP annotations (eg, understanding that oping general-purpose IE systems is inevitable in order to adopt a medication prescribed at admission was removed from the large terminologies that users might not be familiar with. This patient’s discharge medication list). poses challenges in (1) mapping look-up concepts to terminology As unstructured EHR data are inevitably needed by many re- terms, (2) translating clinical relations to term associations, and search projects and clinical studies, more cost-effective and system- (3) exploiting terminology semantics to bring unexpected or atic solutions are needed to address the common challenges unperceived new insights. At the consumption level, SemEHR presented by different use cases, while also ensuring that study- implements an ontology-based semantic search component to specific requirements are not compromised by the unified approach. tackle such challenges. To address such challenges, we propose SemEHR, a semantic Last, and probably most important, EHRs represent a timeline of search and analytical system that generates a complete and process- multiple patient interactions with services. As such, the able view of patients from their clinical notes. document-level IE results should be integrated at the patient level to incorporate temporal and macrocontextual information To realize a general-purpose biomedical information extraction (which reports, which visits, etc., as opposed to the sentence- (IE) system for EHRs, there are at least 3 fundamental chal- based contextual information discussed above). Only after this lenges: (1) syntactic heterogeneity: how to effectively access mul- integration is the EHR IE task complete. However, this requires timodal/multisource EHR data that are almost certainly a thorough understanding of the EHR system. SemEHR provides heterogeneous in format, data model, and access interface; (2) a multiperspective view of each patient by assembling NLP anno- knowledge coverage: how to cover all possible biomedical con- tations at the patient level as longitudinal views and compiling cepts that are required by potential use cases; and (3) context structured medical profiles. Both the NLP results and the patient capturing: how to represent and capture the contexts associated Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 532 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Figure 1. (A) SemEHR data model: entities (patient, clinical note, concept, and concept mentions) and their associations. (B) SemEHR generates 2 longitudinal views for each patient: concept mentions grouped in typed and dated documents (upper part), and concept mentions grouped in structured (discharge) summa- ries (lower part). timeline are made available via an ontology-based search system, patient’s medical history/conditions in a period of time (eg, an inpa- which effectively turns common IE tasks into semantic search tient hospital stay). A summary is composed of groups of concept queries. The interface provides a multiperspective view of each mentions, where each group is about one particular aspect of the patient by assembling NLP annotations at the patient level as patient’s medical profile, eg, past medical history, medications, or longitudinal views and compiling structured medical profiles. physical exams. Preferably, such summaries are derived from discharge summaries. When discharge summaries are not available, an auto- mated summarization approach is appliedtogenerate the summaries METHOD based on the contextual information of the concept types and concept mentions. Automated summaries are differentiated from those gener- Data model and longitudinal patient views ated from discharge summaries. Supplementary Material 2 describes As depicted in Figure 1A, SemEHR is built upon 6 types of entities: the detailed process of automated medical profile generation. patient, clinical note, concept, concept mention, medical profile, and profile aspect. Each patient is associated with a list of dated and typed clinical notes. From these notes, SemEHR identifies mentions Architecture: generic and adaptive information of a wide range of biomedical concepts from the Unified Medical extraction and retrieval 15,16 Language System (UMLS), a compendium of many controlled As illustrated in Figure 2, SemEHR is composed of 3 subsystems: the vocabularies, including SNOMED CT, ICD-10, LOINC, Drug On- producing subsystem, the continuous learning subsystem, and the tology, and Gene Ontology. By analyzing the context of its appear- consuming subsystem. ance, each mention is associated with 3 pieces of dimensional contextual information: negation, temporality, and experiencer. Highlighted in green in Figure 1A, the associations between concepts The producing subsystem (eg, Steatohepatitis is a liver disease; Ribavirin is a drug for treating Essentially, the producing subsystem extracts free-text clinical notes hepatitis C) are made available to conduct semantically enriched from heterogeneous underlying EHR systems, populating the data computations by incorporating the various biomedical ontologies model described in the previous section. This task is performed in 3 and Linked Open Data (https://en.wikipedia.org/wiki/Linked_data) main steps: data retrieval, information extraction, and semantic 17 18 14 such as DBpedia and Wikidata. SemEHR derives periodical indexing. CogStack, a data harmonization and enterprise search medical profiles from a patient’s clinical notes, automatically gener- toolkit for EHRs, is adopted in the data retrieval step to provide a uni- ated medical summaries consisting of a set of profile aspects (sec- fied interface with unstructured EHR data, which is often very hetero- tions describing different aspects of a medical profile, eg, past geneous in format and distributed in storage. Each document that medical history, medications, etc.) for a defined period of time. Con- flows out from the data retrieval component is fed into the NLP pipe- cept mentions are assigned to these aspects according to their ap- line, which embeds Bio-YODIE (https://gate.ac.uk/applications/bio- pearance in the original clinical notes. As the rectangle boxes in yodie.html), an NLP pipeline dedicated to annotating UMLS concepts Figure 1A show, SemEHR entities are mapped to Fast Healthcare In- in clinical notes (“documents” hereafter). (Bio-YODIE was developed teroperability Resources (FHIR) (https://www.hl7.org/fhir/over- as part of the EU KConnect project, in which GG, AR, HW, RS, and view.html) entities whenever possible. RD are involved.) Emerging from the NLP pipeline are the documents Based on this data model, SemEHR populates 2 longitudinal and concept mentions extracted from them, which are then analyzed views (shown in Figure 1B) for each patient. As shown in the upper by the Semantic Index component before being indexed. The analysis part of Figure 1B, the first view is generated directly from the raw involves deriving document types (eg, Radiology, GP Letter, or Dis- data. Concept mentions are organized in a list of clinical notes that charge Summary), parsing document structure (eg, identifying headed are located on a timeline according to their date attributes (eg, the blocks from discharge summaries), and associating concept mentions created date/time of the clinical notes). Wherever possible, types of with document structures. The analysis results, document content, clinical notes (such as GP Letter, Radiology, or Discharge Summary) and NLP outputs are finally indexed by an Elasticsearch (https:// are presented. www.elastic.co/products/elasticsearch) cluster. Patient-level summa- The second view (lower part of Figure 1B) is designed to convey ries are generated as described in the previous section. These summa- structured summaries for a patient, each of which summarizes the ries are updated as new documents are added to the index. Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 533 Figure 2. The architecture of SemEHR is composed of 3 subsystems: (1) the producing subsystem (upper part of the figure), creation of SemEHR semantic index by harmonizing, natural language processing, and indexing EHR data; (2) the continuous learning subsystem, addressing study-specific requirements and sup- porting fine-tuning for separate studies; and (3) the consuming subsystem (lower part), supporting tailored care, patient recruitment, and clinical research by se- mantic searching and study-based continuous learning. SemEHR aims to produce annotations with accurate contextual Linked Open Data application programming interfaces) to support information. Three components work collectively to achieve this tasks such as patient characterization or trial recruitment. A con- goal: the Bio-YODIE pipeline captures the sentence/paragraph-level suming task is called a “study” in SemEHR. Each study will have its contexts (eg, negation, hypothetical mentions); the semantic index’s own storage within SemEHR’s Study Knowledge Graph (KG) (bot- analyzer brings in section/document-level context (eg, past medical tom of the Storage section in Figure 2), which stores its study param- history, laboratory results); and the continuous learning subsystem eters (eg, cohort definition and metadata), search settings (eg, query (described in the next subsection) learns the contexts from user- concepts), study results (eg, selected cohort and exported features), assessed annotations (see Supplementary Material 1 for details). and customized rules (eg, regular expressions to remove unwanted annotations). There is also a common KG (Common KG in Figure 2), where sharable knowledge or efforts (such as manually selected The continuous learning subsystem concepts of alcohol-related liver diseases or postprocessing rules for To accommodate the uniqueness of the IE requirements of different improving NLP results) are made available to other studies. studies, SemEHR is designed with a continuous learning subsystem Key functionalities of the consuming subsystem include the fol- to iteratively address study-specific issues. The system collects and lowing: analyzes user feedback from an annotation component embedded within the user interface. Based on the analyzed feedback, 2 compo- Translating search terms to query concepts. This translates the nents are used to improve the IE results. The first is a rule engine, user’s keyword searches (which are often ambiguous or incom- which generates and applies rules for filtering out unwanted results, plete) into semantically clear concepts (identified using UMLS eg, removing concept mentions based on their original string or sur- Controlled Unclassified Information). The correct translation is rounding text. The second component is a machine learning engine essential to ensure the soundness and completeness of search and (a bidirectional recurrent neural network model), which takes user analytics results. Unfortunately, in the clinical scenario, it is often feedback as training data, applies the trained model on the study’s not a trivial task to compile an accurate and complete concept corpus, and populates a confidence value for each concept mention. list even for a single clinical signal. For example, one SemEHR Confidence values are used as quantified indicators in analytic com- case study needs to look up patients with alcohol-related liver ponents for populating results. The user interface for collecting feed- disease. Given a general clinical term such as “liver disease,” it back and the continuous learning mechanisms are explained in would be time-consuming to compile a list of all subtypes of liver detail in Supplementary Material 1. disease that are also alcohol-related. As depicted in section A of Figure 3, SemEHR provides 2 functions for supporting concept The consuming subsystem translation: (1) matching search terms to concepts, which is en- This subsystem consists of a set of components that utilize IE results hanced with logical reasoning to automatically include semanti- and clinical knowledge (accessed from biomedical ontology and cally related concepts and EHR-based exclusion to remove Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 534 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Figure 3. Screenshots of key functionalities provided by the consuming subsystem. (A) Identifying query concepts (UMLS CUIs): facilities to ensure the correct and complete concepts are used in the query to derive accurate clinical findings. (A1) Concept search for matching a user search term to one or more ontology (UMLS) concepts; logical reasoning is implemented to enable the automated inclusion of semantically related concepts (eg, hepatocellular damage is liver dam- age). (A2) Concept validation component for checking and approving the automated inferred concepts based on the aim and criteria of the clinical study (eg, only retain alcohol-related liver conditions for addiction analytics). (B) Selecting and summarizing cohort (the full text in the screenshot has been deliberately rewritten to avoid leaking sensitive patient data). A summary table is generated for a user query where each row summarizes the numbers of total mentions and contextu- alized mentions for one patient. (C) Patient timeline: longitudinal document view (upper), structured medical profile view (based on FHIR discharge summary for- mat), and the view of latest vital signs and other measurements. concepts that do not exist in EHRs of the study cohort; and (2) The second view is the structured medical profile (lower part validating automatically populated lists, to allow manual assess- of Figure 3C), which is automatically derived from the ment by the researchers. patient’s clinical notes and structured using extended FHIR Selecting and summarizing a cohort. Each query submitted to discharge summary format (23 sections of the FHIR discharge SemEHR will result in a cohort, a list of patients who match the summary [http://hl7.org/fhir/us/ccda/2017Jan/StructureDefini- query. As shown in Figure 3B, a summary table is generated for tion-CCDA-on-FHIR-Discharge-Summary.html] are extended the matching cohort. Each row summarizes a patient, and the first with an additional 8 headings). This structured summary column shows the patient ID. The second one shows the total enhances SemEHR’s search and IE ability. For example, by number of mentions of the search concepts within this patient’s constraining the search field to “Family History,” one can get EHR, followed by numbers of 4 contextualized mentions: positive a cohort of patients with a family history of a certain disorder. mentions, history/hypothetical mentions, negated mentions, and In addition, knowing that a piece of text appeared in the mentions associated with other experiencers. Clicking on the “Hospital Discharge Physical,” sophisticated rules can be ap- numbers brings the user to the clinical notes, where corresponding plied to extract more structured data, such as vital signs. mentions are highlighted (lower part of Figure 3B). The third view is the view of vital signs and other measure- Generating patient views and structured medical profile. As a ge- ments (middle part of Figure 3C). This is automatically gener- neric IE and retrieval platform, SemEHR processes all EHR ated by applying IE rules on the latest structured summary of records for patients and tries to identify a wide range of biomedi- a patient. cal concepts from them. This enables it to produce a panorama for each patient. As shown in Figure 3C, 3 different views are Based on these key functionalities, SemEHR provides a set of generated for each patient: search interfaces to surface the clinical variables hidden in clinical • notes. A typical query, such as “return all patients with a family his- The first view is the longitudinal document view (upper part tory of hepatitis C,” previously might have required the end user to of Figure 3C), which lists all patient documents in chronologi- have NLP expertise, eg, be able to do named entity recognition for cal order, labels documents using their types, and ticks those “hepatitis C” that must be mentioned in the context of “family his- documents that match the query. This view delivers the abun- dance of a patient’s records, the prevalence of matched docu- tory.” Using SemEHR, the end user can put in a simple keyword ments, and their temporal distributions. search: “hepatitis C.” To fulfill this search, SemEHR will pull out Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 535 Table 1. Given a disease (identified by one or more UMLS concepts, ie, search concepts), SemEHR can generate a summary table for a co- hort of patients, which, for each patient, gives the number of positive mentions of the search concepts within all of his/her EHR documents. Using this number as the only feature, we classify whether a patient suffers from a disease or not. a b Precision Recall F-measure Class (200) Precision Recall F-measure Class (1000) 0.857 0.522 0.649 Hepatitis C positive (33) 0.985 0.855 0.915 HIV positive (76) 0.941 0.989 0.964 Hepatitis C unknown (177) 0.988 0.999 0.994 HIV unknown (924) Weighted avg. 0.931 0.935 0.928 Weighted avg. 0.988 0.988 0.988 Two hundred CRIS patients evaluated for hepatitis C; classification model: naive Bayes; test method: 10-fold cross-validation; search concepts: C0019196, C2148557, C0220847. This shows the results of a 200-patient cohort for hepatitis C infection. One thousand CRIS patients evaluated for HIV; classification model: decision table; test method: 10-fold cross-validation; search concepts: C0019699, C0920550. This shows the results of a 1000-patient cohort for HIV. Table 2. The performance of SemEHR laboratory measurement extraction on MIMIC-III data: 11 measurements are studied (first column); 100 patients were randomly selected for this study Laboratory measurements MIMIC-III label # Correct # Incorrect # Actually # Total Accuracy Accuracy (UMLS label) (structured (structured correct extracted (structured (manually data data (manually measurements data verified) comparison) comparison) verified) comparison) (%) (%) Hematocrit Hematocrit 38 5 4 43 88.37 97.67 Platelets Platelet count 1 1 1 2 50.00 100.00 Sodium Sodium 15 0 0 15 100.00 100.00 Mean corpuscular Mean corpuscular 35 1 0 36 97.22 97.22 hemoglobin hemoglobin concentration concentration Alanine aminotransferase Alanine aminotransferase 19 3 2 22 86.36 95.45 Red blood cell Red blood cell 35 1 0 36 97.22 97.22 distribution width distribution width Serum aspartate Aspartate 20 2 1 22 90.91 95.45 aminotransferase aminotransferase Chloride Chloride 15 0 0 15 100.00 100.00 Blood urea Urea nitrogen 3 0 0 3 100.00 100.00 Leukocytes White blood cells 34 5 4 39 87.18 97.44 Glucose Glucose 18 3 0 21 85.71 85.71 Average accuracy 89.36 96.93 The extracted results were assessed by 2 steps: (1) comparing with the structured data (querying lab events table in MIMIC-III; accuracy reported in the 7th column), and (2) manually checking not-matched items in the first step (accuracy reported in the last column). the cohort of relevant patients, populate patient-level summaries (ie, CRIS clinical notes, SemEHR identified 46 million mentions of con- cepts, the predominant ones being pharmacologic substances (16 numbers of contextualized concept mentions, such as patient has 16 million), mental or behavioral dysfunction (12 million), and sign or total mentions of the disease, 15 of them positive and 1 about a fam- symptom (3.8 million). In a CRIS-based liver disease study, ily member), and provide a link to each mention in the original SemEHR identified (in the context of an information retrieval task) source clinical note (similar to the UI illustrated in Figure 3B). 94 instances out of 100 hepatitis C–positive patients that were man- ually annotated (based on structured blood test data). In an HIV study, a random 1000-patient cohort was selected, and SemEHR RESULTS identified 21 out of 23 true positive (verifiable via structured blood This section reports the experiments and results from 3 EHR sys- test data) HIV patients using 2 search concepts, HIV Pos (UMLS tems focusing on evaluating SemEHR’s capacities in semantic code: C0019699) (20 true positives) and HIV diagnosis (UMLS search, analytics, and clinical decision-making support. The evalua- code: C0920550) (8 true positives). SemEHR integrates document- tion on its natural language processing (Bio-YODIE) performance is level NLP annotations at the patient level to generate an integral available in Supplementary Material 3. view of patients. Table 1 presents the results of 2 experiments designed to evaluate the effectiveness of such integration on 2 case Studies conducted on CRIS data of South London and studies, hepatitis C and HIV. The results show that the number of Maudsley Hospital positive mentions of diseases at the patient level is a good feature for SemEHR has been deployed on the anonymized psychiatric records supervised learning methods (naive Bayes or decision table) to clas- database CRIS, which contains a total of 18 million free-text docu- sify whether a patient suffers from a disease or not. (The results ments from South London and Maudsley Hospital, one of Europe’s reported in Table 1 are of a classification task, which is different largest mental health providers (serving 1.2 million residents). In the from the previous information retrieval task.) Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 536 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 Table 3. The number of extracted semantic entities in 5 sections of SemEHR medical profiles of the 100 randomly selected MIMIC-III patients, which are usually not recorded in structured EHRs Admission medications Family history Social history History of past illness Hospital discharge instructions # Total annotations 1475 156 445 1575 1162 Top 5 semantic types by frequency Temporal concept 442 Finding 42 Finding 132 Disease or syndrome 337 Clinical attribute 359 Pharmacologic substance 393 Disease or syndrome 33 Temporal concept 86 Finding 189 Temporal concept 158 Finding 194 Neoplastic process 28 Pharmacologic substance 58 Temporal concept 182 Health care–related 133 organization Clinical drug 121 Pharmacologic substance 14 Clinical attribute 30 Therapeutic or preventive 180 Health care activity 126 procedure Health care–related 51 Clinical attribute 9 Individual behavior 28 Body part, organ, or organ 96 Finding 79 organization component Study conducted at King’s College Hospital, London cation are reported in the last column of Table 2. The average accu- At King’s College Hospital, SemEHR is being used to assess eligibil- racy was improved to 97%. ity and subsequently recruit patients into the 100 000 Genomes Proj- The manual verification revealed that extracting vital signs from ect (https://www.genomicsengland.co.uk/). Here, an open SPARQL clinical notes can complement structured data in MIMIC-III; there endpoint is integrated to map UMLS concepts to Human Phenotype are 6 cases where the measurements are extracted from free text but Ontology terms, inclusion criteria for recruitment, and the concepts missing in the structured data. In general, SemEHR can reveal various necessary to populate complex phenotype models. The preliminary types of structured data that usually are not or cannot be recorded in validation study suggests that the tool is able to validate previously structured EHRs, such as family history and social history. In Table submitted cases and is very fast at searching phenotypes (providing 3, for the above 100 randomly selected patients, we summarize the results within seconds), an operation that previously required man- number of semantic entities identified in 5 sections of SemEHR medi- ual assessment of patient records. For example, the time to check cal profiles which are usually not recorded in structured EHRs. the recruitment criteria for a patient is reduced significantly from days to minutes for dermatology disorders, for which the inclusion/ DISCUSSION exclusion criteria contain 120 phenotypes, on average. In addition, semantic reasoning (eg, expanding search concepts with more spe- SemEHR has been deployed or is in the process of being deployed in a cific concepts) has been found to be helpful for identifying 2 specific number of NHS Trust EHR systems, including South London and phenotypes, neutropenia and hypertension. Maudsley, King’s College Hospital, University College London Hos- pitals, and Guy’s Hospital. Results and feedback from the multiple Studies conducted on MIMIC-III data SemEHR use cases have shown its effectiveness in automating lengthy We deployed a SemEHR instance on MIMIC-III, an intensive care manual tasks without jeopardizing accuracy. Queries are returned at EHR dataset anonymized from 2 US-based hospitals and made pub- a rapid enough rate to enable iterative tailoring to achieve high specif- lic for research purposes. MIMIC-III contains about 2 million free- icity. Moreover, according to our case studies at SLaM, SemEHR has text clinical notes and comprises very good structured data, includ- achieved similar accuracy to bespoke NLP applications built upon ing high-resolution laboratory measurements for most patients. To TextHunter. With a system powered by ontological semantics, evaluate the performance of SemEHR’s structured medical profile, researchers can make use of semantically associated concepts to im- we randomly selected 100 patients and assessed the accuracy of au- prove results, eg, in the CRIS-based liver disease study, the inclusion tomatically extracted laboratory measurements in their SemEHR of 8 drugs used for treating liver disease helped to find more patients. medical profiles. The results are presented in Table 2. Eleven types Our case studies show that building a unified framework like of laboratory measurements were manually selected for this evalua- SemEHR realizes a more cost-effective approach to dealing with tion, which contains popular tests such as hematocrit and relatively common IE challenges and significantly lowers the barrier for rare ones such as blood urea. First, we compared the extracted mea- researchers, coders, and clinicians to access knowledge residing surement values with those stored in the MIMIC-III structured data. within unstructured clinical notes. SemEHR has great potential in A patient usually has multiple values of the same measurement that enabling the efficient and effective secondary use of EHRs to im- are tested at different times, and it should be noted that as long as prove health care services. Furthermore, SemEHR-like systems initi- the extracted value appears within the list of all values from the ate a collaborative learning platform, as advocated by Moseley and structured data, the extraction is deemed correct; otherwise, it is in- et al., enabling studies to be conducted in a cooperative way rather correct. The result of the first step is presented in the second to last than having resources remain in isolated silos. column. The average accuracy using structured data verification is SemEHR provides different patient views, with the aim of pre- 89%. For those incorrect extractions, we applied a second step of senting a more continuous representation of the patient’s treatment 21,22 manual assessment. This step identified some false negative results timeline. Such views may reveal data quality issues to research- from the first step caused by factors such as decimal rounding (3 ers or clinicians so that necessary actions can be taken before deriv- cases), different units (2 cases), and missing laboratory events in ing conclusions. For example, the longitudinal document view gives structured data (6 cases). The accuracies based on the manual verifi- a quick overview of how abundant or detailed a patient’s EHR is, Downloaded from https://academic.oup.com/jamia/article/25/5/530/4817428 by DeepDyve user on 18 July 2022 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 5 537 2. Mathias JS, Gossett D, Baker DW. Use of electronic health record data to which helps to identify patients who have incomplete records and evaluate overuse of cervical cancer screening. J Am Med Inform Assoc. might need to be removed from studies. However, data quality 2012;19:e96–101. issues such as data incompleteness, inconsistency, and inaccuracy 3. Pawloski PA, Thomas AJ, Kane S, et al. Predicting neutropenia risk in need to be addressed in a systematic way; making users aware of the patients with cancer using electronic data. J Am Med Inform Assoc. potential issues is only the first step. In our future work, we will in- 2017;24:e129–35. vestigate approaches to tackling challenges such as checking auto- 4. Bilal U, D ıez J, et al., the HHH Research Group. Population cardiovascu- mated patient-level consistency, bearing in mind that some of the lar health and urban environments: the Heart Healthy Hoods exploratory 23,24 challenges require wider-scope (eg, institution-level) attention. study in Madrid, Spain. BMC Med Res Methodol. 2016;16(1):104. 5. Hebbring SJ, Rastegar-Mojarad M, Ye Z, et al. Application of clinical text data for phenome-wide association studies (PheWASs). Bioinformatics. CONCLUSION 2015;31:1981–87. 6. Abhyankar S, Demner-Fushman D, Callaghan FM, et al. Combining In this paper, we presented SemEHR, a unified information extraction structured and unstructured data to identify a cohort of ICU patients who and semantic search system for obtaining clinical insight from unstruc- received dialysis. J Am Med Inform Assoc. 2014;21:801–07. tured clinical notes. With a dedicated architecture and the incorpora- 7. Scheurwegs E, Luyckx K, Luyten L, et al. Data integration of structured tion of semantic analytics, SemEHR effectively turns IE tasks into and unstructured sources for assigning clinical codes to patient stays. JAm (iterative) ontology-based searches, which significantly lowers the bar- Med Inform Assoc. 2016;23:e11–19. riers to secondary use of unstructured EHR data. The system has been 8. Stewart R, Soremekun M, Perera G, et al. The South London and Mauds- deployed in several NHS hospitals in the UK and a number of case ley NHS Foundation Trust Biomedical Research Centre (SLAM BRC) case studies have been initiated, including patient recruitment for the UK register: development and descriptive data. BMC Psychiatry. 2009;9:51. government’s 100 000 Genomes Project. Results and feedback demon- 9. Wu H, Ibrahim ZM, Iqbal E, et al. Encoding medication episodes for ad- strate that SemEHR can efficiently perform the task of cohort selec- verse drug event prediction. In: Research and Development in Intelligent Systems XXXIII. New York: Springer; 2016: 245–50. tion and patient characterization with high accuracy. SemEHR is open 10. Kadra G, Stewart R, Shetty H, et al. Extracting antipsychotic polyphar- source; all nonsensitive data relating to its verifications have been pub- macy data from electronic health records: developing and evaluating a lished in its online repository: https://github.com/CogStack/SemEHR. novel process. BMC Psychiatry. 2015;15:166. 11. Iqbal E, Mallah R, Jackson RG, et al. Identification of adverse drug events FUNDING STATEMENT from free text electronic patient records and information in a large mental health case register. PLoS One. 2015;10:e0134208. This work was supported by the Medical Research Council (grant 12. Jackson RG, Patel R, Jayatilleke N, et al. Natural language processing to number MC_PC_14089 and MR/L014815/1), the National Institute extract symptoms of severe mental illness from clinical text: the Clinical for Health Research Biomedical Research Centre at South London Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) and Maudsley NHS Foundation Trust and King’s College London, project. BMJ Open. 2017;7:e012012. the European Union’s Horizon 2020 (grant number 644753 KCon- 13. Jackson MSc RG, Ball M, Patel R, et al. TextHunter: a user friendly tool nect), the Wellcome Trust Seed Award in Science (grant number for extracting generic concepts from free text in clinical research. AMIA Annu Symp Proc. 2014;2014:729–38. 109823/Z/15/Z), the National Institute for Health Research Univer- 14. Jackson R, Kartoglu IE, Agrawal A, et al. CogStack: experiences of sity College London Hospital’s Biomedical Research Centre, Arthri- deploying integrated information retrieval and extraction services in a tis Research UK, the British Heart Foundation, Cancer Research large National Health Service Foundation Trust Hospital. bioRxiv. 2017. UK, the Chief Scientist Office, the Economic and Social Research 15. Bodenreider O. The Unified Medical Language System (UMLS): integrat- Council, the Engineering and Physical Sciences Research Council, ing biomedical terminology. Nucleic Acids Res. 2004;32:D267–70. the National Institute for Social Care and Health Research, and the 16. Lindberg DAB, Humphreys BL. The Unified Medical Language System Wellcome Trust (grant number MR/K006584/1). (UMLS) and computer-based patient records. In: Aspects of the Computer-based Patient Record. New York: Springer; 1992: 165–75. 17. Auer S, Bizer C, Kobilarov G, et al. DBpedia: a nucleus for a web of open COMPETING INTERESTS STATEMENT data. In: The Semantic Web: Lecture Notes in Computer Science. Berlin, We declare no competing interests. Heidelberg: Springer; 2007: 722–35. 18. Vrande ci c D, Kro ¨ tzsch M. Wikidata. Commun ACM. 2014;57:78–85. 19. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible CONTRIBUTORSHIP STATEMENT critical care database. Sci Data. 2016;3:160035. 20. Moseley ET, Hsu DJ, Stone DJ, et al. Beyond open big data: addressing HW, IK, AF, AR, GG, RJ, ZI, and AA were involved in development unreliable research. J Med Internet Res. 2014;16(11):e259. of SemEHR or components that are used by the system. GT and 21. Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality KIM led the liver disease and HIV studies on the SLaM EHR. RS led issues and informatics opportunities. Summit Translat Bioinform. an autoimmune study on the SLaM EHR; CS and DG led the 2010;2010:1. 100 000 Genomes Project study at King’s College Hospital. MB and 22. Nair S, Hsu D, Celi LA. Challenges and opportunities in secondary anal- RS provided the access and computational resources for accessing yses of electronic health record data. In: Secondary Analysis of Elec- the SLaM EHR. AR, RS, and RJBD secured funding for this re- tronic Health Records. Cham: Springer; 2016. https://rd.springer.com/ search. All authors contributed to the abstract. chapter/10.1007/978-3-319-43742-2_3/fulltext.html. Accessed June 14, 2017. 23. Cresswell KM, Bates DW, Sheikh A. Ten key considerations for the suc- REFERENCES cessful optimization of large-scale health information technology. JAm Med Inform Assoc. 2017;24:182–87. 1. Warner JL, Wang L, Pao W, et al. CUSTOM-SEQ: a prototype for oncol- 24. Cresswell K, Bates DW, Sheikh A. Six ways for governments to get value ogy rapid learning in a comprehensive EHR environment. J Am Med In- form Assoc. 2016;23:692–700. from health IT. Lancet. 2016;387:2074–75.

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: May 1, 2018

There are no references for this article.