Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise

Development of a global infectious disease activity database using natural language processing,... Abstract Objective We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. Materials and Methods We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. Results Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average. Discussion We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. Conclusion Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports. machine learning, public health surveillance, communicable diseases, internet, health information systems INTRODUCTION Intervening rapidly in the course of an infectious disease epidemic significantly reduces cases and deaths. Event-based surveillance (ie, disease surveillance using online media reports and digital information sources) can facilitate timelier identification of outbreaks1–5 and new areas of disease spread.6 Challenges in sustaining a high quality and easily usable feed of online news about global infectious disease events relate to the volume, range in veracity, and language differences in syntax. Event-based surveillance requires separating the signal (ie, current, real infectious disease outbreaks) from the noise (ie, news that may pertain to an infectious disease or which uses the name of a particular disease but is not about a current outbreak). As machine learning algorithms become more powerful, large data sets of historical disease activity are also needed to fuel predictions of risks associated with epidemics.7,8 Failing to distinguish media reports about general infectious disease information from those pertaining to current disease activity hinders the manual review of articles and creation of disease activity databases. While a number of other systems gather, process and disseminate surveillance information from online sources using text extraction and categorization,2 our system has been specifically designed towards a curated feed of reports explicitly about infectious disease activity. To achieve this, we applied natural language processing (NLP) and machine learning (ML) models to classify articles. We collected media records from the Global Database of Events, Language, and Tone 2.0 (GDELT) Global Knowledge Graph (GKG) Version 2.0.9,10 We curated a list of alternate names for 114 infectious diseases. Articles that did not contain a disease name or an alternate name in the title were deemed irrelevant and were not included in further steps. However, due to nuances of distinguishing between remaining articles containing general infectious disease information and those about disease activity, we could not solely rely on key words within the text. Instead, we used supervised learning to classify articles based on their relevance. Supervised learning algorithms infer a function that maps inputs (e.g., article titles) to outputs (e.g., whether the article contains disease activity information) from labeled data. Using this approach, we could distinguish between articles containing content about disease activity, general infectious disease information, and irrelevant articles. Finally, we built a user interface to allow clinical experts to verify articles clustered by disease, location, and time, as well as supplement case count and locations data using official reports. An overview of the system can be seen in Figure 1. Figure 1. Open in new tabDownload slide The modules of our event-based surveillance system. Figure 1. Open in new tabDownload slide The modules of our event-based surveillance system. MATERIALS AND METHODS Media report acquisition The GDELT Project provides records from over 25 000 broadcast, print, and web outlets in 65 languages, updated every 15 minutes.10 In addition to cataloging the world’s media reports, GDELT also performs information extraction, using key words to associate each article with “tags” such as relevant locations, organizations, and people.9,10 Relevant to our purposes, GDELT tags specific diseases mentioned in each news article. From the potentially relevant articles with URLs acquired from GDELT, we scraped each title and body individually from the original source and used the Google Translate application programming interface to translate titles into English. We also performed automated data cleaning to handle duplicate records and missing data. Disease activity filters Next, we developed filters designed to achieve 2 goals: (1) provide a media feed containing information only about disease activity, while (2) ensuring relevant reports are not lost. The filters are as follows: Filter 1-GDELT tags We identified 114 priority diseases and syndromes to monitor, emphasizing emerging infectious diseases, highly communicable diseases, and virulent diseases with a high potential for morbidity and/or mortality. Articles with GDELT tags corresponding to these priority diseases were retained for further processing. Filter 2-title key words We created a database of alternate names for the 114 infectious diseases. Articles that did not contain a disease name or an alternate name in the title were classified as irrelevant. Filter 3-machine learning Since identifying articles that contain updates about disease activity requires sorting articles based on their semantic content, supervised ML algorithms and transfer learning were employed to detect relevant titles. We developed labeling criteria for differentiating news articles explicitly about disease activity and trained a team of approximately 20 people to label the data. The determination of optimal labeling criteria required a number of iterations over several months. This enabled the full team to label the entire data set using a standardized set of rules. We assigned data sets by disease in duplicates to 2 different labelers, and discrepancies in labeling between matching data sets were evaluated and resolved to the label that met the preestablished criteria. The resulting training data set contained 8322 labeled titles representing all articles on each of the 34 diseases that had the highest volume in the GDELT feed over 10 months. The individual disease-specific data sets ranged from 200 to 300 articles. We then trained classifiers on 70% (n = 6051 media article titles) of this data set (i.e., the training set), with 30% (n = 2271) of remaining articles held back for model validation (i.e., the test set). The distribution of diseases in our training and test sets were kept fixed. Specifically, we assessed the performance of linear classifiers including naïve Bayes, support vector machines, and bidirectional long short-term memory recurrent neural networks. An ensemble of linear classifiers trained on disease-specific titles, linear classifiers trained on all titles, and a bidirectional long short-term memory trained on all titles showed the highest overall F1 score and was implemented in our system. To handle the diseases for which there were low volumes of news articles (fewer than 5 articles per day with variance of less than 10 articles over 10 months), we turned off the disease-specific components of our classifier and applied the remaining ensemble of classifiers to the articles tagged with these low volume diseases. We prioritized model sensitivity over specificity, since missing a disease activity report was of greater concern than having some general infectious disease information or irrelevant articles in our feed. User interface We built a user interface to allow clinical experts to review the feed and manually tabulate disease activity data. The user interface was designed to achieve 3 objectives, specifically: Review the output of our classifier and correct mistakes when they occur. Verify clusters of articles that each contain information about a single disease event in a specific location. Identify new or updated disease outbreak activity and tabulate case counts and fatalities by location. All 3 objectives are performed daily, while reviews of spam and general disease information articles are performed on a monthly basis. Validation Before splitting our data into training and test sets, we removed all articles from our labeled data that did not contain a disease name in their titles. We did not stratify the train-test split by label because the 2 classes were relatively well-balanced. We used the test set to assess the recall, precision, and F1 score of our classifier with respect to the disease activity class, which are defined as follows: Recall = True Positives/(True Positives + False Negatives) Precision = True Positives/(True Positives + False Positives) F1 = 2/(1/precision + 1/recall) Recall is equivalent to sensitivity. Precision, or the positive predictive value, is a measure of how much noise is in our predicted positive class. We use precision instead of specificity because there are more reports on general infectious disease information than disease activity. A few false positives would slightly decrease the specificity but would dramatically increase the proportion of articles in our feed that do not contain disease activity information. F1 is the harmonic mean of the precision and recall, which balances the trade-off between meeting one goal at the cost of the other. To review the classifier’s performance on articles about diseases with low volumes of reporting activity, we randomly selected 117 articles predicted to be about general infectious disease information or irrelevant and 107 articles predicted to be about disease activity. We retroactively labeled this small sample and assessed our performance. Finally, we compared our GDELT-derived feed to the WHO Disease Outbreak News (DON)11 reports published from July 2017 to June 2018 and examined how many WHO officially reported outbreaks were captured. The DON reports provide reliable information on outbreaks most relevant to the international community. RESULTS Supervised learning validation In the supervised learning context, we achieved a recall score of 0.88, a precision score of 0.86, and an F1 score of 0.87. A baseline classifier that predicts all articles to contain infectious disease activity information would have a recall, precision, and F1 score of 1.00, 0.31, and 0.47, respectively. The precision and recall scores, disaggregated by disease, can be seen in Table 1. Our classifier’s precision is improved over the hypothetical baseline classifier precision for all diseases. For those diseases without a specific classifier, the recall, precision, and F1 scores for the out-of-sample data were 0.84, 0.70, and 0.76, respectively. Table 1. Classifier test set precision, recall, and F1 score for articles by disease (with number of articles in test set) compared to a baseline classifier where all articles are predicted to contain infectious disease activity informationa Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 a We only report the baseline classifier’s precision because the recall is 1.00 by definition. Open in new tab Table 1. Classifier test set precision, recall, and F1 score for articles by disease (with number of articles in test set) compared to a baseline classifier where all articles are predicted to contain infectious disease activity informationa Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 a We only report the baseline classifier’s precision because the recall is 1.00 by definition. Open in new tab WHO disease outbreak news comparison We identified 37 unique outbreaks from the WHO DON reports. Out of the 37 outbreaks, 89% were covered by online news outlets before the WHO reported the outbreak. Considering only the disease activity reported by online media, our system detected 94% of events before they were reported by the WHO. Among all events reported by the media before the WHO, the first online article was reported a mean of 43.4 (IQR: 9.5–61) days earlier. DISCUSSION Though recall is essential for a useful and safe surveillance system, the high precision of reports about disease activity allows for efficient manual review and data collection. Assessing the classifier’s performance on a disease-by-disease basis is also challenging due to a lack of available data for some diseases. Those with smaller test sets showed higher variance in precision and recall scores. The recall of the models was also lower for certain diseases (e.g., malaria falciparum and HIV), which warrants further model improvement, since we risk missing outbreaks. However, for most diseases of interest, our system correctly retrieved articles containing information about disease activity. Because there are usually multiple articles written about a single disease event, we suspect that in practice our system detects a higher proportion of events than our recall suggests. To this end, most infectious disease events of international concern were identified using our system well in advance of official reports by the WHO. With respect to diseases for which we did not curate labeled data, we found that although performance overall was lower, recall was still satisfactory. To achieve an adequate recall score for the low volume diseases, some sacrifice in precision was necessary. We considered this an acceptable trade-off due to the relatively small proportion of articles in the feed which pertain to infectious diseases with low media coverage. If a future outbreak occurs for an historically low-volume disease and the media responds by publishing more reports than usual, we could label data and train a specific classifier for the disease in question. Also, by collecting more labeled data on a continual basis, we hope to improve the performance of our model over time. Finally, while Google Translate is known to differ in performance across languages, recent advances in NLP have dramatically improved performance.12 However, we also plan to assess the quality of translation and the performance of the ML ensemble on translated articles and will consider alternative options (e.g., training ML models on a subset of articles in their native language) if deemed necessary. Event-based surveillance should not be considered a replacement for traditional indicator-based surveillance, but rather, complementary to routinely collected public health surveillance data.2 Though we achieved our goal of creating a global feed of disease activity media reports, concerns with digital disease surveillance generally include geographic representativeness, reporting bias, and reliability.13 News outlets report on a proportion of unusual local disease activity, but this coverage is neither complete nor uniformly distributed. These shortcomings were evident in our validation given that 11% of internationally relevant outbreaks reported by the WHO were not covered by local media beforehand. Although nontraditional data sources can provide timelier information than official sources, news reports are prone to error. In particular, reports in the media may fail to distinguish between suspected and confirmed cases, leading to an inaccurate picture of an epidemic and its trajectory. Furthermore, media interest may wane during the course of an outbreak, providing less complete information as the outbreak progresses. Other limitations inherent to online infectious disease reporting include day-of-week and holiday effects, “crowding out” by other notable outbreaks14 and lack of online and digital media coverage in less developed locations.15 To mitigate these concerns, supplementary data from official government health websites and reports from the medical and health communities (e.g., ProMED-mail16) augment the media-based data in our system, and human expert review and validation is performed. Once we have built up a substantial amount of disease activity data, we will further assess our system for bias, representativeness, granularity, accuracy, and timeliness. Ultimately, the combination of NLP, ML, and human expertise allows us to create a verified, international, and near real-time event-based infectious disease activity database, which can be supplemented with additional public or private data sources. The focused nature of the feed may also allow the extraction of case counts in an automated fashion using temporal topic trends.17 An historical database can eventually be utilized for the development of predictive models of potential local and international disease spread. Currently, a public application programming interface is available for retrieving active disease event data globally, by city or disease, and by city for any distant disease events which pose a risk based on connectivity through air travel.18 CONCLUSIONS Using NLP, ML, and human expertise, we have begun the creation of a global infectious disease activity database. Event-based surveillance systems struggle to handle the diversity of epidemiological media reports.1,13 Advances in ML allow us to explicitly distinguish between different types of media reports, making it easier to monitor, record, and quantitatively study disease activity. Now that it is feasible to create a database of disease activity in a highly automated fashion, our system provides a new opportunity to apply ML and forecasting models using near–real-time event-based surveillance data to enhance public health capacity at a global scale. FUNDING This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. AUTHOR CONTRIBUTIONS JF, AT, ZHP, and KK contributed to the design and analysis of the study. Jack Forsyth (JF) contributed to data acquisition. All authors contributed to the preparation of the manuscript. ACKNOWLEDGEMENTS The authors would like to thank Valerie Riabova and Xingliang Huang for their help developing the database infrastructure and Greg Hines for helping with the technical edits of the manuscript. The authors sincerely thank Kalev Leetaru for adding several diseases to the GDELT Global Knowledge Graph on our request. The authors also greatly appreciate the constructive comments received from 3 reviewers. Conflict of interest statement The authors have no competing interests to declare. REFERENCES 1 Yan SJ , Chughtai AA , Macintyre CR. Utility and potential of rapid epidemic intelligence from internet-based sources . Int J Infect Dis 2017 ; 63 : 77 – 87 . Google Scholar Crossref Search ADS PubMed WorldCat 2 O’Shea J. Digital disease detection: a systematic review of event-based internet biosurveillance systems . Int J Med Inform 2017 ; 101 : 15 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Barboza P , Vaillant L , Mawudeku A , et al. . Evaluation of epidemic intelligence systems integrated in the early alerting and reporting project for the detection of A/H5N1 influenza events . PLoS ONE 2013 ; 8 3 : e57252 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Lyon A , Nunn M , Grossel G , Burgman M. Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap . Transbound Emerg Dis 2012 ; 59 3 : 223 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Mondor L , Brownstein JS , Chan E , et al. . Timeliness of nongovernmental versus governmental global outbreak communications . Emerg Infect Dis 2012 ; 18 7 : 1184 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Hoen AG , Keller M , Verma AD , Buckeridge DL , Brownstein JS. Electronic event-based surveillance for monitoring dengue, Latin America . Emerg Infect Dis 2012 ; 18 7 : 1147 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Bansal S , Chowell G , Simonsen L , Vespignani A , Viboud C. Big data for infectious disease surveillance and modeling . J Infect Dis 2016 ; 214 (Suppl 4) : S375 – 9 . Google Scholar Crossref Search ADS WorldCat 8 Hay SI , George DB , Moyes CL , Brownstein JS. Big data opportunities for global infectious disease surveillance . PLoS Med 2013 ; 10 4 : e1001413. Google Scholar Crossref Search ADS PubMed WorldCat 9 Leetaru K , Schrodt PA. GDELT: Global Data on Events, Location and Tone 1979–2012. 2013 . http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf. Accessed September 17, 2018. 10 The GDELT Project 2013–2018 . https://www.gdeltproject.org/ Accessed September 17, 2018 . 11 World Health Organization . Disease Outbreak News. 2018 . http://www.who.int/csr/don/en/ Accessed September 20, 2018. 12 Wu Y , Schuster M , Chen Z , et al. . Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv: 1609.08144; 2016 . 13 Hartley DM , Nelson NP , Arthur RR , et al. . An overview of internet biosurveillance . Clin Microbiol Infect 2013 ; 19 11 : 1006 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Scales D , Zelenev A , Brownstein JS. Quantifying the effect of media limitations on outbreak data in a global online web-crawling epidemic intelligence system, 2008–2011 . J Emerg Health Threats 2013 ; 6 1 : 21621 . Google Scholar Crossref Search ADS WorldCat 15 Schwind JS , Wolking DJ , Brownstein JS, , Mazet JA , Smith WA ; PREDICT Consortium . Evaluation of local media surveillance for improved disease recognition and monitoring in global hotspot regions . PLoS One 2014 ; 9 10 : e110236 . Google Scholar Crossref Search ADS PubMed WorldCat 16 International Society for Infectious Diseases . ProMed-Mail. 2010 . https://www.promedmail.org/ Accessed September 20, 2018. 17 Ghosh S , Chakraborty P , Nsoesie EO , et al. . Temporal topic modeling to assess associations between news trends and infectious disease outbreaks . Sci Rep 2017 ; 7 : 40841 . Google Scholar Crossref Search ADS PubMed WorldCat 18 BlueDot Inc . API Portal. https://bluedot-dev-api.portal.azure-api.net/ Accessed March 1, 2019 . © The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise

Loading next page...
 
/lp/oxford-university-press/development-of-a-global-infectious-disease-activity-database-using-y04fnAr1gx

References (18)

Publisher
Oxford University Press
Copyright
© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocz112
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports. Materials and Methods We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports. Results Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5–61) days earlier on average. Discussion We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy. Conclusion Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports. machine learning, public health surveillance, communicable diseases, internet, health information systems INTRODUCTION Intervening rapidly in the course of an infectious disease epidemic significantly reduces cases and deaths. Event-based surveillance (ie, disease surveillance using online media reports and digital information sources) can facilitate timelier identification of outbreaks1–5 and new areas of disease spread.6 Challenges in sustaining a high quality and easily usable feed of online news about global infectious disease events relate to the volume, range in veracity, and language differences in syntax. Event-based surveillance requires separating the signal (ie, current, real infectious disease outbreaks) from the noise (ie, news that may pertain to an infectious disease or which uses the name of a particular disease but is not about a current outbreak). As machine learning algorithms become more powerful, large data sets of historical disease activity are also needed to fuel predictions of risks associated with epidemics.7,8 Failing to distinguish media reports about general infectious disease information from those pertaining to current disease activity hinders the manual review of articles and creation of disease activity databases. While a number of other systems gather, process and disseminate surveillance information from online sources using text extraction and categorization,2 our system has been specifically designed towards a curated feed of reports explicitly about infectious disease activity. To achieve this, we applied natural language processing (NLP) and machine learning (ML) models to classify articles. We collected media records from the Global Database of Events, Language, and Tone 2.0 (GDELT) Global Knowledge Graph (GKG) Version 2.0.9,10 We curated a list of alternate names for 114 infectious diseases. Articles that did not contain a disease name or an alternate name in the title were deemed irrelevant and were not included in further steps. However, due to nuances of distinguishing between remaining articles containing general infectious disease information and those about disease activity, we could not solely rely on key words within the text. Instead, we used supervised learning to classify articles based on their relevance. Supervised learning algorithms infer a function that maps inputs (e.g., article titles) to outputs (e.g., whether the article contains disease activity information) from labeled data. Using this approach, we could distinguish between articles containing content about disease activity, general infectious disease information, and irrelevant articles. Finally, we built a user interface to allow clinical experts to verify articles clustered by disease, location, and time, as well as supplement case count and locations data using official reports. An overview of the system can be seen in Figure 1. Figure 1. Open in new tabDownload slide The modules of our event-based surveillance system. Figure 1. Open in new tabDownload slide The modules of our event-based surveillance system. MATERIALS AND METHODS Media report acquisition The GDELT Project provides records from over 25 000 broadcast, print, and web outlets in 65 languages, updated every 15 minutes.10 In addition to cataloging the world’s media reports, GDELT also performs information extraction, using key words to associate each article with “tags” such as relevant locations, organizations, and people.9,10 Relevant to our purposes, GDELT tags specific diseases mentioned in each news article. From the potentially relevant articles with URLs acquired from GDELT, we scraped each title and body individually from the original source and used the Google Translate application programming interface to translate titles into English. We also performed automated data cleaning to handle duplicate records and missing data. Disease activity filters Next, we developed filters designed to achieve 2 goals: (1) provide a media feed containing information only about disease activity, while (2) ensuring relevant reports are not lost. The filters are as follows: Filter 1-GDELT tags We identified 114 priority diseases and syndromes to monitor, emphasizing emerging infectious diseases, highly communicable diseases, and virulent diseases with a high potential for morbidity and/or mortality. Articles with GDELT tags corresponding to these priority diseases were retained for further processing. Filter 2-title key words We created a database of alternate names for the 114 infectious diseases. Articles that did not contain a disease name or an alternate name in the title were classified as irrelevant. Filter 3-machine learning Since identifying articles that contain updates about disease activity requires sorting articles based on their semantic content, supervised ML algorithms and transfer learning were employed to detect relevant titles. We developed labeling criteria for differentiating news articles explicitly about disease activity and trained a team of approximately 20 people to label the data. The determination of optimal labeling criteria required a number of iterations over several months. This enabled the full team to label the entire data set using a standardized set of rules. We assigned data sets by disease in duplicates to 2 different labelers, and discrepancies in labeling between matching data sets were evaluated and resolved to the label that met the preestablished criteria. The resulting training data set contained 8322 labeled titles representing all articles on each of the 34 diseases that had the highest volume in the GDELT feed over 10 months. The individual disease-specific data sets ranged from 200 to 300 articles. We then trained classifiers on 70% (n = 6051 media article titles) of this data set (i.e., the training set), with 30% (n = 2271) of remaining articles held back for model validation (i.e., the test set). The distribution of diseases in our training and test sets were kept fixed. Specifically, we assessed the performance of linear classifiers including naïve Bayes, support vector machines, and bidirectional long short-term memory recurrent neural networks. An ensemble of linear classifiers trained on disease-specific titles, linear classifiers trained on all titles, and a bidirectional long short-term memory trained on all titles showed the highest overall F1 score and was implemented in our system. To handle the diseases for which there were low volumes of news articles (fewer than 5 articles per day with variance of less than 10 articles over 10 months), we turned off the disease-specific components of our classifier and applied the remaining ensemble of classifiers to the articles tagged with these low volume diseases. We prioritized model sensitivity over specificity, since missing a disease activity report was of greater concern than having some general infectious disease information or irrelevant articles in our feed. User interface We built a user interface to allow clinical experts to review the feed and manually tabulate disease activity data. The user interface was designed to achieve 3 objectives, specifically: Review the output of our classifier and correct mistakes when they occur. Verify clusters of articles that each contain information about a single disease event in a specific location. Identify new or updated disease outbreak activity and tabulate case counts and fatalities by location. All 3 objectives are performed daily, while reviews of spam and general disease information articles are performed on a monthly basis. Validation Before splitting our data into training and test sets, we removed all articles from our labeled data that did not contain a disease name in their titles. We did not stratify the train-test split by label because the 2 classes were relatively well-balanced. We used the test set to assess the recall, precision, and F1 score of our classifier with respect to the disease activity class, which are defined as follows: Recall = True Positives/(True Positives + False Negatives) Precision = True Positives/(True Positives + False Positives) F1 = 2/(1/precision + 1/recall) Recall is equivalent to sensitivity. Precision, or the positive predictive value, is a measure of how much noise is in our predicted positive class. We use precision instead of specificity because there are more reports on general infectious disease information than disease activity. A few false positives would slightly decrease the specificity but would dramatically increase the proportion of articles in our feed that do not contain disease activity information. F1 is the harmonic mean of the precision and recall, which balances the trade-off between meeting one goal at the cost of the other. To review the classifier’s performance on articles about diseases with low volumes of reporting activity, we randomly selected 117 articles predicted to be about general infectious disease information or irrelevant and 107 articles predicted to be about disease activity. We retroactively labeled this small sample and assessed our performance. Finally, we compared our GDELT-derived feed to the WHO Disease Outbreak News (DON)11 reports published from July 2017 to June 2018 and examined how many WHO officially reported outbreaks were captured. The DON reports provide reliable information on outbreaks most relevant to the international community. RESULTS Supervised learning validation In the supervised learning context, we achieved a recall score of 0.88, a precision score of 0.86, and an F1 score of 0.87. A baseline classifier that predicts all articles to contain infectious disease activity information would have a recall, precision, and F1 score of 1.00, 0.31, and 0.47, respectively. The precision and recall scores, disaggregated by disease, can be seen in Table 1. Our classifier’s precision is improved over the hypothetical baseline classifier precision for all diseases. For those diseases without a specific classifier, the recall, precision, and F1 scores for the out-of-sample data were 0.84, 0.70, and 0.76, respectively. Table 1. Classifier test set precision, recall, and F1 score for articles by disease (with number of articles in test set) compared to a baseline classifier where all articles are predicted to contain infectious disease activity informationa Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 a We only report the baseline classifier’s precision because the recall is 1.00 by definition. Open in new tab Table 1. Classifier test set precision, recall, and F1 score for articles by disease (with number of articles in test set) compared to a baseline classifier where all articles are predicted to contain infectious disease activity informationa Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 Disease Name Baseline precision Precision Improvement in precision Recall F1 score Number of articles All 0.31 0.86 0.55 0.88 0.87 2271 Anthrax 0.26 0.70 0.44 0.78 0.74 34 Avian influenza 0.49 0.84 0.35 0.89 0.86 69 Cholera 0.34 0.88 0.54 0.88 0.88 50 Dengue 0.49 0.76 0.27 1.00 0.86 63 Diphtheria 0.45 0.86 0.40 0.80 0.83 33 Ebola 0.28 0.73 0.45 0.92 0.81 169 Hemorrhagic fever 0.22 0.92 0.70 0.86 0.89 18 Hepatitis A 0.40 0.87 0.47 0.89 0.88 117 Hepatitis B 0.06 0.50 0.44 1.00 0.67 36 Hepatitis C 0.04 0.50 0.46 1.00 0.67 47 Hepatitis E 0.36 0.92 0.56 0.85 0.88 76 HIV 0.05 0.40 0.35 0.67 0.50 55 Lassa fever 0.40 0.93 0.53 0.86 0.89 129 Leprosy 0.06 0.40 0.34 1.00 0.57 34 Listeria 0.33 0.92 0.59 0.94 0.93 106 Lyme disease 0.07 0.67 0.60 1.00 0.80 59 Malaria falciparum 0.16 1.00 0.84 0.43 0.60 90 Marburg 0.39 0.89 0.50 0.85 0.87 51 Measles 0.49 0.84 0.35 0.81 0.82 53 Meningococcal meningitis 0.41 0.94 0.53 0.96 0.95 114 Monkeypox 0.46 0.88 0.42 0.87 0.88 160 Mumps 0.21 0.96 0.75 0.93 0.94 103 Norovirus 0.28 0.89 0.61 0.93 0.91 61 Plague 0.18 0.33 0.15 1.00 0.50 22 Rabies 0.29 0.89 0.60 0.91 0.90 65 Seasonal influenza 0.44 0.87 0.43 0.88 0.88 137 Swine influenza 0.50 0.71 0.21 0.71 0.71 14 Tickborne diseases 0.17 0.80 0.63 0.89 0.84 104 Tuberculosis 0.31 0.73 0.43 0.73 0.73 49 Typhoid 0.21 1.00 0.79 1.00 1.00 19 West Nile virus 0.14 0.97 0.83 0.91 0.94 37 Yellow fever 0.31 0.73 0.43 0.92 0.81 39 Zika 0.17 0.80 0.63 0.80 0.80 58 a We only report the baseline classifier’s precision because the recall is 1.00 by definition. Open in new tab WHO disease outbreak news comparison We identified 37 unique outbreaks from the WHO DON reports. Out of the 37 outbreaks, 89% were covered by online news outlets before the WHO reported the outbreak. Considering only the disease activity reported by online media, our system detected 94% of events before they were reported by the WHO. Among all events reported by the media before the WHO, the first online article was reported a mean of 43.4 (IQR: 9.5–61) days earlier. DISCUSSION Though recall is essential for a useful and safe surveillance system, the high precision of reports about disease activity allows for efficient manual review and data collection. Assessing the classifier’s performance on a disease-by-disease basis is also challenging due to a lack of available data for some diseases. Those with smaller test sets showed higher variance in precision and recall scores. The recall of the models was also lower for certain diseases (e.g., malaria falciparum and HIV), which warrants further model improvement, since we risk missing outbreaks. However, for most diseases of interest, our system correctly retrieved articles containing information about disease activity. Because there are usually multiple articles written about a single disease event, we suspect that in practice our system detects a higher proportion of events than our recall suggests. To this end, most infectious disease events of international concern were identified using our system well in advance of official reports by the WHO. With respect to diseases for which we did not curate labeled data, we found that although performance overall was lower, recall was still satisfactory. To achieve an adequate recall score for the low volume diseases, some sacrifice in precision was necessary. We considered this an acceptable trade-off due to the relatively small proportion of articles in the feed which pertain to infectious diseases with low media coverage. If a future outbreak occurs for an historically low-volume disease and the media responds by publishing more reports than usual, we could label data and train a specific classifier for the disease in question. Also, by collecting more labeled data on a continual basis, we hope to improve the performance of our model over time. Finally, while Google Translate is known to differ in performance across languages, recent advances in NLP have dramatically improved performance.12 However, we also plan to assess the quality of translation and the performance of the ML ensemble on translated articles and will consider alternative options (e.g., training ML models on a subset of articles in their native language) if deemed necessary. Event-based surveillance should not be considered a replacement for traditional indicator-based surveillance, but rather, complementary to routinely collected public health surveillance data.2 Though we achieved our goal of creating a global feed of disease activity media reports, concerns with digital disease surveillance generally include geographic representativeness, reporting bias, and reliability.13 News outlets report on a proportion of unusual local disease activity, but this coverage is neither complete nor uniformly distributed. These shortcomings were evident in our validation given that 11% of internationally relevant outbreaks reported by the WHO were not covered by local media beforehand. Although nontraditional data sources can provide timelier information than official sources, news reports are prone to error. In particular, reports in the media may fail to distinguish between suspected and confirmed cases, leading to an inaccurate picture of an epidemic and its trajectory. Furthermore, media interest may wane during the course of an outbreak, providing less complete information as the outbreak progresses. Other limitations inherent to online infectious disease reporting include day-of-week and holiday effects, “crowding out” by other notable outbreaks14 and lack of online and digital media coverage in less developed locations.15 To mitigate these concerns, supplementary data from official government health websites and reports from the medical and health communities (e.g., ProMED-mail16) augment the media-based data in our system, and human expert review and validation is performed. Once we have built up a substantial amount of disease activity data, we will further assess our system for bias, representativeness, granularity, accuracy, and timeliness. Ultimately, the combination of NLP, ML, and human expertise allows us to create a verified, international, and near real-time event-based infectious disease activity database, which can be supplemented with additional public or private data sources. The focused nature of the feed may also allow the extraction of case counts in an automated fashion using temporal topic trends.17 An historical database can eventually be utilized for the development of predictive models of potential local and international disease spread. Currently, a public application programming interface is available for retrieving active disease event data globally, by city or disease, and by city for any distant disease events which pose a risk based on connectivity through air travel.18 CONCLUSIONS Using NLP, ML, and human expertise, we have begun the creation of a global infectious disease activity database. Event-based surveillance systems struggle to handle the diversity of epidemiological media reports.1,13 Advances in ML allow us to explicitly distinguish between different types of media reports, making it easier to monitor, record, and quantitatively study disease activity. Now that it is feasible to create a database of disease activity in a highly automated fashion, our system provides a new opportunity to apply ML and forecasting models using near–real-time event-based surveillance data to enhance public health capacity at a global scale. FUNDING This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. AUTHOR CONTRIBUTIONS JF, AT, ZHP, and KK contributed to the design and analysis of the study. Jack Forsyth (JF) contributed to data acquisition. All authors contributed to the preparation of the manuscript. ACKNOWLEDGEMENTS The authors would like to thank Valerie Riabova and Xingliang Huang for their help developing the database infrastructure and Greg Hines for helping with the technical edits of the manuscript. The authors sincerely thank Kalev Leetaru for adding several diseases to the GDELT Global Knowledge Graph on our request. The authors also greatly appreciate the constructive comments received from 3 reviewers. Conflict of interest statement The authors have no competing interests to declare. REFERENCES 1 Yan SJ , Chughtai AA , Macintyre CR. Utility and potential of rapid epidemic intelligence from internet-based sources . Int J Infect Dis 2017 ; 63 : 77 – 87 . Google Scholar Crossref Search ADS PubMed WorldCat 2 O’Shea J. Digital disease detection: a systematic review of event-based internet biosurveillance systems . Int J Med Inform 2017 ; 101 : 15 – 22 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Barboza P , Vaillant L , Mawudeku A , et al. . Evaluation of epidemic intelligence systems integrated in the early alerting and reporting project for the detection of A/H5N1 influenza events . PLoS ONE 2013 ; 8 3 : e57252 . Google Scholar Crossref Search ADS PubMed WorldCat 4 Lyon A , Nunn M , Grossel G , Burgman M. Comparison of web-based biosecurity intelligence systems: BioCaster, EpiSPIDER and HealthMap . Transbound Emerg Dis 2012 ; 59 3 : 223 – 32 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Mondor L , Brownstein JS , Chan E , et al. . Timeliness of nongovernmental versus governmental global outbreak communications . Emerg Infect Dis 2012 ; 18 7 : 1184 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 6 Hoen AG , Keller M , Verma AD , Buckeridge DL , Brownstein JS. Electronic event-based surveillance for monitoring dengue, Latin America . Emerg Infect Dis 2012 ; 18 7 : 1147 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 7 Bansal S , Chowell G , Simonsen L , Vespignani A , Viboud C. Big data for infectious disease surveillance and modeling . J Infect Dis 2016 ; 214 (Suppl 4) : S375 – 9 . Google Scholar Crossref Search ADS WorldCat 8 Hay SI , George DB , Moyes CL , Brownstein JS. Big data opportunities for global infectious disease surveillance . PLoS Med 2013 ; 10 4 : e1001413. Google Scholar Crossref Search ADS PubMed WorldCat 9 Leetaru K , Schrodt PA. GDELT: Global Data on Events, Location and Tone 1979–2012. 2013 . http://data.gdeltproject.org/documentation/ISA.2013.GDELT.pdf. Accessed September 17, 2018. 10 The GDELT Project 2013–2018 . https://www.gdeltproject.org/ Accessed September 17, 2018 . 11 World Health Organization . Disease Outbreak News. 2018 . http://www.who.int/csr/don/en/ Accessed September 20, 2018. 12 Wu Y , Schuster M , Chen Z , et al. . Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation. arXiv: 1609.08144; 2016 . 13 Hartley DM , Nelson NP , Arthur RR , et al. . An overview of internet biosurveillance . Clin Microbiol Infect 2013 ; 19 11 : 1006 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Scales D , Zelenev A , Brownstein JS. Quantifying the effect of media limitations on outbreak data in a global online web-crawling epidemic intelligence system, 2008–2011 . J Emerg Health Threats 2013 ; 6 1 : 21621 . Google Scholar Crossref Search ADS WorldCat 15 Schwind JS , Wolking DJ , Brownstein JS, , Mazet JA , Smith WA ; PREDICT Consortium . Evaluation of local media surveillance for improved disease recognition and monitoring in global hotspot regions . PLoS One 2014 ; 9 10 : e110236 . Google Scholar Crossref Search ADS PubMed WorldCat 16 International Society for Infectious Diseases . ProMed-Mail. 2010 . https://www.promedmail.org/ Accessed September 20, 2018. 17 Ghosh S , Chakraborty P , Nsoesie EO , et al. . Temporal topic modeling to assess associations between news trends and infectious disease outbreaks . Sci Rep 2017 ; 7 : 40841 . Google Scholar Crossref Search ADS PubMed WorldCat 18 BlueDot Inc . API Portal. https://bluedot-dev-api.portal.azure-api.net/ Accessed March 1, 2019 . © The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Nov 1, 2019

There are no references for this article.