Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in... review articles abstract Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care 1,2 1,2 1,2 1,3 1,2 Fadila Zerka, PhD ; Samir Barakat, MSc, PhD ; Sean Walsh, MSc, PhD ; Marta Bogowicz, PhD ; Ralph T. H. Leijenaar, MSc, PhD ; 1 2 4 1 Arthur Jochems, PhD ; Benjamin Miraglio, PhD ; David Townend, LLB, MPhil, PhD ; and Philippe Lambin, MD, PhD Big data for health care is one of the potential solutions to deal with the numerous challenges of health care, such as rising cost, aging population, precision medicine, universal health coverage, and the increase of non- communicable diseases. However, data centralization for big data raises privacy and regulatory concerns. Covered topics include (1) an introduction to privacy of patient data and distributed learning as a poten- tial solution to preserving these data, a description of the legal context for patient data research, and adefinition of machine/deep learning concepts; (2) a presentation of the adopted review protocol; (3) a presentation of the search results; and (4) a discussion of the findings, limitations of the review, and future perspectives. Distributed learning from federated databases makes data centralization unnecessary. Distributed algorithms iteratively analyze separate databases, essentially sharing research questions and answers between databases instead of sharing the data. In other words, one can learn from separate and isolated datasets without patient data ever leaving the individual clinical institutes. Distributed learning promises great potential to facilitate big data for medical application, in particular for international consortiums. Our purpose is to review the major implementations of distributed learning in health care. JCO Clin Cancer Inform 4:184-200. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION The methodology for this research is that distributed machine learning is an evolving fieldincomputing, Law and ethics seek to produce a governance with 665 articles published between 2001 and 2018; framework for the processing of patient data that the study is based on a literature search, focuses produces a solution to the issues that arise between on the medical applications of distributed machine the competing desires of individuals in society for learning, and provides an up-to-date summary of privacy and advances in health care. Traditional safe- the field. guards to achieve this governance have come from, for example, the anonymization of data or informed THE LEGAL CONTEXT FOR PATIENT DATA RESEARCH consent. These are not adequate safeguards for the The challenges in law and ethics in relation to big data ASSOCIATED new big data and artificial intelligence methodologies in CONTENT and artificial intelligence are well documented and research; it is increasingly difficult to create anonymous 1-16 Appendix discussed . The issue is one of balance: privacy of data (rather than pseudonymized/coded data) or to Author affiliations health data and access to data for research. This issue maintain it against re-identification (through linking of and support is likely to become more pronounced with the fore- datasets causing accidental or deliberate re-identifi- information (if seeable developments in health care, notably in re- cation). The technology of big data and artificial in- applicable) appear at lation to rising cost, aging population, precision telligence, however, itself increasingly offers safeguards the end of this medicine, universal health coverage, and the increase article. to solve the governance problem. In this article we of noncommunicable diseases. However, recent Accepted on January explore how privacy-preserving distributed machine 16, 2020 and developments in law, for example, in the European learning from federated databases might assist gover- published at Union’s General Data Protection Regulation (GDPR), nance in health care. The article first outlines the basic ascopubs.org/journal/ parameters of the law and ethics issues and then dis- appear to maintain the traditional approach that seems cci on March 5, 2020: cusses machine learning and deep learning. Thereafter, to favor individualism above solidarity. Individualism DOI https://doi.org/10. 1200/CCI.19.00047 the results of the review are presented and discussed. is strengthened in the new legislation. There is 184 Distributed Learning in Health Care CONTEXT Key Objective Review the contribution of distributed learning to preserve data privacy in health care. Knowledge Generated Data in health care are greatly protected; therefore, accessing medical data is restricted by law and ethics. This restriction has led to a change in research practice to adapt to new regulations. Distributed learning makes it possible to learn from medical data without these data ever leaving the medical institutions. Relevance Distributed learning allows learning from medical data while guaranteeing preservation of patient privacy. a narrowing of the definition of informed consent in Article MACHINE LEARNING 4.11 of the GDPR, with the unclear inclusion of the ne- Machine learning comes from the possibility to apply al- cessity for broad consent in scientific research included in 1 gorithms on raw data to acquire knowledge. These algo- Recital 33. rithms are implemented to support decision making in In relation to the continuing ambiguity of the unclear legal different domains, including health care, manufacturing, 2,3 landscape for research using and reusing large datasets education, financial modeling, and marketing. In medical and linking between datasets, the GDPR is not clear in the disciplines, machine learning has contributed to improving area of re-identification of individuals. For the GDPR, part of the efficiency of clinical trials and decision-making pro- the problem is clear—when data have the potential when cesses. Some examples of machine learning applications in added to other data to identify an individual, then those data medicine are the localization of thoracic diseases, early 5 6 are personal data and subject to regulation. The question is, diagnosis of Alzheimer disease, personalized treatment, 7,8 9 is this absolute (any possibility, regardless of remoteness), outcome prediction, and automated radiology reports. or is there a reasonableness test? Recital 26 includes such There are three main categories of machine learning al- a reasonable test: “To ascertain whether means are rea- gorithms. First, in supervised learning, the algorithm gen- sonably likely to be used to identify the natural person, erates a function for mapping input variables to output account should be taken of all objective factors, such as variables. In unsupervised learning, the applied algorithms the costs of and the amount of time required for iden- do not have any outcome variable to estimate, and the tification, taking into consideration the available tech- algorithms generate a function mapping for the structure of nology at the time of the processing and technological the data. The third type is referred to as reinforcement 16a developments.” learning, whereby in the absence of a training dataset the From this overview of legal difficulties, it is clear that there algorithm trains itself by learning from experiences to make are obstacles to processing data in big data, machine increasingly improved decisions. A reinforcement agent learning, and artificial intelligence methodologies and en- decides what action to perform to accomplish a given 10,11 vironments. It must be stressed that the object is not to task. Table 1 provides a brief description of selected circumvent the rights of patients or to suggest that privacy popular machine learning algorithms across the three should be ignored. The difficulty is that where the law is categories. unclear, there is a tendency toward restrictive readings of DEEP LEARNING the law to avoid liability, and, in the case of the method- ologies and applications of data science discussed here, Deep learning is a subset of machine learning, which, in the effect of unclear law and restrictive interpretations of the turn, is a subset of artificial intelligence, as represented in law will be to block potentially important medical and Figure 1. The learning process of a deep neural network scientific developments and research. Each of the un- architecture cascades through multiple nodes in multiple certainties will require regulators to take a position on the layers, where nodes and layers use the output of the best interpretation of the meaning of the law according to previous nodes and layers as input. The output of a node the available safeguards. The question for the data science is calculated by applying an activation function to the community is, how far can that community itself address weighted average of this node’s input. As described by concerns about privacy, about re-identification, and about Andrew Ng , “The analogy to deep learning is that the safeguarding autonomy of individuals and their legiti- rocket engine is the deep learning models and the fuel is mate expectations to dignity in their treatment through the the huge amounts of data that we can feed in to these proper treatment of their personal data? How far distributed algorithms,” meaning that the more data are fed into the learning might contribute a suitable safeguard is the model the better the performance. Yet, this continuous question addressed in the remainder of this paper. improvement of the performance in concordance with the JCO Clinical Cancer Informatics 185 Zerka et al 186 © 2020 by American Society of Clinical Oncology TABLE 1. Examples of Machine Learning Algorithms Distributed Algorithm Example Algorithm Description Available? Supervised learning SVM An algorithm performing classification tasks by composing hyperplanes in Yes a multidimensional space that separates cases of different class labels. SVM supports both regression and classification tasks and can handle 2,3 multiple continuous and categorical variables. 83-85 Logistic regression An algorithm used for discrete values estimation tasks based on a given Yes set of independent variables. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function that maps probabilities with values lying between 0 and 1. Hence, it is also known 81,82 as logit regression. 86,87 Decision tree A nonparametric method used for both classification and regression Yes problems. As the name suggests, it uses a tree-like decision model by splitting a dataset into two or more subsets on the basis of conditional control statements. It can also be used to visually represent decision- making processes. Random forest Random forest is a method performing classification and regression No tasks. A random forest is a composition of two initials: forest because it represents a collection of decision trees, and random because the forest randomly selects observations and features from which it puts up various decision trees. The results are then averaged. Each decision tree in the forest has access to a random set of the training data, chooses a class, and the most selected class is then the predicted class. 88,89 KNN KNN can be used for regression problems; however, it is widely used for Yes classification problems. In KNN, the assumption is that similar data elements are close to each other. Given K (positive integer) and a test observation, KNN first groups the K closest elements to the test observation. Then, in the case of regression, it returns the mean of the K labels, or in the case of classification, it returns the mode of the K labels. 90-93 Unsupervised learning K-means An algorithm mainly used for clustering in data mining. K-means Yes nomination comes from its functionality, which is partitioning of N observations to K clusters, where each and every observation is part of the cluster with the nearest mean. Apriori algorithm A classic algorithm in data mining. Used for mining frequent item groups Yes and relevant association rules in a transactional database. (Continued on following page) Distributed Learning in Health Care JCO Clinical Cancer Informatics 187 TABLE 1. Examples of Machine Learning Algorithms (Continued) Distributed Algorithm Example Algorithm Description Available? Reinforcement learning MDP Introduced in 1950s, MDP is a discrete stochastic control process No providing a framework for modeling decision making when final outcomes are ambiguous. Given S (current state) and S (new state), the decision process is made by steps. The process has a state at each step and the decision maker can choose any available action in S . The process then moves randomly into S, the new state. The chosen action influences the probability of moving from S to S. In other words, the next state depends only on the current state, not the previous states, and the action taken by the decision maker, satisfying the Markovian property, from which comes the algorithm’s name. Q-learning Useful for optimization of action selection of any finite MDP. Q-learning Yes algorithm provides agents in a process the ability to know what action to take under what situation. Abbreviations: KNN, K-nearest neighbors; MDP, Markov decision process; SVM, support vector machine. Zerka et al Distributed Machine Learning Artificial intelligence A large quantity of training data is required for machine Machine programs able to imitate human learning to be applied, especially in outcome modeling, intelligence where multiple factors influence learning. Provided there Machine learning are sufficient and appropriate data, machine learning Algorithms able to learn 20,21 typically results in accurate and generalizable models. from examples However, the sensitivity of the personal data greatly hinders Deep learning Set of learning techniques the conventional centralized approach to machine learn- inspired from biological ing, whereby all data are gathered in a single data store. neural networks Distributed machine learning resolves legal and ethical privacy concerns by learning without the personal data ever leaving the firewall of the medical centers. FIG 1. Relationship between artificial intelligence, machine learning, 23 24 The euroCAT and ukCAT projects are a proof of dis- and deep learning. tributed learning being successfully implemented into clinical settings to overcome data access restrictions. The amount of the data are not correct for traditional machine purpose of the euroCAT project was to predict patient learning algorithms reaching a steady performance level outcomes (eg, post-radiotherapy dyspnea for patients with that does not improve with the increase of the amount of the lung cancer) by learning from data stored within clinics training data. without sharing any of the medical data. METHODS AND MATERIAL SELECTION Distributed Deep Learning A PubMed search was performed to collect relevant studies Training a deep learning model typically requires thou- concerning the utilization of distributed machine learning in sands to millions of data points and is therefore compu- medicine. We used the search strings: “distributed learn- tationally expensive as well as time consuming. These ing,”“distributed machine learning,” and “privacy pre- challenges can be mitigated with different approaches. serving data mining.” The Preferred Reporting Items for First, because it is possible to train deep learning models Systematic Reviews and Meta-Analyses (PRISMA) state- 25 in a parallelized fashion, using dedicated hardware ment was adopted to select and compare distributed 26 (graphics processing units, tensor processing units) re- learning literature. The PRISMA flow diagram and checklist duces the computational time. Second, as the memory of are slightly modified and presented in Appendix Figure A1 this dedicated hardware is often limited, it is possible to and Appendix Table A1, respectively. The last search for divide the training data into subsets called batches. In this distributed machine learning articles was performed on situation, the training process iterates over the batches, February 28, 2019. 27 only considering the data of one batch at each iteration. On top of easing the computing burden, using small SEARCH RESULTS batches during training improves the model’s ability to A total of 127 articles were identified in PubMed using the 28 generalize. search query: (“distributed learning” OR “distributed ma- These approaches address computation challenges but do chine learning” OR “privacy preserving data mining”). Six not necessarily preserve data privacy. As for machine papers were screened; a brief summary of each article is learning, deep learning can be distributed to protect patient presented in Table 2. 29,30 data. Moreover, distributed deep learning also im- DISTRIBUTED LEARNING proves computing performance, as in the case of wireless sensor networks, where centralized learning is inefficient in Distributed learning ensures data safety by only sharing 31,32 terms of both communication and energy. mathematical parameters (or metadata) and not the actual data or in any instance data that might enable tracking back An example of distributed deep learning in the medical the patient information (such as patient ID, name, or date domain is that of Chang et al, who deployed a deep of birth). In other words, distributed algorithms iteratively learning model across four medical institutions for image analyze separate databases and return the same solution classification purposes using three distinct datasets: retinal as if data were centralized, essentially sharing research fundus, mammography, and ImageNet. The results were questions and answers between databases instead of compared with the same deep learning model trained on data. Also, before processing with the learning process, centrally hosted data. The comparison showed that the researchers must make sure all data have been success- distributed model accuracy is similar to the centrally hosted 33 34 fully anonymized and secured by means of hashing al- model. In a different study, McClure et al developed gorithms and semantic web techniques, respectively, as a distributed deep neural network model to reproduce can be seen in Figure 2, in addition to post-processing FreeSurfer brain segmentation. FreeSurfer is an open methods to address the multicenter variabilities. source tool for preprocessing and analyzing (segmentation, 188 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care JCO Clinical Cancer Informatics 189 TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results Jochems Clinical data from 287 patients with lung cancer, A Bayesian network model is adapted for Varian learning portal AUC, 0.61 treated with curative intent with CRT or RT distributed learning using data from five alone were collected and stored in five different hospitals. medical institutes: MAASTRO: (the Netherlands), Jessa (Belgium), Liege ` (Belgium), Aachen (Germany), and Eindhoven (the Netherlands). Target: predict dyspnea. Patient data were extracted and stored locally in the hospitals. Only the weights were then sent to the master server. Deist Clinical data from 268 patients with NSCLC from Alternating Direction Method of Multipliers was Varian learning portal AUC, 0.62 for training set and 0.66 for validation five different medical institutes: Aachen used to learn SVM models. set (Germany), Eindhoven (The Netherlands), Hasselt (Belgium), Liege (Belgium), and Maastricht (the Netherlands). Target: predict dyspnea grade ≥ 2. The data were processed simultaneously in local databases. Then, the updated model parameters were sent to the master machine to compare and update them and check if the learning process has converged. The process is repeated until convergence criteria were met. Dluhos 258 patients with first-episode schizophrenia and All images were preprocessed: normalized, VBM8 toolbox Joint and meta models had similar classification 222 healthy controls originating from four segmented, and standardized. performance, which was better than datasets were collected: two datasets from performance of local models. University Hospital Brno (Czech Republic), University Medical Center Utrecht (The Netherlands), and the last dataset originates from the Prague Psychiatric Center and Psychiatric Hospital Bohnice. Target: classification of patients with first-episode Create four local SVM models. Then create MATLAB statistics and machine schizophrenia multisample models (joint model and meta learning toolbox model) based on the individual models created previously. This process was repeated four times, by setting each time three training datasets, with remaining one as the validation set. (Continued on following page) Zerka et al 190 © 2020 by American Society of Clinical Oncology TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center (Continued) Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results Jochems Clinical data from 698 patients with lung cancer, Distributed learning for a Bayesian network using Varian learning portal AUC, 0.662 treated with curative intent with CRT or RT data from three hospitals alone were collected and stored in two medical institutes: MAASTRO (Netherlands) and Michigan University (United States). Target: prediction of NSCLC 2-year survival after The model used the T category and N category, The discriminative performance of centralized radiation therapy age, total tumor dose, and WHO performance and distributed models on the validation set for predictions. was similar. Brisimi Electronic health records from Boston Medical Soft-margin l1-regularized sparse SVM classifier. Not provided AUC, 0.56 Center of patients with at least one heart- related diagnosis between 2005 and 2010. The data are distributed between 10 hospitals. Developed an iterative cPDS algorithm for solving the large-scale SVM problem in a decentralized fashion. The system then predicted patient’s hospitalization for cardiac events in upcoming calendar year. Target: prediction of heart cardiac events. cPDS converged faster than centralized methods. Tagliaferri 227 variables extracted from thyroid cancer data Inferential regression analysis. COBRA framework Thyroid COBRA: based on COBRA-Storage from six Italian cancer centers. Each has four System. A new software BOA “Beyond properties: name, form, type of field, and Ontology” supporting two different models: levels. Cloud-based large database model and distributed learning model Target: prediction of survival and toxicity. Learning Analyzer Proxy (module of BOA only in distributed mode) sends algorithms directly to local research proxies, taking back from them only the results of each iteration step, with no need to work with shared data in the Cloud anymore. Abbreviations: AUC, area under the curve; BOA, Beyond Ontology Awareness; COBRA, Consortium for Brachytherapy Data Analysis; cPDS, cluster Primal Dual Splitting; CRT, chemoradiation; NSCLC, non–small-cell lung cancer; RT, radiotherapy; SVM, support vector machine. Hashing algorithm Distributed Learning in Health Care Each hospital is responsible Three centers distributed network Transparency: In a blockchain all records are saved and are non A B D for data preparation removable 1. Extract local data We would like We can to learn from provide more hospital 1’s Master data! data! Hospital 1 Hospital 3 Images Forms Hospital 2 Hospital 1 DB Sending parameters/master Hospital 2 Sending local model/hospital 2. Data anonymization Learning machine/hospital New action from Master machine New action from hospital 2 hospital 1 Distributed learning flow chart of the above network C Blockchain Blockchain Step 1 Step 2 Sending parameters Waiting Start 649 Master Master Action recorded Action approved by hospital 1 Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3 Waiting Learning Waiting Waiting Learning Learning Step 4 Convergence Step 3 Waiting for all 3. Semantic web criteria reached? local models Any new action in the Master Master network is added to the old Blockchain actions Blockchain Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3 Waiting Waiting Sending Sending Sending Waiting 283 local model local model local model YES NO Step 5 Step 6 Master Master Update parameters based on Create final model F A I R local models FIG 2. Schematic representation of the processes in a transparent distributed learning network. (A) Data preparation steps. (B) Distributed learning network, which is composed of three hospitals, each of which is equipped with a learning machine that can communicate with a master machine responsible for sending model parameters and checking convergence criteria. (C) Flowchart of the distributed learning network described in B. (D) Example of an action that can be tracked by blockchain (designed and implemented according to needs agreed among network members) and keep all network participants aware of any new activity taken in the network. DB, database; FAIR, findable, accessible, interoperable, reusable. thickness estimation, and so on) of human brain magnetic model is ready, not only can the network participants use it resonance images. The results demonstrated perfor- to learn from their data, but this learning should be able to mance improvement on the test datasets. Similar to the be performed locally and under highly private and secure previous study, a brain tumor segmentation was suc- conditions to protect the model’s output. cessfully performed using distributed deep learning across The users of a machine/deep learning model are not 10 institutions (BraTS distribution). necessarily the model’s developers. Hence, documentation and the integration of automated data eligibility tests have In the matter of distributed deep learning, the training two important assets: weights are combined to train a final model, and the raw 35,37 data are never exposed. In the case of sharing the � The documentation ensures providing a clear view of locale gradients, it might be possible to retrieve estima- what the model is designed for, a technical description of tions of the original data from these gradients. Training the the model, and its use. local models on batches may prevent retrieving all the data � The eligibility tests are important to ensure that correct from the gradients, as these gradients correspond to single input data are extracted and provided before executing batches rather than all the local data. However, setting an 23 the model. In euroCAT, a distributed learning expert optimal batch size needs to be considered to assure data installed quality control via data extraction pipelines at 28,39,40 safety and the model’s ability to generalize. every participant point in the network. The pipeline automatically allowed data records fulfilling the model PRIVACY AND INTEGRATION OF DISTRIBUTED training eligibility criteria to be used in the training. The LEARNING NETWORKS experts also test the extraction pipeline thoroughly in Privacy in a distributed learning network addresses three addition to the machine learning testing. However, there main areas: data privacy, the implemented model’s privacy, were post-processing compensation methods to cor- and the model’s output privacy. Data privacy is achieved by rect for the variations caused by using different local means of data anonymization and data never leaving the protocols. medical institutions. The distributed learning model can be 41 DISCUSSION secured by applying differential privacy techniques, preventing leakage of weights during the training, and If one examines oncology, for instance, cancer is clearly cryptographic techniques. These cryptographic tech- one of the greatest challenges facing health care. More than niques provide a set of multiparty protocols that ensure 16 million new cancer cases were reported in 2017 alone. security of the computations and communication. Once the This number climbed to 18.1 million cases in 2018. This JCO Clinical Cancer Informatics 191 Zerka et al increasing number of cancer incidences means that to publish and reuse computational workflows, and to there are undoubtedly sufficient data worldwide to put define and share scientific protocols as workflow templates. machine/deep learning to meaningful work. However, as Such solutions will address emerging concerns about the highlighted earlier, this requires access to the data and, as nonreproducibility of scientific research, particularly in data also highlighted earlier, distributed learning enables this in science (eg, poorly published data, incomplete workflow a manner that resolves legal and ethical concerns. None- descriptions, limited ability to perform meta-analyses, and an 51,52 theless, integration of distributed learning into health care is overall lack of reproducibility). Because workflows are much slower compared with other fields, which raises the fundamental to research activities, FAIR has broad applica- question of why this should be. Here, we summarize a set bility, which is vital in the context of distributed learning with of methodologies to facilitate the adoption of distributed medical data. learning and provide future directions. WHY NOT PUBLICLY SHARE MEDICAL DATA? CURRENT STATE OF MEDICAL DATA STORAGE Some studies were conducted trying to facilitate and secure AND PREPROCESSING data-sharing procedures to encourage related researchers and organizations to publicly share their data and embrace Information Communication Technology transparency, by proposing data-sharing procedures and 38,39 Every hospital has its own storage devices and architecture. protocols aiming to harmonize regulatory frameworks and In this case, the information communication technology 54,55 research governance. Despite the efforts made toward preparation for distributed learning requires significant data-sharing globalization, the sociocultural issues sur- energy, time, and manpower, which can be costly. This rounding data sharing remain pertinent. Large clinical same process (data acquisition and preprocessing) needs trials also face limitations in the data collection capabilities 46-48 to be repeated for each participating hospital, and because of limited data storage capacities and manpower. subsequently development and adoption of medical data To retrospectively perform additional analysis, all the par- standardization protocols need to be developed for this ticipating centers need to be contacted again, which is time implementation process. consuming and delays research. Make the Data Readable: Findable, Accessible, Furthermore, medical institutions prefer not to share patient Interoperable, Reusable Data Principles data to ensure privacy protection. This is, of course, in no One way to enable a virtuous circle network effect is to small part about ensuring the trust and confidence of embrace another community engaged in synergistic ac- patients who display a wide range of sensitivities toward the tivities (joining a distributed learning network is worthwhile use of their personal data. if it links to another large network). The Findable, Acces- ORGANIZATIONAL CHANGE MANAGEMENT sible, Interoperable, Reusable (FAIR) Guiding Principles for The adoption of distributed learning will require a change in data management and stewardship have gained sub- organizational management (such as making use of newest stantial interest, but delivering scientific protocols and data standardization techniques and adapting the roles of workflows that are aligned with these principles is employees to more technically oriented tasks, such as data significant. A description of FAIR principles is repre- retrieval). Provided knowledge and understanding of sented in Figure 3. Technological solutions are urgently proper change management concepts, health care pro- needed that will enable researchers to explore, consume, viders can implement the latter successfully. Change and produce FAIR data in a reliable and efficient manner, management principles, such as defining a global vision, networking, and continuous communicating, could facili- Findable tate the integration of new technologies and bring up Descriptive metadata the clinical capabilities. However, this process of change Persistent identifiers management can be complicated, because it requires the Accessible Reusable involvement of multiple health care centers from different Findable Right and license Specify what to share countries and continents. This diversity can trigger a fear of Accessible Risk management management loss (one of the major factors of financial decision making), Usage standards Participant consent Interoperable definition (what can management which stems from differences of opinion and regulation, and cannot be used) Access status and the absence of data standardization, making the Reusable processes of data acquisition and preprocessing harder. In addition, the lack of knowledge about the new tech- Interoperable nology leads to resistance to accept the change and XML standards, including 60,61 data documentation innovation. Therefore, it is important to help health care organizations understand the need for distributed learn- ing by explaining the context of the change in terms of FIG 3. Description of findable, accessible, interoperable, reusable (FAIR) principles. traditional ways of learning to distributed learning and 192 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care a long-term vision of the improvements that it can bring, FUTURE PERSPECTIVES including time and money savings for both hospitals and An automated monitoring system accessible by the part- patients. This could in turn improve patient lives, in ad- ners or medical centers participating in the distributed dition to conducting more studies on research databases learning network can promote transparency, traceability, to consolidate proof of safety and quality of distributed 67 and trust. Recent advances of information technology, models. such as blockchain, can be integrated into a distributed As can be seen in Table 2, distributed learning has been learning network. Blockchain allows trusted partners to applied to train different models that can predict different visualize the history of the transactions and actions taken outcomes for a variety of pathologies, including lung in the distributed network. This integration of blockchain 23,62,63,63a 64 65 cancer, thyroid cancer, heart cardiac events, should help in easing the resistance to the new distributed and schizophrenia, in addition to the continuous devel- technology among health care workers as it provides both opment of tools and algorithms facilitating the adoption of provenance and enforceable governance. distributed learning, such as the variant learning portal, the 69 In 2008, Satoshi Nakamoto introduced the concept of alternating direction method of multipliers algorithm, as a peer-to-peer electronic cash system known as Bitcoin. well as the application of FAIR data principles. The cited Blockchain was made famous as the public transaction studies provide a proof that distributed learning can ensure 69,70 ledger of this cryptocurrency. It ensures security by patient data privacy and guarantee that accurate models using cryptography in a decentralized, immutable distrib- are built that are the equivalent of centralized models. 71 uted ledger technology. It is easy to manage as it can be LIMITATIONS OF THE EXISTING DISTRIBUTED made public, whereby any individual can participate, or it LEARNING IMPLEMENTATIONS can be made private, where all participants are known to each other. It is an efficient monitoring system, as records A shared limitation of the studies presented in Table 2 is cannot be deleted from the chain. By these means, that the number of institutes involved in the distributed blockchain exceeds its application as a cryptocurrency to network is rather small. The size of the network varies from a permanent trustful tracing system. Figure 4 illustrates four to 10 institutions. With few medical institutes involved, a visual representation of blockchain. the models were trained using the data of only a few hundred patients. By promoting the use of distributed Boulos et al demonstrated how blockchain could be used learning, it should instead be possible to train the models to contribute in health care: securing patient information using data from thousands or even millions of patients. and provider identities, managing health supply chains, The block is broadcasted to every party in the network The request is represented H1 wants to provide by a block more data Master Master The block then can be added to the chain, The training can start using Master which provides an indelible the extra data provided H1 then is entitled to start providing and transparent record of the requests by H1 the data to the network New block Hospital New block chained with last block in the chain Master computer Existing block in the chain FIG 4. Visual representation of blockchain. Adapted from Rennock et al. JCO Clinical Cancer Informatics 193 Zerka et al monetizing clinical research and data (giving patients the these technologies on privacy, and the relationship be- choice to share), processing claims, detecting fraud, and tween privacy and confidentiality, but there are significant managing prescriptions (replace incorrect and outdated technical developments for the regulators to consider that data). In addition to the above-mentioned uses of block- could answer a number of their concerns. chain, it has been also used to maintain security and scalability of clinical data sharing, secure medical record SUMMARY 74 75 sharing, prevent drug counterfeiting, and secure a pa- 76 Currently, a combination of regulations and ethics makes it tient’s location. difficult to share data even for scientific research purposes. It is essential that the use of distributed machine/deep The issues relate to the legal basis for processing and learning and blockchain be harmonized with the available anonymization. Specifically, there has been reluctance to security-preserving technologies (ie, continues devel- move away from informed consent as the legal basis for opment and cybersecurity), which begins at the user levels processing toward processing in the public interest, and (use strong passwords, connect using only trusted net- there are concerns about the re-identification of individuals works, and so on) and ends with more complex information where data are de-identified and then shared in aggregated technology infrastructures (such as data anonymization environments. A solution could be to allow researchers to and user ID encryption). Cybersecurity is a key aspect in train their machine learning programs without the data ever preserving privacy and ensuring safety and trust among having to leave the clinics, which in this paper we have patients and health care systems. The continuous de- established as distributed learning. This safe practice velopment or postmarketing surveillance can be seen as makes it possible to learn from medical data and can be the set of checks and integrations that should occur when applied across various medical disciplines. A limitation to its a distributed learning network is launched. This practice application, however, is that medical centers need to be should make it possible to identify any weak security convinced to participate in such practice, and regulators measures in the network or non-up-to-date features that also need to know suitable safeguards have been estab- 79,80 may require re-implementation. lished. Moreover, as can be seen in Table 2, even with the The distributed learning and blockchain technologies use of distributed learning, the size of the data pool learned presented here show that there are emerging data science from remains rather small. In the future, the integration of solutions that begin to meet the concerns and shortcom- blockchain technology to distributed learning networks ings of the law. The problems of re-identification are greatly could be considered, as it ensures transparency and reduced and managed through the technologies. Clearly, traceability while following FAIR data principles and can there are conceptual issues of understanding the impact of facilitate the implementation of distributed learning. 8295; Interreg V-A Euregio Meuse-Rhine “Euradiomics” Grant No. AFFILIATIONS 1 EMR4; and the Scientific Exchange from Swiss National Science The D-Lab, Department of Precision Medicine, GROW School for Foundation Grant No. IZSEZ0_180524. Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands Oncoradiomics, Liege, Belgium AUTHOR CONTRIBUTIONS Department of Radiation Oncology, University Hospital Zurich and Conception and design: All authors University of Zurich, Zurich, Switzerland Financial support: Sean Walsh Department of Health, Ethics, and Society, CAPHRI (Care and Public Administrative support: Sean Walsh Health Research Institute), Maastricht University, Maastricht, The Provision of study material or patients: Sean Walsh Netherlands Collection and assembly of data: Fadila Zerka, Samir Barakat, Ralph T.H. Leijenaar Data analysis and interpretation: Fadila Zerka, Samir Barakat, CORRESPONDING AUTHOR Ralph T.H. Leijenaar, David Townend, Philippe Lambin Fadila Zerka, PhD, Universiteit Maastricht, Postbus 616, Maastricht Manuscript writing: All authors 6200 MD, the Netherlands; e-mail: f.zerka@maastrichtuniversity.nl. Final approval of manuscript: All authors Accountable for all aspects of the work: All authors SUPPORT Supported by European Research Council advanced grant ERC-ADG- AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF 2015 Grant No. 694812, Hypoximmuno; the Dutch technology INTEREST Foundation Stichting Technische Wetenschappen Grant No. P14-19 The following represents disclosure information provided by authors of Radiomics STRaTegy, which is the applied science division of Dutch this manuscript. All relationships are considered compensated unless Research Council (De Nederlandse Organisatie voor Wetenschappelijk) otherwise noted. Relationships are self-held unless noted. I = Immediate the Technology Program of the Ministry of Economic Affairs; Small and Family Member, Inst = My Institution. Relationships may not relate to the Medium-Sized Enterprises Phase 2 RAIL Grant No. 673780; subject matter of this manuscript. For more information about ASCO’s EUROSTARS, DART Grant No. E10116 and DECIDE Grant No. E11541; conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. the European Program PREDICT ITN Grant No. 766276; Third Joint org/cci/author-center. Transnational Call 2016 JTC2016 “CLEARLY” Grant No. UM 2017- 194 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care Open Payments is a public database containing information reported by Benjamin Miraglio companies about payments made to US-licensed physicians (Open Employment: OncoRadiomics Payments). Philippe Lambin Fadila Zerka Employment: Convert Pharmaceuticals Employment: Oncoradiomics Leadership: DNAmito Research Funding: PREDICT Stock and Other Ownership Interests: BHV, Oncoradiomics, Convert Pharmaceuticals, The Medical Cloud Company Samir Barakat Honoraria: Varian Medical Employment: PtTheragnostic Consulting or Advisory Role: BHV, Oncoradiomics Leadership: PtTheragnostic Research Funding: ptTheragnostic Patents, Royalties, Other Intellectual Property: Co-inventor of two issued Sean Walsh patents with royalties on radiomics (PCT/NL2014/050248, PCT/ Employment: Oncoradiomics NL2014/050728) licensed to Oncoradiomics and one issued patent on Leadership: Oncoradiomics mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, Stock and Other Ownership Interests: Oncoradiomics three nonpatentable inventions (software) licensed to ptTheragnostic/ Research Funding: Varian Medical Systems (Inst) DNAmito, Oncoradiomics, and Health Innovation Ventures. Ralph T. H. Leijenaar Travel, Accommodations, Expenses: ptTheragnostic, Elekta, Varian Employment: Oncoradiomics Medical Leadership: Oncoradiomics David Townend Stock and Other Ownership Interests: Oncoradiomics Consulting or Advisory Role: Newron Pharmaceuticals (I) Patents, Royalties, Other Intellectual Property: Image analysis method supporting illness development prediction for a neoplasm in a human or No other potential conflicts of interest were reported. animal body (PCT/NL2014/050728) Arthur Jochems ACKNOWLEDGMENT Stock and Other Ownership Interests: Oncoradiomics, Medical Cloud We thank Simone Moorman for editing the manuscript. Company REFERENCES 1. Mitchell TM: Machine Learning International ed., [Reprint.]. New York, NY, McGraw-Hill, 1997 2. Boyd S, Parikh N, Chu E, et al: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3:1-122, 2010 3. Cardoso I, Almeida E, Allende-Cid H, et al: Analysis of machine learning algorithms for diagnosis of diffuse lung diseases. Methods Inf Med 57:272-279, 2018 4. Wang X, Peng Y, Lu L, et al: ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Presented at 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, July 21-26, 2017 5. Ding Y, Sohn JH, Kawczynski MG, et al: A deep learning model to predict a diagnosis of Alzheimer disease by using F-FDG PET of the brain. Radiology 290:456-464, 2019 6. Emmert-Streib F, Dehmer M: A machine learning perspective on personalized medicine: An automized, comprehensive knowledge base with ontology for pattern recognition. Mach Learn Knowl Extr 1:149-156, 2018 7. Deist TM, Dankers FJWM, Valdes G, et al: Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers. Med Phys 45:3449-3459, 2018 8. Lambin P, van Stiphout RG, Starmans MH, et al: Predicting outcomes in radiation oncology multifactorial decision support systems. Nat Rev Clin Oncol 10:27-40, 2013 9. Wang S, Summers RM: Machine learning and radiology. Med Image Anal 16:933-951, 2012 10. James G, Witten D, Hastie T, et al: An introduction to statistical learning: With applications in R. New York, NY, Springer, 2017 11. Sutton RS, Barto AG: Reinforcement Learning: An Introduction. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf 12. Deng L: Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7:197-387, 2014 13. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 521:436-444, 2015 14. Garling C: Andrew Ng: Why ‘deep learning’ is a mandate for humans, not just machines. Wired 2015. https://www.wired.com/brandlab/2015/05/andrew-ng- deep-learning-mandate-humans-not-just-machines/ 15. Pesapane F, Codari M, Sardanelli F: Artificial intelligence in medical imaging: Threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp 2:35, 2018 16. Liberati A, Altman DG, Tetzlaff J, et al: The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. PLoS Med 6:e1000100, 2009 16a. Intersoft Consulting: General Data Protection Regulation: Recitals. https://gdpr-info.eu/recitals/no-26/ 17. MAASTRO Clinic: euroCAT: Distributed learning. https://youtu.be/nQpqMIuHyOk 18. Rennock MJW, Cohn A, Butcher JR: Blockchain technology and regulatory investigations. https://www.steptoe.com/images/content/1/7/v2/171967/LIT- FebMar18-Feature-Blockchain.pdf 19. Orlhac F, Frouin F, Nioche C, et al: Validation of a method to compensate multicenter effects affecting CT radiomics. Radiology 291:53-59, 2019 20. Goodfellow I, Bengio Y, Courville A: Deep Learning. https://www.deeplearningbook.org/ 21. Lambin P, Roelofs E, Reymen B, et al: Rapid Learning health care in oncology - an approach towards decision support systems enabling customised radiotherapy. Radiother Oncol 109:159-164, 2013 22. Lustberg T, van Soest J, Jochems A, et al: Big Data in radiation therapy: Challenges and opportunities. Br J Radiol 90:20160689, 2017 JCO Clinical Cancer Informatics 195 Zerka et al 23. Deist TM, Jochems A, van Soest J, et al: Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT. Clin Transl Radiat Oncol 4:24-31, 2017 24. Price G, van Herk M, Faivre-Finn C: Data mining in oncology: The ukCAT project and the practicalities of working with routine patient data. Clin Oncol (R Coll Radiol) 29:814-817, 2017 25. Dean J, Corrado G, Monga R, et al: Large Scale Distributed Deep Networks. Advances in Neural Information Processing Systems 25, 2012, 1223-1231. https:// papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012 26. Cires¸an D, Meier U, Schmidhuber J: Multi-column deep neural networks for image classification. http://arxiv.org/abs/1202.2745 27. Radiuk PM: Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Information Technology and Management Science 20:20-24, 2017 28. Keskar NS, Mudigere D, Nocedal J, et al: On large-batch training for deep learning: generalization gap and sharp minima. http://arxiv.org/abs/1609.04836 29. Papernot N, Abadi M, Erlingsson U, et al: Semi-supervised knowledge transfer for deep learning from private training data. http://arxiv.org/abs/1610.05755 30. Shokri R, Shmatikov V: Privacy-preserving deep learning, in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security - CCS ’15. Denver, Colorado, ACM Press, 2015, pp 1310-1321. 31. Predd JB, Kulkarni SB, Poor HV: Distributed learning in wireless sensor networks. IEEE Signal Process Mag 23:56-69, 2006 32. Ji X, Hou C, Hou Y, et al: A distributed learning method for l 1 -regularized kernel machine over wireless sensor networks. Sensors (Basel) 16:1021, 2016 33. Chang K, Balachandar N, Lam C, et al: Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 25:945-954, 2018 34. McClure P, Zheng CY, Kaczmarzyk J, et al: Distributed Weight Consolidation: A Brain Segmentation Case Study. https://arxiv.org/abs/1805.10863 35. FreeSurferWiki: FreeSurfer. http://freesurfer.net/fswiki/FreeSurferWiki 36. Sheller MJ, Reina GA, Edwards B, et al: Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. http://arxiv.org/abs/1810.04304 37. Li W, Milletarı F, Xu D, et al: Privacy-preserving federated brain tumour segmentation. http://arxiv.org/abs/1910.00962 38. Abadi M, Chu A, Goodfellow I, et al: Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Com- munications Security – CCS’16. 308-318, 2016 39. Mishkin D, Sergievskiy N, Matas J: Systematic evaluation of convolution neural network advances on the Imagenet. Comput Vis Image Underst 161:11-19, 40. Lin T, Stich SU, Patel KK, et al: Don’t use large mini-batches, use local SGD. http://arxiv.org/abs/1808.07217 41. Biryukov A, De Canniere ` C, Winkler WE, et al: Discretionary access control policies (DAC), in van Tilborg HCA, Jajodia S (eds): Encyclopedia of Cryptography and Security. Boston, MA, Springer, 2011, pp 356-358 42. Pinkas B: Cryptographic techniques for privacy-preserving data mining. SIGKDD Explor 4:12-19, 2002 43. Siegel RL, Miller KD, Jemal A: Cancer statistics, 2017. CA Cancer J Clin 67:7-30, 2017 44. Bray F, Ferlay J, Soerjomataram I, et al: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 68:394-424, 2018 45. Siegel R, DeSantis C, Virgo K, et al: Cancer treatment and survivorship statistics, 2012. CA Cancer J Clin 62:220-241, 2012 46. Shortliffe EH, Barnett GO: Medical data: Their acquisition, storage, and use, in Shortliffe EH, Perreault LE (eds): Medical Informatics. New York, NY, Springer, 2001, pp 41-75 47. Shabani M, Vears D, Borry P: Raw genomic data: Storage, access, and sharing. Trends Genet 34:8-10, 2018 48. Langer SG: Challenges for data storage in medical imaging research. J Digit Imaging 24:203-207, 2011 49. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018, 2016 50. Wilkinson MD, Sansone S-A, Schultes E, et al: A design framework and exemplar metrics for FAIRness. Sci Data 5:180118, 2018 51. Dumontier M, Gray AJG, Marshall MS, et al: The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331, 2016 52. Jagodnik KM, Koplev S, Jenkins SL, et al: Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop. J Biomed Inform 71:49-57, 2017 53. Polanin JR, Terzian M: A data-sharing agreement helps to increase researchers’ willingness to share primary data: Results from a randomized controlled trial. J Clin Epidemiol 106:60-69, 2018 54. Azzariti DR, Riggs ER, Niehaus A, et al: Points to consider for sharing variant-level information from clinical genetic testing with ClinVar. Cold Spring Harb Mol Case Stud 4:a002345, 2018 55. Boue S, Byrne M, Hayes AW, et al: Embracing transparency through data sharing. Int J Toxicol 10.1177/1091581818803880 56. Poline J-B, Breeze JL, Ghosh S, et al: Data sharing in neuroimaging research. Front Neuroinform 6:9 2012 57. Cutts FT, Enwere G, Zaman SMA, et al: Operational challenges in large clinical trials: Examples and lessons learned from the Gambia pneumococcal vaccine trial. PLoS Clin Trials 1:e16 2006 58. Xia W, Wan Z, Yin Z, et al: It’s all in the timing: Calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc 25:25-31, 2018 59. Fleishon H, Muroff LR, Patel SS: Change management for radiologists. J Am Coll Radiol 14:1229-1233, 2017 60. Delaney R, D’Agostino R: The challenges of integrating new technology into an organization. https://digitalcommons.lasalle.edu/cgi/viewcontent.cgi? article=1024&context=mathcompcapstones 61. Agboola A, Salawu R: Managing deviant behavior and resistance to change. Int J Bus Manage 6:235, 2010 62. Jochems A, Deist TM, van Soest J, et al: Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital - A real life proof of concept. Radiother Oncol 121:459-467, 2016 63. Jochems A, Deist TM, El Naqa I, et al: Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries. Int J Radiat Oncol Biol Phys 99:344-352, 2017 63a. Deist TM, Dankers FJWM, Ojha P, et al: Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. Radiother Oncol 144:189-200, 64. Tagliaferri L, Gobitti C, Colloca GF, et al: A new standardized data collection system for interdisciplinary thyroid cancer management: Thyroid COBRA. Eur J Intern Med 53:73-78, 2018 65. Brisimi TS, Chen R, Mela T, et al: Federated learning of predictive models from federated Electronic Health Records. Int J Med Inform 112:59-67, 2018 66. Dluhos P, Schwarz D, Cahn W, et al: Multi-center machine learning in imaging psychiatry: A meta-model approach. Neuroimage 155:10-24, 2017 196 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care 67. Dhillon V, Metcalf D, Hooper M: Blockchain in health care, in Dhillon V, Metcalf D, Hooper M (eds): Blockchain Enabled Applications: Understand the Blockchain Ecosystem and How to Make it Work for You. Berkeley, CA, Apress, 2017, pp 125-138 68. Lugan S, Desbordes P, Tormo LXR, et al: Secure architectures implementing trusted coalitions for blockchained distributed learning (TCLearn). http://arxiv. org/abs/1906.07690 69. Nakamoto S: Bitcoin: A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf 70. Gordon WJ, Catalini C: Blockchain technology for healthcare: Facilitating the transition to patient-driven interoperability. Comput Struct Biotechnol J 16:224-230, 2018 71. Kamel Boulos MN, Wilson JT, Clauson KA: Geospatial blockchain: Promises, challenges, and scenarios in health and healthcare. Int J Health Geogr 17:25 72. Pirtle C, Ehrenfeld J: Blockchain for healthcare: The next generation of medical records? J Med Syst 42:172, 2018 73. Zhang P, White J, Schmidt DC, et al: FHIRChain: Applying blockchain to securely and scalably share clinical data. Comput Struct Biotechnol J 16:267-278, 74. Dubovitskaya A, Xu Z, Ryu S, et al: Secure and trustable electronic medical records sharing using blockchain. AMIA Annu Symp Proc 2017:650-659, 2018 75. Vruddhula S: Application of on-dose identification and blockchain to prevent drug counterfeiting. Pathog Glob Health 112:161, 2018 76. Ji Y, Zhang J, Ma J, et al: BMPLS: Blockchain-based multi-level privacy-preserving location sharing scheme for telecare medical information systems. J Med Syst 42:147, 2018 77. Coventry L, Branley D: Cybersecurity in healthcare: A narrative review of trends, threats and ways forward. Maturitas 113:48-52, 2018 78. Jalali MS, Kaiser JP: Cybersecurity in hospitals: A systematic, organizational perspective. J Med Internet Res 20:e10059, 2018 ´ ˇ 79. Vlahovic-Palcevski V, Mentzer D: Postmarketing surveillance, in Seyberth HW, Rane A, Schwab M (eds): Pediatric Clinical Pharmacology. Berlin, Springer, 2011, pp 339-351 80. Parkash R, Thibault B, Philippon F, et al: Canadian Registry of Implantable Electronic Device outcomes: Surveillance of high-voltage leads. Can J Cardiol 34:808-811, 2018 81. Ing EB, Ing R: The use of a nomogram to visually interpret a logistic regression prediction model for giant cell arteritis. Neuroophthalmology 42:284-286, 2018 82. Tirzite M, Bukovskis M, Strazda G, et al: Detection of lung cancer with electronic nose and logistic regression analysis. J Breath Res 13: 016006, 2018 83. Ji Z, Jiang X, Wang S, et al: Differentially private distributed logistic regression using private and public data. BMC Med Genomics 7:S14, 2014 (suppl 1) 84. Jiang W, Li P, Wang S, et al: WebGLORE: A web service for Grid LOgistic REgression. Bioinformatics 29:3238-3240, 2013 85. Wang S, Jiang X, Wu Y, et al: EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning. J Biomed Inform 46:480-496, 2013 86. Desai A, Chaudhary S: Distributed decision tree. Proceedings of the Ninth Annual ACM India Conference, Gandhinagar, India, ACM Press, 2016, pp 43-50 87. Caragea D, Silvescu A, Honavar V: Decision tree induction from distributed heterogeneous autonomous data sources, in Abraham A, Franke K, Koppen M (eds): Intelligent Systems Design and Applications. Berlin, Springer, 2003, pp 341-350 88. Plaku E, Kavraki LE: Distributed computation of the knn graph for large high-dimensional point sets. J Parallel Distrib Comput 67:346-359, 2007 89. Xiong L, Chitti S, Liu L: Mining multiple private databases using a kNN classifier, in Proceedings of the 2007 ACM symposium on Applied computing – SAC ’07. Seoul, Korea, ACM Press, 2007, p 435 90. Huang Z: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283-304, 1998 91. Jagannathan G, Wright RN: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data, in Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining – KDD ’05. Chicago, Illinois, USA, ACM Press, 2005, p 593 92. Jin R, Goswami A, Agrawal G: Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10:17-40, 2006 93. Jagannathan G, Pillaipakkamnatt K, Wright RN: A new privacy-preserving distributed k -clustering algorithm, in Proceedings of the 2006 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2006, pp 494-498 94. Ye Y, Chiang C-C: A parallel apriori algorithm for frequent itemsets mining, in Fourth International Conference on Software Engineering Research, Management and Applications (SERA’06). Seattle, WA, IEEE, 2006, pp 87-94 95. Cheung DW, Ng VT, Fu AW, et al: Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8:911-922, 1996 96. Bellman R: A Markovian decision process. Indiana Univ Math J 6:679-684, 1957 97. Puterman ML: Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, John Wiley & Sons, 2014 98. Watkins CJCH, Dayan P: Q-learning. Mach Learn 8:279-292, 1992 99. Lauer M, Riedmiller M: An algorithm for distributed reinforcement learning in cooperative multi-agent systems, in Proceedings of the Seventeenth International Conference on Machine Learning. Burlington, MA, Morgan Kaufmann, 2000, pp 535-542. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.772 nn n JCO Clinical Cancer Informatics 197 Zerka et al APPENDIX Records identified through Additional records identified database searching through other sources (n = 127) (n = 0) Records after duplicates removed (n = 127) Records screened Records excluded (n = 6) (n = 121) Full-text articles assessed Full-text articles excluded, for eligibility with reasons (n = 6) (n = 0) Studies included in qualitative synthesis (n = 6) Studies included in quantitative synthesis (meta-analysis) (n = 6) FIG A1. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2009 flow diagram. 198 © 2020 by American Society of Clinical Oncology Included Eligibility Screening Identification Distributed Learning in Health Care TABLE A2. PRISMA 2009 Checklist Reported on Section/Topic No. Checklist Item Page No. Title Title 1 Identify the report as a systematic review, meta-analysis, or both. 1 Abstract Structured summary 2 Provide a structured summary including, as applicable: background; 1 objectives; data sources; study eligibility criteria, participants, and interventions; study appraisal and synthesis methods; results; limitations; conclusions and implications of key findings; systematic review registration number. Introduction Rationale 3 Describe the rationale for the review in the context of what is already 1-5 known. Objectives 4 Provide an explicit statement of questions being addressed with reference 2 to PICOS. Methods Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (eg, 5 Web address), and, if available, provide registration information including registration number. Eligibility criteria 6 Specify study characteristics (eg, PICOS, length of follow-up) and report 5 characteristics (eg, years considered, language, publication status) used as criteria for eligibility, giving rationale. Information sources 7 Describe all information sources (eg, databases with dates of coverage, 5 contact with study authors to identify additional studies) in the search and date last searched. Search 8 Present full electronic search strategy for at least one database, including 5 any limits used, such that it could be repeated. Study selection 9 State the process for selecting studies (ie, screening, eligibility, included 5 in systematic review, and, if applicable, included in the meta-analysis). (and Fig A1) Data collection process 10 Describe method of data extraction from reports (eg, piloted forms, 5 independently, in duplicate) and any processes for obtaining and confirming data from investigators. Data items 11 List and define all variables for which data were sought (eg, PICOS, N/A funding sources) and any assumptions and simplifications made. Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies N/A (including specification of whether this was done at the study or outcome level) and how this information is to be used in any data synthesis. Summary measures 13 State the principal summary measures (eg, risk ratio, difference in N/A means). Synthesis of results 14 Describe the methods of handling data and combining results of studies, 5 if done, including measures of consistency (eg, I ) for each meta-analysis. Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative N/A evidence (eg, publication bias, selective reporting within studies). Additional analyses 16 Describe methods of additional analyses (eg, sensitivity or subgroup N/A analyses, meta-regression), if done, indicating which were prespecified. Results Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in 5 the review, with reasons for exclusions at each stage, ideally with a flow (and Fig A1) diagram. Study characteristics 18 For each study, present characteristics for which data were extracted (eg, 5-8 study size, PICOS, follow-up period) and provide the citations. (Continued on following page) JCO Clinical Cancer Informatics 199 Zerka et al TABLE A2. PRISMA 2009 Checklist (Continued) Reported on Section/Topic No. Checklist Item Page No. Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any N/A outcome-level assessment (see item 12). Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: 5-8 (a) simple summary data for each intervention group, and (b) effect estimates and confidence intervals, ideally with a forest plot. Synthesis of results 21 Present results of each meta-analysis done, including confidence 5-8 intervals and measures of consistency. Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item N/A 15). Additional analysis 23 Give results of additional analyses, if done (eg, sensitivity or subgroup N/A analyses, meta-regression [see Item 16]). Discussion Summary of evidence 24 Summarize the main findings, including the strength of evidence for each 8 main outcome; consider their relevance to key groups (eg, health care providers, users, and policy makers). Limitations 25 Discuss limitations at study and outcome level (eg, risk of bias), and at 10 review level (eg, incomplete retrieval of identified research, reporting bias). Conclusions 26 Provide a general interpretation of the results in the context of other 11 evidence, and implications for future research. Funding Funding 27 Describe sources of funding for the systematic review and other support 11 (eg, supply of data); role of funders for the systematic review. Abbreviations: N/A, not applicable; PICOS, participants, interventions, comparisons, outcomes, and study design; PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses. 200 © 2020 by American Society of Clinical Oncology http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png JCO Clinical Cancer Informatics Wolters Kluwer Health

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Loading next page...
 
/lp/wolters-kluwer-health/systematic-review-of-privacy-preserving-distributed-machine-learning-WOB4mYyA8M

References (205)

Publisher
Wolters Kluwer Health
Copyright
(C) 2020 American Society of Clinical Oncology
ISSN
2473-4276
DOI
10.1200/CCI.19.00047
Publisher site
See Article on Publisher Site

Abstract

review articles abstract Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care 1,2 1,2 1,2 1,3 1,2 Fadila Zerka, PhD ; Samir Barakat, MSc, PhD ; Sean Walsh, MSc, PhD ; Marta Bogowicz, PhD ; Ralph T. H. Leijenaar, MSc, PhD ; 1 2 4 1 Arthur Jochems, PhD ; Benjamin Miraglio, PhD ; David Townend, LLB, MPhil, PhD ; and Philippe Lambin, MD, PhD Big data for health care is one of the potential solutions to deal with the numerous challenges of health care, such as rising cost, aging population, precision medicine, universal health coverage, and the increase of non- communicable diseases. However, data centralization for big data raises privacy and regulatory concerns. Covered topics include (1) an introduction to privacy of patient data and distributed learning as a poten- tial solution to preserving these data, a description of the legal context for patient data research, and adefinition of machine/deep learning concepts; (2) a presentation of the adopted review protocol; (3) a presentation of the search results; and (4) a discussion of the findings, limitations of the review, and future perspectives. Distributed learning from federated databases makes data centralization unnecessary. Distributed algorithms iteratively analyze separate databases, essentially sharing research questions and answers between databases instead of sharing the data. In other words, one can learn from separate and isolated datasets without patient data ever leaving the individual clinical institutes. Distributed learning promises great potential to facilitate big data for medical application, in particular for international consortiums. Our purpose is to review the major implementations of distributed learning in health care. JCO Clin Cancer Inform 4:184-200. © 2020 by American Society of Clinical Oncology Licensed under the Creative Commons Attribution 4.0 License INTRODUCTION The methodology for this research is that distributed machine learning is an evolving fieldincomputing, Law and ethics seek to produce a governance with 665 articles published between 2001 and 2018; framework for the processing of patient data that the study is based on a literature search, focuses produces a solution to the issues that arise between on the medical applications of distributed machine the competing desires of individuals in society for learning, and provides an up-to-date summary of privacy and advances in health care. Traditional safe- the field. guards to achieve this governance have come from, for example, the anonymization of data or informed THE LEGAL CONTEXT FOR PATIENT DATA RESEARCH consent. These are not adequate safeguards for the The challenges in law and ethics in relation to big data ASSOCIATED new big data and artificial intelligence methodologies in CONTENT and artificial intelligence are well documented and research; it is increasingly difficult to create anonymous 1-16 Appendix discussed . The issue is one of balance: privacy of data (rather than pseudonymized/coded data) or to Author affiliations health data and access to data for research. This issue maintain it against re-identification (through linking of and support is likely to become more pronounced with the fore- datasets causing accidental or deliberate re-identifi- information (if seeable developments in health care, notably in re- cation). The technology of big data and artificial in- applicable) appear at lation to rising cost, aging population, precision telligence, however, itself increasingly offers safeguards the end of this medicine, universal health coverage, and the increase article. to solve the governance problem. In this article we of noncommunicable diseases. However, recent Accepted on January explore how privacy-preserving distributed machine 16, 2020 and developments in law, for example, in the European learning from federated databases might assist gover- published at Union’s General Data Protection Regulation (GDPR), nance in health care. The article first outlines the basic ascopubs.org/journal/ parameters of the law and ethics issues and then dis- appear to maintain the traditional approach that seems cci on March 5, 2020: cusses machine learning and deep learning. Thereafter, to favor individualism above solidarity. Individualism DOI https://doi.org/10. 1200/CCI.19.00047 the results of the review are presented and discussed. is strengthened in the new legislation. There is 184 Distributed Learning in Health Care CONTEXT Key Objective Review the contribution of distributed learning to preserve data privacy in health care. Knowledge Generated Data in health care are greatly protected; therefore, accessing medical data is restricted by law and ethics. This restriction has led to a change in research practice to adapt to new regulations. Distributed learning makes it possible to learn from medical data without these data ever leaving the medical institutions. Relevance Distributed learning allows learning from medical data while guaranteeing preservation of patient privacy. a narrowing of the definition of informed consent in Article MACHINE LEARNING 4.11 of the GDPR, with the unclear inclusion of the ne- Machine learning comes from the possibility to apply al- cessity for broad consent in scientific research included in 1 gorithms on raw data to acquire knowledge. These algo- Recital 33. rithms are implemented to support decision making in In relation to the continuing ambiguity of the unclear legal different domains, including health care, manufacturing, 2,3 landscape for research using and reusing large datasets education, financial modeling, and marketing. In medical and linking between datasets, the GDPR is not clear in the disciplines, machine learning has contributed to improving area of re-identification of individuals. For the GDPR, part of the efficiency of clinical trials and decision-making pro- the problem is clear—when data have the potential when cesses. Some examples of machine learning applications in added to other data to identify an individual, then those data medicine are the localization of thoracic diseases, early 5 6 are personal data and subject to regulation. The question is, diagnosis of Alzheimer disease, personalized treatment, 7,8 9 is this absolute (any possibility, regardless of remoteness), outcome prediction, and automated radiology reports. or is there a reasonableness test? Recital 26 includes such There are three main categories of machine learning al- a reasonable test: “To ascertain whether means are rea- gorithms. First, in supervised learning, the algorithm gen- sonably likely to be used to identify the natural person, erates a function for mapping input variables to output account should be taken of all objective factors, such as variables. In unsupervised learning, the applied algorithms the costs of and the amount of time required for iden- do not have any outcome variable to estimate, and the tification, taking into consideration the available tech- algorithms generate a function mapping for the structure of nology at the time of the processing and technological the data. The third type is referred to as reinforcement 16a developments.” learning, whereby in the absence of a training dataset the From this overview of legal difficulties, it is clear that there algorithm trains itself by learning from experiences to make are obstacles to processing data in big data, machine increasingly improved decisions. A reinforcement agent learning, and artificial intelligence methodologies and en- decides what action to perform to accomplish a given 10,11 vironments. It must be stressed that the object is not to task. Table 1 provides a brief description of selected circumvent the rights of patients or to suggest that privacy popular machine learning algorithms across the three should be ignored. The difficulty is that where the law is categories. unclear, there is a tendency toward restrictive readings of DEEP LEARNING the law to avoid liability, and, in the case of the method- ologies and applications of data science discussed here, Deep learning is a subset of machine learning, which, in the effect of unclear law and restrictive interpretations of the turn, is a subset of artificial intelligence, as represented in law will be to block potentially important medical and Figure 1. The learning process of a deep neural network scientific developments and research. Each of the un- architecture cascades through multiple nodes in multiple certainties will require regulators to take a position on the layers, where nodes and layers use the output of the best interpretation of the meaning of the law according to previous nodes and layers as input. The output of a node the available safeguards. The question for the data science is calculated by applying an activation function to the community is, how far can that community itself address weighted average of this node’s input. As described by concerns about privacy, about re-identification, and about Andrew Ng , “The analogy to deep learning is that the safeguarding autonomy of individuals and their legiti- rocket engine is the deep learning models and the fuel is mate expectations to dignity in their treatment through the the huge amounts of data that we can feed in to these proper treatment of their personal data? How far distributed algorithms,” meaning that the more data are fed into the learning might contribute a suitable safeguard is the model the better the performance. Yet, this continuous question addressed in the remainder of this paper. improvement of the performance in concordance with the JCO Clinical Cancer Informatics 185 Zerka et al 186 © 2020 by American Society of Clinical Oncology TABLE 1. Examples of Machine Learning Algorithms Distributed Algorithm Example Algorithm Description Available? Supervised learning SVM An algorithm performing classification tasks by composing hyperplanes in Yes a multidimensional space that separates cases of different class labels. SVM supports both regression and classification tasks and can handle 2,3 multiple continuous and categorical variables. 83-85 Logistic regression An algorithm used for discrete values estimation tasks based on a given Yes set of independent variables. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function that maps probabilities with values lying between 0 and 1. Hence, it is also known 81,82 as logit regression. 86,87 Decision tree A nonparametric method used for both classification and regression Yes problems. As the name suggests, it uses a tree-like decision model by splitting a dataset into two or more subsets on the basis of conditional control statements. It can also be used to visually represent decision- making processes. Random forest Random forest is a method performing classification and regression No tasks. A random forest is a composition of two initials: forest because it represents a collection of decision trees, and random because the forest randomly selects observations and features from which it puts up various decision trees. The results are then averaged. Each decision tree in the forest has access to a random set of the training data, chooses a class, and the most selected class is then the predicted class. 88,89 KNN KNN can be used for regression problems; however, it is widely used for Yes classification problems. In KNN, the assumption is that similar data elements are close to each other. Given K (positive integer) and a test observation, KNN first groups the K closest elements to the test observation. Then, in the case of regression, it returns the mean of the K labels, or in the case of classification, it returns the mode of the K labels. 90-93 Unsupervised learning K-means An algorithm mainly used for clustering in data mining. K-means Yes nomination comes from its functionality, which is partitioning of N observations to K clusters, where each and every observation is part of the cluster with the nearest mean. Apriori algorithm A classic algorithm in data mining. Used for mining frequent item groups Yes and relevant association rules in a transactional database. (Continued on following page) Distributed Learning in Health Care JCO Clinical Cancer Informatics 187 TABLE 1. Examples of Machine Learning Algorithms (Continued) Distributed Algorithm Example Algorithm Description Available? Reinforcement learning MDP Introduced in 1950s, MDP is a discrete stochastic control process No providing a framework for modeling decision making when final outcomes are ambiguous. Given S (current state) and S (new state), the decision process is made by steps. The process has a state at each step and the decision maker can choose any available action in S . The process then moves randomly into S, the new state. The chosen action influences the probability of moving from S to S. In other words, the next state depends only on the current state, not the previous states, and the action taken by the decision maker, satisfying the Markovian property, from which comes the algorithm’s name. Q-learning Useful for optimization of action selection of any finite MDP. Q-learning Yes algorithm provides agents in a process the ability to know what action to take under what situation. Abbreviations: KNN, K-nearest neighbors; MDP, Markov decision process; SVM, support vector machine. Zerka et al Distributed Machine Learning Artificial intelligence A large quantity of training data is required for machine Machine programs able to imitate human learning to be applied, especially in outcome modeling, intelligence where multiple factors influence learning. Provided there Machine learning are sufficient and appropriate data, machine learning Algorithms able to learn 20,21 typically results in accurate and generalizable models. from examples However, the sensitivity of the personal data greatly hinders Deep learning Set of learning techniques the conventional centralized approach to machine learn- inspired from biological ing, whereby all data are gathered in a single data store. neural networks Distributed machine learning resolves legal and ethical privacy concerns by learning without the personal data ever leaving the firewall of the medical centers. FIG 1. Relationship between artificial intelligence, machine learning, 23 24 The euroCAT and ukCAT projects are a proof of dis- and deep learning. tributed learning being successfully implemented into clinical settings to overcome data access restrictions. The amount of the data are not correct for traditional machine purpose of the euroCAT project was to predict patient learning algorithms reaching a steady performance level outcomes (eg, post-radiotherapy dyspnea for patients with that does not improve with the increase of the amount of the lung cancer) by learning from data stored within clinics training data. without sharing any of the medical data. METHODS AND MATERIAL SELECTION Distributed Deep Learning A PubMed search was performed to collect relevant studies Training a deep learning model typically requires thou- concerning the utilization of distributed machine learning in sands to millions of data points and is therefore compu- medicine. We used the search strings: “distributed learn- tationally expensive as well as time consuming. These ing,”“distributed machine learning,” and “privacy pre- challenges can be mitigated with different approaches. serving data mining.” The Preferred Reporting Items for First, because it is possible to train deep learning models Systematic Reviews and Meta-Analyses (PRISMA) state- 25 in a parallelized fashion, using dedicated hardware ment was adopted to select and compare distributed 26 (graphics processing units, tensor processing units) re- learning literature. The PRISMA flow diagram and checklist duces the computational time. Second, as the memory of are slightly modified and presented in Appendix Figure A1 this dedicated hardware is often limited, it is possible to and Appendix Table A1, respectively. The last search for divide the training data into subsets called batches. In this distributed machine learning articles was performed on situation, the training process iterates over the batches, February 28, 2019. 27 only considering the data of one batch at each iteration. On top of easing the computing burden, using small SEARCH RESULTS batches during training improves the model’s ability to A total of 127 articles were identified in PubMed using the 28 generalize. search query: (“distributed learning” OR “distributed ma- These approaches address computation challenges but do chine learning” OR “privacy preserving data mining”). Six not necessarily preserve data privacy. As for machine papers were screened; a brief summary of each article is learning, deep learning can be distributed to protect patient presented in Table 2. 29,30 data. Moreover, distributed deep learning also im- DISTRIBUTED LEARNING proves computing performance, as in the case of wireless sensor networks, where centralized learning is inefficient in Distributed learning ensures data safety by only sharing 31,32 terms of both communication and energy. mathematical parameters (or metadata) and not the actual data or in any instance data that might enable tracking back An example of distributed deep learning in the medical the patient information (such as patient ID, name, or date domain is that of Chang et al, who deployed a deep of birth). In other words, distributed algorithms iteratively learning model across four medical institutions for image analyze separate databases and return the same solution classification purposes using three distinct datasets: retinal as if data were centralized, essentially sharing research fundus, mammography, and ImageNet. The results were questions and answers between databases instead of compared with the same deep learning model trained on data. Also, before processing with the learning process, centrally hosted data. The comparison showed that the researchers must make sure all data have been success- distributed model accuracy is similar to the centrally hosted 33 34 fully anonymized and secured by means of hashing al- model. In a different study, McClure et al developed gorithms and semantic web techniques, respectively, as a distributed deep neural network model to reproduce can be seen in Figure 2, in addition to post-processing FreeSurfer brain segmentation. FreeSurfer is an open methods to address the multicenter variabilities. source tool for preprocessing and analyzing (segmentation, 188 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care JCO Clinical Cancer Informatics 189 TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results Jochems Clinical data from 287 patients with lung cancer, A Bayesian network model is adapted for Varian learning portal AUC, 0.61 treated with curative intent with CRT or RT distributed learning using data from five alone were collected and stored in five different hospitals. medical institutes: MAASTRO: (the Netherlands), Jessa (Belgium), Liege ` (Belgium), Aachen (Germany), and Eindhoven (the Netherlands). Target: predict dyspnea. Patient data were extracted and stored locally in the hospitals. Only the weights were then sent to the master server. Deist Clinical data from 268 patients with NSCLC from Alternating Direction Method of Multipliers was Varian learning portal AUC, 0.62 for training set and 0.66 for validation five different medical institutes: Aachen used to learn SVM models. set (Germany), Eindhoven (The Netherlands), Hasselt (Belgium), Liege (Belgium), and Maastricht (the Netherlands). Target: predict dyspnea grade ≥ 2. The data were processed simultaneously in local databases. Then, the updated model parameters were sent to the master machine to compare and update them and check if the learning process has converged. The process is repeated until convergence criteria were met. Dluhos 258 patients with first-episode schizophrenia and All images were preprocessed: normalized, VBM8 toolbox Joint and meta models had similar classification 222 healthy controls originating from four segmented, and standardized. performance, which was better than datasets were collected: two datasets from performance of local models. University Hospital Brno (Czech Republic), University Medical Center Utrecht (The Netherlands), and the last dataset originates from the Prague Psychiatric Center and Psychiatric Hospital Bohnice. Target: classification of patients with first-episode Create four local SVM models. Then create MATLAB statistics and machine schizophrenia multisample models (joint model and meta learning toolbox model) based on the individual models created previously. This process was repeated four times, by setting each time three training datasets, with remaining one as the validation set. (Continued on following page) Zerka et al 190 © 2020 by American Society of Clinical Oncology TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center (Continued) Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results Jochems Clinical data from 698 patients with lung cancer, Distributed learning for a Bayesian network using Varian learning portal AUC, 0.662 treated with curative intent with CRT or RT data from three hospitals alone were collected and stored in two medical institutes: MAASTRO (Netherlands) and Michigan University (United States). Target: prediction of NSCLC 2-year survival after The model used the T category and N category, The discriminative performance of centralized radiation therapy age, total tumor dose, and WHO performance and distributed models on the validation set for predictions. was similar. Brisimi Electronic health records from Boston Medical Soft-margin l1-regularized sparse SVM classifier. Not provided AUC, 0.56 Center of patients with at least one heart- related diagnosis between 2005 and 2010. The data are distributed between 10 hospitals. Developed an iterative cPDS algorithm for solving the large-scale SVM problem in a decentralized fashion. The system then predicted patient’s hospitalization for cardiac events in upcoming calendar year. Target: prediction of heart cardiac events. cPDS converged faster than centralized methods. Tagliaferri 227 variables extracted from thyroid cancer data Inferential regression analysis. COBRA framework Thyroid COBRA: based on COBRA-Storage from six Italian cancer centers. Each has four System. A new software BOA “Beyond properties: name, form, type of field, and Ontology” supporting two different models: levels. Cloud-based large database model and distributed learning model Target: prediction of survival and toxicity. Learning Analyzer Proxy (module of BOA only in distributed mode) sends algorithms directly to local research proxies, taking back from them only the results of each iteration step, with no need to work with shared data in the Cloud anymore. Abbreviations: AUC, area under the curve; BOA, Beyond Ontology Awareness; COBRA, Consortium for Brachytherapy Data Analysis; cPDS, cluster Primal Dual Splitting; CRT, chemoradiation; NSCLC, non–small-cell lung cancer; RT, radiotherapy; SVM, support vector machine. Hashing algorithm Distributed Learning in Health Care Each hospital is responsible Three centers distributed network Transparency: In a blockchain all records are saved and are non A B D for data preparation removable 1. Extract local data We would like We can to learn from provide more hospital 1’s Master data! data! Hospital 1 Hospital 3 Images Forms Hospital 2 Hospital 1 DB Sending parameters/master Hospital 2 Sending local model/hospital 2. Data anonymization Learning machine/hospital New action from Master machine New action from hospital 2 hospital 1 Distributed learning flow chart of the above network C Blockchain Blockchain Step 1 Step 2 Sending parameters Waiting Start 649 Master Master Action recorded Action approved by hospital 1 Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3 Waiting Learning Waiting Waiting Learning Learning Step 4 Convergence Step 3 Waiting for all 3. Semantic web criteria reached? local models Any new action in the Master Master network is added to the old Blockchain actions Blockchain Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3 Waiting Waiting Sending Sending Sending Waiting 283 local model local model local model YES NO Step 5 Step 6 Master Master Update parameters based on Create final model F A I R local models FIG 2. Schematic representation of the processes in a transparent distributed learning network. (A) Data preparation steps. (B) Distributed learning network, which is composed of three hospitals, each of which is equipped with a learning machine that can communicate with a master machine responsible for sending model parameters and checking convergence criteria. (C) Flowchart of the distributed learning network described in B. (D) Example of an action that can be tracked by blockchain (designed and implemented according to needs agreed among network members) and keep all network participants aware of any new activity taken in the network. DB, database; FAIR, findable, accessible, interoperable, reusable. thickness estimation, and so on) of human brain magnetic model is ready, not only can the network participants use it resonance images. The results demonstrated perfor- to learn from their data, but this learning should be able to mance improvement on the test datasets. Similar to the be performed locally and under highly private and secure previous study, a brain tumor segmentation was suc- conditions to protect the model’s output. cessfully performed using distributed deep learning across The users of a machine/deep learning model are not 10 institutions (BraTS distribution). necessarily the model’s developers. Hence, documentation and the integration of automated data eligibility tests have In the matter of distributed deep learning, the training two important assets: weights are combined to train a final model, and the raw 35,37 data are never exposed. In the case of sharing the � The documentation ensures providing a clear view of locale gradients, it might be possible to retrieve estima- what the model is designed for, a technical description of tions of the original data from these gradients. Training the the model, and its use. local models on batches may prevent retrieving all the data � The eligibility tests are important to ensure that correct from the gradients, as these gradients correspond to single input data are extracted and provided before executing batches rather than all the local data. However, setting an 23 the model. In euroCAT, a distributed learning expert optimal batch size needs to be considered to assure data installed quality control via data extraction pipelines at 28,39,40 safety and the model’s ability to generalize. every participant point in the network. The pipeline automatically allowed data records fulfilling the model PRIVACY AND INTEGRATION OF DISTRIBUTED training eligibility criteria to be used in the training. The LEARNING NETWORKS experts also test the extraction pipeline thoroughly in Privacy in a distributed learning network addresses three addition to the machine learning testing. However, there main areas: data privacy, the implemented model’s privacy, were post-processing compensation methods to cor- and the model’s output privacy. Data privacy is achieved by rect for the variations caused by using different local means of data anonymization and data never leaving the protocols. medical institutions. The distributed learning model can be 41 DISCUSSION secured by applying differential privacy techniques, preventing leakage of weights during the training, and If one examines oncology, for instance, cancer is clearly cryptographic techniques. These cryptographic tech- one of the greatest challenges facing health care. More than niques provide a set of multiparty protocols that ensure 16 million new cancer cases were reported in 2017 alone. security of the computations and communication. Once the This number climbed to 18.1 million cases in 2018. This JCO Clinical Cancer Informatics 191 Zerka et al increasing number of cancer incidences means that to publish and reuse computational workflows, and to there are undoubtedly sufficient data worldwide to put define and share scientific protocols as workflow templates. machine/deep learning to meaningful work. However, as Such solutions will address emerging concerns about the highlighted earlier, this requires access to the data and, as nonreproducibility of scientific research, particularly in data also highlighted earlier, distributed learning enables this in science (eg, poorly published data, incomplete workflow a manner that resolves legal and ethical concerns. None- descriptions, limited ability to perform meta-analyses, and an 51,52 theless, integration of distributed learning into health care is overall lack of reproducibility). Because workflows are much slower compared with other fields, which raises the fundamental to research activities, FAIR has broad applica- question of why this should be. Here, we summarize a set bility, which is vital in the context of distributed learning with of methodologies to facilitate the adoption of distributed medical data. learning and provide future directions. WHY NOT PUBLICLY SHARE MEDICAL DATA? CURRENT STATE OF MEDICAL DATA STORAGE Some studies were conducted trying to facilitate and secure AND PREPROCESSING data-sharing procedures to encourage related researchers and organizations to publicly share their data and embrace Information Communication Technology transparency, by proposing data-sharing procedures and 38,39 Every hospital has its own storage devices and architecture. protocols aiming to harmonize regulatory frameworks and In this case, the information communication technology 54,55 research governance. Despite the efforts made toward preparation for distributed learning requires significant data-sharing globalization, the sociocultural issues sur- energy, time, and manpower, which can be costly. This rounding data sharing remain pertinent. Large clinical same process (data acquisition and preprocessing) needs trials also face limitations in the data collection capabilities 46-48 to be repeated for each participating hospital, and because of limited data storage capacities and manpower. subsequently development and adoption of medical data To retrospectively perform additional analysis, all the par- standardization protocols need to be developed for this ticipating centers need to be contacted again, which is time implementation process. consuming and delays research. Make the Data Readable: Findable, Accessible, Furthermore, medical institutions prefer not to share patient Interoperable, Reusable Data Principles data to ensure privacy protection. This is, of course, in no One way to enable a virtuous circle network effect is to small part about ensuring the trust and confidence of embrace another community engaged in synergistic ac- patients who display a wide range of sensitivities toward the tivities (joining a distributed learning network is worthwhile use of their personal data. if it links to another large network). The Findable, Acces- ORGANIZATIONAL CHANGE MANAGEMENT sible, Interoperable, Reusable (FAIR) Guiding Principles for The adoption of distributed learning will require a change in data management and stewardship have gained sub- organizational management (such as making use of newest stantial interest, but delivering scientific protocols and data standardization techniques and adapting the roles of workflows that are aligned with these principles is employees to more technically oriented tasks, such as data significant. A description of FAIR principles is repre- retrieval). Provided knowledge and understanding of sented in Figure 3. Technological solutions are urgently proper change management concepts, health care pro- needed that will enable researchers to explore, consume, viders can implement the latter successfully. Change and produce FAIR data in a reliable and efficient manner, management principles, such as defining a global vision, networking, and continuous communicating, could facili- Findable tate the integration of new technologies and bring up Descriptive metadata the clinical capabilities. However, this process of change Persistent identifiers management can be complicated, because it requires the Accessible Reusable involvement of multiple health care centers from different Findable Right and license Specify what to share countries and continents. This diversity can trigger a fear of Accessible Risk management management loss (one of the major factors of financial decision making), Usage standards Participant consent Interoperable definition (what can management which stems from differences of opinion and regulation, and cannot be used) Access status and the absence of data standardization, making the Reusable processes of data acquisition and preprocessing harder. In addition, the lack of knowledge about the new tech- Interoperable nology leads to resistance to accept the change and XML standards, including 60,61 data documentation innovation. Therefore, it is important to help health care organizations understand the need for distributed learn- ing by explaining the context of the change in terms of FIG 3. Description of findable, accessible, interoperable, reusable (FAIR) principles. traditional ways of learning to distributed learning and 192 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care a long-term vision of the improvements that it can bring, FUTURE PERSPECTIVES including time and money savings for both hospitals and An automated monitoring system accessible by the part- patients. This could in turn improve patient lives, in ad- ners or medical centers participating in the distributed dition to conducting more studies on research databases learning network can promote transparency, traceability, to consolidate proof of safety and quality of distributed 67 and trust. Recent advances of information technology, models. such as blockchain, can be integrated into a distributed As can be seen in Table 2, distributed learning has been learning network. Blockchain allows trusted partners to applied to train different models that can predict different visualize the history of the transactions and actions taken outcomes for a variety of pathologies, including lung in the distributed network. This integration of blockchain 23,62,63,63a 64 65 cancer, thyroid cancer, heart cardiac events, should help in easing the resistance to the new distributed and schizophrenia, in addition to the continuous devel- technology among health care workers as it provides both opment of tools and algorithms facilitating the adoption of provenance and enforceable governance. distributed learning, such as the variant learning portal, the 69 In 2008, Satoshi Nakamoto introduced the concept of alternating direction method of multipliers algorithm, as a peer-to-peer electronic cash system known as Bitcoin. well as the application of FAIR data principles. The cited Blockchain was made famous as the public transaction studies provide a proof that distributed learning can ensure 69,70 ledger of this cryptocurrency. It ensures security by patient data privacy and guarantee that accurate models using cryptography in a decentralized, immutable distrib- are built that are the equivalent of centralized models. 71 uted ledger technology. It is easy to manage as it can be LIMITATIONS OF THE EXISTING DISTRIBUTED made public, whereby any individual can participate, or it LEARNING IMPLEMENTATIONS can be made private, where all participants are known to each other. It is an efficient monitoring system, as records A shared limitation of the studies presented in Table 2 is cannot be deleted from the chain. By these means, that the number of institutes involved in the distributed blockchain exceeds its application as a cryptocurrency to network is rather small. The size of the network varies from a permanent trustful tracing system. Figure 4 illustrates four to 10 institutions. With few medical institutes involved, a visual representation of blockchain. the models were trained using the data of only a few hundred patients. By promoting the use of distributed Boulos et al demonstrated how blockchain could be used learning, it should instead be possible to train the models to contribute in health care: securing patient information using data from thousands or even millions of patients. and provider identities, managing health supply chains, The block is broadcasted to every party in the network The request is represented H1 wants to provide by a block more data Master Master The block then can be added to the chain, The training can start using Master which provides an indelible the extra data provided H1 then is entitled to start providing and transparent record of the requests by H1 the data to the network New block Hospital New block chained with last block in the chain Master computer Existing block in the chain FIG 4. Visual representation of blockchain. Adapted from Rennock et al. JCO Clinical Cancer Informatics 193 Zerka et al monetizing clinical research and data (giving patients the these technologies on privacy, and the relationship be- choice to share), processing claims, detecting fraud, and tween privacy and confidentiality, but there are significant managing prescriptions (replace incorrect and outdated technical developments for the regulators to consider that data). In addition to the above-mentioned uses of block- could answer a number of their concerns. chain, it has been also used to maintain security and scalability of clinical data sharing, secure medical record SUMMARY 74 75 sharing, prevent drug counterfeiting, and secure a pa- 76 Currently, a combination of regulations and ethics makes it tient’s location. difficult to share data even for scientific research purposes. It is essential that the use of distributed machine/deep The issues relate to the legal basis for processing and learning and blockchain be harmonized with the available anonymization. Specifically, there has been reluctance to security-preserving technologies (ie, continues devel- move away from informed consent as the legal basis for opment and cybersecurity), which begins at the user levels processing toward processing in the public interest, and (use strong passwords, connect using only trusted net- there are concerns about the re-identification of individuals works, and so on) and ends with more complex information where data are de-identified and then shared in aggregated technology infrastructures (such as data anonymization environments. A solution could be to allow researchers to and user ID encryption). Cybersecurity is a key aspect in train their machine learning programs without the data ever preserving privacy and ensuring safety and trust among having to leave the clinics, which in this paper we have patients and health care systems. The continuous de- established as distributed learning. This safe practice velopment or postmarketing surveillance can be seen as makes it possible to learn from medical data and can be the set of checks and integrations that should occur when applied across various medical disciplines. A limitation to its a distributed learning network is launched. This practice application, however, is that medical centers need to be should make it possible to identify any weak security convinced to participate in such practice, and regulators measures in the network or non-up-to-date features that also need to know suitable safeguards have been estab- 79,80 may require re-implementation. lished. Moreover, as can be seen in Table 2, even with the The distributed learning and blockchain technologies use of distributed learning, the size of the data pool learned presented here show that there are emerging data science from remains rather small. In the future, the integration of solutions that begin to meet the concerns and shortcom- blockchain technology to distributed learning networks ings of the law. The problems of re-identification are greatly could be considered, as it ensures transparency and reduced and managed through the technologies. Clearly, traceability while following FAIR data principles and can there are conceptual issues of understanding the impact of facilitate the implementation of distributed learning. 8295; Interreg V-A Euregio Meuse-Rhine “Euradiomics” Grant No. AFFILIATIONS 1 EMR4; and the Scientific Exchange from Swiss National Science The D-Lab, Department of Precision Medicine, GROW School for Foundation Grant No. IZSEZ0_180524. Oncology and Developmental Biology, Maastricht University Medical Centre, Maastricht, The Netherlands Oncoradiomics, Liege, Belgium AUTHOR CONTRIBUTIONS Department of Radiation Oncology, University Hospital Zurich and Conception and design: All authors University of Zurich, Zurich, Switzerland Financial support: Sean Walsh Department of Health, Ethics, and Society, CAPHRI (Care and Public Administrative support: Sean Walsh Health Research Institute), Maastricht University, Maastricht, The Provision of study material or patients: Sean Walsh Netherlands Collection and assembly of data: Fadila Zerka, Samir Barakat, Ralph T.H. Leijenaar Data analysis and interpretation: Fadila Zerka, Samir Barakat, CORRESPONDING AUTHOR Ralph T.H. Leijenaar, David Townend, Philippe Lambin Fadila Zerka, PhD, Universiteit Maastricht, Postbus 616, Maastricht Manuscript writing: All authors 6200 MD, the Netherlands; e-mail: f.zerka@maastrichtuniversity.nl. Final approval of manuscript: All authors Accountable for all aspects of the work: All authors SUPPORT Supported by European Research Council advanced grant ERC-ADG- AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF 2015 Grant No. 694812, Hypoximmuno; the Dutch technology INTEREST Foundation Stichting Technische Wetenschappen Grant No. P14-19 The following represents disclosure information provided by authors of Radiomics STRaTegy, which is the applied science division of Dutch this manuscript. All relationships are considered compensated unless Research Council (De Nederlandse Organisatie voor Wetenschappelijk) otherwise noted. Relationships are self-held unless noted. I = Immediate the Technology Program of the Ministry of Economic Affairs; Small and Family Member, Inst = My Institution. Relationships may not relate to the Medium-Sized Enterprises Phase 2 RAIL Grant No. 673780; subject matter of this manuscript. For more information about ASCO’s EUROSTARS, DART Grant No. E10116 and DECIDE Grant No. E11541; conflict of interest policy, please refer to www.asco.org/rwc or ascopubs. the European Program PREDICT ITN Grant No. 766276; Third Joint org/cci/author-center. Transnational Call 2016 JTC2016 “CLEARLY” Grant No. UM 2017- 194 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care Open Payments is a public database containing information reported by Benjamin Miraglio companies about payments made to US-licensed physicians (Open Employment: OncoRadiomics Payments). Philippe Lambin Fadila Zerka Employment: Convert Pharmaceuticals Employment: Oncoradiomics Leadership: DNAmito Research Funding: PREDICT Stock and Other Ownership Interests: BHV, Oncoradiomics, Convert Pharmaceuticals, The Medical Cloud Company Samir Barakat Honoraria: Varian Medical Employment: PtTheragnostic Consulting or Advisory Role: BHV, Oncoradiomics Leadership: PtTheragnostic Research Funding: ptTheragnostic Patents, Royalties, Other Intellectual Property: Co-inventor of two issued Sean Walsh patents with royalties on radiomics (PCT/NL2014/050248, PCT/ Employment: Oncoradiomics NL2014/050728) licensed to Oncoradiomics and one issued patent on Leadership: Oncoradiomics mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, Stock and Other Ownership Interests: Oncoradiomics three nonpatentable inventions (software) licensed to ptTheragnostic/ Research Funding: Varian Medical Systems (Inst) DNAmito, Oncoradiomics, and Health Innovation Ventures. Ralph T. H. Leijenaar Travel, Accommodations, Expenses: ptTheragnostic, Elekta, Varian Employment: Oncoradiomics Medical Leadership: Oncoradiomics David Townend Stock and Other Ownership Interests: Oncoradiomics Consulting or Advisory Role: Newron Pharmaceuticals (I) Patents, Royalties, Other Intellectual Property: Image analysis method supporting illness development prediction for a neoplasm in a human or No other potential conflicts of interest were reported. animal body (PCT/NL2014/050728) Arthur Jochems ACKNOWLEDGMENT Stock and Other Ownership Interests: Oncoradiomics, Medical Cloud We thank Simone Moorman for editing the manuscript. Company REFERENCES 1. Mitchell TM: Machine Learning International ed., [Reprint.]. New York, NY, McGraw-Hill, 1997 2. Boyd S, Parikh N, Chu E, et al: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3:1-122, 2010 3. Cardoso I, Almeida E, Allende-Cid H, et al: Analysis of machine learning algorithms for diagnosis of diffuse lung diseases. Methods Inf Med 57:272-279, 2018 4. Wang X, Peng Y, Lu L, et al: ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Presented at 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, July 21-26, 2017 5. Ding Y, Sohn JH, Kawczynski MG, et al: A deep learning model to predict a diagnosis of Alzheimer disease by using F-FDG PET of the brain. Radiology 290:456-464, 2019 6. Emmert-Streib F, Dehmer M: A machine learning perspective on personalized medicine: An automized, comprehensive knowledge base with ontology for pattern recognition. Mach Learn Knowl Extr 1:149-156, 2018 7. Deist TM, Dankers FJWM, Valdes G, et al: Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers. Med Phys 45:3449-3459, 2018 8. Lambin P, van Stiphout RG, Starmans MH, et al: Predicting outcomes in radiation oncology multifactorial decision support systems. Nat Rev Clin Oncol 10:27-40, 2013 9. Wang S, Summers RM: Machine learning and radiology. Med Image Anal 16:933-951, 2012 10. James G, Witten D, Hastie T, et al: An introduction to statistical learning: With applications in R. New York, NY, Springer, 2017 11. Sutton RS, Barto AG: Reinforcement Learning: An Introduction. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf 12. Deng L: Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7:197-387, 2014 13. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 521:436-444, 2015 14. Garling C: Andrew Ng: Why ‘deep learning’ is a mandate for humans, not just machines. Wired 2015. https://www.wired.com/brandlab/2015/05/andrew-ng- deep-learning-mandate-humans-not-just-machines/ 15. Pesapane F, Codari M, Sardanelli F: Artificial intelligence in medical imaging: Threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp 2:35, 2018 16. Liberati A, Altman DG, Tetzlaff J, et al: The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. PLoS Med 6:e1000100, 2009 16a. Intersoft Consulting: General Data Protection Regulation: Recitals. https://gdpr-info.eu/recitals/no-26/ 17. MAASTRO Clinic: euroCAT: Distributed learning. https://youtu.be/nQpqMIuHyOk 18. Rennock MJW, Cohn A, Butcher JR: Blockchain technology and regulatory investigations. https://www.steptoe.com/images/content/1/7/v2/171967/LIT- FebMar18-Feature-Blockchain.pdf 19. Orlhac F, Frouin F, Nioche C, et al: Validation of a method to compensate multicenter effects affecting CT radiomics. Radiology 291:53-59, 2019 20. Goodfellow I, Bengio Y, Courville A: Deep Learning. https://www.deeplearningbook.org/ 21. Lambin P, Roelofs E, Reymen B, et al: Rapid Learning health care in oncology - an approach towards decision support systems enabling customised radiotherapy. Radiother Oncol 109:159-164, 2013 22. Lustberg T, van Soest J, Jochems A, et al: Big Data in radiation therapy: Challenges and opportunities. Br J Radiol 90:20160689, 2017 JCO Clinical Cancer Informatics 195 Zerka et al 23. Deist TM, Jochems A, van Soest J, et al: Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care: euroCAT. Clin Transl Radiat Oncol 4:24-31, 2017 24. Price G, van Herk M, Faivre-Finn C: Data mining in oncology: The ukCAT project and the practicalities of working with routine patient data. Clin Oncol (R Coll Radiol) 29:814-817, 2017 25. Dean J, Corrado G, Monga R, et al: Large Scale Distributed Deep Networks. Advances in Neural Information Processing Systems 25, 2012, 1223-1231. https:// papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012 26. Cires¸an D, Meier U, Schmidhuber J: Multi-column deep neural networks for image classification. http://arxiv.org/abs/1202.2745 27. Radiuk PM: Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Information Technology and Management Science 20:20-24, 2017 28. Keskar NS, Mudigere D, Nocedal J, et al: On large-batch training for deep learning: generalization gap and sharp minima. http://arxiv.org/abs/1609.04836 29. Papernot N, Abadi M, Erlingsson U, et al: Semi-supervised knowledge transfer for deep learning from private training data. http://arxiv.org/abs/1610.05755 30. Shokri R, Shmatikov V: Privacy-preserving deep learning, in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security - CCS ’15. Denver, Colorado, ACM Press, 2015, pp 1310-1321. 31. Predd JB, Kulkarni SB, Poor HV: Distributed learning in wireless sensor networks. IEEE Signal Process Mag 23:56-69, 2006 32. Ji X, Hou C, Hou Y, et al: A distributed learning method for l 1 -regularized kernel machine over wireless sensor networks. Sensors (Basel) 16:1021, 2016 33. Chang K, Balachandar N, Lam C, et al: Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 25:945-954, 2018 34. McClure P, Zheng CY, Kaczmarzyk J, et al: Distributed Weight Consolidation: A Brain Segmentation Case Study. https://arxiv.org/abs/1805.10863 35. FreeSurferWiki: FreeSurfer. http://freesurfer.net/fswiki/FreeSurferWiki 36. Sheller MJ, Reina GA, Edwards B, et al: Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. http://arxiv.org/abs/1810.04304 37. Li W, Milletarı F, Xu D, et al: Privacy-preserving federated brain tumour segmentation. http://arxiv.org/abs/1910.00962 38. Abadi M, Chu A, Goodfellow I, et al: Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Com- munications Security – CCS’16. 308-318, 2016 39. Mishkin D, Sergievskiy N, Matas J: Systematic evaluation of convolution neural network advances on the Imagenet. Comput Vis Image Underst 161:11-19, 40. Lin T, Stich SU, Patel KK, et al: Don’t use large mini-batches, use local SGD. http://arxiv.org/abs/1808.07217 41. Biryukov A, De Canniere ` C, Winkler WE, et al: Discretionary access control policies (DAC), in van Tilborg HCA, Jajodia S (eds): Encyclopedia of Cryptography and Security. Boston, MA, Springer, 2011, pp 356-358 42. Pinkas B: Cryptographic techniques for privacy-preserving data mining. SIGKDD Explor 4:12-19, 2002 43. Siegel RL, Miller KD, Jemal A: Cancer statistics, 2017. CA Cancer J Clin 67:7-30, 2017 44. Bray F, Ferlay J, Soerjomataram I, et al: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 68:394-424, 2018 45. Siegel R, DeSantis C, Virgo K, et al: Cancer treatment and survivorship statistics, 2012. CA Cancer J Clin 62:220-241, 2012 46. Shortliffe EH, Barnett GO: Medical data: Their acquisition, storage, and use, in Shortliffe EH, Perreault LE (eds): Medical Informatics. New York, NY, Springer, 2001, pp 41-75 47. Shabani M, Vears D, Borry P: Raw genomic data: Storage, access, and sharing. Trends Genet 34:8-10, 2018 48. Langer SG: Challenges for data storage in medical imaging research. J Digit Imaging 24:203-207, 2011 49. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018, 2016 50. Wilkinson MD, Sansone S-A, Schultes E, et al: A design framework and exemplar metrics for FAIRness. Sci Data 5:180118, 2018 51. Dumontier M, Gray AJG, Marshall MS, et al: The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331, 2016 52. Jagodnik KM, Koplev S, Jenkins SL, et al: Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop. J Biomed Inform 71:49-57, 2017 53. Polanin JR, Terzian M: A data-sharing agreement helps to increase researchers’ willingness to share primary data: Results from a randomized controlled trial. J Clin Epidemiol 106:60-69, 2018 54. Azzariti DR, Riggs ER, Niehaus A, et al: Points to consider for sharing variant-level information from clinical genetic testing with ClinVar. Cold Spring Harb Mol Case Stud 4:a002345, 2018 55. Boue S, Byrne M, Hayes AW, et al: Embracing transparency through data sharing. Int J Toxicol 10.1177/1091581818803880 56. Poline J-B, Breeze JL, Ghosh S, et al: Data sharing in neuroimaging research. Front Neuroinform 6:9 2012 57. Cutts FT, Enwere G, Zaman SMA, et al: Operational challenges in large clinical trials: Examples and lessons learned from the Gambia pneumococcal vaccine trial. PLoS Clin Trials 1:e16 2006 58. Xia W, Wan Z, Yin Z, et al: It’s all in the timing: Calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc 25:25-31, 2018 59. Fleishon H, Muroff LR, Patel SS: Change management for radiologists. J Am Coll Radiol 14:1229-1233, 2017 60. Delaney R, D’Agostino R: The challenges of integrating new technology into an organization. https://digitalcommons.lasalle.edu/cgi/viewcontent.cgi? article=1024&context=mathcompcapstones 61. Agboola A, Salawu R: Managing deviant behavior and resistance to change. Int J Bus Manage 6:235, 2010 62. Jochems A, Deist TM, van Soest J, et al: Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital - A real life proof of concept. Radiother Oncol 121:459-467, 2016 63. Jochems A, Deist TM, El Naqa I, et al: Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries. Int J Radiat Oncol Biol Phys 99:344-352, 2017 63a. Deist TM, Dankers FJWM, Ojha P, et al: Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. Radiother Oncol 144:189-200, 64. Tagliaferri L, Gobitti C, Colloca GF, et al: A new standardized data collection system for interdisciplinary thyroid cancer management: Thyroid COBRA. Eur J Intern Med 53:73-78, 2018 65. Brisimi TS, Chen R, Mela T, et al: Federated learning of predictive models from federated Electronic Health Records. Int J Med Inform 112:59-67, 2018 66. Dluhos P, Schwarz D, Cahn W, et al: Multi-center machine learning in imaging psychiatry: A meta-model approach. Neuroimage 155:10-24, 2017 196 © 2020 by American Society of Clinical Oncology Distributed Learning in Health Care 67. Dhillon V, Metcalf D, Hooper M: Blockchain in health care, in Dhillon V, Metcalf D, Hooper M (eds): Blockchain Enabled Applications: Understand the Blockchain Ecosystem and How to Make it Work for You. Berkeley, CA, Apress, 2017, pp 125-138 68. Lugan S, Desbordes P, Tormo LXR, et al: Secure architectures implementing trusted coalitions for blockchained distributed learning (TCLearn). http://arxiv. org/abs/1906.07690 69. Nakamoto S: Bitcoin: A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf 70. Gordon WJ, Catalini C: Blockchain technology for healthcare: Facilitating the transition to patient-driven interoperability. Comput Struct Biotechnol J 16:224-230, 2018 71. Kamel Boulos MN, Wilson JT, Clauson KA: Geospatial blockchain: Promises, challenges, and scenarios in health and healthcare. Int J Health Geogr 17:25 72. Pirtle C, Ehrenfeld J: Blockchain for healthcare: The next generation of medical records? J Med Syst 42:172, 2018 73. Zhang P, White J, Schmidt DC, et al: FHIRChain: Applying blockchain to securely and scalably share clinical data. Comput Struct Biotechnol J 16:267-278, 74. Dubovitskaya A, Xu Z, Ryu S, et al: Secure and trustable electronic medical records sharing using blockchain. AMIA Annu Symp Proc 2017:650-659, 2018 75. Vruddhula S: Application of on-dose identification and blockchain to prevent drug counterfeiting. Pathog Glob Health 112:161, 2018 76. Ji Y, Zhang J, Ma J, et al: BMPLS: Blockchain-based multi-level privacy-preserving location sharing scheme for telecare medical information systems. J Med Syst 42:147, 2018 77. Coventry L, Branley D: Cybersecurity in healthcare: A narrative review of trends, threats and ways forward. Maturitas 113:48-52, 2018 78. Jalali MS, Kaiser JP: Cybersecurity in hospitals: A systematic, organizational perspective. J Med Internet Res 20:e10059, 2018 ´ ˇ 79. Vlahovic-Palcevski V, Mentzer D: Postmarketing surveillance, in Seyberth HW, Rane A, Schwab M (eds): Pediatric Clinical Pharmacology. Berlin, Springer, 2011, pp 339-351 80. Parkash R, Thibault B, Philippon F, et al: Canadian Registry of Implantable Electronic Device outcomes: Surveillance of high-voltage leads. Can J Cardiol 34:808-811, 2018 81. Ing EB, Ing R: The use of a nomogram to visually interpret a logistic regression prediction model for giant cell arteritis. Neuroophthalmology 42:284-286, 2018 82. Tirzite M, Bukovskis M, Strazda G, et al: Detection of lung cancer with electronic nose and logistic regression analysis. J Breath Res 13: 016006, 2018 83. Ji Z, Jiang X, Wang S, et al: Differentially private distributed logistic regression using private and public data. BMC Med Genomics 7:S14, 2014 (suppl 1) 84. Jiang W, Li P, Wang S, et al: WebGLORE: A web service for Grid LOgistic REgression. Bioinformatics 29:3238-3240, 2013 85. Wang S, Jiang X, Wu Y, et al: EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning. J Biomed Inform 46:480-496, 2013 86. Desai A, Chaudhary S: Distributed decision tree. Proceedings of the Ninth Annual ACM India Conference, Gandhinagar, India, ACM Press, 2016, pp 43-50 87. Caragea D, Silvescu A, Honavar V: Decision tree induction from distributed heterogeneous autonomous data sources, in Abraham A, Franke K, Koppen M (eds): Intelligent Systems Design and Applications. Berlin, Springer, 2003, pp 341-350 88. Plaku E, Kavraki LE: Distributed computation of the knn graph for large high-dimensional point sets. J Parallel Distrib Comput 67:346-359, 2007 89. Xiong L, Chitti S, Liu L: Mining multiple private databases using a kNN classifier, in Proceedings of the 2007 ACM symposium on Applied computing – SAC ’07. Seoul, Korea, ACM Press, 2007, p 435 90. Huang Z: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283-304, 1998 91. Jagannathan G, Wright RN: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data, in Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining – KDD ’05. Chicago, Illinois, USA, ACM Press, 2005, p 593 92. Jin R, Goswami A, Agrawal G: Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10:17-40, 2006 93. Jagannathan G, Pillaipakkamnatt K, Wright RN: A new privacy-preserving distributed k -clustering algorithm, in Proceedings of the 2006 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2006, pp 494-498 94. Ye Y, Chiang C-C: A parallel apriori algorithm for frequent itemsets mining, in Fourth International Conference on Software Engineering Research, Management and Applications (SERA’06). Seattle, WA, IEEE, 2006, pp 87-94 95. Cheung DW, Ng VT, Fu AW, et al: Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8:911-922, 1996 96. Bellman R: A Markovian decision process. Indiana Univ Math J 6:679-684, 1957 97. Puterman ML: Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, John Wiley & Sons, 2014 98. Watkins CJCH, Dayan P: Q-learning. Mach Learn 8:279-292, 1992 99. Lauer M, Riedmiller M: An algorithm for distributed reinforcement learning in cooperative multi-agent systems, in Proceedings of the Seventeenth International Conference on Machine Learning. Burlington, MA, Morgan Kaufmann, 2000, pp 535-542. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.772 nn n JCO Clinical Cancer Informatics 197 Zerka et al APPENDIX Records identified through Additional records identified database searching through other sources (n = 127) (n = 0) Records after duplicates removed (n = 127) Records screened Records excluded (n = 6) (n = 121) Full-text articles assessed Full-text articles excluded, for eligibility with reasons (n = 6) (n = 0) Studies included in qualitative synthesis (n = 6) Studies included in quantitative synthesis (meta-analysis) (n = 6) FIG A1. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2009 flow diagram. 198 © 2020 by American Society of Clinical Oncology Included Eligibility Screening Identification Distributed Learning in Health Care TABLE A2. PRISMA 2009 Checklist Reported on Section/Topic No. Checklist Item Page No. Title Title 1 Identify the report as a systematic review, meta-analysis, or both. 1 Abstract Structured summary 2 Provide a structured summary including, as applicable: background; 1 objectives; data sources; study eligibility criteria, participants, and interventions; study appraisal and synthesis methods; results; limitations; conclusions and implications of key findings; systematic review registration number. Introduction Rationale 3 Describe the rationale for the review in the context of what is already 1-5 known. Objectives 4 Provide an explicit statement of questions being addressed with reference 2 to PICOS. Methods Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (eg, 5 Web address), and, if available, provide registration information including registration number. Eligibility criteria 6 Specify study characteristics (eg, PICOS, length of follow-up) and report 5 characteristics (eg, years considered, language, publication status) used as criteria for eligibility, giving rationale. Information sources 7 Describe all information sources (eg, databases with dates of coverage, 5 contact with study authors to identify additional studies) in the search and date last searched. Search 8 Present full electronic search strategy for at least one database, including 5 any limits used, such that it could be repeated. Study selection 9 State the process for selecting studies (ie, screening, eligibility, included 5 in systematic review, and, if applicable, included in the meta-analysis). (and Fig A1) Data collection process 10 Describe method of data extraction from reports (eg, piloted forms, 5 independently, in duplicate) and any processes for obtaining and confirming data from investigators. Data items 11 List and define all variables for which data were sought (eg, PICOS, N/A funding sources) and any assumptions and simplifications made. Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies N/A (including specification of whether this was done at the study or outcome level) and how this information is to be used in any data synthesis. Summary measures 13 State the principal summary measures (eg, risk ratio, difference in N/A means). Synthesis of results 14 Describe the methods of handling data and combining results of studies, 5 if done, including measures of consistency (eg, I ) for each meta-analysis. Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative N/A evidence (eg, publication bias, selective reporting within studies). Additional analyses 16 Describe methods of additional analyses (eg, sensitivity or subgroup N/A analyses, meta-regression), if done, indicating which were prespecified. Results Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in 5 the review, with reasons for exclusions at each stage, ideally with a flow (and Fig A1) diagram. Study characteristics 18 For each study, present characteristics for which data were extracted (eg, 5-8 study size, PICOS, follow-up period) and provide the citations. (Continued on following page) JCO Clinical Cancer Informatics 199 Zerka et al TABLE A2. PRISMA 2009 Checklist (Continued) Reported on Section/Topic No. Checklist Item Page No. Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any N/A outcome-level assessment (see item 12). Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: 5-8 (a) simple summary data for each intervention group, and (b) effect estimates and confidence intervals, ideally with a forest plot. Synthesis of results 21 Present results of each meta-analysis done, including confidence 5-8 intervals and measures of consistency. Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item N/A 15). Additional analysis 23 Give results of additional analyses, if done (eg, sensitivity or subgroup N/A analyses, meta-regression [see Item 16]). Discussion Summary of evidence 24 Summarize the main findings, including the strength of evidence for each 8 main outcome; consider their relevance to key groups (eg, health care providers, users, and policy makers). Limitations 25 Discuss limitations at study and outcome level (eg, risk of bias), and at 10 review level (eg, incomplete retrieval of identified research, reporting bias). Conclusions 26 Provide a general interpretation of the results in the context of other 11 evidence, and implications for future research. Funding Funding 27 Describe sources of funding for the systematic review and other support 11 (eg, supply of data); role of funders for the systematic review. Abbreviations: N/A, not applicable; PICOS, participants, interventions, comparisons, outcomes, and study design; PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses. 200 © 2020 by American Society of Clinical Oncology

Journal

JCO Clinical Cancer InformaticsWolters Kluwer Health

Published: Mar 5, 2020

There are no references for this article.