Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Data mining with Random Forests as a methodology for biomedical signal classification

Data mining with Random Forests as a methodology for biomedical signal classification As the contribution of specific parameters is not known and significant intersubject variability is expected, a decision system allowing adaptation for subject and environment conditions has to be designed to evaluate biomedical signal classification. A decision support system has to be trained in its desirable functionality prior to being used for patient monitoring evaluation. This paper describes a decision system based on data mining with Random Forests, allowing the adaptation for subject and environment conditions. This methodology may lead to specific system scoring by an artificial intelligence-supported patient monitoring evaluation system, which may help find a way of making decisions concerning future treatment and have influence on the quality of patients' life. Keywords: data mining; patient monitoring; Random Forests. assessment of ST-segment changes. Present-day patient home care monitoring systems are still under development, and the data acquisition modules integrated are also able to store data for retrospective analysis. The domain of patient monitoring and e-health [3], involving health-care administration and support, education, health-care delivery, and research is based on clinical information systems such as electronic medical records, decision support, and monitoring of clinical and institutional practice including disease management, services, remote patient monitoring, teleconsultations, and homecare [4]. Statistical models based on patient's biomedical signals allow classifications of disorders. In order to facilitate this, appropriate methods of processing and analysis of biomedical signals that have been previously used [5] are developed or are waiting for the process of validation in real life. One of the methods that can be used for prediction and classification is Random Forests, which is described in this work. Introduction The acquisition of bioelectrical signals and the development of new signal processing techniques continue to excite physicians and engineers alike, as over the years these have helped reveal new information regarding various diseases [1]. Current trends in patient care emphasize the overall quality of life as a goal for treatment outcome; therefore, various ways of continuous patient monitoring have become increasingly popular in health care and e-health [2]. Since many years, the aims of continuous patient monitoring were known and have expanded from monitoring of a single biomarker to the detection and classification of morphological disorders of acquired whole signals such as cardiac arrhythmias or dynamic Random Forests methodology Random Forests has become a popular technique for classification, prediction, variable selection, and investigation of variable importance. It is a powerful new approach to data exploration, analysis, and predictive modeling. This method was developed by Leo Breiman (father of CART) at the University of California, Berkeley [6]. The main advantages of Random Forests are [7] ­ Clustering, ­ Analysis of small sample data, ­ Good assessment for large number of variables, ­ Automated identification of important predictors, ­ Generation of strong predictive models. A Random Forests method is a collection of trees following specific rules for ­ Tree growing, ­ Tree combination, ­ Self-testing, ­ Postprocessing. *Corresponding author: Klaudia Proniewska, Jagiellonian University Medical College, Krakow, Poland, E-mail: klaudia.proniewska@uj.edu.pl 90Proniewska: Data mining with Random Forests The accuracy of a random forest depends on the strength of the individual tree classifiers and a measure of the dependence between them. A single tree classifier will then have accuracy only slightly better than a random choice of class. Combining trees grown using random features can produce improved accuracy. To improve accuracy, the randomness injected has to minimize the correlation while maintaining strength. The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The Random Forests method starts with a machine learning technique called a "decision tree" [8]. In a decision tree, input is entered at the top and, as it traverses down the tree, the data become bucketed into smaller and smaller sets, as in Figure 1. The basics of Random Forests Tree Growing are as follows: ­ Each parent node is split into no more than two children. ­ Each tree is grown at least partially at random. ­ Each tree is based on a different random subsample of the training data. ­ Split selection process so that the splitter at any node is determined partly at random. The best splitter from the eligible random subset is used to split the node, and if the splitter is not particularly good, the end results are two children nodes [9]. As a categorical variable with values can be coded into variables, we make the variable as probable as a numeric variable to be selected in node splitting. In growing a decision tree, we normally conduct exhaustive searches across all possible predictors to find the best possible partition of data in each node of the tree. Suppose that instead of always picking the best splitter, we picked the splitter at random. This would guarantee that different trees would be rather dissimilar to each other. Random Forests tree evaluation is done by performing the following operations on each node [10]: ­ Splitting a node on the best eligible splitter, ­ Finding a new list of random, eligible predictors, ­ Defining an eligible predictor set, ­ Selecting the strongest branch of the tree. Random Forests has a very simple mechanism: ­ Grow many trees (in the presented study, it is 300 trees). ­ Each tree costs a vote at its terminal nodes (yes or no). ­ Counting yes votes is the Random Forests score. ­ The predicted probability is the percent of yes votes. Random Forests testing: ­ Each tree is grown on about 70% of the original training data. ­ Of the data, 30% are available to test any single tree (for each tree will be a different 30%). ­ Remaining data are used to calibrate the performance of each tree and to check how often each record is classified correctly. Random Forests testing is repeated hundreds of times with different random portioning of data each time [11, 12]. The small samples of the record types of interest allows All data Figure 1:Random Forests concept. At each node (random subset), then small subset of variables at random is selected and a variable (a value for the variable) which optimizes the split is define. Proniewska: Data mining with Random Forests91 Table 1:Properties of Random Forests method [16]. Number of predictors Number of tress Random test data proportion Subsample proportion Minimum number of cases Minimum number in child node Specifies the number of predictors for models of trees. The default value is a subset of the total number of predictor variables selected for analysis. Specifies the number of simple trees built gradually in subsequent steps of the analysis. The results allow to identify which individual trees used in the model (graph and tree structure) should be displayed in the output node. A portion of the data set will be randomly selected for the sample test. This parameter applies if the use of the test sample is enabled. Specifies the proportion of trials for the bootstrap algorithm in subsequent cycles of model construction. The bootstrap algorithm cases are drawn with replacement of the cases contained in the original data set. The process of nodes division can be controlled by specifying a minimum number of observations. This field specifies the minimum cardinality of the node. This field specifies the minimum cardinality node, resulting from the division. Parameter: the minimum number of cases decides how many cases must be in the node that it was permissible to its division. The minimum number in the child node determines how large must be the node resulting from division to division was acceptable (this condition applies to both nodes resulting from the division). The maximum number of levels of the tree. Used for stopping the division based on the number of levels of the tree. Thereafter, each division is checked: whether the number of levels (depth of tree) does not exceed this value. If it does, it ends up searching for divisions. Used to stop the division based on the total number of nodes. Thereafter, each division is checked; whether the number of nodes does not exceed this parameter. If it does, it ends up searching for divisions. Maximum number of levels Maximum number of nodes self-testing. The Random Forests method can be used to define statistical models [13]. One of the components of the prediction system is a classifier, which is a currently an ensemble of Random Forests classifier [14]. 1. Data mining classification with constructed Random Forests models The goal of the built models was to achieve the highest possible sensitivity (prediction value) and specificity (ability to distinguish one event group from the others) [15]. The Random Forests method used in the following example takes eight properties listed in Table 1 (which can be changed depending on the model). Initially, the Random Forests method is fed with all the data acquired and grouped into different types of events [17]. 2. Validation in patient monitoring environment To perform validation of data Mining with the Random Forests method, a set of parameters derived from biomedical analysis from quantitative analysis has to be prepared [18, 19]. All extracted parameters from dedicated analyses are considered as input of predictive statistical models used to find the best possible classification of defined disorders based on methods used in the planned study [20, 21]. To perform validation of acquired data in a correct way, some assumptions are made: ­ Definition of assumed disorders to be grouped, ­ Assessment of the feasibility and possibility of applying methods for processing and analyzing the biomedical signal, ­ ­ ­ ­ ­ Implementation of various methods for signal processing in case of disorders, Development of a measurement environment at home to acquire the data, Validation of various processing schemes and selection of an optimal method for recognition for selected types, Verification of results from recorded signals, Presentation and analysis of statistical models dedicated to the recognition of abnormalities. Summary The development of techniques for analyzing the information contained in the biomedical signals usually is based on methods supporting decision making. These techniques are implemented in the following stages of signal processing: registration of the biomedical signal, filtering, segmentation, windowing, feature extraction, and classification. The development and implementation of a classifier selection method based on Random Forests methodology presented in this paper is a new approach to biomedical signal classification and can be further extended to remote patient monitoring systems. Author contributions: The author has accepted responsibility for the entire content of this submitted manuscript and approved submission. 92Proniewska: Data mining with Random Forests Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 10. Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012;99:323­9. 11. Ho TK. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, 1, 1995:278­82. doi:10.1109/ICDAR.1995.598994. 12. Saffari A, Leistner C, Santner J, Godec M, Bischof H. On-line random forests. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 2009:1393­400. doi:10.1109/ICCVW.2009.5457447. 13. Genuer R, Poggi J-M, Tuleau C. Random Forests: some methodological insights. Inria 6729, 2008:32, arXiv:0811.3619v1 [stat. ML], ISSN 0249-6399. 14. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinform 2008;9:307. 15. Amaratunga D, Cabrera J, Lee YS. Enriched random forests. Bioinformatics 2008;24:2010­4. 16. Statsoft. Statistica 10 Manual. [Online]. Available at: www.statsoft.com. Accessed: 6 April 2016. 17. Lin Y, Jeon Y. Random Forests and adaptive nearest neighbors. J Am Stat Assoc 2006;101:578­90. 18. Adele C, Cutler DR, Stevens JR. Random Forests. In: Ensemble machine learning. Cambridge, MA, USA: Academic Press, 2012:157­75. doi:10.1007/978-1-4419-9326-7. 19. Boström H. Calibrating random forests. In: Proceedings ­ 7th International Conference on Machine Learning and Applications, ICMLA 2008, 2008:121­6. doi:10.1109/ICMLA.2008.107. 20. Biau G. Analysis of a Random Forests model. J Mach Learn Res 2012;13:1063­95. 21. Abdulsalam H, Skillicorn DB, Martin P. Streaming Random Forests. In: Proceedings of the International Database Engineering and Applications Symposium, IDEAS, 2007:225­32. doi:10.1109/IDEAS.2007.4318108. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Data mining with Random Forests as a methodology for biomedical signal classification

Bio-Algorithms and Med-Systems , Volume 12 (2) – Jun 1, 2016

Loading next page...
 
/lp/de-gruyter/data-mining-with-random-forests-as-a-methodology-for-biomedical-signal-rONHvCBn5U

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
de Gruyter
Copyright
Copyright © 2016 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2016-0005
Publisher site
See Article on Publisher Site

Abstract

As the contribution of specific parameters is not known and significant intersubject variability is expected, a decision system allowing adaptation for subject and environment conditions has to be designed to evaluate biomedical signal classification. A decision support system has to be trained in its desirable functionality prior to being used for patient monitoring evaluation. This paper describes a decision system based on data mining with Random Forests, allowing the adaptation for subject and environment conditions. This methodology may lead to specific system scoring by an artificial intelligence-supported patient monitoring evaluation system, which may help find a way of making decisions concerning future treatment and have influence on the quality of patients' life. Keywords: data mining; patient monitoring; Random Forests. assessment of ST-segment changes. Present-day patient home care monitoring systems are still under development, and the data acquisition modules integrated are also able to store data for retrospective analysis. The domain of patient monitoring and e-health [3], involving health-care administration and support, education, health-care delivery, and research is based on clinical information systems such as electronic medical records, decision support, and monitoring of clinical and institutional practice including disease management, services, remote patient monitoring, teleconsultations, and homecare [4]. Statistical models based on patient's biomedical signals allow classifications of disorders. In order to facilitate this, appropriate methods of processing and analysis of biomedical signals that have been previously used [5] are developed or are waiting for the process of validation in real life. One of the methods that can be used for prediction and classification is Random Forests, which is described in this work. Introduction The acquisition of bioelectrical signals and the development of new signal processing techniques continue to excite physicians and engineers alike, as over the years these have helped reveal new information regarding various diseases [1]. Current trends in patient care emphasize the overall quality of life as a goal for treatment outcome; therefore, various ways of continuous patient monitoring have become increasingly popular in health care and e-health [2]. Since many years, the aims of continuous patient monitoring were known and have expanded from monitoring of a single biomarker to the detection and classification of morphological disorders of acquired whole signals such as cardiac arrhythmias or dynamic Random Forests methodology Random Forests has become a popular technique for classification, prediction, variable selection, and investigation of variable importance. It is a powerful new approach to data exploration, analysis, and predictive modeling. This method was developed by Leo Breiman (father of CART) at the University of California, Berkeley [6]. The main advantages of Random Forests are [7] ­ Clustering, ­ Analysis of small sample data, ­ Good assessment for large number of variables, ­ Automated identification of important predictors, ­ Generation of strong predictive models. A Random Forests method is a collection of trees following specific rules for ­ Tree growing, ­ Tree combination, ­ Self-testing, ­ Postprocessing. *Corresponding author: Klaudia Proniewska, Jagiellonian University Medical College, Krakow, Poland, E-mail: klaudia.proniewska@uj.edu.pl 90Proniewska: Data mining with Random Forests The accuracy of a random forest depends on the strength of the individual tree classifiers and a measure of the dependence between them. A single tree classifier will then have accuracy only slightly better than a random choice of class. Combining trees grown using random features can produce improved accuracy. To improve accuracy, the randomness injected has to minimize the correlation while maintaining strength. The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The Random Forests method starts with a machine learning technique called a "decision tree" [8]. In a decision tree, input is entered at the top and, as it traverses down the tree, the data become bucketed into smaller and smaller sets, as in Figure 1. The basics of Random Forests Tree Growing are as follows: ­ Each parent node is split into no more than two children. ­ Each tree is grown at least partially at random. ­ Each tree is based on a different random subsample of the training data. ­ Split selection process so that the splitter at any node is determined partly at random. The best splitter from the eligible random subset is used to split the node, and if the splitter is not particularly good, the end results are two children nodes [9]. As a categorical variable with values can be coded into variables, we make the variable as probable as a numeric variable to be selected in node splitting. In growing a decision tree, we normally conduct exhaustive searches across all possible predictors to find the best possible partition of data in each node of the tree. Suppose that instead of always picking the best splitter, we picked the splitter at random. This would guarantee that different trees would be rather dissimilar to each other. Random Forests tree evaluation is done by performing the following operations on each node [10]: ­ Splitting a node on the best eligible splitter, ­ Finding a new list of random, eligible predictors, ­ Defining an eligible predictor set, ­ Selecting the strongest branch of the tree. Random Forests has a very simple mechanism: ­ Grow many trees (in the presented study, it is 300 trees). ­ Each tree costs a vote at its terminal nodes (yes or no). ­ Counting yes votes is the Random Forests score. ­ The predicted probability is the percent of yes votes. Random Forests testing: ­ Each tree is grown on about 70% of the original training data. ­ Of the data, 30% are available to test any single tree (for each tree will be a different 30%). ­ Remaining data are used to calibrate the performance of each tree and to check how often each record is classified correctly. Random Forests testing is repeated hundreds of times with different random portioning of data each time [11, 12]. The small samples of the record types of interest allows All data Figure 1:Random Forests concept. At each node (random subset), then small subset of variables at random is selected and a variable (a value for the variable) which optimizes the split is define. Proniewska: Data mining with Random Forests91 Table 1:Properties of Random Forests method [16]. Number of predictors Number of tress Random test data proportion Subsample proportion Minimum number of cases Minimum number in child node Specifies the number of predictors for models of trees. The default value is a subset of the total number of predictor variables selected for analysis. Specifies the number of simple trees built gradually in subsequent steps of the analysis. The results allow to identify which individual trees used in the model (graph and tree structure) should be displayed in the output node. A portion of the data set will be randomly selected for the sample test. This parameter applies if the use of the test sample is enabled. Specifies the proportion of trials for the bootstrap algorithm in subsequent cycles of model construction. The bootstrap algorithm cases are drawn with replacement of the cases contained in the original data set. The process of nodes division can be controlled by specifying a minimum number of observations. This field specifies the minimum cardinality of the node. This field specifies the minimum cardinality node, resulting from the division. Parameter: the minimum number of cases decides how many cases must be in the node that it was permissible to its division. The minimum number in the child node determines how large must be the node resulting from division to division was acceptable (this condition applies to both nodes resulting from the division). The maximum number of levels of the tree. Used for stopping the division based on the number of levels of the tree. Thereafter, each division is checked: whether the number of levels (depth of tree) does not exceed this value. If it does, it ends up searching for divisions. Used to stop the division based on the total number of nodes. Thereafter, each division is checked; whether the number of nodes does not exceed this parameter. If it does, it ends up searching for divisions. Maximum number of levels Maximum number of nodes self-testing. The Random Forests method can be used to define statistical models [13]. One of the components of the prediction system is a classifier, which is a currently an ensemble of Random Forests classifier [14]. 1. Data mining classification with constructed Random Forests models The goal of the built models was to achieve the highest possible sensitivity (prediction value) and specificity (ability to distinguish one event group from the others) [15]. The Random Forests method used in the following example takes eight properties listed in Table 1 (which can be changed depending on the model). Initially, the Random Forests method is fed with all the data acquired and grouped into different types of events [17]. 2. Validation in patient monitoring environment To perform validation of data Mining with the Random Forests method, a set of parameters derived from biomedical analysis from quantitative analysis has to be prepared [18, 19]. All extracted parameters from dedicated analyses are considered as input of predictive statistical models used to find the best possible classification of defined disorders based on methods used in the planned study [20, 21]. To perform validation of acquired data in a correct way, some assumptions are made: ­ Definition of assumed disorders to be grouped, ­ Assessment of the feasibility and possibility of applying methods for processing and analyzing the biomedical signal, ­ ­ ­ ­ ­ Implementation of various methods for signal processing in case of disorders, Development of a measurement environment at home to acquire the data, Validation of various processing schemes and selection of an optimal method for recognition for selected types, Verification of results from recorded signals, Presentation and analysis of statistical models dedicated to the recognition of abnormalities. Summary The development of techniques for analyzing the information contained in the biomedical signals usually is based on methods supporting decision making. These techniques are implemented in the following stages of signal processing: registration of the biomedical signal, filtering, segmentation, windowing, feature extraction, and classification. The development and implementation of a classifier selection method based on Random Forests methodology presented in this paper is a new approach to biomedical signal classification and can be further extended to remote patient monitoring systems. Author contributions: The author has accepted responsibility for the entire content of this submitted manuscript and approved submission. 92Proniewska: Data mining with Random Forests Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 10. Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics 2012;99:323­9. 11. Ho TK. Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, 1, 1995:278­82. doi:10.1109/ICDAR.1995.598994. 12. Saffari A, Leistner C, Santner J, Godec M, Bischof H. On-line random forests. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 2009:1393­400. doi:10.1109/ICCVW.2009.5457447. 13. Genuer R, Poggi J-M, Tuleau C. Random Forests: some methodological insights. Inria 6729, 2008:32, arXiv:0811.3619v1 [stat. ML], ISSN 0249-6399. 14. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinform 2008;9:307. 15. Amaratunga D, Cabrera J, Lee YS. Enriched random forests. Bioinformatics 2008;24:2010­4. 16. Statsoft. Statistica 10 Manual. [Online]. Available at: www.statsoft.com. Accessed: 6 April 2016. 17. Lin Y, Jeon Y. Random Forests and adaptive nearest neighbors. J Am Stat Assoc 2006;101:578­90. 18. Adele C, Cutler DR, Stevens JR. Random Forests. In: Ensemble machine learning. Cambridge, MA, USA: Academic Press, 2012:157­75. doi:10.1007/978-1-4419-9326-7. 19. Boström H. Calibrating random forests. In: Proceedings ­ 7th International Conference on Machine Learning and Applications, ICMLA 2008, 2008:121­6. doi:10.1109/ICMLA.2008.107. 20. Biau G. Analysis of a Random Forests model. J Mach Learn Res 2012;13:1063­95. 21. Abdulsalam H, Skillicorn DB, Martin P. Streaming Random Forests. In: Proceedings of the International Database Engineering and Applications Symposium, IDEAS, 2007:225­32. doi:10.1109/IDEAS.2007.4318108.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Jun 1, 2016

References