Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Computer-aided analysis of data from evaluation sheets of subjects with autism spectrum disorders

Computer-aided analysis of data from evaluation sheets of subjects with autism spectrum disorders In this paper, we deal with the problem of the initial analysis of data from evaluation sheets of subjects with autism spectrum disorders (ASDs). In the research, we use an original evaluation sheet including questions about competencies grouped into 17 spheres. An initial analysis is focused on the data preprocessing step including the filtration of cases based on consistency factors. This approach enables us to obtain simpler classifiers in terms of their size (a number of nodes and leaves in decision trees and a number of classification rules). Keywords: autism spectrum disorders; classification rules; classification trees; preprocessing. Introduction Autism is a brain development disorder that impairs social interaction and communication and causes restricted and repetitive behaviors, all starting before a child is 3 years old. Starting in May 2013 [i.e. the date of publication of the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5)], all autism disorders were merged into one umbrella diagnosis of autism spectrum disorders (ASDs). ASDs can dramatically affect a child's life as well as that of their families, schools, friends, and the wider community. The main aim of our research is to adapt computational intelligence methods for computer-aided decision support in the diagnosis and therapy of persons with ASDs. Computer-based decision support (CDS) is defined as the use of a computer to bring relevant knowledge *Corresponding author: Krzysztof Pancerz, Faculty of Mathematics and Natural Sciences, University of Rzeszow, Prof. S. Pigonia Str. 1, 35-310 Rzeszow, Poland, E-mail: kpancerz@ur.edu.pl Aneta Derkacz and Olga Mich: University of Management and Administration, Zamosc, Poland Jerzy Gomula: Cardinal Stefan Wyszynski University, Warsaw, Poland to bear on the health care and well-being of a patient (cf. [1]). In the first step of our research, we are interested in the initial analysis of data from evaluation sheets of subjects with ASDs. The evaluation sheet we used in the experiment is an original sheet including questions (more than 300) about the competencies of the subjects grouped into 17 spheres (self-service, communication, cognitive, physical, and the sphere responsible for functioning in the social and family environment, among others). An initial analysis is focused on the data preprocessing step. Data preprocessing is an important and basic stage in data mining and machine learning approaches. The data preprocessing step consists of different operations affecting the quality of results (e.g. in classification tasks). In general, we can distinguish a number of data preprocessing operations (cf. [2­4]), including case (instance) selection, outlier detection, feature (attribute) selection, feature transformation, feature generation, data cleaning, data discretization, data normalization, data integration, and dealing with missing values, etc. In the first step of our research, we have concentrated on a case selection (filtration) problem in terms of building classifiers. It is obvious that an increasing number and complexity of classification rules make them difficult to be validated by domain experts. Therefore, there is an important problem to reduce the number of rules embedded in the classifiers. Several approaches have been proposed to reduce the number of cases in the training set for building classifiers. A survey of algorithms is given, among others, in [5]. However, due to data specificity, we have used in the experiments our original approach, as presented in [6], for the selection of cases. This approach is based on splitting a set of input cases into two subsets consisting of unambiguous and boundary cases, respectively. The boundary cases are subjected to be removed from the process of training the classifier because they are assigned to one decision class but are also close to other decision classes with respect to consistency factors. To differentiate two subsets of cases, we have used an approach based on consistency factors defined in terms of appropriate lower approximations of sets (cf. [7, 8]). The main goal of this approach is to obtain simpler classifiers in terms of 110Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs their size (a number of nodes and leaves in decision trees and a number of classification rules) but without a significant loss of their classification ability. LOW=low-functioning autism, MEDIUM=mediumfunctioning autism, and HIGH=high-functioning autism. In the current stage of our research, each sphere is treated separately. For each sphere, the training data (which are used to train classifiers) are stored in a tabular form that is formally called a decision table. A decision table represents a decision system in the Pawlak's form (cf. [9]). We used the following formal definition of a decision system. A decision system DS is a tuple DS=(U, C, D, Vdescr, Vdec, finf, fdec), where U is a nonempty, finite set of cases; C is a nonempty, finite set of descriptive attributes; D is a nonempty, finite set of decision attributes; Vdescr=cCVc, such that Vc is a set of values of the descriptive attribute c; Vdec=dDVd, such that Vd is a set of values of the decision attribute d; finf:C×UVdescr is an information function such that finf(c, u)Vc for each cC and uU; and fdec:D×UVdec is a decision function such that fdec(d, u)Vd for each dD and uU. Input data Experiments testing the relative effectiveness of our approach have been performed on data describing more than 70 cases (subjects) classified into three categories: high-functioning, medium-functioning, or low-functioning autism. Each subject has been evaluated using an author's original sheet including questions about competencies grouped into 17 spheres marked with roman numerals (only spheres used in our experiments are listed, and the remaining spheres are omitted): ­ VI. Support for active communication. ­ VII. Active communication concerning objects, people, parts of the body. ­ VIII. Imitation, the length and complexity of the utterance. ­ IX. Needs, emotions, moods. ­ X. Object communication (the level of specific symbols). ­ XI. Symbolic communication. ­ XII. Requests. ­ XIII. Choices. ­ XIV. Communication in a pair (with contemporary, with an adult). ­ XV. Social communication competences. ­ XVI. Communication in a group and in social situations (in a team, at school, in the closest social environment). ­ XVIII. Vocabulary. ­ XIX. The degree of effectiveness of information. ­ XX. The degree of motivation to communicate. ­ XXI. The degree and type of hint in communication. ­ XXII. Building the utterance ­ the degree of its complexity and functionality. ­ XXIII. Dialogues. Each case x is described by a data vector a(x) consisting of more than 300 descriptive attributes: a(x)={a1(x), a2(x), ..., am(x)}. Four values of descriptive attributes are possible, namely 0, 25, 50, and 100. They have the following meaning: 0, not performed; 25, performed after physical help; 50, performed after verbal help/demonstration; and 100, performed unaided. To each case x, we have assigned one of the decision values determining the category of autism: Methodology and tools The proposed methodology of an initial analysis of data to select cases used to build a classifier consists of the following main stages (cf. [6]): ­ Calculating the consistency factors of cases included in the decision subsystem corresponding to class Y with the knowledge included in the decision subsystem corresponding to class X, where YX and X, Y{LOW, MEDIUM, HIGH}; ­ Dividing a set of all cases into two subsets: a subset of and a subset of boundary cases (see Figure 1) according to the calculated consistency factors; High-functioning (HIGH ) Boundary cases Low-functioning (LOW ) Medium-functioning (MEDIUM ) Figure 1:Dividing a set of all cases. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs111 ­ ­ Training a classifier using a set of ; and Testing the classifier using a set of all cases (i.e. before distinction between unambiguous and boundary cases). Calculating the consistency factors of cases and dividing a set of all cases into two subsets (unambiguous and boundary cases) were carried out using our specialized tool called Classification and Prediction Software System (CLAPSS) [10]. CLAPSS is a computer tool solving different classification and prediction problems using, among others, some specialized approaches based mainly on rough set theory [11]. The tool was designed for the Java platform (see Figure 2). In CLAPSS, we have implemented the algorithm for splitting a set of cases included in a decision system into two subsets consisting of unambiguous and boundary cases [6]. Let us consider a decision system in the form: DS=(U, C, D, Vdescr, Vdec, finf, fdec), where D={d}, Vdec={vd1, vd2, ..., vdk}. In our case, Vdec{LOW, MEDIUM, HIGH}. d By DivU , we denote a division of U into disjoint subsets according to the values of d. We assume that a threshold value [0, 1]. The output of the algorithm is two sets UunambU and UboundU, where UunambUbound= and UunambUbound=U. Uunamb and Ubound are subsets of U consisting of unambiguous and boundary cases, respectively, with respect to the threshold . The algorithm has the form: Computing a consistency factor Xj(u*) of a given case u* is based on the knowledge included in a subsystem DS|X of a decision system DS restricted to the set Xj of cases. This knowledge is expressed by all minimal rules true and realizable in DS|X (see [12­14]). The importance (relevance) of rules extracted from the system DS|X , which are not satisfied by the new case, is calculated. If the importance of these rules is greater, the consistency factor of a new case with the knowledge included in DS|X is smaller. The importance of a set of rules not satisfied by the new case is determined by means of a strength factor of this set of j j j j Figure 2:Screenshot of CLAPSS. 112Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs rules in DS|Xj. For a detailed information on how the consistency factor is calculated, we refer the readers to [6]. To build classifiers using a set of , we used two machine learning computer tools: Orange, a comprehensive, component-based software suite for machine learning and data mining [15], and RSES, a tool set for analyzing data with the use of methods from rough set theory [16]. Table 1:Results of dividing a set of all cases into two subsets consisting of unambiguous and boundary cases. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Number of 60 56 73 70 23 59 50 64 69 45 58 42 58 62 53 63 57 Number of boundary cases 13 17 0 3 50 14 23 9 4 28 15 31 15 11 20 10 16 Results In the experiments, we used the data described in the "Input data" section. We collected the results of the application of the algorithm splitting a set of cases included in a decision system into two subsets consisting of unambiguous and boundary cases, for each sphere separately, in Table 1. One can see that, for some spheres, a number of cases were significantly reduced. In Orange, we have used an algorithm for the generation of decision trees based on the Gini criterion [4] for attribute selection. The following values of pruning parameters have been set: minimum instances in leaves 2 and limit of the depth 100. In RSES, we have used the LEM2 algorithm [17] for rule generation. LEM2 is most frequently used for rule induction. LEM2 explores the search space of the attribute-values pairs. It is based on lower and upper approximations of decision classes defined in rough set theory. The expected degree of coverage of the training set by derived rules was set to 0.9. In the classification Figure 3:Process of training and testing a classifier in Orange. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs113 Table 2:Results of experiments in Orange. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Classification tree generated from all cases Number of nodes 35 23 17 25 27 31 33 13 33 27 29 33 37 35 31 27 37 Number of leaves 18 12 9 13 14 16 17 7 17 14 15 17 19 18 16 14 19 CA 0.836 0.877 0.945 0.945 0.616 0.836 0.767 0.945 0.890 0.822 0.890 0.822 0.877 0.863 0.808 0.918 0.863 Classification tree generated from Number of nodes 27 13 17 23 7 23 27 13 31 23 23 13 29 35 25 21 27 Number of leaves 14 7 9 12 4 12 14 7 16 12 12 7 15 18 13 11 14 CA 0.836 0.890 0.945 0.945 0.616 0.836 0.767 0.945 0.890 0.822 0.890 0.822 0.877 0.863 0.808 0.890 0.863 process, conflicts were resolved by standard voting (each rule has as many votes as supporting cases). The process of training and testing a classifier in Orange is shown in Figure 3. The results of the experiments in Orange are collected, for each sphere separately, in Table 2. The process of training and testing a classifier in RSES is shown in Figure 4. The results of the experiments in RSES are collected, for each sphere separately, in Table 3. In case of complexity of classifiers, we have taken into consideration: a number of rules and mean of rule premise length (for a rule-based classifier) and a number of nodes and a number of leaves (for a decision tree-based classifier). For example, one can compare the complexity of decision trees for sphere VII shown in Figures 5 and 6. It is worth noting that the classification power of both sets is comparable. In general, a case selection procedure in the preprocessing step causes that the complexity of classifiers is decreased. In the case of decision trees, the classification accuracy is not lost generally. In the case of rules generated by LEM2, a case selection procedure positively influences the classification accuracy. Sometimes, the coverage factor is smaller. It means that a classifier does not make mistaken decisions. Those cases remain unclassified. Figure 4:Process of training and testing a classifier in RSES. As mentioned earlier, in our experiments, each data set has been treated separately. It enabled us to assess the evaluation sheet with respect to individual spheres. The results can be used in the further development of the sheet. In the future, all adding, removing, and modifying of questions are allowed. Especially, spheres with a relatively big number of boundary cases should be checked (i.e. spheres X and XVIII). Conclusions We have shown in this paper one of the data preprocessing steps for the classification of cases with ASD. In this step, 114Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs Figure 5:A decision tree generated for sphere VII based on all cases. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs115 Table 3:Results of experiments in RSES. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Rules generated from all cases using LEM2 Number of rules 24 17 17 17 14 27 27 15 19 19 14 23 28 28 18 18 21 CA 0.909 0.896 1.000 0.984 0.561 0.925 0.818 0.985 0.985 0.879 0.924 0.851 0.940 0.942 0.862 0.955 0.926 Coverage 0.904 0.918 0.945 0.877 0.904 0.918 0.904 0.890 0.904 0.904 0.904 0.918 0.918 0.945 0.890 0.918 0.932 Rules generated from using LEM2 Number of rules 22 11 17 14 5 17 21 9 17 14 14 8 19 17 12 15 18 CA 1.000 0.873 1.000 0.985 0.818 0.925 0.940 0.877 1.000 0.926 1.000 0.900 0.964 0.934 0.812 0.952 0.949 Coverage 0.740 0.863 0.945 0.890 0.452 0.918 0.685 0.890 0.849 0.740 0.712 0.685 0.753 0.836 0.877 0.849 0.808 Figure 6:A decision tree generated for sphere VII based on . a procedure of the selection of cases based on the original approach was applied. The main goal was to obtain simpler classifiers enabling domain experts to validate classification rules from the diagnosis point of view. The fundamental goal of the whole project is to build a computer tool based on a classifier ensemble combining a wide range of approaches. The important part of this tool will be a data preprocessing block with a variety of operations. 116Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 7. Pancerz K. Extensions of information systems: the rough set perspective. Trans Rough Sets 2009;X:157­68. 8. Pitek L, Pancerz K, Owsiany G. Validation of data categorization using extensions of information systems: experiments on melanocytic skin lesion data. In: Federated Conference on Computer Science and Information Systems, 18­21 September 2011, Szczecin, Poland, 2011:147­51. 9. Pawlak Z. Rough sets. Theoretical aspects of reasoning about data. Dordrecht: Kluwer Academic Publishers, 1991. 10. Pancerz K. On selected functionality of the Classification and Prediction Software System (CLAPSS). In: International Conference on Information and Digital Technologies, 7­9 July 2015, Zilina, Slovakia, 2015:267­74. 11. Pawlak Z, Skowron A. Rudiments of rough sets. Inf Sci 2007;177:3­27. 12. Suraj Z, Pancerz K, Owsiany G. On consistent and partially consistent extensions of information systems. In: lzak D, Wang G, Szczuka M, Duntsch I, Yao Y, editors. Rough sets, fuzzy sets, data mining, and granular computing. Ser. LNAI 3641. Berlin/ Heidelberg: Springer-Verlag, 2005:224­33. 13. Moshkov M, Skowron A, Suraj Z. On testing membership to maximal consistent extensions of information systems. In: Greco S, Hata Y, Hirano S, Inuiguchi M, Miyamoto S, Nguyen HS, Slowinski R, editors. Rough sets and current trends in computing. Ser. LNAI 4259. Berlin/Heidelberg: Springer-Verlag, 2006:85­90. 14. Suraj Z. Some remarks on extensions and restrictions of information systems. In: Ziarko W, Yao Y, editors. Rough sets and current trends in computing. Ser. LNAI 2005. Berlin/Heidelberg: Springer-Verlag, 2001:204­11. 15. Demsar J, Curk T, Erjavec A, Gorup C, Hocevar T, Milutinovic M, et al. Orange: data mining toolbox in Python. J Mach Learn Res 2013;14:2349­53. 16. Bazan JG, Szczuka MS. The rough set exploration system. In: Transactions on rough sets III. Ser. LNAI 3400. Berlin/Heidelberg: Springer-Verlag, 2005:37­56. 17. Grzymala-Busse J. A new version of the rule induction system LERS. Fundam Inf 1997;31:27­39. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Computer-aided analysis of data from evaluation sheets of subjects with autism spectrum disorders

Loading next page...
 
/lp/de-gruyter/computer-aided-analysis-of-data-from-evaluation-sheets-of-subjects-f9rxBCG2PR
Publisher
de Gruyter
Copyright
Copyright © 2016 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.1515/bams-2016-0011
Publisher site
See Article on Publisher Site

Abstract

In this paper, we deal with the problem of the initial analysis of data from evaluation sheets of subjects with autism spectrum disorders (ASDs). In the research, we use an original evaluation sheet including questions about competencies grouped into 17 spheres. An initial analysis is focused on the data preprocessing step including the filtration of cases based on consistency factors. This approach enables us to obtain simpler classifiers in terms of their size (a number of nodes and leaves in decision trees and a number of classification rules). Keywords: autism spectrum disorders; classification rules; classification trees; preprocessing. Introduction Autism is a brain development disorder that impairs social interaction and communication and causes restricted and repetitive behaviors, all starting before a child is 3 years old. Starting in May 2013 [i.e. the date of publication of the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5)], all autism disorders were merged into one umbrella diagnosis of autism spectrum disorders (ASDs). ASDs can dramatically affect a child's life as well as that of their families, schools, friends, and the wider community. The main aim of our research is to adapt computational intelligence methods for computer-aided decision support in the diagnosis and therapy of persons with ASDs. Computer-based decision support (CDS) is defined as the use of a computer to bring relevant knowledge *Corresponding author: Krzysztof Pancerz, Faculty of Mathematics and Natural Sciences, University of Rzeszow, Prof. S. Pigonia Str. 1, 35-310 Rzeszow, Poland, E-mail: kpancerz@ur.edu.pl Aneta Derkacz and Olga Mich: University of Management and Administration, Zamosc, Poland Jerzy Gomula: Cardinal Stefan Wyszynski University, Warsaw, Poland to bear on the health care and well-being of a patient (cf. [1]). In the first step of our research, we are interested in the initial analysis of data from evaluation sheets of subjects with ASDs. The evaluation sheet we used in the experiment is an original sheet including questions (more than 300) about the competencies of the subjects grouped into 17 spheres (self-service, communication, cognitive, physical, and the sphere responsible for functioning in the social and family environment, among others). An initial analysis is focused on the data preprocessing step. Data preprocessing is an important and basic stage in data mining and machine learning approaches. The data preprocessing step consists of different operations affecting the quality of results (e.g. in classification tasks). In general, we can distinguish a number of data preprocessing operations (cf. [2­4]), including case (instance) selection, outlier detection, feature (attribute) selection, feature transformation, feature generation, data cleaning, data discretization, data normalization, data integration, and dealing with missing values, etc. In the first step of our research, we have concentrated on a case selection (filtration) problem in terms of building classifiers. It is obvious that an increasing number and complexity of classification rules make them difficult to be validated by domain experts. Therefore, there is an important problem to reduce the number of rules embedded in the classifiers. Several approaches have been proposed to reduce the number of cases in the training set for building classifiers. A survey of algorithms is given, among others, in [5]. However, due to data specificity, we have used in the experiments our original approach, as presented in [6], for the selection of cases. This approach is based on splitting a set of input cases into two subsets consisting of unambiguous and boundary cases, respectively. The boundary cases are subjected to be removed from the process of training the classifier because they are assigned to one decision class but are also close to other decision classes with respect to consistency factors. To differentiate two subsets of cases, we have used an approach based on consistency factors defined in terms of appropriate lower approximations of sets (cf. [7, 8]). The main goal of this approach is to obtain simpler classifiers in terms of 110Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs their size (a number of nodes and leaves in decision trees and a number of classification rules) but without a significant loss of their classification ability. LOW=low-functioning autism, MEDIUM=mediumfunctioning autism, and HIGH=high-functioning autism. In the current stage of our research, each sphere is treated separately. For each sphere, the training data (which are used to train classifiers) are stored in a tabular form that is formally called a decision table. A decision table represents a decision system in the Pawlak's form (cf. [9]). We used the following formal definition of a decision system. A decision system DS is a tuple DS=(U, C, D, Vdescr, Vdec, finf, fdec), where U is a nonempty, finite set of cases; C is a nonempty, finite set of descriptive attributes; D is a nonempty, finite set of decision attributes; Vdescr=cCVc, such that Vc is a set of values of the descriptive attribute c; Vdec=dDVd, such that Vd is a set of values of the decision attribute d; finf:C×UVdescr is an information function such that finf(c, u)Vc for each cC and uU; and fdec:D×UVdec is a decision function such that fdec(d, u)Vd for each dD and uU. Input data Experiments testing the relative effectiveness of our approach have been performed on data describing more than 70 cases (subjects) classified into three categories: high-functioning, medium-functioning, or low-functioning autism. Each subject has been evaluated using an author's original sheet including questions about competencies grouped into 17 spheres marked with roman numerals (only spheres used in our experiments are listed, and the remaining spheres are omitted): ­ VI. Support for active communication. ­ VII. Active communication concerning objects, people, parts of the body. ­ VIII. Imitation, the length and complexity of the utterance. ­ IX. Needs, emotions, moods. ­ X. Object communication (the level of specific symbols). ­ XI. Symbolic communication. ­ XII. Requests. ­ XIII. Choices. ­ XIV. Communication in a pair (with contemporary, with an adult). ­ XV. Social communication competences. ­ XVI. Communication in a group and in social situations (in a team, at school, in the closest social environment). ­ XVIII. Vocabulary. ­ XIX. The degree of effectiveness of information. ­ XX. The degree of motivation to communicate. ­ XXI. The degree and type of hint in communication. ­ XXII. Building the utterance ­ the degree of its complexity and functionality. ­ XXIII. Dialogues. Each case x is described by a data vector a(x) consisting of more than 300 descriptive attributes: a(x)={a1(x), a2(x), ..., am(x)}. Four values of descriptive attributes are possible, namely 0, 25, 50, and 100. They have the following meaning: 0, not performed; 25, performed after physical help; 50, performed after verbal help/demonstration; and 100, performed unaided. To each case x, we have assigned one of the decision values determining the category of autism: Methodology and tools The proposed methodology of an initial analysis of data to select cases used to build a classifier consists of the following main stages (cf. [6]): ­ Calculating the consistency factors of cases included in the decision subsystem corresponding to class Y with the knowledge included in the decision subsystem corresponding to class X, where YX and X, Y{LOW, MEDIUM, HIGH}; ­ Dividing a set of all cases into two subsets: a subset of and a subset of boundary cases (see Figure 1) according to the calculated consistency factors; High-functioning (HIGH ) Boundary cases Low-functioning (LOW ) Medium-functioning (MEDIUM ) Figure 1:Dividing a set of all cases. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs111 ­ ­ Training a classifier using a set of ; and Testing the classifier using a set of all cases (i.e. before distinction between unambiguous and boundary cases). Calculating the consistency factors of cases and dividing a set of all cases into two subsets (unambiguous and boundary cases) were carried out using our specialized tool called Classification and Prediction Software System (CLAPSS) [10]. CLAPSS is a computer tool solving different classification and prediction problems using, among others, some specialized approaches based mainly on rough set theory [11]. The tool was designed for the Java platform (see Figure 2). In CLAPSS, we have implemented the algorithm for splitting a set of cases included in a decision system into two subsets consisting of unambiguous and boundary cases [6]. Let us consider a decision system in the form: DS=(U, C, D, Vdescr, Vdec, finf, fdec), where D={d}, Vdec={vd1, vd2, ..., vdk}. In our case, Vdec{LOW, MEDIUM, HIGH}. d By DivU , we denote a division of U into disjoint subsets according to the values of d. We assume that a threshold value [0, 1]. The output of the algorithm is two sets UunambU and UboundU, where UunambUbound= and UunambUbound=U. Uunamb and Ubound are subsets of U consisting of unambiguous and boundary cases, respectively, with respect to the threshold . The algorithm has the form: Computing a consistency factor Xj(u*) of a given case u* is based on the knowledge included in a subsystem DS|X of a decision system DS restricted to the set Xj of cases. This knowledge is expressed by all minimal rules true and realizable in DS|X (see [12­14]). The importance (relevance) of rules extracted from the system DS|X , which are not satisfied by the new case, is calculated. If the importance of these rules is greater, the consistency factor of a new case with the knowledge included in DS|X is smaller. The importance of a set of rules not satisfied by the new case is determined by means of a strength factor of this set of j j j j Figure 2:Screenshot of CLAPSS. 112Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs rules in DS|Xj. For a detailed information on how the consistency factor is calculated, we refer the readers to [6]. To build classifiers using a set of , we used two machine learning computer tools: Orange, a comprehensive, component-based software suite for machine learning and data mining [15], and RSES, a tool set for analyzing data with the use of methods from rough set theory [16]. Table 1:Results of dividing a set of all cases into two subsets consisting of unambiguous and boundary cases. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Number of 60 56 73 70 23 59 50 64 69 45 58 42 58 62 53 63 57 Number of boundary cases 13 17 0 3 50 14 23 9 4 28 15 31 15 11 20 10 16 Results In the experiments, we used the data described in the "Input data" section. We collected the results of the application of the algorithm splitting a set of cases included in a decision system into two subsets consisting of unambiguous and boundary cases, for each sphere separately, in Table 1. One can see that, for some spheres, a number of cases were significantly reduced. In Orange, we have used an algorithm for the generation of decision trees based on the Gini criterion [4] for attribute selection. The following values of pruning parameters have been set: minimum instances in leaves 2 and limit of the depth 100. In RSES, we have used the LEM2 algorithm [17] for rule generation. LEM2 is most frequently used for rule induction. LEM2 explores the search space of the attribute-values pairs. It is based on lower and upper approximations of decision classes defined in rough set theory. The expected degree of coverage of the training set by derived rules was set to 0.9. In the classification Figure 3:Process of training and testing a classifier in Orange. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs113 Table 2:Results of experiments in Orange. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Classification tree generated from all cases Number of nodes 35 23 17 25 27 31 33 13 33 27 29 33 37 35 31 27 37 Number of leaves 18 12 9 13 14 16 17 7 17 14 15 17 19 18 16 14 19 CA 0.836 0.877 0.945 0.945 0.616 0.836 0.767 0.945 0.890 0.822 0.890 0.822 0.877 0.863 0.808 0.918 0.863 Classification tree generated from Number of nodes 27 13 17 23 7 23 27 13 31 23 23 13 29 35 25 21 27 Number of leaves 14 7 9 12 4 12 14 7 16 12 12 7 15 18 13 11 14 CA 0.836 0.890 0.945 0.945 0.616 0.836 0.767 0.945 0.890 0.822 0.890 0.822 0.877 0.863 0.808 0.890 0.863 process, conflicts were resolved by standard voting (each rule has as many votes as supporting cases). The process of training and testing a classifier in Orange is shown in Figure 3. The results of the experiments in Orange are collected, for each sphere separately, in Table 2. The process of training and testing a classifier in RSES is shown in Figure 4. The results of the experiments in RSES are collected, for each sphere separately, in Table 3. In case of complexity of classifiers, we have taken into consideration: a number of rules and mean of rule premise length (for a rule-based classifier) and a number of nodes and a number of leaves (for a decision tree-based classifier). For example, one can compare the complexity of decision trees for sphere VII shown in Figures 5 and 6. It is worth noting that the classification power of both sets is comparable. In general, a case selection procedure in the preprocessing step causes that the complexity of classifiers is decreased. In the case of decision trees, the classification accuracy is not lost generally. In the case of rules generated by LEM2, a case selection procedure positively influences the classification accuracy. Sometimes, the coverage factor is smaller. It means that a classifier does not make mistaken decisions. Those cases remain unclassified. Figure 4:Process of training and testing a classifier in RSES. As mentioned earlier, in our experiments, each data set has been treated separately. It enabled us to assess the evaluation sheet with respect to individual spheres. The results can be used in the further development of the sheet. In the future, all adding, removing, and modifying of questions are allowed. Especially, spheres with a relatively big number of boundary cases should be checked (i.e. spheres X and XVIII). Conclusions We have shown in this paper one of the data preprocessing steps for the classification of cases with ASD. In this step, 114Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs Figure 5:A decision tree generated for sphere VII based on all cases. Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs115 Table 3:Results of experiments in RSES. Data set (sphere) VI VII VIII IX X XI XII XIII XIV XV XVI XVIII XIX XX XXI XXII XXIII Rules generated from all cases using LEM2 Number of rules 24 17 17 17 14 27 27 15 19 19 14 23 28 28 18 18 21 CA 0.909 0.896 1.000 0.984 0.561 0.925 0.818 0.985 0.985 0.879 0.924 0.851 0.940 0.942 0.862 0.955 0.926 Coverage 0.904 0.918 0.945 0.877 0.904 0.918 0.904 0.890 0.904 0.904 0.904 0.918 0.918 0.945 0.890 0.918 0.932 Rules generated from using LEM2 Number of rules 22 11 17 14 5 17 21 9 17 14 14 8 19 17 12 15 18 CA 1.000 0.873 1.000 0.985 0.818 0.925 0.940 0.877 1.000 0.926 1.000 0.900 0.964 0.934 0.812 0.952 0.949 Coverage 0.740 0.863 0.945 0.890 0.452 0.918 0.685 0.890 0.849 0.740 0.712 0.685 0.753 0.836 0.877 0.849 0.808 Figure 6:A decision tree generated for sphere VII based on . a procedure of the selection of cases based on the original approach was applied. The main goal was to obtain simpler classifiers enabling domain experts to validate classification rules from the diagnosis point of view. The fundamental goal of the whole project is to build a computer tool based on a classifier ensemble combining a wide range of approaches. The important part of this tool will be a data preprocessing block with a variety of operations. 116Pancerz et al.: Analysis of data from evaluation sheets of subjects with ASDs Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Research funding: None declared. Employment or leadership: None declared. Honorarium: None declared. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. 7. Pancerz K. Extensions of information systems: the rough set perspective. Trans Rough Sets 2009;X:157­68. 8. Pitek L, Pancerz K, Owsiany G. Validation of data categorization using extensions of information systems: experiments on melanocytic skin lesion data. In: Federated Conference on Computer Science and Information Systems, 18­21 September 2011, Szczecin, Poland, 2011:147­51. 9. Pawlak Z. Rough sets. Theoretical aspects of reasoning about data. Dordrecht: Kluwer Academic Publishers, 1991. 10. Pancerz K. On selected functionality of the Classification and Prediction Software System (CLAPSS). In: International Conference on Information and Digital Technologies, 7­9 July 2015, Zilina, Slovakia, 2015:267­74. 11. Pawlak Z, Skowron A. Rudiments of rough sets. Inf Sci 2007;177:3­27. 12. Suraj Z, Pancerz K, Owsiany G. On consistent and partially consistent extensions of information systems. In: lzak D, Wang G, Szczuka M, Duntsch I, Yao Y, editors. Rough sets, fuzzy sets, data mining, and granular computing. Ser. LNAI 3641. Berlin/ Heidelberg: Springer-Verlag, 2005:224­33. 13. Moshkov M, Skowron A, Suraj Z. On testing membership to maximal consistent extensions of information systems. In: Greco S, Hata Y, Hirano S, Inuiguchi M, Miyamoto S, Nguyen HS, Slowinski R, editors. Rough sets and current trends in computing. Ser. LNAI 4259. Berlin/Heidelberg: Springer-Verlag, 2006:85­90. 14. Suraj Z. Some remarks on extensions and restrictions of information systems. In: Ziarko W, Yao Y, editors. Rough sets and current trends in computing. Ser. LNAI 2005. Berlin/Heidelberg: Springer-Verlag, 2001:204­11. 15. Demsar J, Curk T, Erjavec A, Gorup C, Hocevar T, Milutinovic M, et al. Orange: data mining toolbox in Python. J Mach Learn Res 2013;14:2349­53. 16. Bazan JG, Szczuka MS. The rough set exploration system. In: Transactions on rough sets III. Ser. LNAI 3400. Berlin/Heidelberg: Springer-Verlag, 2005:37­56. 17. Grzymala-Busse J. A new version of the rule induction system LERS. Fundam Inf 1997;31:27­39.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Sep 1, 2016

There are no references for this article.