Access the full text.
Sign up today, get DeepDyve free for 14 days.
D. Chakraborty, N. Pal (2002)
Designing Rule-Based Classifiers with On-Line Feature Selection: A Neuro-fuzzy Approach
(1997)
A connectionist system for feature selection. Neural
S. Pal, R. De, J. Basak (2000)
Unsupervised feature evaluation: a neuro-fuzzy approachIEEE transactions on neural networks, 11 2
W. Siedlecki, J. Sklansky (1989)
A note on genetic algorithms for large-scale feature selectionPattern Recognit. Lett., 10
D. Goldberg (1988)
Genetic Algorithms in Search Optimization and Machine Learning
Durga Muni, N. Pal, J. Das (2004)
A novel approach to design classifiers using genetic programmingIEEE Transactions on Evolutionary Computation, 8
M. Gluck (1985)
Information, Uncertainty and the Utility of Categories
J. Casillas, O. Cordón, M. Jesús, F. Herrera (2001)
Genetic feature selection in a fuzzy rule-based classification system learning process for high-dimensional problemsInf. Sci., 136
M. Dash, Huan Liu, Jun Yao (1997)
Dimensionality reduction of unsupervised dataProceedings Ninth IEEE International Conference on Tools with Artificial Intelligence
D. Fisher (1987)
Knowledge Acquisition Via Incremental Conceptual ClusteringMachine Learning, 2
Pabitra Mitra, C. Murthy, S. Pal (2002)
Unsupervised Feature Selection Using Feature SimilarityIEEE Trans. Pattern Anal. Mach. Intell., 24
S. Raudys, Vitalijus Pikelis (1980)
On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2
B. Schachter (1978)
A nonlinear mapping algorithm for large data setsComputer Graphics and Image Processing, 8
Jennifer Dy, C. Brodley (2000)
Feature Subset Selection and Order Identification for Unsupervised Learning
S. Kothari, H. Oh (1993)
Neural Networks for Pattern RecognitionAdv. Comput., 37
Anil Jain, R. Dubes (1988)
Algorithms for Clustering Data
M. Hall (1999)
Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning
N. Pal, S. Nandi, M. Kundu (1998)
Self-crossover-a new genetic operator and its application to feature selectionInt. J. Syst. Sci., 29
C. Pykett (1978)
Improving the efficiency of Sammon's nonlinear mapping by using clustering archetypesElectronics Letters, 14
M. Dash, Poon Koot (2000)
Feature Selection for Clustering
J. Sherrah, R. Bogner, A. Bouzerdoum (1996)
Automatic selection of features for classification using genetic programming1996 Australian New Zealand Conference on Intelligent Information Systems. Proceedings. ANZIIS 96
(2000)
Maximum entropy and maximum likelihood criteria for feature selection and multivariate data
Luis Talavera (2000)
Dependency-based feature selection for clustering symbolic dataIntell. Data Anal., 4
N. Pal (1999)
Soft computing for feature analysisFuzzy Sets Syst., 103
K. Sim (2006)
Guest Editorial Special Issue on Game-Theoretic Analysis and Stochastic Simulation of Negotiation Agents, 36
Ron Kohavi, George John (1997)
Wrappers for Feature Subset SelectionArtif. Intell., 97
R. Heydorn (1971)
Redundancy in Feature ExtractionIEEE Transactions on Computers, C-20
S. Raudys, Anil Jain (1991)
Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for PractitionersIEEE Trans. Pattern Anal. Mach. Intell., 13
Todd Golub, Todd Golub, D. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, J. Mesirov, Hilary Coller, M. Loh, James Downing, Michael Caligiuri, C. Bloomfield, Eric Lander (1999)
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.Science, 286 5439
R. De, N. Pal, S. Pal (1997)
Feature analysis: Neural network and fuzzy set theoretic approachesPattern Recognit., 30
J. Peña, J. Lozano, P. Larrañaga, Iñaki Inza (2001)
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian NetworksIEEE Trans. Pattern Anal. Mach. Intell., 23
S. Das (1971)
Feature Selection with a Linear Dependence MeasureIEEE Transactions on Computers, C-20
G. Trunk (1979)
A Problem of Dimensionality: A Simple ExampleIEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1
J. Sammon (1969)
A Nonlinear Mapping for Data Structure AnalysisIEEE Transactions on Computers, C-18
W. Siedlecki, J. Sklansky (1988)
On Automatic Feature SelectionInt. J. Pattern Recognit. Artif. Intell., 2
C. Bishop (1995)
Neural networks for pattern recognition
Hua-Liang Wei, S. Billings (2007)
Feature Subset Selection and Ranking for Data Dimensionality ReductionIEEE Transactions on Pattern Analysis and Machine Intelligence, 29
N. Pal, Vijay Eluri, Gautam Mandal (2002)
Fuzzy logic approaches to structure preserving dimensionality reductionIEEE Trans. Fuzzy Syst., 10
Mark Devaney, A. Ram (1997)
Efficient Feature Selection in Conceptual Clustering
Durga Muni, N. Pal, J. Das (2006)
Genetic programming for simultaneous feature selection and classifier designIEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36
Manu Ahluwalia, L. Bull (2001)
Coevolving functions in genetic programmingJ. Syst. Archit., 47
Fuzzy Inf. Eng. (2010) 3: 229-247 DOI 10.1007/s12543-010-0047-4 ORIGINAL ARTICLE Evolutionary Methods for Unsupervised Feature Selection Using Sammon’s Stress Function Amit Saxena· Nikhil R. Pal· Megha Vora Received: 13 April 2010/ Revised: 21 May 2010/ Accepted: 19 August 2010/ © Springer-Verlag Berlin Heidelberg and Fuzzy Information and Engineering Branch of the Operations Research Society of China 2010 Abstract In this paper, four methods are proposed for feature selection in an unsu- pervised manner by using genetic algorithms. The proposed methods do not use the class label information but select a set of features using a task independent criterion that can preserve the geometric structure (topology) of the original data in the reduced feature space. One of the components of the fitness function is Sammon’s stress func- tion which tries to preserve the topology of the high dimensional data when reduced into the lower dimensional one. In this context, in addition to using a fitness criterion, we also explore the utility of unfitness criterion to select chromosomes for genetic op- erations. This ensures higher diversity in the population and helps unfit chromosomes to become more fit. We use four different ways for evaluation of the quality of the features selected: Sammon error, correlation between the inter-point distances in the two spaces, a measure of preservation of cluster structure found in the original and reduced spaces and a classifier performance. The proposed methods are tested on six real data sets with dimensionality varying between 9 and 60. The selected features are found to be excellent in terms of preservation topology (inter-point geometry), cluster structure and classifier performance. We do not compare our methods with other methods because, unlike other methods, using four different ways we check the quality of the selected features by finding how well the selected features preserve the “structure” of the original data. Keywords Dimensionality reduction · Feature analysis · Genetic algorithm · Clas- sification techniques Amit Saxena () Department of Computer Science and Information Technology, Guru Ghasidas University, Bilaspur, India email: amitsaxena65@rediffmail.com Nikhil R. Pal () Electronics and Communication Sciences Unit, Indian Stastistical Institute, 203 B. T. Road, Calcutta, India email: nrpal59@gmail.com Megha Vora () Department of Computer Science and Engineering, Indian Institute of Technology-Madras, Chennai- 600036, India email: meghavora25@gmail.com 230 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) 1. Introduction One of the major problems in mining of large databases is the dimension of the data. More often than not, it is observed that some features do not affect the performance of a classifier. There could be features that are derogatory in nature and degrade the performance of classifiers. Thus one can have indifferent features, bad features and highly correlated/redundant features. Removing such features can not only improve the performance of the system but also can make the learning task much simpler. More specifically, the performance of a classifier or a prediction system depends on several factors: i) number of training instances. ii) dimensionality, i.e., number of features, and iii) complexity of the classifier. With an increase in dimensionality the hyper-volume increases exponentially and thus large dimensionality of data demands a large number of training samples for an adequate representation of the input space. Dimensionality reduction can be done mainly in two ways: selecting a small but im- portant subset of features and generating (extracting) a lower dimensional data pre- serving the distinguishing characteristics of the original higher dimensional data [1]. Dimensionality reduction not only helps in the design of a classifier/predictor, it also helps other exploratory data analysis. It can help assessment of clustering tendency as well as to decide on the number of clusters by looking at the scatter plot of the lower dimensional data. Feature extraction and data projection can be viewed as an implicit or explicit mapping from a p dimensional input space to a q (p >= q) dimensional output space such that some criterion is optimized. A large number of approaches for feature extraction and data projection are available in the pattern recognition lit- erature [2, 3, 4, 5, 6, 7]. These approaches differ from each other in terms of nature of the mapping function, how it is learned and what optimization criterion is used. The mapping can be linear or nonlinear, and can be learned through either supervised or unsupervised methods. Feature analysis methods can use different tools such as neural network, fuzzy logic, evolutionary algorithm, and statistical methods [1, 6, 7, 12, 15, 17, 19, 25, 29, 30]. Feature Selection leads to savings measurement cost, because some of the features get discarded. Another advantage of selection is that the selected features retain their original interpretation, which is important to understand the underlying process gen- erating the data. On the other hand, extracted features sometimes have better discrim- inating capability leading to better performance, but these new features may not have any clear physical meaning. In practice it is often found that additional features actu- ally degrade the performance of a classifier designed using class-conditional density estimates when the training set is small with respect to the dimensionality [26, 27]. This usually happens because classifiers estimate the class-conditional densities from the available training data. Thus, if the dimensionality is increased keeping the size of the training set fixed, the number of unknown parameters automatically increases and the reliability of the estimate decreases. As a consequence, the performance of the classifier, constructed from a fixed number of training instances, may degrade with the increase in dimensionality. Trunk provided a nice example to illustrate this [28]. Thus, if the number of features is increased, error is likely to increase when the parameters of the class-conditional densities are estimated from a finite number of Fuzzy Inf. Eng. (2010) 3: 229-247 231 training samples. Consequently, we should always try to select only a small number of salient features when a limited training set is given. When feature selection methods use class information, we call it supervised fea- ture selection. Although the majority of the feature selection methods are supervised in nature, there has been a substantial amount of work using unsupervised methods [8, 37, 10, 11, 12, 13, 14, 15]. In [8], a method is proposed to partition original feature sets into distinct subsets or clusters so that features in one cluster are highly simi- lar while those in different clusters are dissimilar. A single feature is then selected from each cluster to form a reduced feature subset. Feature Selection for clustering is discussed in [37]. Dy and Brodley [10] presented a wrapper framework for fea- ture selection. They also compared two different feature selection criteria viz. scatter separability preferring feature subsets whose cluster centroid are far apart; and the maximum likelihood that prefers feature subsets leading to clusters that best fit Gaus- sian models. In [11], several methods are discussed for feature selection based on maximum entropy and maximum likelihood criteria. Pal et.al. [12] have proposed an unsupervised neuro-fuzzy feature ranking method. They used a criterion to measure the similarity between two patterns in the original feature space and in a transformed feature space. The transformed feature space is obtained by multiplying each feature by a coefficient w in [0, 1]. This coefficient is learned through a feed-forward neu- ral network. After training, the features are ranked according to the values of these weights. Higher values of w indicate higher importance and hence higher ranks. Us- ing this rank, the required number of features is selected. A new correlation-based approach to feature selection (CFS) is presented in [13]. CFS uses the features’ pre- dictive performances and inter-correlations to guide its search for a good subset of features. Experiments on discrete and continuous class datasets reveal that CFS can drastically reduce the dimensionality of datasets while maintaining or improving the performance of learning algorithms. Heydorn [14] gives a definition of redundancy between two random variables X and Y and this definition is used to define a test of redundancy also. This test can be used to eliminate redundant features without degrading performance of classifiers. Features that are linearly dependent on other features do not contribute toward pattern classification by linear techniques. In order to detect the linearly dependent features, a measure of linear dependence is proposed in [15]. This measure is used as an aid to feature selection. Speaker verification experiments demonstrate the usefulness of employing such a measure in practice. Wei and Billings [40] proposed an interesting unsupervised feature selection method that attempts to select a subset of features that can approximate the data in the original feature assuming a linear model. The search process is driven by repeated computation of the squared correlation matrix and or- thogonalization of variables. Feature selection is also important in other areas such as finding cluster structures in data or in other exploratory data analysis. Therefore, the general objective should be to select a subset of features so that the “inherent characteristics” of the original data are preserved in the lower dimensional data. This demands the use of a task independent criterion to select features. If we can select features in this manner, then we expect that a clustering algorithm will find similar partition on both the original data and the data with the subset of features. Similarly, 232 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) the performance of a classifier designed using the original data and the lower dimen- sional data would also be similar. These kinds of feature selection techniques are known as unsupervised feature selection. In this paper we consider the feature selection problem in an unsupervised frame- work. The paper is organized as follows. Section 2 describes some existing feature selection methods and Section 3 explains the proposed methods using genetic algo- rithms. We use Sammon’s stress function as the criterion. To validate our method, we compare the performance of 1− nearest neighbor algorithm (a supervised method) as well as the performance of a clustering algorithm using data in the reduced space and the original space. We also consider correlation coefficient to assess the quality of the selected features. The validation methods are presented in Section 4. Our experimen- tal results are reported in Section 5 and finally, the paper is concluded in Section 6. 2. Some Existing Feature Selection Techniques The problem of feature selection can be formulated as follows: Given a data set X ⊂ R (i.e., each xX has p features) we have to select a subset of features of size q that leads to the smallest (or highest as the case may be) value with respect to some criterion. Let F be the given set of features and let f be the selected set of features of cardinality m, f ⊂ F. Let the feature selection criterion for the data set X be represented by J(F, X) (lower value of J(.) indicates a better selection). When the training instances are labeled, we can use the label information but in case of unlabeled data, this cannot be done. According to Kohavi and John [31], feature selection schemes can be broadly classified into two categories, Wrapper models and filter models. Wrapper models use feedback principle to evaluate a selected feature subset; typically the classifier itself is used to measure the goodness of the selected feature subset, either on a test data set or on the training data [31]. Filter models use some form of intrinsic property of the data set, which is assumed to affect the ultimate performance of the target system (which may be a classifier or something else). Filter models are therefore, more suitable to characterize unsupervised feature selection. Mostly the preferred evaluation criterion is the performance of a classifier, i.e., the number of misclassifications. However, this depends on the choice of the particular classifier, so the selected subset may be different if different classifiers are chosen. If the problem at hand is different from classification, class label may not be available. Because of the absence of the class label in the training data, methods proposed for supervised feature selection may not be applicable in the unsupervised mode. Even when the problem is of classifier design, it will be better if the selected features work equally good on different classifiers. This can be achieved if we can capture the “inherent characteristics” of the data without using a particular classifier i.e., without using the class label. So it would be better to select features in an unsupervised manner. Since no target outputs will be used, the feature selection system has to preserve some task independent properties of the data. So choice of the evaluation criterion becomes very critical in this case. Devaney and Ram [32] proposed an unsupervised feature-ranking scheme. Based on the work of Gluck and Corter [33], they formulated a measure called Category Fuzzy Inf. Eng. (2010) 3: 229-247 233 Utility to evaluate the goodness of a feature subset. In each step one feature is added to the existing feature set and the COBWEB [34] is run using this feature subset. Category Utility of the partition created in the first level is computed and the feature subset yielding the highest score is retained. The iteration is continued until there is no significant improvement in the Category Utility score. Talavera proposed an- other unsupervised method of feature ranking for categorical data [35] using Fisher’s feature dependency measure [34]. The algorithm exploits the assumption that in the absence of the class labels, features exhibiting higher dependencies with other fea- tures are going to be more relevant from clustering point of view. Based on some definitions of distinctness and cohesiveness, Talavera [35] formulated an expression for the relevance of a feature for capturing the dependency of a feature on other fea- tures. Using this the features are ranked. A similar approach could be found in the paper by Pena et.al. [36]. Authors have proposed a feature selection scheme in connection with clustering by using Conditional Gaussian Networks (CGN). Before learning the network, authors preprocess the training data set by selecting only rel- evant features from the point of view of learning CGNs. The basic assumption is that features exhibiting low correlation with other features can be considered irrele- vant. Based on this assumption, they came up with a relevance measure and relevance threshold. Features yielding higher relevance measure than the relevance threshold are then selected. Dash et.al. proposed an entropy based unsupervised feature ranking scheme, which is applicable to both categorical and numerical data [37]. They used a sequential backward selection scheme, which iteratively rejects the least important feature based on the entropy measure. In this paper, we propose a new approach to unsupervised feature selection pre- serving the topology of the data. Here Genetic Algorithm has been used to select a subset of features by taking Sammon stress/error as the fitness function. The data set with the reduced set of features is then evaluated by using classification (1-nearest neighbor (1-NN)) and clustering (k-means) techniques. The correlation coefficient between the proximity matrices of original data set and the reduced one is also com- puted to check how good the topology of the data set is preserved in the reduced dimension. 3. Proposed Feature Selection Methods 3.1. Feature Selection Criterion The most important task in designing an unsupervised feature selection algorithm is to decide on the criterion for selection. We should use criteria such that the struc- ture in the data in the original space (topology of the data in the original space ) is preserved in the reduced space. Such a very interesting structure-preserving criterion is Sammon’s error function [3]. Note that, the original Sammon’s method is devel- oped for feature extraction not for selection. Let X = {x |x = (x , x ,..., x ) , k k k1 k2 kp k = 1, 2,..., n} be the set of n vectors in the original space and Y = {y |y = k k (y , y ,..., y ) , k = 1, 2,..., n} be the unknown data vectors in the reduced space. k1 k2 kq Let d = d(x.x ), x, x X and d = d(y, y ), y, y Y. where d(x, x ) is the Euclidean i j i j ij i j i j i j ij distance between x and x . The Sammon Error E is given by i j 234 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) ∗ 2 (d − d ) ij ij E = . (1) ∗ ∗ d d i< j ij ij i< j Sammon estimated Y minimizing Eq. 1 using gradient descent algorithm. Since Eq. 1 preserves the inter-point distances, we want to use Eq. 1 for feature selection. Clearly, E would be zero when all features are selected. Here our problem is that given a value of q, select a subset of q features so that E in Eq. 1 is minimized. We want to use Genetic algorithms for this purpose. Genetic Algorithm (GA) is an evolutionary algorithm(EA), which optimizes a fit- ness function to find the solution of a problem. Different evolutionary algorithms have been used for feature selection [16, 17, 18, 19, 20, 21, 22]. In a typical GA, each chromosome represents a prospective solution of the problem [23]. The problem is associated with a fitness function - higher (or lower depending on the formulation) fitness refers to a better solution. The set of chromosomes is usually called a popula- tion. The population goes through a set of iterations involving crossover and mutation operations to find better solutions. At a certain fitness level or after a certain number of iterations, the procedure is stopped and the chromosome giving the best solution is taken as a solution to the problem. We propose here four simple, yet effective, methods for unsupervised feature selection as outlined next. 3.1.1. Method 1 Suppose we want to select about q features from a set of p features. We can proceed as follows. To initialize a chromosome we randomly select an integer r between 1 and q and then randomly select r distinct integers between 1 and p. If a selected integer th th is k, then k bit of the chromosome is set to 1 in order to indicate that the k feature is selected. Then we go through the process of selection, crossover, and mutation a fixed number of times or till some other termination criteria are satisfied. Since lower Sammon’s errors (SE) indicate better chromosomes, we define an inverse function of the SE as the fitness. More specifically, let TE be the total sum of Sammon’s th error of all chromosomes in the population. Then we define the fitness, IE of the i chromosome as IE = (TE− E ), where E is the Sammon’s error associated with the i i i th i chromosome. The selection is done with probability proportional to fitness. In this algorithm we want to allow a string with higher fitness, a higher probability to get involved in genetic operations. Hence, to select chromosomes for crossover, we use the Roulette wheel selection scheme with probability proportional to IE. Once GA terminates, different chromosomes may have different number of fea- tures. We group the chromosomes according to the number of features involved. For example, Group 1 may have k chromosomes with n features; Group 2 may have k 1 1 2 chromosomes with n features. There may be m such groups where Group m has k 2 m chromosomes with n features. Now we find the group G such that k = min|q− n| m M M i There could be two values of M = M and M . To select the best feature set we use 1 2 the best chromosome from the groups M and M resulting in the smallest Sammon’s 1 2 error. Fuzzy Inf. Eng. (2010) 3: 229-247 235 Algorithm for Method 1 1) Creation of initial population of size N: We start with a desired number of features q, randomly selecting an integer r,1 ≤ r ≤ q and then randomly selecting r distinct integers between 1 and p. If a selected integer is k, then the th th k bit of the chromosome is set to 1 in order to indicate that the k feature is selected. In this way, N chromosomes are generated. 2) Perform crossover operation with probability of crossover P : Compute the Sammon error of each chromosome E , i = 1, 2,..., N, Compute TE = E i i i−1 th and the fitness of the i chromosome as IE = TE − E , i = 1, 2,..., N. Ap- i i ply Roulette wheel method to select a pair of chromosomes with probability proportional to fitness, E , for possible crossover operation.We randomly gen- erate a random number, R, in [0, 1]. If R is less or equal to P (probability of crossover) then we use one point crossover to generate two offspring for the next generation; otherwise this pair is dropped from crossover operation. This process is repeated N/2 times. 3) Perform mutation operation with probability of mutation P : We apply bit- by bit mutation in a chromosome. For each bit in the chromosome, generate a random number, R, in [0, 1]. If R is less than or equal to P (probability of mutation, a fixed small quantity, 0.001 and 0.003 used here) then we flip the bit. 4) Selection of chromosomes for the next generation: The initial population having N chromosomes gets modified due to crossover and mutation. We com- bine the initial and modified populations. Sort all chromosomes in ascending order of SE (i.e., highest fit first, and so on). We pick a fixed number of best chromosomes from the beginning (40% of N, to be exact) of this combined population. Then randomly pick the remaining 60% of N chromosomes, from the remaining chromosomes in the combined population. These two sets (40% of N best, 60% of N random) jointly constitute the final population with N chromosomes for the next generation. 5) Repeat Steps 2 through 4 until the termination condition is satisfied. 6) Feature Selection: At the end of each round, we select a set of features as fol- lows. In the final population, there are N chromosomes and each chromosome represents a number of features. We group the chromosomes according to the chromosomes number of features involved. For example, Group 1 may have k with n features; group 2 may have k chromosomes with n features. There 1 2 2 may be such m groups where Group m has k chromosomes with n features. m m Now we find the group G such that k = abs(min{|q− n|}). There could be M M i i two values of M = M and M . To select the best feature set we use the best 1 2 chromosome from the groups M and M resulting in the smallest Sammon’s 1 2 error. 236 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) 3.1.2. Method 2 Since Method 1 does not guarantee that on termination the best solution will have q number of features, we propose a modified method based on a special-type muta- tion operator, called pair-mutation. It does not use any recombination operation. In this case, every chromosome in the population is initialized with exactly q randomly selected features. The GA does not perform any crossover operation, but only pair- mutation operation. To do the mutation operation we proceed as follows: randomly select one position from the set of q positions with bit value 1 and randomly select another position from the remaining places and swap the bit values. This will change one randomly selected 0 bit to 1 and one randomly selected 1 bit to 0. So the number of features in a chromosome will remain the same as q. There is one more signif- icant difference with the previous method. Unlike conventional GA where more fit strings have higher probability of selection; here we want poorer strings to have more chances to get involved in crossover operation to improve themselves. Thus for the purpose of selection for mutation, unfit strings are given higher priorities. This will promote diversity in the population and make unfit chromosomes more fit. After selection of chromosomes for the next generation, the mutation operation is repeated N (size of the population) times in the usual manner. Algorithm for Method 2 1) Generation of initial population: Randomly generate N chromosomes to form the initial population such that each chromosome has exactly q features and compute Sammon’s error for each chromosome. 2) Mutation operation with mutation probability P : Compute the Sammon error of each chromosome. As we select unfit chromosomes here, the proce- dure will be slightly different from that in Method 1. Normalize these Sammon errors for all chromosomes. Calculate the cumulative fitness of all chromo- somes. Now randomly select a chromosome for mutation with probability pro- portional to Sammon’s error, i.e., proportional to unfitness of the chromosome. Draw a random number, R, in the range [0, 1]. If R is less than or equal to P then perform pair mutation on this chromosome. Repeat this process N times. 3) Selection of chromosomes for the next generation: The method is exactly the same as shown in Step 4 of Method 1. 4) Repeat Steps 2 through Step 4 until the termination criterion is satisfied. 5) Feature Selection: Find the chromosome with the minimum SE and select the features in that chromosome. Fuzzy Inf. Eng. (2010) 3: 229-247 237 3.1.3. Method 3 In this method instead of using only the mutation operation, we use both crossover and usual mutation operation as Method 1. However, we add a penalty term to Sam- mon error to get an augmented version of Sammon error as in Eq. 2: ∗ 2 (d − d ) ij ij E = +γ· f (q− l ). (2) i i ∗ ∗ d d i< j ij ij i< j th In Eq. 2, l is the number of features selected by the i chromosome and E is i i th the fitness (lower the value, better is the fitness) of the i chromosome. Here γ is a positive constant. A simple choice for f can be f (q − l ) = (q− l ) . One can also i i use an exponential function. If we use Eq. 2, because of the penalty term the number of selected features will be close to q. In this case, we experiment with two versions: we select the chromosomes for crossover and mutation operations with (i) probability proportional to fitness as well as with (ii) probability proportional to unfitness (thus two approaches). If we have no idea about the desired number of features, we can use f as just a function of the number of selected features l like f (l ) = (l ) or i i i l /p f (l ) = e . Many other choices are possible. In Method 3, like Method 1, we select chromosomes for mutation and crossover operation based on fitness (not unfitness). Algorithm for Method 3 1) Initialize population: Initialize the population with N chromosomes where each chromosome represents exactly q feature selected randomly. 2) Perform Crossover operation with probability of crossover P : The step is same as Step 2 in Method 1 except that the fitness is computed using Eq. 2. 3) Perform mutation operation with probability of mutation P : The step is the same as Step-3 in Method 1 but the fitness is computed using Eq. 2. 4) Selection of chromosomes for the next generation: The step is same as Step- 4 in Method 1. 5) Repeat Steps 2 through 4 until termination condition is satisfied. 6) Feature Selection: The method is exactly the same as shown in Step 6 of Method 1. 3.1.4. Method 4 This algorithm is exactly the same as that of Method 3, but instead of considering fitness, we use the unfitness of chromosomes as done in Algorithm 2 for computing the probability of selection of chromosomes for genetic operations. 238 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) 4. Validation Methods If the feature selection method is a supervised one, then typically the validation is done by using either the classification error, (if the problem is of classification) or by prediction error, (if the problem is of function approximation/prediction type). For unsupervised method, some other criterion characterizing the geometry of the data should be used. We use several criteria: Topology preservation (i.e., preservation of the distance geometry between the original and reduced space), Sammon’s error and Correlation coefficient, and cluster structure preservation. In addition to these since all of our data sets have class labels, we also assess the quality of the features in terms of the performance of a classifier. 4.1. Topology Preservation For this we shall use two indices: Sammon’s error and Correlation Coefficient. The Sammon’s error measures how the inter point distances are preserved. So the Sam- mon Error, SE as defined in Eq. 1, can be used as a measure of quality of the selected feature. A lower value of SE will indicate higher topology preservation and hence a better set of selected features. In order to assess how good the topology of the original space is preserved in the reduced space, we proceed as follows. In the original space we have L = n(n− 1)/2 different distances d and similarly we have L corresponding distances d in the ij ij reduced space. Let d and d be the average of the L distances in the original and reduced spaces, respectively. Then the correlation coefficient between the two sets of distances is computed as in Eq. 3. n−1 n ∗ (d − d )(d − d) ij i=1 i+1 ij r = . (3) 2 2 n−1 n n−1 n ∗ ∗ (d − d ) (d − d) ij i=1 j=i+1 ij i=1 j=i+1 The correlation coefficient lies between -1 and +1 and a value close to unity will indicate a better choice of features. 4.2. Cluster Structure Preservation If the selected features are good in the sense of preserving structure of the data, then in both spaces (original and reduced) we should have similar cluster structure. If the data set is distributed in c classes then we cluster the data set using the k-means algorithm with k = c, assuming that each class imposes some cluster structure (which may not be true though). If the data are not related to any class structure, we have to fix the number of clusters. Now we compare the two partitions (in the original space and in the reduced space) to find how different the two partitions are. To do this, we assume the cluster structure found in the original space as the primary cluster structure. To find the clusters in the original space we initialize the algorithm using a set of randomly selected data point (MATLAB implementation of k-means), and to find clusters in the reduced space, we use the projection of the cluster centers initially used in the original space. This is likely to reduce the effect of initialization on the Fuzzy Inf. Eng. (2010) 3: 229-247 239 final clustering results in the reduced dimension and try to impose similar partition in the reduced space. We now compute the confusion matrix C = [C ] such that C is kl kl th th the number of points in the i cluster in original space that are placed in the l cluster found in the reduced space. Then re-label the centroids and realign the confusion matrix using a realignment algorithm described below. After that we compute the sum of the off diagonal entries in the realigned confusion matrix. Let the sum of the off diagonal entries be T , then M = (T/n) ∗ 100 is the percentage of points that are misplaced between two partitions. We call (100 − T ) as cluster -preservation index, CPI, and we shall report the mean and the standard deviation of CPI. Realignment Algorithm 1) Construct a confusion matrix C of dimension cx c, where c is the number of clusters or number of classes (in this case) as follows: for i = 1, 2, 3,..., ndo k ← clusterlabel of xX obtained in the original space corresponding to x . l ← clusterlabel of yX , where y is the data point in the reduced space i i corresponding to x . C ← C + 1 kl kl end for Now normalize C to C for i = 1, 2,..., cdo for j = 1, 2,..., cdo C ← C / C ij ik ij end for end for 2) We then get the realigned matrix R of C using C as follows for i = 1, 2,..., cdo th For the i row of C , select the column, maxcol, that contains the maximum. th th Copy the maxcol column of C into the i column of R( c X c matrix), i.e., for j = 1, 2,..., cdo R ← C ij j,maxcol end for th Replace all entries in the maxcol column of C by -1 i.e., for j = 1, 2,..., cdo C ←−1 j,maxcol end for end for 240 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) The sum of off-diagonal entries in the new confusion matrix R gives the disagree- ment between the cluster structures found in X and in Y. The realignment step estab- lishes a proper correspondence between the clusters. 4.3. Classifier Performance If the selected features are good in the sense of preserving geometry of the original data, then any classification method that makes decision by using distance is likely to yield comparable performance by using all features and the selected features. In order to test this, we have used the 1-Nearest Neighbor classifier with five-fold cross validation to estimate the performance of the classifier, and divided the data set ran- domly into 5 (almost) equal groups. The four groups are used as a training set and the remaining group is used as the test set. This cross-validation experiment is repeated 10 times. In order to make a fair comparison, we have used the same partition in the original and reduced spaces. 5. Results 5.1. Data Sets To validate our methods, we have used six real data sets obtained from UCI Machine learning repository [24]. 1) Wisconsin Breast Cancer (WBC): It has 699 samples distributed in two classes. Sixteen of the instances have missing values. Hence, we removed them. All reported results are computed on the remaining 683 data points. Each data point is represented by nine attributes. 2) Glass: Glass data set has 214 points with nine features distributed in six classes. 3) Wine: Wine data set consist of 178 samples in 13-dimension distributed in three classes. These data represent the results of chemical analysis of wines grown in a particular region of Italy but derived from three different cultivators. The analysis determined the quantities of 13 constituents found in each of the three types of wine. 4) Ionosphere: The data represent autocorrelation functions of radar measure- ments. The task is to classify them into two classes denoting passages or obstruction in the ionosphere. There are 351 data patterns with 34 attributes distributed in two classes. 5) Sonar: This data set contains 208 patterns obtained by bouncing sonar signals off a metal cylinder and rocks at various angles and under various conditions. Each pattern is represented by 60 attributes. Each attribute represents the en- ergy within a particular frequency band. 6) Vehicle: This data set has 846 points distributed in four classes. Each data point is represented by 18 attributes. The attribute values are normalized. Fuzzy Inf. Eng. (2010) 3: 229-247 241 These data sets are summarized in Table 1. Table 1 also includes the population size and number of generations used for each data set. For data sets with lower dimensionality (less than 10) we have used populations of size 10 and have used only 10 generations; while for the other data sets with higher dimensionality we have used larger populations and 50 generations. Table 1: Summary of the data sets used. Dataset #Class #Features Size of dataset (With class) No. of Population generations size GLass 6 9 214 (70+76+17+13+9+29) 10 10 WBC 2 9 683 (444+239) 10 10 Wine 3 13 178 (59+72+47) 40 50 Ionosphere 2 34 351 (225+126) 30 50 Sonar 2 60 208 (97+111) 40 50 Vehicle 4 18 846 (212+217+218+199) 40 50 5.2. Summarization of the Performance Indexes The length of a chromosome for a data set is equal to the number of features in the data set, e.g., for Glass data, the length is nine. Table 2 lists the choices of various parameters that are used for all data sets. On termination of the evolution, the best chromosome (best in terms of Sammon’s error) is picked up to find the associated set of features. We use these selected features to compute four performance indexes, Sammon’s error, correlation, cluster preservation and classification accuracy with a 1−NN classifier as described earlier. This completes one trial of a method. The entire procedure is now repeated ten times. In other words, given a data set and a method, we make ten trials of the method on that data set. We then compute the performance indexes on each of the 10 sets of selected features as follows: Sammon’s Error and Correlation: Let the feature sets selected in 10 trials be F , F ,..., F . For each such feature set F , k = 1, 2,..., 10, we compute the Sam- 1 2 10 k mons’s error SE and correlation r between the inter-point distances in the original k k space and in the reduced space. In column 4 of Table 3 we report the average and stan- dard deviation of the 10 values of SE , k = 1, 2,..., 10. Similarly in column 5 we in- clude the mean and standard deviation of the 10 correlation values r , k = 1, 2,..., 10. Cluster preservation: The output of the k-means clustering depends on the ini- tialization used. So to compute the cluster preservation, for each of the 10 feature sets obtained from a trial, we run the k-means algorithm 10 times. In other words, for each feature set, F , k = 1, 2,..., 10, we use 10 initializations I , I ,..., I . k 1 2 10 Thus for F , we get 10 value cluster preservation, CP ; i = 1, 2,..., 10. Let k k,i CP = average{CP ; i = 1, 2,..., 10}. In column 6 of Table 3, for each data avg,k k,i 242 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) Table 2: Description of parameters used in GA. Parameter Description Value N Population Size 50 P Probability of Crossover 0.8 P Probability of Mutation 0.003 (for Methods 1, 3, 4); 0.7 for Method 2 P Probability of Selection 0.07 S elect γ Constant of Penalty function 0.004 set we report mean and standard deviation of CP ; k = 1, 2,..., 10. avg,k Classifier Performance: For the 1− NN classifier, for each set, F , k = 1, 2,..., 10, we use five-fold cross-validation and this five-fold cross-validation experiment is re- peated 10 rounds (times). Let the average cross-validation error from round i is E ; k,i i = 1, 2,..., 10. Let E = average{E ; i = 1, 2,..., 10}. Like cluster preserva- avg,k k,i tion, in column 7 of Table 3 we report the mean and standard deviation of E ; avg,k k = 1, 2,..., 10. The very low values of Sammon’s error, together with the very high values of correlation in Table 3 for Glass data reveal that on average just 4 features, can do a very good job of structure preservation. In other words, distance geometry in the original data space and in the reduced space are likely to be similar. If that is really true, then the performance of a 1 − NN classifier in the original space and in the reduced space is expected to be almost the same. Table 3 indeed reveals that. In fact, for two data sets, the average performance of the 1− NN classifier with the features selected by the four methods, are marginally better than that with all features. In Table 3 there are several entries exhibiting values as (0.00). These values are not exactly zero, but becomes zero when rounded to two digits. Thus, this must not be interpreted that in each of these cases the same feature set was selected in all ten rounds. Similar is the case when the correlation is reported as 1.0 - this is actually obtained after rounding. Overall, the performance of the algorithms, on all data sets is similar in nature. Here are some key observations: The average Sammon’s error and correlation are very good for almost all data sets. There are 4 algorithms and 6 data sets and out of the total 24 cases, in all but four cases the average correlation is more than 0.95. Note that, Sammon’s error and correlation are directly related to structure preservation and each algorithm could do a good job for all data sets. From Table 3, we also find that the average cluster preservation is also very good. Out of the 24 cases, in 20 cases it is more than 80% of which in 18 cases it is more than 90%. Although, our algorithm, did not pay any attention to class structure, yet the average performance of the 1− NN classifier by using the reduced set of features and using the entire feature set are almost the same for most data sets. In fact, for Ionosphere and WBC data sets, the reduced set of features selected by all four algorithms resulted in better average Fuzzy Inf. Eng. (2010) 3: 229-247 243 Table 3: Performance evaluation of the four methods. Data Met- Avg # Samm Err Correlation Cluster pre- 1-NN Classifier set hod# of fea- Mean (SD) Mean (SD) servation % E E avg,k avg,k tures CP (Selected (All avg,k features) feat- ures) Glass 1 3.8 0.09(0.01) 0.93(0.04) 69.3(4.9) 64.8(4.43) 71.91 2 4 0.02(0.01) 0.98(0.01) 73.72(1.96) 68.92(2.25) 3 3.8 0.05(0.03) 0.93(0.04) 69.04(6.2) 63.5(6.15) 4 3.7 0.03(0.02) 0.94(0.05) 70.03(6.6) 63.2(6.8) WBC 1 4.7 0.1(0.05) 0.95(0.03) 94.7(2.3) 95.1(0.8) 95.8 2 4 0.09(0.09) 0.96(0.01) 92.06(0.18) 96.07(0.09) 3 4.5 0.06(0.02) 0.95(0.03) 94.0(2.4) 94.7(1.4) 4 4.5 0.04(0.01) 0.95(0.02) 94.2(3.4) 95.0(0.7) Wine 1 4.9 0.00(0.00) 1.0(0) 100(0) 72.82(1.02) 74.22 2 5 0.00(0.00) 1.0(0.00) 100(0) 72.74(0.64) 3 4.9 0.00(0.00) 1.0(0.00) 100(0) 72.44(1.32) 4 5 0.00(0.00) 0.99(0.00) 100(0) 72.61(1.56) Ionosphere 1 16.8 0.07(0.00) 0.98(0.01) 97.2(1.09) 87.53(0.90) 86.5 2 17 0.06(0.00) 0.97(0.00) 98.5(0.82) 87.22(1.33) 3 17.2 0.07(0.01) 0.97(0.00) 97.4(1.27) 88.08(1.44) 4 17.1 0.08(0.01) 0.97(0.01) 97.3(1.19) 87.25(1.44) Sonar 1 18.2 0.1(0.02) 0.93(0.015) 89.4(5.09) 80.71(2.06) 81.29 2 18 0.08(0.01) 0.94(0.02) 90.9(3.74) 79.02(2.25) 3 18 0.05(0.01) 0.94(0.01) 89(6.53) 79.76(2.32) 4 18 0.03(0.00) 0.94(0.01) 92.1(3.65) 79.89(1.68) Vehicle 1 9.5 0.00(0.00) 0.99(0.00) 99.8(0.06) 60.86(1.51) 64.17 2 9 0.00(0.00) 0.99(0.00) 99.8(0.12) 60.91(1.64) 3 9.1 0.00(0.00) 0.99(0.00) 99.6(0.22) 58.39(2.78) 4 9 0.00(0.00) 0.99(0.00) 99.4(0.47) 59.43(2.94) classifier performance than using the entire feature set. In 23 of the 24 cases, the difference between the average 1 − NN performance using the reduced feature set and the entire feature set is less than 5%, although the unsupervised method did not use the class information. The average cluster preservation is worst in case of glass data, between 69%-74%. 244 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) The reason behind this can be found from the scatter plot in Fig 1, which depicts the first two principal components of the Glass data. Fig 1 reveals that this data set does not have any clear cluster structure and when this happens, the output of clustering algorithms like k-means becomes too much sensitive to initialization. Even though the cluster preservation is quite poor, the average classifier performance by using 3-4 features is about 68% while that by using all 9 features is about 72%. Fig. 1 Scatter plot of glass dataset Table 4: The average (over 10 runs) time (in seconds) taken by different methods over 10 runs. Dataset Method 1 Method 2 Method 3 Method 4 GLass 7.56 8.52 7.4 7.43 WBC 67.2 73.7 67.1 67.8 Wine 98.94 107.44 99.38 94.25 Ionosphere 593.86 555.75 600.84 593.72 Sonar 440.75 404.69 441.32 437.31 Vehicle 2760.00 2866.13 2780.18 2743.13 Table 4 depicts the average (averaged over 10 runs) time (in seconds) taken by different methods for different data sets. All programs are developed in C and all computations are done on an Intel Core 2 Duo processor (2.80 GHz) based PC. The Fuzzy Inf. Eng. (2010) 3: 229-247 245 number of generations for a given data set is kept the same for all the four methods and these values are shown in the last column of Table 3. Table 4 shows that except for ionosphere and sonar data having higher dimension, Method 2 takes more time than other methods. 6. Conclusion and Discussion In this paper we have proposed four unsupervised methods for feature selection using genetic algorithms. The most important issue in unsupervised feature selection is the choice of a criterion function. The criterion for feature selection should be such that both cluster structure, as well as the performance of classifiers are preserved in the reduced space. To achieve this, we have used the Sammon’s stress function that attempts to preserve the topology of the data by preserving the inter-point distances. In this context, in addition to a commonly used fitness criterion, we have explored the use of an unfitness criterion for selection of chromosomes for genetic operations. This is done to promote diversity in the new population and this will allow higher chances to unfit strings to become fit. The proposed methods are evaluated by using a set of six data sets with widely varying dimensionality. Since the methods are unsupervised, we have evaluated the quality of the selected feature using four criteria. Two of the criteria (Sammon’s error and correlation between inter-point distances) measure the extent; the topology of the original data space is preserved in the reduced space. The third criterion measures how well the clusters structure in the original space is preserved in the reduced space while the remaining criterion assesses how good a classifier (here 1− NN) performs in the reduced space compared to its performance in the original data space. All of these four criteria reveal that the proposed methods do an excellent job of selecting useful features. References 1. Pal N R (2002) Fuzzy logic approaches to structure preserving dimensionality reduction. IEEE Trans- actions on Fuzzy Systems 10(3): 277-286 2. Jain A K, Dubes R C (1998) Algorithms for clustering data. Prentice-Hall, Upper Saddle River, NJ 3. Sammon J J W (1969) A nonlinear mapping for data structure analysis. IEEE Transaction on Com- putation C(18): 401-409 4. Schachter B (1978) A nonlinear mapping algorithm for large databases. Computer Graphics and Image Processing 7: 271-278 5. Pykett C E (1980) Improving the efficiency of Sammon’s nonlinear mapping by using clustering archetypes. Electronic Letter 14: 799-800 6. Pal N R (1999) Soft computing for feature analysis. Fuzzy Sets and Systems 103: 201-221 7. Muni D P, Pal N R, Das J (2004) A novel approach for designing classifiers using genetic program- ming. IEEE Transactions on Evolutionary Computation 8(2): 183-196 8. Mitra P, Murthy C A, Pal S K (2002) Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3): 301-312 9. Dash M, Liu H (2000) Feature selection for clustering. Proceedings of Asia Pacific Conference on Knowledge Discovery and Data Mining: 110-121 10. Dy J G, Brodley C E (2000) Feature subset selection and order identification for unsupervised learn- th ing. Proceedings of 17 International Conference on Machine Learning 11. Basu S, Micchell C A, Oslen P (2000) Maximum entropy and maximum likelihood criteria for fea- ture selection and multivariate data. Proceedings of IEEE International Symposium on Circuits and 246 Amit Saxena · Nikhil R. Pal· Megha Vora (2010) Systems: 267-270 12. Pal S K, De R K, Basak J (2000) Unsupervised feature evaluation: A neuro-fuzzy approach. IEEE Transactions on Neural Networks 1: 366-376 13. Hall M A (2000) Correlation based feature selection for discrete and numeric class machine learning. th Proceedings of 17 International Conference on Machine Learning 14. Heydorn R P (1971) Redundancy in feature extraction. IEEE Transactions on Computers: 1051-1054 15. Das S K (1971) Feature selector with a linear dependence measure. IEEE Transactions on Computers: 1106-1109 16. Muni D P, Pal N R, Das J (2006) Genetic programming for simultaneous feature selection and classi- fier design. IEEE Transactions on System Man and Cybernetics B36(1): 1-12 17. Siedlecki W, Sklansky (1989) A note on genetic algorithms for large scale feature selection. Pattern Recognition Letters 10: 335-347 18. (1988) On automatic feature selection. International Journal of Pattern Recognition and Artificial Intelligence 2(2): 197-220 19. Casillas J, Cordon O, Del Jesus M J, Herrera F (2001) Genetic feature selection in a fuzzy rule based classification system learning process for high dimensional problems. Information Sciences 136: 135-157 20. Pal N R, Nandi S, Kundu M K (1998) Self-crossover: A new genetic operator and its application to feature selection. International Journal on System Science 29(2): 207-212 21. Ahluwalia M, Bull L (2001) Co-evolving functions in genetic programming. Journal of Systems Architect 47(7): 573-585 22. Sherrah J, Bonger R E, Bouzerdoum (1996) Automatic selection of features for classification using genetic programming. In Proceedings of the Australian New Zealand Conference Intelligent Infor- mation Systems: 284-287 23. Goldberg D (1989) Genetic algorithms in search optimization and machine learning. Reading. MA: Addison-Wesley 24. http://www.ics.uci.edu/∼mlearn/MLRepository.html 25. Bishop C M (1995) Neural networks for pattern recognition. Oxford, Clarendon Press 26. Raudys S J, Pikelis V (1980) On dimensionality sample size classification error and complexity of classification algorithms in pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2: 243-251 27. Raudys S J, Jain A K (1991) Small sample size effects in statistical pattern recognition: Recom- mendations for Practioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(2): 252-264 28. Trunk G V (1979) A problem of dimensionality: A simple example. IEEE Transactions on Pattern Analysis and Machine Intelligence 1(3): 306-307 29. Chakraborty D, Pal N R (2002) Designing rule-based classifiers with on-Line feature selection: A neuro-fuzzy approach. Proceedings of Advances in Soft Computing-AFSS. Spinger: 251-259 30. Pal N R, Chintalapudi K (1997) A connectionist system for feature selection. Neural, Parallel and Scientific Computation 5: 359-382 31. Kohavi P, John G H (1997) Wrappers for feature subset selection. Artificial Intelligence 97: 273-324 32. Devaney M, Ram A (1997) Efficient feature selection in conceptual clustering. In Proceedings of Machine Learning: Fourteenth International Conference. Nashville. TN 33. Gluck M A, Corter J E (1985) Information uncertainty and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Lawrence Erlbaum Associates. Irvine. CA: 283-287 34. Douglas F (1987) Knowledge acquisition via incremental conceptual clustering. Machine Learning 2(2): 139-172 35. Talavera L (2000) Dependency-based feature selection for clustering symbolic data. Intelligent Data Analysis 4: 19-28 36. Pena J M, Lozano J A, Larranaga P, Iwza I (2001) Dimensionality reduction in unsupervised learning of conditional gaussian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6) Fuzzy Inf. Eng. (2010) 3: 229-247 247 th 37. Dash M, Liu H, Yao J (1997) Dimensionality reduction for unsupervised data. In Proceedings of 19 IEEE International Conference on Tools with AI. ICTAI 38. De R, Pal N R, Pal S K (1997) Feature analysis: Neural network and fuzzy set theoretic approaches. Pattern Recognition 30(10): 1579-1590 39. Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P (1999) Molecular classi- fication of cancer: Class disovery and class prediction by gene expression monitoring, Science 286: 531-537 40. Wei H L, Billings S A (2007) Feature subset selection and ranking for data dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1): 162 -166
Fuzzy Information and Engineering – Taylor & Francis
Published: Sep 1, 2010
Keywords: Dimensionality reduction; Feature analysis; Genetic algorithm; Classification techniques
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.