Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A simple and efficient algorithm for gene selection using sparse logistic regression

A simple and efficient algorithm for gene selection using sparse logistic regression Vol. 19 no. 17 2003, pages 2246–2253 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg308 A simple and efficient algorithm for gene selection using sparse logistic regression 1 2,∗ S. K. Shevade and S. S. Keerthi Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India and Control Division, Department of Mechanical Engineering, National University of Singapore, Singapore 117576, Republic of Singapore Received on November 30, 2002; revised on March 12, 2003; accepted on May 29, 2003 ABSTRACT learning techniques like Parzen windows, Fisher’s linear dis- Motivation: This paper gives a new and efficient algorithm criminant, and decision trees have been applied to solve this for the sparse logistic regression problem. The proposed classification problem (Brown et al., 2000). Since the data algorithm is based on the Gauss–Seidel method and is dimension is very large, Support Vector Machines (SVMs) asymptotically convergent. It is simple and extremely easy have been found to be very useful for this classification prob- to implement; it neither uses any sophisticated mathematical lem (Brown et al., 2000; Furey et al., 2000). Apart from the programming software nor needs any matrix operations. It can classification task, it is also important to remove the irrelevant be applied to a variety of real-world problems like identifying genes from the data so as to simplify the inference. One of marker genes and building a classifier in the context of cancer the aims of the microarray data experiments is to identify a diagnosis using microarray data. small subset of informative genes, called marker genes, which Results: The gene selection method suggested in this paper discriminate between the tumour and the normal tissues, or is demonstrated on two real-world data sets and the results between different kinds of tumour tissues. Identification of were found to be consistent with the literature. marker genes from a small set of samples becomes a difficult Availability: The implementation of this algorithm is task and there is a need to develop an efficient classification available at the site http://guppy.mpe.nus.edu.sg/ mpessk/ algorithm for such an application where the number of fea- SparseLOGREG.shtml tures is much larger than the number of samples. Furey et al. Contact: mpessk@nus.edu.sg (2000) used Fisher score to perform feature selection prior Supplementary Information: Supplementary material is to training. This method was compared against feature selec- available at the site http://guppy.mpe.nus.edu.sg/ mpessk/ tion using radius-margin bound for SVMs by Weston et al. SparseLOGREG.shtml (2001). Guyon et al. (2002) introduced an algorithm, called Recursive Feature Elimination (RFE), where features are suc- cessively eliminated during training of a sequence of SVM INTRODUCTION classifiers. Li et al. (2002) introduced two Bayesian classi- Logistic regression is a powerful discriminative method. It fication algorithms which also incorporate automatic feature also has a direct probabilistic interpretation built into its selection. In their work, the Bayesian technique of automatic model. One of the advantages of logistic regression is that relevance determination (ARD) was used to perform feature it provides the user with explicit probabilities of classification selection. apart from the class label information. Moreover, it can be The main aim of this paper is to develop an efficient easily extended to the multi-category classification problem. algorithm for the sparse logistic regression problem. The pro- In this paper the problem of logistic regression is addressed posed algorithm is very much in the spirit of the Gauss–Seidel with the application of gene expression data particularly method (Bertsekas and Tsitsiklis, 1989) for solving uncon- in mind. strained optimization problems. The algorithm is simple, The development of microarray technology has created a extremely easy to implement, and can be used for feature selec- wealth of gene expression data. Typically, these data sets tion as well as classifier design, especially for the microarray involve thousands of genes (features) while the number of data set. It does not need any sophisticated LP or QP solver, tissue samples is in the range 10–100. The objective is to nor does it involve any matrix operations which is a drawback design a classifier which separates the tissue samples into pre- of some of the previously suggested methods. Our algorithm defined classes (e.g. tumour and normal). Different machine also has the advantage of not using any extra matrix storage and therefore it can be easily used in cases where the number of features used is very high. To whom correspondence should be addressed. 2246 Published by Oxford University Press 2003. Gene selection using sparse logistic regression The remaining sections of the paper contain the follow- interested in finding a small subset of genes which enables us ing: description of the problem formulation and the optimality to do good classification it is appropriate to use a sparse model conditions of the problem; details of the proposed algorithm; of the regression function, that is, in the final function f(x) the feature selection and the classifier design procedure; in (2.1), only a small number of αs corresponding to relevant computational experiments and, concluding remarks. features will have non-zero values. Therefore, we solve the sparse logistic regression problem, which can be formulated as the following optimization problem: PROBLEM FORMULATION AND OPTIMALITY min ρ = g − y f(x ) i i In this paper we focus on the two category classification prob- α i lem. The ideas discussed here can be easily extended to the s.t. |α |≤ t (2.2) multi-category problem and those details will be addressed in j =1 a future paper. We will now describe the problem formulation where t ≥ 0 is a parameter that is tuned using techniques such using microarray tissue classification as an example. Given a as cross-validation and the function g is given by: microarray data set containing m tissue samples, with each tis- sue sample represented by the expression levels of N genes, g(ξ ) = log(1 + e ) (2.3) the goal is to design a classifier which separates the tissue samples into two predefined classes. Let {(x ˜ , y )} denote i i i=1 which is the negative log-likelihood function associated with the training set, where x ˜ is the ith input pattern, x ˜ ∈ R and i i the probabilistic model y is the corresponding target value; y = 1 means x ˜ is in class i i i 1 and y =−1 means x ˜ is in class 2. Note that the vector x ˜ i i i Prob(y|x) = (2.4) contains the expression values of N genes for the ith tissue −y ·f(x) 1 + e sample and x ˜ denotes the expression value of gene j for the ij ith tissue sample. To conveniently take into account the bias Once the αs are determined by solving (2.2), the class of the T T term in the regression function, let us define x = (1, x ˜ ). test sample, x ¯,is +1if f(x) ¯ > 0 and −1 otherwise. Let us specify the functional form of the regression function The formulation (2.2) was first suggested by Tibshirani f(x) as a linear model, (1996) as an extension of the ‘LASSO’ (Least Absolute Shrinkage and Selection Operator) method for the linear regression problem. The LASSO estimator for the regression f(x ) = α x (2.1) i j ij problem is obtained by solving the following optimization j =0 problem: where α denotes the weight vector. We address the first ele- ment (corresponding to the bias term) of the vectors x and α min y − f(x ) s.t. |α |≤ t (2.5) i i j as the ‘zeroth element’. To avoid notational clutter, through- i j =1 out this paper we use the index i to denote the elements of the training set (i = 1, ... , m) and the index j to repres- In Tibshirani (1996), two iterative algorithms were sugges- ent the elements of x and α (j = 0, ... , N , unless specified ted for the solution of (2.5). One of these algorithms treats explicitly). (2.5) as a problem in N + 1 variables and 2 constraints. In Typically for the microarray data, N  m. That is, the each iteration it solves a linear least squares problem subject training samples lie in a very high-dimensional space. There- to a subset of constraints E. Once a solution is found in an fore, linear separability of two classes may not be a problem. iteration, a constraint violated by the current solution is added Hence, it is appropriate to consider using a linear classi- to the set E and a new feasible point is found. The linear fier in the input space directly without transforming it into least squares problem is then resolved subject to the new set a higher dimensional feature space. Furthermore, for such of constraints E. This procedure is repeated until the optim- data sets it is important to note that finding a linear classi- ality conditions are satisfied. Note that, during each iteration fier in the input space involves estimation of a large number one constraint is added to the set E and there exist only a of parameters [in Equation (2.1)] using a very small number finite (2 ) number of constraints. Therefore, this algorithm of training examples. It is therefore possible to derive differ- will converge in a finite number of steps. Use of active set ent linear classifiers for the same problem as the problem is method was also suggested for the solution of (2.5) where underdetermined. How the proposed algorithm takes care of every time the set E contains only the set of equality con- the effect of multiple solutions is discussed in the section on straints satisfied by the current point. The second algorithm, feature selection and classifier design. suggested in Tibshirani (1996), states (2.5) as a problem in For diagnostic purposes, it is important to have a classifier 2N + 1 variables and 2N + 1 constraints, which can be solved which uses as few features (genes) as possible. Since we are using any quadratic programming solver. 2247 S.K.Shevade and S.S.Keerthi In Osborne et al. (2000), the problem (2.5) was treated as a keeping the other variables fixed, and the process is repeated convex programming problem and an efficient algorithm for till the optimality conditions are satisfied. Asymptotic conver- computing the linear model coefficients was given using dual- gence of this method for a more general version of the problem ity theory. This algorithm was stated only for the cases where has been proved in (Bertsekas and Tsitsiklis, 1989; Chapter 3, the objective function is a quadratic loss function. These ideas, Proposition 4.1). It should be noted that strict convexity of g however, can be extended to problems with non-quadratic plays an important role in this proof. objective functions, for example (2.2), by optimizing the We now derive the first order optimality conditions for (2.6). quadratic approximation of the cost function in (2.2) at any Since (2.6) is a convex programming problem these condi- given point and repeating this procedure until convergence. tions are both necessary and sufficient for optimality. Let us Recently, Roth (2002) extended the ideas given in Osborne define et al. (2000) to a more general class of cost functions includ- ξ =−y f(x ) i i i ing the one given in (2.2). The proof of convergence of this algorithm to a general class of loss functions is also given (2.7) F = y x j i ij therein. In this method, the problem (2.2) is transformed to a 1 + e L constrained least squares problem which can then be solved using iteratively re-weighted least squares (IRLS) method on The first order optimality conditions for the problem (2.6) can a transformed set of variables. To avoid solving large dimen- be easily derived from geometry: (i) since W is differentiable sional constrained least squares problems, especially for the with respect to α , ∂W /∂α = 0; (ii) if j> 0 and α = 0, 0 0 j microarray data, Roth (2002) uses a wrapper approach where then, since W is differentiable with respect to α at such a α , j j a maximum violating variable is added to a small variable we have ∂W /∂α = 0 and (iii) if j> 0 and α = 0, then, j j set, X, and the constrained least squares problem is solved since W is only directionally differentiable with respect to α with respect to the variables in the set X only. This procedure at α = 0, we require the right side derivative of W with is repeated till the optimality conditions are satisfied. respect to α to be non-negative and the left side derivative All of the above methods work well and are very effi- to be non-positive. These conditions can be rewritten as the cient; but they rely on mathematical programming solvers following algebraic conditions. that require detailed matrix operations. The main contribu- tion of this paper is to utilize the special structure of (2.2) F =0if j = 0 and devise a simple algorithm which is extremely easy to F = γ if α > 0, j> 0 j j implement; it neither uses any mathematical programming F =−γ if α < 0, j> 0 package nor needs any matrix operations. The simplicity j j of our algorithm is evident from the pseudocode. See the − γ ≤ F ≤ γ if α = 0, j> 0 j j supplementary information for details. Using optimality conditions it can be shown that there exists Thus, if we define a γ> 0 for which, (2.2) is equivalent to the following viol =|F | if j = 0 j j unconstrained optimization problem: =|γ − F | if α > 0, j> 0 N j j (2.8) min W = γ |α |+ g − y f(x ) (2.6) j i i =|γ + F | if α < 0, j> 0 j j j =1 i = ψ if α = 0, j> 0 j j This means that the family of classifiers obtained by varying t where ψ = max(F − γ , −γ − F ,0), then the first order in (2.2) and the family obtained by varying γ in (2.6) are j j j optimality conditions can be compactly written as the same. The addition of the penalty term, |α |, to the original objective function can be seen as putting a Laplacian viol = 0 ∀j (2.9) prior over the vector α. The above formulation thus promotes the choice of a sparse model, where the final αs are either Since, in asymptotically convergent procedures it is hard to large or zero. In this paper, we shall concentrate on the model achieve exact optimality in finite time, it is usual to stop when in (2.6). the optimality conditions are satisfied up to some tolerance, τ . We solve (2.6) directly without converting it into its For our purpose, we will take it that any algorithm used to dual. The problem (2.6) can be solved using an appro- solve (2.6) will terminate successfully when priate unconstrained nonlinear programming technique. We choose to basically use the Gauss–Seidel method which viol ≤ τ ∀j . (2.10) uses coordinate-wise descent approach, mainly because the method is extremely easy to implement while also being very This can be used as a stopping condition for (2.6). Throughout, efficient. In this method, one variable is optimized at a time we will refer to optimality as optimality with tolerance. 2248 Gene selection using sparse logistic regression In the next section, we describe the actual details of our In this algorithm, all αs are set to zero initially which implies algorithm for solving the problem (2.6). that only the set I exists. The algorithm can be thought of as a two-loop approach. The type I loop runs over the variables in the set I to choose the maximum violator, v. In the type II A SIMPLE ALGORITHM FOR SPARSE loop, W is optimized with respect to α , thus modifying the set LOGISTIC REGRESSION I and the maximum violator in the set I is then found. This nz nz procedure in the type II loop is then repeated until no violators The algorithmic ideas discussed here are based on the Gauss– are found in the set I . The algorithm thus alternates between nz Seidel method used for solving unconstrained optimization the type I and type II loops until no violators exist in either of problem. In solving (2.6) using Gauss–Seidel method, one the sets, I and I . z nz variable α which violates the optimality conditions is chosen Once the maximum violating variable in a given set is and the optimization subproblem is solved with respect to this chosen, some non-linear optimization technique needs to be variable α alone, keeping the other αs fixed. This procedure used to solve the unconstrained optimization problem (2.6) is repeated as long as there exists a variable which violates with respect to a single variable. Note that the objective func- the optimality conditions. The method terminates when the tion is convex. A combination of the bisection method and optimality conditions (2.9) are satisfied. Note that the object- Newton method was used in our algorithm. In this method, ive function in (2.6) is strictly convex and it will strictly two points L and H for which the derivative of the objective decrease at every step when the optimization subproblem function has opposite signs are chosen. This ensures that the is solved with respect to one variable. Repeated application root always lies in a bracketed interval, [L, H ]. The Newton of this procedure for the problem (2.6) will make sure that method is tried first, and a check is made in the algorithm to the algorithm will converge asymptotically (Bertsekas and make sure the iteration obtains a solution in this interval of Tsitsiklis, 1989) to the solution. Of course, if the stopping interest. If the Newton method takes a step outside the interval condition (2.10) is used then the algorithm will terminate in we do not accept this next point, and instead resort to the bisec- finite time. tion method. In the bisection method, the next point is chosen Foragiven α, let us define the following sets: I ={j : to be the midpoint of the given interval. Depending upon the α = 0,j> 0}; and I ={0}∪{j : α = 0,j> 0}. j nz j sign of the derivative of the objective function at this point the Also, let I = I ∪ I . The key to efficiently solving (2.6) z nz new interval is decided. Now, the abovementioned procedure using Gauss–Seidel method is the selection of the variable α starting from the trying of the Newton step is repeated. Since in each iteration with respect to which the objective function it is always ensured that the root lies in the interval of interest, is optimized. At optimality, it is expected that the resulting the method is guaranteed to reach the solution. model is sparse, that is, there will be only a few weights It is important to note that the objective function, W , has α with non-zero values. It is important that the algorithm different right-hand and left-hand derivatives with respect to spends most of its time adjusting the non-zero αs and mak- α at α = 0. Therefore, if the non-zero variable attains a j j ing the subset I self-consistent as far as optimality of the nz value of 0 when our algorithm is executed it is necessary to see variables in this subset is concerned. Therefore, it is appro- whether further progress in the objective function can be made priate to find the maximum violating variable, say α in by altering the same variable (but now in the opposite direc- the set I and then solve (2.6) using the variables in the set tion). This will make sure that after successful termination of I ∪{v} only, until the optimality conditions for these vari- nz the type II loop the optimality conditions for the variables in ables are satisfied. This procedure can be repeated till no I are satisfied. nz violator remains in the set I , at which point the algorithm Note from (2.7) and (2.8) that for checking the optimality can be terminated. This procedure can be best explained using conditions for each variable it is necessary to calculate ξ for Algorithm 1. all the examples. Since this calculation is done repeatedly, the efficient implementation of the algorithm requires that ξ Algorithm 1 is stored in the memory for all the examples. After a single Input Training Examples variable, say α , is updated it becomes necessary to update Initialize αsto0. ξ ∀i. This can be done by using the following simple equation: while Optimality Violator exists in I new old old new ξ = ξ + (α − α )y x (3.1) Find the maximum violator, v,in I i ij z i i j j repeat Let us now comment on the speed of our algorithm for Optimize W w.r.t. α (2.6). γ in (2.6) is a hyperparameter and one cannot identify Find the maximum violator, v,in I nz any nominal value for it as a starting point. For a given value until No violator exists in I nz of γ and a data set containing 72 samples with 7129 fea- end while tures per sample, it took about 20 s to solve (2.6) on a SUN UltraSparc III CPU running on 750 MHz machine. Once the Output A set of αs for the function in (2.1) 2249 S.K.Shevade and S.S.Keerthi solution at a certain value of γ is found, the solution at another step is to choose the ‘right’ set of genes to design a classi- close-by value of γ can be efficiently found using the previ- fier. For this purpose, the following systematic method can ous solution as the starting point. This idea is very useful be used. for improving efficiency when γ needs to be tuned using Let S denote the set of features being analysed. To examine cross-validation; see the section below. the performance of these features, the k-fold cross-validation experiment [on (2.6)] is repeated 100 times using only the features in the set S. To avoid any bias towards a particular FEATURE SELECTION AND CLASSIFIER value of γ , the validation error at every value of γ is averaged DESIGN over those 100 experiments. The minimum of these validation Before we discuss the procedure for feature selection it is error values over different choices of γ then gives the aver- important to note that the solution of (2.6) is useful for the age validation error for the set S. Ideally, one would like to tasks of feature selection as well as classifier design. Since identify the set of few features for which the average valida- (2.6) automatically introduces sparseness into the model, the tion error is minimum. For this purpose, the top ranked genes non-zero αs in the final solution of (2.6) help us in deciding (ranked according to relevance count) are added to the set S the relevant features for classification. To identify the relevant one-by-one, and the average validation error is calculated. The features, we proceed as follows. procedure is repeated as long as adding the next top ranked To design a classifier and identify the relevant features using gene to the set S does not reduce the average validation error (2.6), it is important to find the appropriate value of γ in (2.6). significantly. This final set S of genes is then used along with This can be done using techniques like k-fold cross-validation. all the training examples to design a classifier for diagnostic For this purpose, the entire set of training samples is divided purposes. into k segments. For a given value of γ , the problem (2.6) Note that this final set of genes is derived from all the train- is solved k times, each time using a version of the training ing samples available. Typically for the microarray data set, set in which one of the segments is omitted. Each of these the training set size is very small and there are no test samples k classifiers is then tested on the samples from the segment available. Therefore, one needs to use the entire training data which was omitted during training, and the ‘validation error’ is for classifier design. The average validation error mentioned found by averaging the test set results over all the k classifiers. earlier uses the entire training set for gene selection and so This procedure is repeated for different values of γ and the it is not an ‘unbiased estimate’. To examine how good the optimal value γ is chosen to be the one which gives minimum feature selection procedure is, it becomes necessary to give validation error. The final classifier is then obtained by solving an unbiased estimate of this procedure. This estimate can be (2.6) using all the training samples and setting γ to γ . obtained using techniques of external cross-validation sug- For a given value of γ , each of the k classifiers uses a set gested in (Ambroise and McLachlan, 2002). For this purpose, of genes (corresponding to non-zero αs) for classification. the training data is split into P different segments (P = 10 is a We assign a count of one to each gene if it is selected by a good choice). The gene selection procedure described above classifier at γ . Since there are k different classifiers designed is executed P times, each time using the training samples at γ , a gene can get a maximum count of k meaning that derived by omitting one of the P segments and testing the it is used in all the k classifiers. To avoid any bias arising final classifier (obtained with the reduced set of genes) on the from a particular k-fold split of the training data, the above samples from the segment which was omitted during training. procedure is repeated 100 times, each time splitting the data This procedure gives an unbiased estimate of the performance differently into k folds and doing the cross-validation. The of the gene selection method described earlier. We will call counts assigned to a gene for 100 different experiments are this estimate as the average test error. As one might expect, the then added to give a relevance count for every gene. Thus average test error is usually higher than the average validation a gene can have a maximum relevance count of 100k since error. k-fold cross-validation was repeated 100 times. One of the aims of using the microarray data is to develop NUMERICAL EXPERIMENTS a classifier using as few genes as possible. It is therefore In this section we report the test performance of the pro- important to rank the genes using some method and then select posed gene selection algorithm on some publicly available some of these genes in a systematic way so as to design a data sets: colon cancer and breast cancer. As the prob- classifier which generalizes well on the unseen data. As the lem formulation (2.6) and the one used in (Roth, 2002) are relevance count of a gene reflects its importance in some sense, the same, the results of different algorithms for the same it is appropriate to rank the genes based on the relevance count. formulation are expected to be close. Having decided the order of importance of the genes, the next The algorithm for the solution of (2.6) and the actual method of selecting This issue is particularly important for microarray data since the number of relevant genes used by Roth (2002), however, are quite different from our examples is usually small. method described in the previous section. 2250 Gene selection using sparse logistic regression 0.18 0.17 0.16 0.16 0.14 0.15 0.12 0.14 0.1 0.13 0.08 0.12 0.06 0.11 0.04 0.1 0.02 0.09 0.08 2 4 6 8 10 2 4 6 8 10 12 Number of Genes Number of Genes Fig. 2. Variation of average validation error on the breast cancer Fig. 1. Variation of validation error on the colon cancer data set, data set, shown as the number of genes used in the feature set. The shown as the number of genes used in the feature set. The average average validation error reached the minimum when six genes were validation error reached the minimum when eight genes, listed in used. These genes are listed in Table 2. Table 1, were used. Table 1. Selected top eight genes with their relevance counts for the colon For the colon cancer data set (Alon et al., 1999) the task cancer data set is to distinguish tumour from normal tissue using micro- array data with 2000 features per sample. The original data Sr. Gene annotation Relevance consisted of 22 normal and 40 cancer tissues. This dataset no. count is available at http://lara.enm.bris.ac.uk/colin. We also eval- uated our algorithm on the breast cancer data set (West 1 Homo sapiens mRNA for GCAP-II/uroguanylin precursor 229 et al., 2001). The original data set consisted of 49 tumour 2 COLLAGEN ALPHA 2(XI) CHAIN (H.sapiens) 165 samples. The number of genes representing each sample is 3 MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) 151 7129. The aim is to classify these tumour samples into estro- 4 PLACENTAL FOLATE TRANSPORTER (H.sapiens) 145 gen receptor-positive (ER+) and estrogen receptor-negative ATP SYNTHASE COUPLING FACTOR 6, 5 124 MITOCHONDRIAL PRECURSOR (HUMAN) (ER−). West et al. (2001) observed that in five tumour 6 GELSOLIN PRECURSOR, PLASMA (HUMAN) 117 sample cases, the classification results using immunohis- 7 Human mRNA for ORF, complete cds 107 tochemistry and protein immunoblotting assay conflicted. 8 60S RIBOSOMAL PROTEIN L24 (Arabidopsis thaliana) 100 Therefore, we decided to exclude these five samples from our training set. This data set is available at the following site: Table 2. Selected top six genes with their relevance counts for the breast http://mgm.duke.edu/genome/dna_micro/work/. cancer data set Each of these data sets was pre-processed using the fol- lowing procedure. The microarray data set was arranged as matrix with m rows and N columns. Each row of this mat- Sr. Gene annotation Relevance no. count rix was standardized to have mean zero and unit variance. Finally, each column of the consequent matrix was standard- 1 Human splicing factor SRp40-3 mRNA, complete cds 171 ized to have mean zero and unit variance. This transformed 2 H.sapiens mRNA for cathepsin C 151 data was then used for all the experiments. 3 Human mRNA for KIAA0182 gene, partial cds 133 The results of the gene selection algorithm on these different 4 Y box binding protein-1 (YB-1) mRNA 130 data sets are reported in Figures 1 and 2 and Tables 1 and 2. As Human transforming growth factor-beta 3 5 127 discussed in the previous section, the relevance count of every (TGF-beta3) mRNA, complete cds Human microtubule-associated protein gene was obtained by repeating the k-fold cross-validation 6 115 tau mRNA, complete cds procedure 100 times, where k was set to 3. Tables 1 and 2 denote the important genes used in the classifier design for each of the data sets along with their relevance counts. For the colon cancer data set, the top three ranked genes hypothesis need not be unique as the samples in gene expres- are same as those obtained in (Li et al., 2002), although the sion data typically lie in a high-dimensional space. Moreover, gene selection procedure adopted there was different than the if the two genes are co-regulated then it is possible that one one we used. The main reason for this is that the classification of them might get selected during the gene selection process. Average Validation Error Average Validation Error S.K.Shevade and S.S.Keerthi Table 3. Average test error for different data sets (Suykens and Vandewalle, 1999) and kLOGREG (Roth, 2001, http://www.informatik.uni-bonn.de/ roth/dagm01.pdf). Data set Average test error ACKNOWLEDGEMENTS Colon cancer 0.177 Breast cancer 0.181 The work of the first author was partially supported by Genome Institute of Singapore. The authors thank Phil Long for his valuable comments and suggestions on the draft of this paper. Some comments about the genes selected for the breast cancer data set (shown in Table 2) are worthy of mention. REFERENCES Cathepsins have been shown to be good markers in cancer Alon,U., Barkai,N., Notterman,D., Gish,K., Ybarra,S., Mack,D. (Kos and Schweiger, 2002). Recently, it has been shown that and Levine,A. (1999) Broad patterns of gene expression YB-1, also called Nuclease-sensitive element-binding pro- revealed by clustering analysis of tumour and normal colon tein 1 (NSEP1), is elevated in breast cancer; see the following cancer tissues probed by oligonucleotide arrays. Cell Biol., 96, site: http://www.wfubmc.edu/pathology/faculty/berquin.htm. 6745–6750. Moreover, microtubule associated protein is the cell cycle Ambroise,C. and McLachlan,G.J. (2002) Selection bias in gene control gene, and therefore is associated with cancer. extraction on the basis of microarray gene expression data. Proc. TGF-β has been shown to be a critical cytokine in the Natl Acad. Sci. USA, 99, 668–674. development and progression of many epithelial malig- Arrick,A.B. and Derynck,R. (1996) The biological role of trans- nancies, including breast cancer (Arrick and Derynck, forming growth factor beta in cancer development. In Waxman,J. 1996). (ed.), Molecular Endocrinology of Cancer, Cambridge University Press, New York, USA, pp. 51–78. The generalization behaviour of our algorithm was Bertsekas,D.P. and Tsitsiklis,J.N. (1989) Parallel and Distributed examined using the technique of external cross-validation Computation: Numerical Methods. Prentice Hall, Englewood described at the end of the previous section. The average Cliffs, NJ, USA. test error for each of the data sets is reported in Table 3. Brown,M., Grundy,W., Lin,D., Cristianini,N., Sugnet,C., Furey,T., We also tested our algorithm on some other publicly avail- Ares, Jr., M. and Haussler,D. (2000) Knowledge based able data sets. See the supplementary information for more analysis of microarray gene expression data using sup- details. port vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267. Furey,T., Cristianini,N., Duffy,N., Bednarski,D., Schummer,M. and Haussler,D. (2000) Support vector machine classification and val- CONCLUSION idation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914. We have presented a simple and efficient training algorithm Guyon,I., Weston,J., Barnhill,S. and Vapnik,V. (2002) Gene selec- for feature selection and classification using sparse logistic tion for cancer classification using support vector machines. Mach. regression. The algorithm is robust in the sense that on the Learning, 46, 389–422. data sets we have tried there was not even a single case Kos,J. and Schweiger,A. (2002) Cathepsins and cystatins in extracel- of failure. It is useful, especially for problems where the lular fluids—useful biological markers in cancer. Radiol. Oncol., number of features is much higher than the number of train- 36, 176–179. ing set examples. The algorithm, when applied to gene Li, Yi, Campbell,C. and Tipping,M. (2002) Bayesian automatic rel- microarray data sets, gives results comparable with other evance determination algorithms for classifying gene expression methods. It can be easily extended to the case where the data. Bioinformatics, 18, 1332–1339. Osborne,M., Presnell,B. and Turlach,B. (2000) On the LASSO and aim is to identify the genes which discriminate between its dual. J. Comput. Graphical Stat., 9, 319–337. different kinds of tumour tissues (multi-category classific- Roth,V. (2001) Probabilistic discriminative kernel classifiers for ation problem). The proposed algorithm can also be used multi-class problems. In Radig, B. and Florczyk, S. (eds), Pattern for solving other problem formulations where strictly convex Recognition—DAGM’01, Springer, pp. 246–253. cost functions are used, e.g. the LASSO problem formula- Roth,V. (2002) The generalized LASSO: a wrapper approach tion where quadratic loss function is used. In fact, in the to gene selection for microarray data. Tech. Rep. IAI-TR- case where the quadratic loss function is used, the one- 2002-8, University of Bonn, Computer Science III, Bonn, dimensional optimization problem becomes very simple since Germany. its solution can be reached in one Newton step. We are cur- Suykens,J. and Vandewalle,J. (1999) Least squares sup- rently investigating the extension of the proposed algorithm port vector machine classifiers. Neural Process. Lett., 9, to sparse versions of kernelized formulations like LS-SVM 293–300. 2252 Gene selection using sparse logistic regression Tibshirani,R. (1996) Regression shrinkage and selection via the using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, lasso. J. R. Stat. Soc. Ser. B, 58, 267–288. 11462–11467. West,M., Blanchette,C., Dressman,H., Huang,E., Ishida,S., Weston,J., Mukherjee,S., Chapelle,O., Pontil,M., Poggio,T. and Spang,R., Zuzan,H., Olson, Jr., A.J., Marks,J.R. and Nevins,J.R. Vapnik,V. (2001) Feature selection for SVMs. Adv. Neural Inform. (2001) Predicting the clinical status of human breast cancer by Process. Syst. 13, 668–674. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

A simple and efficient algorithm for gene selection using sparse logistic regression

Bioinformatics , Volume 19 (17): 8 – Nov 22, 2003

Loading next page...
 
/lp/oxford-university-press/a-simple-and-efficient-algorithm-for-gene-selection-using-sparse-WeX6qV1rgq

References (16)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2003; all rights reserved.
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btg308
Publisher site
See Article on Publisher Site

Abstract

Vol. 19 no. 17 2003, pages 2246–2253 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg308 A simple and efficient algorithm for gene selection using sparse logistic regression 1 2,∗ S. K. Shevade and S. S. Keerthi Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India and Control Division, Department of Mechanical Engineering, National University of Singapore, Singapore 117576, Republic of Singapore Received on November 30, 2002; revised on March 12, 2003; accepted on May 29, 2003 ABSTRACT learning techniques like Parzen windows, Fisher’s linear dis- Motivation: This paper gives a new and efficient algorithm criminant, and decision trees have been applied to solve this for the sparse logistic regression problem. The proposed classification problem (Brown et al., 2000). Since the data algorithm is based on the Gauss–Seidel method and is dimension is very large, Support Vector Machines (SVMs) asymptotically convergent. It is simple and extremely easy have been found to be very useful for this classification prob- to implement; it neither uses any sophisticated mathematical lem (Brown et al., 2000; Furey et al., 2000). Apart from the programming software nor needs any matrix operations. It can classification task, it is also important to remove the irrelevant be applied to a variety of real-world problems like identifying genes from the data so as to simplify the inference. One of marker genes and building a classifier in the context of cancer the aims of the microarray data experiments is to identify a diagnosis using microarray data. small subset of informative genes, called marker genes, which Results: The gene selection method suggested in this paper discriminate between the tumour and the normal tissues, or is demonstrated on two real-world data sets and the results between different kinds of tumour tissues. Identification of were found to be consistent with the literature. marker genes from a small set of samples becomes a difficult Availability: The implementation of this algorithm is task and there is a need to develop an efficient classification available at the site http://guppy.mpe.nus.edu.sg/ mpessk/ algorithm for such an application where the number of fea- SparseLOGREG.shtml tures is much larger than the number of samples. Furey et al. Contact: mpessk@nus.edu.sg (2000) used Fisher score to perform feature selection prior Supplementary Information: Supplementary material is to training. This method was compared against feature selec- available at the site http://guppy.mpe.nus.edu.sg/ mpessk/ tion using radius-margin bound for SVMs by Weston et al. SparseLOGREG.shtml (2001). Guyon et al. (2002) introduced an algorithm, called Recursive Feature Elimination (RFE), where features are suc- cessively eliminated during training of a sequence of SVM INTRODUCTION classifiers. Li et al. (2002) introduced two Bayesian classi- Logistic regression is a powerful discriminative method. It fication algorithms which also incorporate automatic feature also has a direct probabilistic interpretation built into its selection. In their work, the Bayesian technique of automatic model. One of the advantages of logistic regression is that relevance determination (ARD) was used to perform feature it provides the user with explicit probabilities of classification selection. apart from the class label information. Moreover, it can be The main aim of this paper is to develop an efficient easily extended to the multi-category classification problem. algorithm for the sparse logistic regression problem. The pro- In this paper the problem of logistic regression is addressed posed algorithm is very much in the spirit of the Gauss–Seidel with the application of gene expression data particularly method (Bertsekas and Tsitsiklis, 1989) for solving uncon- in mind. strained optimization problems. The algorithm is simple, The development of microarray technology has created a extremely easy to implement, and can be used for feature selec- wealth of gene expression data. Typically, these data sets tion as well as classifier design, especially for the microarray involve thousands of genes (features) while the number of data set. It does not need any sophisticated LP or QP solver, tissue samples is in the range 10–100. The objective is to nor does it involve any matrix operations which is a drawback design a classifier which separates the tissue samples into pre- of some of the previously suggested methods. Our algorithm defined classes (e.g. tumour and normal). Different machine also has the advantage of not using any extra matrix storage and therefore it can be easily used in cases where the number of features used is very high. To whom correspondence should be addressed. 2246 Published by Oxford University Press 2003. Gene selection using sparse logistic regression The remaining sections of the paper contain the follow- interested in finding a small subset of genes which enables us ing: description of the problem formulation and the optimality to do good classification it is appropriate to use a sparse model conditions of the problem; details of the proposed algorithm; of the regression function, that is, in the final function f(x) the feature selection and the classifier design procedure; in (2.1), only a small number of αs corresponding to relevant computational experiments and, concluding remarks. features will have non-zero values. Therefore, we solve the sparse logistic regression problem, which can be formulated as the following optimization problem: PROBLEM FORMULATION AND OPTIMALITY min ρ = g − y f(x ) i i In this paper we focus on the two category classification prob- α i lem. The ideas discussed here can be easily extended to the s.t. |α |≤ t (2.2) multi-category problem and those details will be addressed in j =1 a future paper. We will now describe the problem formulation where t ≥ 0 is a parameter that is tuned using techniques such using microarray tissue classification as an example. Given a as cross-validation and the function g is given by: microarray data set containing m tissue samples, with each tis- sue sample represented by the expression levels of N genes, g(ξ ) = log(1 + e ) (2.3) the goal is to design a classifier which separates the tissue samples into two predefined classes. Let {(x ˜ , y )} denote i i i=1 which is the negative log-likelihood function associated with the training set, where x ˜ is the ith input pattern, x ˜ ∈ R and i i the probabilistic model y is the corresponding target value; y = 1 means x ˜ is in class i i i 1 and y =−1 means x ˜ is in class 2. Note that the vector x ˜ i i i Prob(y|x) = (2.4) contains the expression values of N genes for the ith tissue −y ·f(x) 1 + e sample and x ˜ denotes the expression value of gene j for the ij ith tissue sample. To conveniently take into account the bias Once the αs are determined by solving (2.2), the class of the T T term in the regression function, let us define x = (1, x ˜ ). test sample, x ¯,is +1if f(x) ¯ > 0 and −1 otherwise. Let us specify the functional form of the regression function The formulation (2.2) was first suggested by Tibshirani f(x) as a linear model, (1996) as an extension of the ‘LASSO’ (Least Absolute Shrinkage and Selection Operator) method for the linear regression problem. The LASSO estimator for the regression f(x ) = α x (2.1) i j ij problem is obtained by solving the following optimization j =0 problem: where α denotes the weight vector. We address the first ele- ment (corresponding to the bias term) of the vectors x and α min y − f(x ) s.t. |α |≤ t (2.5) i i j as the ‘zeroth element’. To avoid notational clutter, through- i j =1 out this paper we use the index i to denote the elements of the training set (i = 1, ... , m) and the index j to repres- In Tibshirani (1996), two iterative algorithms were sugges- ent the elements of x and α (j = 0, ... , N , unless specified ted for the solution of (2.5). One of these algorithms treats explicitly). (2.5) as a problem in N + 1 variables and 2 constraints. In Typically for the microarray data, N  m. That is, the each iteration it solves a linear least squares problem subject training samples lie in a very high-dimensional space. There- to a subset of constraints E. Once a solution is found in an fore, linear separability of two classes may not be a problem. iteration, a constraint violated by the current solution is added Hence, it is appropriate to consider using a linear classi- to the set E and a new feasible point is found. The linear fier in the input space directly without transforming it into least squares problem is then resolved subject to the new set a higher dimensional feature space. Furthermore, for such of constraints E. This procedure is repeated until the optim- data sets it is important to note that finding a linear classi- ality conditions are satisfied. Note that, during each iteration fier in the input space involves estimation of a large number one constraint is added to the set E and there exist only a of parameters [in Equation (2.1)] using a very small number finite (2 ) number of constraints. Therefore, this algorithm of training examples. It is therefore possible to derive differ- will converge in a finite number of steps. Use of active set ent linear classifiers for the same problem as the problem is method was also suggested for the solution of (2.5) where underdetermined. How the proposed algorithm takes care of every time the set E contains only the set of equality con- the effect of multiple solutions is discussed in the section on straints satisfied by the current point. The second algorithm, feature selection and classifier design. suggested in Tibshirani (1996), states (2.5) as a problem in For diagnostic purposes, it is important to have a classifier 2N + 1 variables and 2N + 1 constraints, which can be solved which uses as few features (genes) as possible. Since we are using any quadratic programming solver. 2247 S.K.Shevade and S.S.Keerthi In Osborne et al. (2000), the problem (2.5) was treated as a keeping the other variables fixed, and the process is repeated convex programming problem and an efficient algorithm for till the optimality conditions are satisfied. Asymptotic conver- computing the linear model coefficients was given using dual- gence of this method for a more general version of the problem ity theory. This algorithm was stated only for the cases where has been proved in (Bertsekas and Tsitsiklis, 1989; Chapter 3, the objective function is a quadratic loss function. These ideas, Proposition 4.1). It should be noted that strict convexity of g however, can be extended to problems with non-quadratic plays an important role in this proof. objective functions, for example (2.2), by optimizing the We now derive the first order optimality conditions for (2.6). quadratic approximation of the cost function in (2.2) at any Since (2.6) is a convex programming problem these condi- given point and repeating this procedure until convergence. tions are both necessary and sufficient for optimality. Let us Recently, Roth (2002) extended the ideas given in Osborne define et al. (2000) to a more general class of cost functions includ- ξ =−y f(x ) i i i ing the one given in (2.2). The proof of convergence of this algorithm to a general class of loss functions is also given (2.7) F = y x j i ij therein. In this method, the problem (2.2) is transformed to a 1 + e L constrained least squares problem which can then be solved using iteratively re-weighted least squares (IRLS) method on The first order optimality conditions for the problem (2.6) can a transformed set of variables. To avoid solving large dimen- be easily derived from geometry: (i) since W is differentiable sional constrained least squares problems, especially for the with respect to α , ∂W /∂α = 0; (ii) if j> 0 and α = 0, 0 0 j microarray data, Roth (2002) uses a wrapper approach where then, since W is differentiable with respect to α at such a α , j j a maximum violating variable is added to a small variable we have ∂W /∂α = 0 and (iii) if j> 0 and α = 0, then, j j set, X, and the constrained least squares problem is solved since W is only directionally differentiable with respect to α with respect to the variables in the set X only. This procedure at α = 0, we require the right side derivative of W with is repeated till the optimality conditions are satisfied. respect to α to be non-negative and the left side derivative All of the above methods work well and are very effi- to be non-positive. These conditions can be rewritten as the cient; but they rely on mathematical programming solvers following algebraic conditions. that require detailed matrix operations. The main contribu- tion of this paper is to utilize the special structure of (2.2) F =0if j = 0 and devise a simple algorithm which is extremely easy to F = γ if α > 0, j> 0 j j implement; it neither uses any mathematical programming F =−γ if α < 0, j> 0 package nor needs any matrix operations. The simplicity j j of our algorithm is evident from the pseudocode. See the − γ ≤ F ≤ γ if α = 0, j> 0 j j supplementary information for details. Using optimality conditions it can be shown that there exists Thus, if we define a γ> 0 for which, (2.2) is equivalent to the following viol =|F | if j = 0 j j unconstrained optimization problem: =|γ − F | if α > 0, j> 0 N j j (2.8) min W = γ |α |+ g − y f(x ) (2.6) j i i =|γ + F | if α < 0, j> 0 j j j =1 i = ψ if α = 0, j> 0 j j This means that the family of classifiers obtained by varying t where ψ = max(F − γ , −γ − F ,0), then the first order in (2.2) and the family obtained by varying γ in (2.6) are j j j optimality conditions can be compactly written as the same. The addition of the penalty term, |α |, to the original objective function can be seen as putting a Laplacian viol = 0 ∀j (2.9) prior over the vector α. The above formulation thus promotes the choice of a sparse model, where the final αs are either Since, in asymptotically convergent procedures it is hard to large or zero. In this paper, we shall concentrate on the model achieve exact optimality in finite time, it is usual to stop when in (2.6). the optimality conditions are satisfied up to some tolerance, τ . We solve (2.6) directly without converting it into its For our purpose, we will take it that any algorithm used to dual. The problem (2.6) can be solved using an appro- solve (2.6) will terminate successfully when priate unconstrained nonlinear programming technique. We choose to basically use the Gauss–Seidel method which viol ≤ τ ∀j . (2.10) uses coordinate-wise descent approach, mainly because the method is extremely easy to implement while also being very This can be used as a stopping condition for (2.6). Throughout, efficient. In this method, one variable is optimized at a time we will refer to optimality as optimality with tolerance. 2248 Gene selection using sparse logistic regression In the next section, we describe the actual details of our In this algorithm, all αs are set to zero initially which implies algorithm for solving the problem (2.6). that only the set I exists. The algorithm can be thought of as a two-loop approach. The type I loop runs over the variables in the set I to choose the maximum violator, v. In the type II A SIMPLE ALGORITHM FOR SPARSE loop, W is optimized with respect to α , thus modifying the set LOGISTIC REGRESSION I and the maximum violator in the set I is then found. This nz nz procedure in the type II loop is then repeated until no violators The algorithmic ideas discussed here are based on the Gauss– are found in the set I . The algorithm thus alternates between nz Seidel method used for solving unconstrained optimization the type I and type II loops until no violators exist in either of problem. In solving (2.6) using Gauss–Seidel method, one the sets, I and I . z nz variable α which violates the optimality conditions is chosen Once the maximum violating variable in a given set is and the optimization subproblem is solved with respect to this chosen, some non-linear optimization technique needs to be variable α alone, keeping the other αs fixed. This procedure used to solve the unconstrained optimization problem (2.6) is repeated as long as there exists a variable which violates with respect to a single variable. Note that the objective func- the optimality conditions. The method terminates when the tion is convex. A combination of the bisection method and optimality conditions (2.9) are satisfied. Note that the object- Newton method was used in our algorithm. In this method, ive function in (2.6) is strictly convex and it will strictly two points L and H for which the derivative of the objective decrease at every step when the optimization subproblem function has opposite signs are chosen. This ensures that the is solved with respect to one variable. Repeated application root always lies in a bracketed interval, [L, H ]. The Newton of this procedure for the problem (2.6) will make sure that method is tried first, and a check is made in the algorithm to the algorithm will converge asymptotically (Bertsekas and make sure the iteration obtains a solution in this interval of Tsitsiklis, 1989) to the solution. Of course, if the stopping interest. If the Newton method takes a step outside the interval condition (2.10) is used then the algorithm will terminate in we do not accept this next point, and instead resort to the bisec- finite time. tion method. In the bisection method, the next point is chosen Foragiven α, let us define the following sets: I ={j : to be the midpoint of the given interval. Depending upon the α = 0,j> 0}; and I ={0}∪{j : α = 0,j> 0}. j nz j sign of the derivative of the objective function at this point the Also, let I = I ∪ I . The key to efficiently solving (2.6) z nz new interval is decided. Now, the abovementioned procedure using Gauss–Seidel method is the selection of the variable α starting from the trying of the Newton step is repeated. Since in each iteration with respect to which the objective function it is always ensured that the root lies in the interval of interest, is optimized. At optimality, it is expected that the resulting the method is guaranteed to reach the solution. model is sparse, that is, there will be only a few weights It is important to note that the objective function, W , has α with non-zero values. It is important that the algorithm different right-hand and left-hand derivatives with respect to spends most of its time adjusting the non-zero αs and mak- α at α = 0. Therefore, if the non-zero variable attains a j j ing the subset I self-consistent as far as optimality of the nz value of 0 when our algorithm is executed it is necessary to see variables in this subset is concerned. Therefore, it is appro- whether further progress in the objective function can be made priate to find the maximum violating variable, say α in by altering the same variable (but now in the opposite direc- the set I and then solve (2.6) using the variables in the set tion). This will make sure that after successful termination of I ∪{v} only, until the optimality conditions for these vari- nz the type II loop the optimality conditions for the variables in ables are satisfied. This procedure can be repeated till no I are satisfied. nz violator remains in the set I , at which point the algorithm Note from (2.7) and (2.8) that for checking the optimality can be terminated. This procedure can be best explained using conditions for each variable it is necessary to calculate ξ for Algorithm 1. all the examples. Since this calculation is done repeatedly, the efficient implementation of the algorithm requires that ξ Algorithm 1 is stored in the memory for all the examples. After a single Input Training Examples variable, say α , is updated it becomes necessary to update Initialize αsto0. ξ ∀i. This can be done by using the following simple equation: while Optimality Violator exists in I new old old new ξ = ξ + (α − α )y x (3.1) Find the maximum violator, v,in I i ij z i i j j repeat Let us now comment on the speed of our algorithm for Optimize W w.r.t. α (2.6). γ in (2.6) is a hyperparameter and one cannot identify Find the maximum violator, v,in I nz any nominal value for it as a starting point. For a given value until No violator exists in I nz of γ and a data set containing 72 samples with 7129 fea- end while tures per sample, it took about 20 s to solve (2.6) on a SUN UltraSparc III CPU running on 750 MHz machine. Once the Output A set of αs for the function in (2.1) 2249 S.K.Shevade and S.S.Keerthi solution at a certain value of γ is found, the solution at another step is to choose the ‘right’ set of genes to design a classi- close-by value of γ can be efficiently found using the previ- fier. For this purpose, the following systematic method can ous solution as the starting point. This idea is very useful be used. for improving efficiency when γ needs to be tuned using Let S denote the set of features being analysed. To examine cross-validation; see the section below. the performance of these features, the k-fold cross-validation experiment [on (2.6)] is repeated 100 times using only the features in the set S. To avoid any bias towards a particular FEATURE SELECTION AND CLASSIFIER value of γ , the validation error at every value of γ is averaged DESIGN over those 100 experiments. The minimum of these validation Before we discuss the procedure for feature selection it is error values over different choices of γ then gives the aver- important to note that the solution of (2.6) is useful for the age validation error for the set S. Ideally, one would like to tasks of feature selection as well as classifier design. Since identify the set of few features for which the average valida- (2.6) automatically introduces sparseness into the model, the tion error is minimum. For this purpose, the top ranked genes non-zero αs in the final solution of (2.6) help us in deciding (ranked according to relevance count) are added to the set S the relevant features for classification. To identify the relevant one-by-one, and the average validation error is calculated. The features, we proceed as follows. procedure is repeated as long as adding the next top ranked To design a classifier and identify the relevant features using gene to the set S does not reduce the average validation error (2.6), it is important to find the appropriate value of γ in (2.6). significantly. This final set S of genes is then used along with This can be done using techniques like k-fold cross-validation. all the training examples to design a classifier for diagnostic For this purpose, the entire set of training samples is divided purposes. into k segments. For a given value of γ , the problem (2.6) Note that this final set of genes is derived from all the train- is solved k times, each time using a version of the training ing samples available. Typically for the microarray data set, set in which one of the segments is omitted. Each of these the training set size is very small and there are no test samples k classifiers is then tested on the samples from the segment available. Therefore, one needs to use the entire training data which was omitted during training, and the ‘validation error’ is for classifier design. The average validation error mentioned found by averaging the test set results over all the k classifiers. earlier uses the entire training set for gene selection and so This procedure is repeated for different values of γ and the it is not an ‘unbiased estimate’. To examine how good the optimal value γ is chosen to be the one which gives minimum feature selection procedure is, it becomes necessary to give validation error. The final classifier is then obtained by solving an unbiased estimate of this procedure. This estimate can be (2.6) using all the training samples and setting γ to γ . obtained using techniques of external cross-validation sug- For a given value of γ , each of the k classifiers uses a set gested in (Ambroise and McLachlan, 2002). For this purpose, of genes (corresponding to non-zero αs) for classification. the training data is split into P different segments (P = 10 is a We assign a count of one to each gene if it is selected by a good choice). The gene selection procedure described above classifier at γ . Since there are k different classifiers designed is executed P times, each time using the training samples at γ , a gene can get a maximum count of k meaning that derived by omitting one of the P segments and testing the it is used in all the k classifiers. To avoid any bias arising final classifier (obtained with the reduced set of genes) on the from a particular k-fold split of the training data, the above samples from the segment which was omitted during training. procedure is repeated 100 times, each time splitting the data This procedure gives an unbiased estimate of the performance differently into k folds and doing the cross-validation. The of the gene selection method described earlier. We will call counts assigned to a gene for 100 different experiments are this estimate as the average test error. As one might expect, the then added to give a relevance count for every gene. Thus average test error is usually higher than the average validation a gene can have a maximum relevance count of 100k since error. k-fold cross-validation was repeated 100 times. One of the aims of using the microarray data is to develop NUMERICAL EXPERIMENTS a classifier using as few genes as possible. It is therefore In this section we report the test performance of the pro- important to rank the genes using some method and then select posed gene selection algorithm on some publicly available some of these genes in a systematic way so as to design a data sets: colon cancer and breast cancer. As the prob- classifier which generalizes well on the unseen data. As the lem formulation (2.6) and the one used in (Roth, 2002) are relevance count of a gene reflects its importance in some sense, the same, the results of different algorithms for the same it is appropriate to rank the genes based on the relevance count. formulation are expected to be close. Having decided the order of importance of the genes, the next The algorithm for the solution of (2.6) and the actual method of selecting This issue is particularly important for microarray data since the number of relevant genes used by Roth (2002), however, are quite different from our examples is usually small. method described in the previous section. 2250 Gene selection using sparse logistic regression 0.18 0.17 0.16 0.16 0.14 0.15 0.12 0.14 0.1 0.13 0.08 0.12 0.06 0.11 0.04 0.1 0.02 0.09 0.08 2 4 6 8 10 2 4 6 8 10 12 Number of Genes Number of Genes Fig. 2. Variation of average validation error on the breast cancer Fig. 1. Variation of validation error on the colon cancer data set, data set, shown as the number of genes used in the feature set. The shown as the number of genes used in the feature set. The average average validation error reached the minimum when six genes were validation error reached the minimum when eight genes, listed in used. These genes are listed in Table 2. Table 1, were used. Table 1. Selected top eight genes with their relevance counts for the colon For the colon cancer data set (Alon et al., 1999) the task cancer data set is to distinguish tumour from normal tissue using micro- array data with 2000 features per sample. The original data Sr. Gene annotation Relevance consisted of 22 normal and 40 cancer tissues. This dataset no. count is available at http://lara.enm.bris.ac.uk/colin. We also eval- uated our algorithm on the breast cancer data set (West 1 Homo sapiens mRNA for GCAP-II/uroguanylin precursor 229 et al., 2001). The original data set consisted of 49 tumour 2 COLLAGEN ALPHA 2(XI) CHAIN (H.sapiens) 165 samples. The number of genes representing each sample is 3 MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) 151 7129. The aim is to classify these tumour samples into estro- 4 PLACENTAL FOLATE TRANSPORTER (H.sapiens) 145 gen receptor-positive (ER+) and estrogen receptor-negative ATP SYNTHASE COUPLING FACTOR 6, 5 124 MITOCHONDRIAL PRECURSOR (HUMAN) (ER−). West et al. (2001) observed that in five tumour 6 GELSOLIN PRECURSOR, PLASMA (HUMAN) 117 sample cases, the classification results using immunohis- 7 Human mRNA for ORF, complete cds 107 tochemistry and protein immunoblotting assay conflicted. 8 60S RIBOSOMAL PROTEIN L24 (Arabidopsis thaliana) 100 Therefore, we decided to exclude these five samples from our training set. This data set is available at the following site: Table 2. Selected top six genes with their relevance counts for the breast http://mgm.duke.edu/genome/dna_micro/work/. cancer data set Each of these data sets was pre-processed using the fol- lowing procedure. The microarray data set was arranged as matrix with m rows and N columns. Each row of this mat- Sr. Gene annotation Relevance no. count rix was standardized to have mean zero and unit variance. Finally, each column of the consequent matrix was standard- 1 Human splicing factor SRp40-3 mRNA, complete cds 171 ized to have mean zero and unit variance. This transformed 2 H.sapiens mRNA for cathepsin C 151 data was then used for all the experiments. 3 Human mRNA for KIAA0182 gene, partial cds 133 The results of the gene selection algorithm on these different 4 Y box binding protein-1 (YB-1) mRNA 130 data sets are reported in Figures 1 and 2 and Tables 1 and 2. As Human transforming growth factor-beta 3 5 127 discussed in the previous section, the relevance count of every (TGF-beta3) mRNA, complete cds Human microtubule-associated protein gene was obtained by repeating the k-fold cross-validation 6 115 tau mRNA, complete cds procedure 100 times, where k was set to 3. Tables 1 and 2 denote the important genes used in the classifier design for each of the data sets along with their relevance counts. For the colon cancer data set, the top three ranked genes hypothesis need not be unique as the samples in gene expres- are same as those obtained in (Li et al., 2002), although the sion data typically lie in a high-dimensional space. Moreover, gene selection procedure adopted there was different than the if the two genes are co-regulated then it is possible that one one we used. The main reason for this is that the classification of them might get selected during the gene selection process. Average Validation Error Average Validation Error S.K.Shevade and S.S.Keerthi Table 3. Average test error for different data sets (Suykens and Vandewalle, 1999) and kLOGREG (Roth, 2001, http://www.informatik.uni-bonn.de/ roth/dagm01.pdf). Data set Average test error ACKNOWLEDGEMENTS Colon cancer 0.177 Breast cancer 0.181 The work of the first author was partially supported by Genome Institute of Singapore. The authors thank Phil Long for his valuable comments and suggestions on the draft of this paper. Some comments about the genes selected for the breast cancer data set (shown in Table 2) are worthy of mention. REFERENCES Cathepsins have been shown to be good markers in cancer Alon,U., Barkai,N., Notterman,D., Gish,K., Ybarra,S., Mack,D. (Kos and Schweiger, 2002). Recently, it has been shown that and Levine,A. (1999) Broad patterns of gene expression YB-1, also called Nuclease-sensitive element-binding pro- revealed by clustering analysis of tumour and normal colon tein 1 (NSEP1), is elevated in breast cancer; see the following cancer tissues probed by oligonucleotide arrays. Cell Biol., 96, site: http://www.wfubmc.edu/pathology/faculty/berquin.htm. 6745–6750. Moreover, microtubule associated protein is the cell cycle Ambroise,C. and McLachlan,G.J. (2002) Selection bias in gene control gene, and therefore is associated with cancer. extraction on the basis of microarray gene expression data. Proc. TGF-β has been shown to be a critical cytokine in the Natl Acad. Sci. USA, 99, 668–674. development and progression of many epithelial malig- Arrick,A.B. and Derynck,R. (1996) The biological role of trans- nancies, including breast cancer (Arrick and Derynck, forming growth factor beta in cancer development. In Waxman,J. 1996). (ed.), Molecular Endocrinology of Cancer, Cambridge University Press, New York, USA, pp. 51–78. The generalization behaviour of our algorithm was Bertsekas,D.P. and Tsitsiklis,J.N. (1989) Parallel and Distributed examined using the technique of external cross-validation Computation: Numerical Methods. Prentice Hall, Englewood described at the end of the previous section. The average Cliffs, NJ, USA. test error for each of the data sets is reported in Table 3. Brown,M., Grundy,W., Lin,D., Cristianini,N., Sugnet,C., Furey,T., We also tested our algorithm on some other publicly avail- Ares, Jr., M. and Haussler,D. (2000) Knowledge based able data sets. See the supplementary information for more analysis of microarray gene expression data using sup- details. port vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267. Furey,T., Cristianini,N., Duffy,N., Bednarski,D., Schummer,M. and Haussler,D. (2000) Support vector machine classification and val- CONCLUSION idation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914. We have presented a simple and efficient training algorithm Guyon,I., Weston,J., Barnhill,S. and Vapnik,V. (2002) Gene selec- for feature selection and classification using sparse logistic tion for cancer classification using support vector machines. Mach. regression. The algorithm is robust in the sense that on the Learning, 46, 389–422. data sets we have tried there was not even a single case Kos,J. and Schweiger,A. (2002) Cathepsins and cystatins in extracel- of failure. It is useful, especially for problems where the lular fluids—useful biological markers in cancer. Radiol. Oncol., number of features is much higher than the number of train- 36, 176–179. ing set examples. The algorithm, when applied to gene Li, Yi, Campbell,C. and Tipping,M. (2002) Bayesian automatic rel- microarray data sets, gives results comparable with other evance determination algorithms for classifying gene expression methods. It can be easily extended to the case where the data. Bioinformatics, 18, 1332–1339. Osborne,M., Presnell,B. and Turlach,B. (2000) On the LASSO and aim is to identify the genes which discriminate between its dual. J. Comput. Graphical Stat., 9, 319–337. different kinds of tumour tissues (multi-category classific- Roth,V. (2001) Probabilistic discriminative kernel classifiers for ation problem). The proposed algorithm can also be used multi-class problems. In Radig, B. and Florczyk, S. (eds), Pattern for solving other problem formulations where strictly convex Recognition—DAGM’01, Springer, pp. 246–253. cost functions are used, e.g. the LASSO problem formula- Roth,V. (2002) The generalized LASSO: a wrapper approach tion where quadratic loss function is used. In fact, in the to gene selection for microarray data. Tech. Rep. IAI-TR- case where the quadratic loss function is used, the one- 2002-8, University of Bonn, Computer Science III, Bonn, dimensional optimization problem becomes very simple since Germany. its solution can be reached in one Newton step. We are cur- Suykens,J. and Vandewalle,J. (1999) Least squares sup- rently investigating the extension of the proposed algorithm port vector machine classifiers. Neural Process. Lett., 9, to sparse versions of kernelized formulations like LS-SVM 293–300. 2252 Gene selection using sparse logistic regression Tibshirani,R. (1996) Regression shrinkage and selection via the using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, lasso. J. R. Stat. Soc. Ser. B, 58, 267–288. 11462–11467. West,M., Blanchette,C., Dressman,H., Huang,E., Ishida,S., Weston,J., Mukherjee,S., Chapelle,O., Pontil,M., Poggio,T. and Spang,R., Zuzan,H., Olson, Jr., A.J., Marks,J.R. and Nevins,J.R. Vapnik,V. (2001) Feature selection for SVMs. Adv. Neural Inform. (2001) Predicting the clinical status of human breast cancer by Process. Syst. 13, 668–674.

Journal

BioinformaticsOxford University Press

Published: Nov 22, 2003

There are no references for this article.