Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Component retention in principal component analysis with application to cDNA microarray data

Component retention in principal component analysis with application to cDNA microarray data Shannon entropy is used to provide an estimate of the number of interpretable components in a principal component analysis. In addition, several ad hoc stopping rules for dimension determination are reviewed and a modification of the broken stick model is presented. The modification incorporates a test for the presence of an "effective degeneracy" among the subspaces spanned by the eigenvectors of the correlation matrix of the data set then allocates the total variance among subspaces. A summary of the performance of the methods applied to both published microarray data sets and to simulated data is given. This article was reviewed by Orly Alter, John Spouge (nominated by Eugene Koonin), David Horn and Roy Varshavsky (both nominated by O. Alter). 1 Background In his development of PCA, Pearson [1] was interested in Principal component analysis (PCA) is a 100 year old constructing a line or a plane that "best fits" a system of mathematical technique credited to Karl Pearson [1] and points in q-dimensional space. Geometrically, this its properties as well as the interpretation of components amounts to repositioning the origin at the centroid of the have been investigated extensively [2-9]. The technique points in q-dimensional space and then rotating the coor- has found application in many diverse fields such as ecol- dinate axes in such a way as to satisfy the maximal vari- ogy, economics, psychology, meteorology, oceanography, ance property. Statistically speaking, PCA represents a and zoology. More recently it has been applied to the transformation of a set of q correlated variables into linear analysis of data obtained from cDNA microarray experi- combinations of a set of q pair-wise uncorrelated variables ments [10-18]. cDNA microarray experiments provide a called principal components. Components are con- snapshot in time of gene expression levels across poten- structed so that the first component explains the largest tially thousands of genes and several time steps [19]. To amount of total variance in the data and each subsequent assist in the data analysis, PCA (among other techniques) component is constructed so as to explain the largest is generally employed as both a descriptive and data amount of the remaining variance while remaining uncor- reduction technique. The focus of this letter will be on the related with (orthogonal to) previously constructed com- latter. ponents. Page 1 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 We define the dimension of the data set to be equal to the and summarize the results. We also introduce a modifica- number of principal components. The set of q principal tion of the broken stick model which incorporates the components is often reduced to a set of size k, where 1 ≤ k notion of degenerate subspaces in component retention. <<> q. The objective of dimension reduction is to make Finally, we introduce and include in the summary a novel analysis and interpretation easier, while at the same time application of statistical entropy to provide a new heuris- retaining most of the information (variation) contained tic measure of the number of interpretable components. in the data. Clearly, the closer the value of k is to q the bet- ter the PCA model will fit the data since more information 2 Mathematical methods has been retained, while the closer k is to 1, the simpler 2.1 Principal component analysis the model. Each principal component represents a linear combina- tion of the original variables with the first principal com- Many methods, both heuristic and statistically based, ponent defined as the linear combination with maximal have been proposed to determine the number k, that is, sample variance among all linear combinations of the var- the number of "meaningful" components. Some methods iables. The next principal component represents the linear can be easily computed while others are computationally combination that explains the maximal sample variance intensive. Methods include (among others): the broken that remains unexplained by the first with the additional stick model, the Kaiser-Guttman test, Log-Eigenvalue condition that it is orthogonal to the first [27]. Each sub- (LEV) diagram, Velicer's Partial Correlation Procedure, sequent component is determined in a similar fashion. If Cattell's SCREE test, cross-validation, bootstrapping tech- we have a q-dimensional space, we expect to have q prin- niques, cumulative percentage of total of variance, and cipal components due to sampling variation. Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], The following derivation can be found in [[27], pp. 373– Section 6.1]. For convenience, a brief overview of the tech- 374], Jolliffe [9] or Basilevsky [26]. Let X be a (p × q) niques considered in this paper is given in the appendices. matrix that contains the observed expression of the i-th gene in its i-th row. Denote by g the i-th observation and Most techniques either suffer from an inherent subjectiv- let S be the sample covariance matrix of X. For a particular , we seek ity or have a tendency to under estimate or over estimate observation g the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimen- GG z = a g + a g + ... + a g = (1) ag 1 i,1 2 i,2 p i,p i sionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound GG GG T T statistical foundations seem, at present, to offer little such that var(z) = var(ag ) is maximal subject to aa = advantage over simpler rules in most circumstances." A 1. That is, we maximize the expression comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]. GG G G TT aSa−− λ a a12() () Data reduction is frequently instrumental in revealing where λ is a Lagrange multiplier. Differentiating with mathematical structure. The challenge is to balance the accuracy (or fit) of the model with ease of analysis and the respect to a leads to the familiar eigenvalue problem potential loss of information. To confound matters, even GG random data may appear to have structure due to sam- Sa−= λ a03 () pling variation. Karr and Martin [25] note that the percent variance attributed to principal components derived from So λ is an eigenvalue of S and real data may not be substantially greater than that GG G G G G TT T derived from randomly generated data. They caution that aSa=== a λλ a a a λ 4 is its correspond- () most biologists could, given a set of random data, gener- ing eigenvector. Since ate plausible "post-facto" explanations for high loadings in "variables." Basilevsky [26] cautions that it is not neces- GG G G G G TT T aSa=== a λλ a a a λ () 4 sarily true that mathematical structure implies a physical process; however, the articles mentioned above provide we see that to maximize the expression we should choose examples of the successful implementation of the tech- the largest eigenvalue and its associated eigenvector. Pro- nique. ceed in a similar fashion to determine all q eigenvalues and eigenvectors. In this report, we apply nine ad-hoc methods to previ- ously published and publicly available microarray data Page 2 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 2.1.1 Data preprocessing sionality in three of the four patterned matrices used in his Data obtained from cDNA microarray experiments are fre- study, giving underestimates in the other. He reported quently "polished" or pre-processed. This may include, that overall, the model was one of the two most accurate but is not limited to: log transformations, the use of under consideration. Bartkowiak [22] claims that the bro- weights and metrics, mean centering of rows (genes) or ken stick model applied to hydro-meteorical data pro- columns (arrays), and normalization, which sets the mag- vided an underestimate of the dimensionality of the data. nitude of a row or column vector equal to one. The term Her claim is based on the fact that other heuristic tech- data preprocessing varies from author to author and its niques generally gave higher numbers (2 versus 5 to 6). merits and implications can be found in [28-32]. However, it should be noted that the true dimension of the data is unknown. Ferré [21] suggests that since PCA is It is important to note that such operations will affect the used primarily for descriptive rather than predictive pur- eigensystem of the data matrix. A simple example is pro- poses, which has been the case with microarray data anal- vided by comparing the singular spectrum from a singular ysis, any solution less than the true dimension is value decomposition (SVD) with that of a traditional acceptable. PCA. Note that PCA can be considered as a special case of singular value decomposition [32]. In SVD one computes The broken-stick model has the advantage of being the eigensystem of X X, where the p × q matrix X contains extremely easy to calculate and implement. Consider the the gene expression data. In PCA one computes the eigen- closed interval J = [0,1]. Suppose J is partitioned into n system of S = M M/(p - 1), where M equals the re-scaled subintervals by randomly selecting n - 1 points from a uni- and column centered (column means are zero) matrix X. form distribution in the same interval. Arrange the subin- The matrix S is recognized as the sample covariance matrix tervals according to length in descending order and of the data. Figure 1 illustrates the eigenvalues (expressed denote by L the length of the k-th subinterval. Then the as a percent of total dispersion) obtained from a PCA and expected value of L is [37] an SVD on both the raw and log base-two transformed [33] elutriation data set of the budding yeast Saccharomy- ces cerevisiae [34]. Note the robustness of PCA. In Figure EL = . 5 () () k ∑ nj 1.a, which is an SVD performed on the raw data, we see jk = the dominance of the first mode. In general, the further Figure 2 provides an illustration of the broken stick distri- the mean is from the origin, the larger the largest singular bution for n = 20 subintervals graphed along with eigen- value will be in a SVD relative to the others [7]. values obtained from the covariance matrix of a random matrix. The elements of the random matrix are drawn 2.2 Broken stick model from a uniform distribution on the interval [0,1]. The bars The so-called broken stick model has been referred to as a represent the values from the broken stick distribution; resource apportionment model [35] and was first pre- the circles represent the eigenvalues of the random matrix. sented as such by MacArthur [36] in the study of the struc- In this case, no component would be retained since the ture of animal communities, specifically, bird species proportion of variance explained by the first (largest) from various regions. Frontier [37] proposed comparing eigenvalue falls below the first value given by the broken eigenvalues from a PCA to values given by the broken stick model. stick distribution. The apportioned resource is the total variance of the data set (the variance is considered a 2.3 Modified broken stick model resource shared among the principal components). Since Consider a subspace spanned by eigenvectors associated each eigenvalue of a PCA represents a measure of each with a set of "nearly equal" eigenvalues that are "well sep- components' variance, a component is retained if its asso- arated" from all other eigenvalues. Such a subspace is well ciated eigenvalue is larger than the value given by the bro- defined in that it is orthogonal to the subspaces spanned ken stick distribution. An example of broken stick by the remaining eigenvectors; however, individual prin- distribution with a plot can be found in Section 3.2. cipal components within that subspace are unstable [9]. This instability is described in North et al. [38] where a As with all methods currently in use, the broken stick first order approximation to estimate how sample eigen- model has drawbacks and advantages. Since the model values and eigenvectors differ from their exact quantities does not consider sample size, Franklin et al. [23] con- is derived. This "rule of thumb" estimate is tends that the broken stick distribution cannot really 1/2 model sampling distributions of eigenvalues. The model δλ ~ λ (2/N) (6) also has a tendency to underestimate the dimension of the data [20-22]. However, Jackson [20] claims that the bro- where N is the sample size and λ is an eigenvalue. The ken stick model accurately determined the correct dimen- interpretation given by North et al. [38] is that "... if a Page 3 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 (a). SVD on raw Data (b). PCA on raw data 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Component Number Component Number (c). SVD on log−transformed data (d). PCA on log−transformed data 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Component Number Component Number SVD Figure 1 and PCA SVD and PCA. (a) SVD performed on the elutriation data set; (b) PCA on the elutriation data set; (c) SVD on the log base two transformed data; (d) PCA on the log base two data. group of true eigenvalues lie within one or two δλ of each (6), an estimate of the sampling error is determined and other then they form an 'effectively degenerate multiplex,' those eigenvalues which lie within 1.5 of each other is and sample eigenvectors are a random mixture of the true noted (note that the value of the spacing, 1.5, is somewhat eigenvectors." arbitrary. In their report, North et al. [38] suggest using a value between 1 and 2.) Components are then grouped As noted previously, the broken stick model has been into subspaces preserving the order determined by the referred to as a resource apportionment model, and in maximum variance property of PCA. Subspaces are particular, the resource to be apportioned among the spanned by either a single eigenvector, or in the case of an components is the total variance. We modify this "effective degeneracy," by multiple eigenvectors. Denote approach by considering the variance as apportioned these subspaces by W . For each W we sum the eigenval- i i among individual subspaces. ues associated with the eigenvectors spanning that space. We then repartition the broken stick model to match the Once the eigenvalues, λ , have been computed, the spac- subspaces and then apply the broken stick model to each ing between them, λ - λ , is calculated. Using Equation subspace, requiring that the sum of the eigenvalues asso- i+1 i Page 4 of 21 (page number not for citation purposes) Percent of total Percent of Total Percent of total Percent of total Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 18 20 Component Number The Figure 2 broken stick method The broken stick method. The broken stick distribution (bars) with eigenvalues obtained from a uniform random matrix of size 500 × 20. ciated with that subspace exceed the value given by the 2.4.1 Information and uncertainty broken stick model. In the context of information theory, information is a measure of what can be communicated rather than what 2.4 Statistical entropy and dimensionality is communicated [42]. Information can also be though of In physics, the entropy of a closed system is a measure of as a term used to describe a process that selects one or disorder. An increase in entropy corresponds to an more objects from a set of object. For example, consider a increase in disorder which is accompanied by a decrease balanced six-sided die. We will consider the die as a in information. In the branch of applied probability device, D, that can produce with equal probability any ele- known as information theory [39-42], the concept of ment of the set S = {1, 2, 3,4, 5, 6}. Of course, observing entropy is used to provide a measure of the uncertainty or a tossed die and noting the number of the top face is a information of a system. Note that uncertainty and infor- finite discrete probability experiment with sample space mation are used synonymously, the reason for which is S where each sample point is equally likely to occur. explained below. The word system is used to imply a com- plete discrete random experiment. Denote the probability model for an experiment with out- comes e ,...,e with associated probabilities p ,...,p as 1 n 1 n Page 5 of 21 (page number not for citation purposes) Percent of total Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 ee " e ⎛ ⎞ 12 n . () 7 Hp() ,...,p =− p log p . () 11 ⎜ ⎟ 12 Nk ∑ k pp " p ⎝ 12 n ⎠ k=1 In the example of the balanced six-sided die we have Equation (11) represents the average information content or average uncertainty of a discrete system. The quantity is considered a measure of information or uncertainty 12 3 4 5 6 ⎛ ⎞ . 8 () depending upon whether we consider ourselves in the ⎜ ⎟ 16////// 16 16 16 16 16 ⎝ ⎠ moment before the experiment (uncertainty) or in a moment after the experiment (information), [54]. Prior to performing the experiment, we are uncertain as to its outcome. Once the die is tossed and we receive infor- 2.4.3 Derivation of H mation regarding the outcome, the uncertainty decreases. Statistical entropy is derived from the negative binomial As a measure of this uncertainty we can say that the device distribution where an experiment with two equally likely has an "uncertainty of six symbols" [43]. Now consider a outcomes, labeled success or failure, is considered. Basi- "fair" coin. This "device," which we will call C, produces levsky [26] shows that with x representing the first two symbols with equal likelihood from the set S = {h, t} number on which a "success" is observed, the probability and we say this device has an "uncertainty of two sym- th of observing success on the x trial is given by f(x) = p = (l/ bols." We denote this probability model as 2) . Upon solving for x we have x = -log p, and the expected or total entropy of a system, H, is ht ⎛ ⎞ . 9 () ⎜ ⎟ 12// 12 ⎝ ⎠ Hf =− xx = p log , 12 () () ∑ ∑ Both models represent uniform distributions (the out- k 2 x k=1 comes in the respective models have equal probabilities), but it is inferred that device D is a finite scheme with where p log is defined to be 0 if p = 0. k 2 k greater uncertainty than device C (an "uncertainty of six symbols" versus an "uncertainty of two symbols"). Conse- quently, we expect device D to convey more information. 2.4.4 Basic properties To see this, consider (as an approximation to the amount It is possible to derive the form of the function H by of information conveyed) the average minimum number assuming it possesses four basic properties, [41]: (i) con- of binary questions that would be required to ascertain the outcome of each experiment. In the case of device D, tinuity, (ii) symmetry, (iii) an extremal property, and (iv) the average minimum number of questions is 2.4 while in additivity. Continuity requires that the measure of uncer- the case of device C only one question is required. Now tainty varies continuously if the probabilities of the out- consider an oddly minted coin with identical sides (say comes of an experiment are varied in a continuous way. heads on either side). The model for this device is Symmetry states that the measure must be invariant to the order of the p s, that is, H(p , p ,...,p ) = H(p , p ,...,p ). k 1 2 N 2 1 N ht ⎛ ⎞ . 10 () Additivity requires that given the following three H func- ⎜ ⎟ ⎝ ⎠ tions defined on the same probability space Since heads, h, is the only possible outcome, we consider H (p , p ,...,p ), this as a "device of one symbol." Notice that this device 1 1 2 N carries no information and contains no element of uncer- H (p , p ,...,p , q , q ,...,q ), (13) tainty. We need not pose a question to ascertain the out- 2 1 2 N 1 2 M come of the experiment. Thus, a function that attempts to H (q /p , q /p ,...,q /p ), quantify the information or uncertainty of a system will 3 1 N 2 N M N depend on the cardinality of the sample space and the the relationship H = H + p H holds. Notice that this probability distribution. 2 1 N 3 implies H ≥ H , that is, partitioning events into sub- 2 3 events cannot decrease the entropy of the system [39]. The 2.4.2 Entropy: a measure of information content (or uncertainty) extremal property, which we now describe, will be used in Every probability model (or device) describes a state of our development of the information dimension described uncertainty [41]. Shannon [42] provided a measure for below. First, notice that since 0 ≤ p ≤ 1 for all k, a mini- such uncertainty, which is known as statistical entropy mum value of 0 is attained when p = 1 and p = p =  = (often referred to as Shannon's entropy). Its functional 1 2 3 p = 0, so H ≥ 0. As an upper bound, we have form is given by Page 6 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 11 1 1 ⎛ ⎞ Hp ,p ,...,p ≤ H , ,..., = log N, 14 () () pp == " = () 18 12 N ⎜ ⎟ 2 1 n NN N ⎝ ⎠ that is, a uniform distribution of probabilities provides an pp == " =01 , 9 () upper bound on the uncertainty measure of all discrete nN +1 probability models whose sample space has cardinality of at most N. This relationship can be proved either with dif- Inserting these values into H and solving for n yields ferential calculus [39] or from Jensen's inequality −p 0 k N N nN== p . () 20 ⎛ ⎞ 0 ∏ ϕϕ ⎜ a ⎟ ≤ a 15 k=1 () () ∑∑ k k ⎜ ⎟ N N k== 11 k ⎝ ⎠ 2.4.6 A geometric example which is valid for any convex function ϕ [41]. Consider the following geometric example. The surface of a three-dimensional ellipsoid is parameterized by the 2.4.5 The information dimension, n 0 equations What we coin as, the "information dimension," n , is pre- sented as a novel (but heuristic) measure for the interpret- x(φ,θ) = R sin(φ)cos(θ), able components in a principal component analysis. We assume that a principal component analysis has been per- y(φ,θ) = R sin(φ)sin(θ), (21) formed on a microarray data set and that our objective is to reduce the dimension of the data by retaining "mean- z(φ,θ) = R cos(φ). ingful" components. This involves setting one or more of the eigenvalues associated with the low variance compo- Points are distributed along the surface of the ellipsoid nents to zero. Let λ , λ ,...,λ represent the eigenvalues according to the above parametrization and are tabulated 1 2 N from a PCA of the data. Define for each k in the matrix X of size (4584 × 3). Set R = R = R = 1, then x y z (21) is a parametrization representing the surface of the unit sphere centered at the origin. Gradually deform the p = . 16 () sphere by changing the values of R subject to the con- ∑ j j=1 straint R R R = 1, which gives ellipsoids of constant vol- x y z ume (equal to 4π/3). We summarize the results in Table 1. The p s satisfy 0 ≤ p ≤ 1, (k = 1,...,N) and p = 1 . k k ∑ k Notice that for the case R = R = R = 1, which represents K =1 x y z the unit sphere, n = 3. The gradual deformation of the We view the distribution of the eigenvalues expressed as a sphere has an information dimension of approximately proportion of total variance as a discrete probability two for the values: R = 2, R = 1, R = 1/2. This suggests that x y z model. the magnitude along the z-axis has become sufficiently small relative to the x- and y-axes, that it may be discarded We begin by normalizing the entropy measure [10] for for information purposes. Thus, a projection onto the xy- some fixed n = N using its extremal property to get plane may provide sufficient information regarding the shape of the object. For R = 8, R = 1, R = 1/8 the object x y z H 1 begins to "look" one dimensional with n = 1.09. With H = =− pp log . 17 () 0 ∑kk 2 logNN log () () 22 this configuration, most of the variance lies along the x- k=1 axis. The values of H will vary between 0 and 1 inclusive. 3 Results and discussion We calculate the entropy of the probability space using In this section we apply the information dimension, the broken stick model, the modified broken stick model, Equation (17) to obtain the functional value H , where 0 Bartlett's Test, Kaiser-Guttman, Jolliffe's modification of ≤ H ≤ 1. (Note that at either extreme the dimension is Kaiser-Guttman, Velicer's minimum average partial known.) Next, we deform the original distribution of (MAP) criteria, Cattell's scree test, parallel analysis, cumu- eigenvalues so that the following holds lative percent of variance explained, and Log-eigenvalue diagram techniques to published yeast cdc15 cell-cycle and elutriation-synchronized cell cycle data sets [34], sporulation data set [44], serum-treated human fibroblast Page 7 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 1: Dimension suggested by the Information Dimension (R , R , R ) (1,1,1) (8/5,1,5/8) (2,1,1/2) (3,1,1/3) (8,1,1/8) x y z H(p ,p ,p ) 1.000 0.781 0.608 0.348 0.074 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.648 0.762 0.890 0.984 1 1 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.253 0.191 0.099 0.015 2 2 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.099 0.048 0.010 0.000 3 3 1 2 3 n 3.00 2.36 1.95 1.47 1.09 data set [45], and the cancer cell lines data sets [46]. These nomials were constructed using a Gramm-Schmidt proc- data sets have been previously explored [10-12,47]. ess [48] with norm Before attempting to reduce the dimension of the data, we px()q ()x dx = δ,2()3 jk jk first consider whether a PCA is appropriate, that is, a data set with very high information content will not lend itself to significant dimension reduction, at least not without where δ is the Kronecker delta function. The functional jk some non-trivial loss of information. In their study, Alter form of the polynomials are: et al. [10] addresses the issue by considering the normal- ized entropy (presented above) of a data set, which is a px()=− α 32() x 1 , measure of the complexity or redundancy of the data. The px()=− α 56x 6x+1 , 22 () index ranges in value from 0 to 1 with values near zero px()=− α 72() x 1 10x −10x x + 1 , 33 () indicating low information content versus values near 1 () 43 2 px=− α 9 210x 420x+ 270x− 60x+ 3 , () 44() which indicate a highly disordered or random data set. In 5 43 2 px()=− α 11 252x 6 630xx+− 560 210x+ 30x− 1, 55() this form, the entropy can only be use to give the 65 4 32 researcher a "feeling" for the potential for dimension px=− α 13 924x 2772x+ 3150x− 16 680xx+− 420 42x+ 1 , ()() reduction. For example, what level of dimension reduc- where the α 's are applied to each functional value and tion is implied by an entropy reading of 0.3 versus 0.4? represent uniform random variables drawn from the interval [0.5,1.5]. The remaining 5,400 rows are popu- Another measure is presented in Jackson [7] and credited lated with random numbers drawn from a uniform distri- to Gleason and Staelin for use with the q × q correlation bution on the interval [-3,3]. Figure 3 provides an matrix, R, and is given by illustration of the polynomials in the presence of Gaus- sian noise (σ = 0.25). Rq − A singular value decomposition was performed on X , for ϑ= , () 22 qq − 1 () σ ranging from 0.0 to 1.0 by 0.1 increments. In the absence of Gaussian noise (σ = 0), the information The statistic also ranges in value from 0 to 1. If there is lit- dimension predicts the dimension of the data to be n = tle or no correlation among the variables, the statistic will 5.9, which compares favorably with the true dimension of be close to 0; a set of highly correlated variables will have 6. It should be noted, however, that like other stopping a statistic close to 1. The statistic is given by Jackson [7] criteria, the information dimension is a function of the asserts that the distribution of the statistic is unknown, noise present in the data. Figure 4 illustrates this depend- but may be useful in comparing data sets. ence when the number of assays is 15. The information dimension (line with circle markers), Jolliffe's modifica- 3.1 Stopping rules applied to synthetic data tion of the Guttman-Kaiser rule (line with star markers) In this section we apply the stopping criteria to a (6000 × and LEV (line with square markers) are plotted against 15) matrix, X , where X is populated with simulated data. noise level, measured in standard deviations. The predic- σ σ The simulation model can be expressed as X = Y + N , tions given by both the information dimension and Gutt- σ σ where N is a random matrix representing Gaussian noise man-Kaiser's rule increase as the noise level increases, and whose entries were drawn from a standard normal while LEV drops sharply. The reason LEV decreases is that distribution with zero mean and standard deviation, σ. higher noise levels cause the distribution of the eigenval- The matrix Y was constructed by populating the first 600 ues to look uniform. The results of applying all of the rows with values from six orthonormal polynomials. Each stopping techniques to the matrix X for σ = 0 and σ = polynomial populates 100 rows of the matrix. The poly- 0.25 are summarized in Table 2. Page 8 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Linear Quadratic Cubic 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 0 0.5 1 0 0.5 1 0 0.5 1 Quartic Quintic Sixth Degree Polynomial 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 0 0.5 1 0 0.5 1 0 0.5 1 Si Figure 3 mulated data Simulated data. Six orthonormal polynomials with gaussian Noise. 3.2 Yeast cdc15 cell-cycle data set that it requires the first seven eigenvalues to account for A PCA was performed on the genes identified in Spellman over 90% of the variance in the data. The Kaiser-Guttman [34] responsible for cell cycle regulation in yeast samples. test retains the first four eigenvalues which represents the The cdc15 data set contains p = 799 rows (genes) and q = number of eigenvalues obtained from the correlation 15 columns representing equally spaced time points. The matrix that exceeds unity. To incorporate the effect of sam- unpolished data set appears to have a high information ple variance Jolliffe [9] suggests that the appropriate content as suggested by the normalized entropy which is number to retain are those eigenvalues whose value .7264 and the Gleason-Staelin statistic which is 0.3683. exceed 0.7. Jolliffe's modification of Kaiser-Guttman Therefore, we should expect the stopping criteria to indi- would indicate that the first five eigenvalues are signifi- cate that significant dimension reduction may not be pos- cant. Parallel analysis compares the eigenvalues obtained sible. from either the correlation or covariance matrix of the data to those obtain from a matrix whose entries are Eigenvalues based on both the covariance and correlation drawn from a uniform random distribution. matrices are given in Table 2. From the given data we see Page 9 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 sigma level Predi Figure 4 cted dimension for simulated data Predicted dimension for simulated data. Predicted dimension versus noise Level. Shown is the information dimension (line with circle markers), Jolliffe's modification of the Guttman-Kaiser rule (line with star markers) and LEV (line with square markers) plotted against noise level, measured in standard deviations. Note that the predictions given by both the information dimension and Guttman-Kaiser's rule increase as the noise level increases, while LEV drops sharply (see Text). Cattell's scree test looks for an inflection point in the 6.b). For each eigenvalue λ , we graph log (λ ) against j and j j graph of the eigenvalues, which are plotted in descending look for the point at which the eigenvalues decay linearly. order. Figure 5 illustrates Cattell's scree test for eigenval- The method is based on the conjecture that the eigenval- ues obtained from the correlation matrix and from the ues associated with eigenvectors that are dominated by covariance matrix respectively. By inspecting the differ- noise will decay geometrically. The LEV diagram is subject ences of the differences between eigenvalues, we see that to interpretation and may suggest retaining 0, 3, or 9 com- the first inflection point occurs between the fourth and ponents. fifth eigenvalues. Therefore, the scree test gives a dimen- sion of five. Figure 7 illustrates Velicer's minimum average partial cor- relation statistic. It is based upon the average of the Figure 6 contains graphs of the Log-eigenvalue diagram, squared partial correlations between q variables after the LEV, where the eigenvalues are obtained from the correla- first m components have been removed [52]. The sum- tion matrix (Figure 6.a) and the covariance matrix (Figure mary statistic is given by Page 10 of 21 (page number not for citation purposes) Dimension Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 2: Summary of results Column No. 1 2 3 4 5 6 7 8 Data Set Y + N Y + N alpha cdc15 elutriation fibro sporula tumor 0 .25 Broken Stick, (BS) 1 1 2 4 3 2 1 3 Modified BS 1 1 2 4 3 2 1 3 Velicer's MAP 2/8 2 3 5 3 3 2 8/10 Kaiser-Guttman, (KG) 2 2 3 4 3 3 2 9 Jolliffe's KG 7 7 4 5 4 3 3 14 LEV Diagram 6/8 6/8 4/5 5 4/5 4/6 3 12/21 Parallel Analysis 1 1 5 4 3 3 2 8 Scree Test 8 8 5 5 4 6 4 7 Info Dimension 5.9 7.3 11.1 7.2 6.4 3.0 2.6 17.3 Gleason-Staelin Stat .525 .45 .34 .37 .38 .54 .58 .37 Normalized Entropy .917 .941 .779 .726 .706 .438 .493 .696 80% of Var. 4 4 9 4 5 3 3 19 90% of Var. 7 7 14 7 9 5 3 33 Bartlett's Test 15 15 22 15 14 12 7 60 The table contains the results of twelve stopping rules along with two measures of data information content for six cDNA microarray data sets. We recommend looking for a consensus among the rules given in the upper portion of the table, while avoiding rules based on cumulative percent of variation explained or Barlett's test. Synthetic data sets are summarized in columns 1 (no noise) and 2 (Gaussian noise, μ = 0 and σ = 0.25). The matrix sizes for columns 1 through 8 are : (6000 × 15), (6000 × 15), (4579 × 22), (799 × 15), (5981 × 14), (517 × 12), (6118 × 7), and (1375 × 60). through fifteen lengths. This would also suggest that we f = r , () 25 accept or reject the entire subspace. Of course the tail of mi ∑∑()j qq − 1 () ij ≠ the distribution can never exceed that suggested by the broken stick distribution. The broken stick model suggests a dimension of four for the cdc15 data. where r is the element in the i-th row and j-th column of ij 3.3 Summary of results For six microarray data sets the matrix of partial correlations and co-variances. The Table 2 summarizes the results of the stopping criteria for pattern of the statistic given in Figure 7 for cdc15 cell cycle six microarray data sets. Note that Bartlett's test fails to dis- data is typical in that the statistic first declines then rises. card any components. The null hypothesis that all roots Once the statistic begins to rise it indicates that additional are equal is rejected at every stage of the test. The large principal components represent more variance than cov- sample size of each data set was a major factor for all roots ariance [7]. Therefore, no components are retained after testing out to be significantly different. The broken stick the average squared partial correlation reaches a mini- model consistently retained the fewest number of compo- mum. Here the minimum occurs at j = 5, which suggests nents, which appears consistent with comments in the lit- erature. The results of the modified broken stick model retaining the first five principal components. were identical to that of the original model since the first Figure 8. a shows the eigenvalues from the covariance few eigenvalues in each data set appear to be well sepa- matrix along with error bars representing the sampling rated, at least with respect to Equation (6). Since no effec- error estimate suggested by North et al. [38] and presented tive degenerate subspaces were identified, all subspaces in Section 2.3 above. Figure 8.b shows the eigenvalues matched those of the original model. The cumulative per- obtained from the covariance matrix superimposed on the cent of variance at the 90% level retains the greatest broken stick distribution. Applying the rule of thumb number of components while components retained at the method given in North et al. [38], we find that the first six 80% level appear to be more consistent with other rules. eigenvalues are sufficiently separated and may be treated Regardless of the cutoff level chosen, this method is com- as individual subspaces. The remaining eigenvalues are pletely arbitrary and appears to be without merit. While close compared to their sampling error. Therefore, when the LEV diagram is less subjective, it is often difficult to applying the broken stick model we require that the total interpret. Kaiser-Guttman, Jolliffe's modification of Kai- variance of the effectively degenerate subspace spanned by ser-Guttman, Cattell's scree test, parallel analysis and the associated eigenvectors to exceed the value suggested Velicer's MAP consistently retained similar numbers of by the broken stick model for the sum of the seventh components. The information dimension gave compara- Page 11 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Cattell's scree Figure 5 test Cattell's scree test. Cattell's scree test using eigenvalues obtained from (a) the correlation matrix and from (b) the covari- ance matrix of the yeast cdc15 cell-cycle data set. Since the first inflection point occurs between the fourth and fifth eigenval- ues, the implied dimension is five. Cattell's scree test using eigenvalues obtained from (a) the correlation matrix and from (b) the covariance matrix of the yeast cdc15 cell-cycle data set. Since the first inflection point occurs between the fourth and fifth eigenvalues, the implied dimension is five. ble results. It often retained the most components of the Our analysis shows that the broken stick model and five aforementioned rules suggesting that it may provides Velicer's MAP consistently retained the fewest number of an upper bound on the interpretable number of compo- components while stopping rules based on percent varia- nents. tion explained and Bartlett's test retained the greatest number of components. We do not recommend the use of 4 Conclusion Bartlett's test (as presented here) or those based on the Principal component analysis is a powerful descriptive cumulative percent of variation explained. Due to the and data reduction technique for the analysis of microar- large sample size (typical of microarray data), Bartlett's ray data. Twelve stopping rules to determine the appropri- test failed to discard any components, while rules based ate level of data reduction were presented including a new on percent variation explained are completely arbitrary in heuristic model based on statistical entropy. While the nature. issue of component retention remains unresolved, the information dimension provides a reasonable and con- For the analysis of cDNA microarray data, we do not rec- servative estimate of the upper bound of the true dimen- ommend any one stopping technique. Instead, we suggest sion of the data. that one look for a "consensus dimension" given by the Page 12 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Th Figure 6 e LEV Diagram The LEV Diagram. LEV Diagram using eigenvalues obtained from (a) the correlation matrix and from (b) the covariance matrix of the yeast cdc15 cell-cycle data set. The fifth through fifteen eigenvalues lie approximately on a line indicating a dimen- sion of five. modified broken stick model, Velicer's MAP, Jolliffe mod- A A brief overview of stopping techniques ification of Kaiser-Guttman, the LEV diagram, parallel A.1 Introduction analysis, the scree test and the information dimension Franklin et al. [23] suggests that when a researcher uses while using the information dimension as an upper PCA for data analysis, the most critical problem faced is bound for the number of components to retain. Comput- determining the number of components to retain. Indeed, ing all seven stopping rules is an easy task and Matlab rou- retaining too many components potentially leads to an tines are available from the authors. attempt to ascribe physical meaning to what may be noth- ing more than noise in the data set, while retaining too As a guiding example consider the results from the cdc15 few components may cause the researcher to discard valu- cDNA data set given in Table 3. For the cdc15 data set, the able information. Many methods have been proposed to consensus is split between four and five; however, given address the question of component selection and a brief that the information dimension is seven, it appears rea- review is given here. A more extensive review can be found sonable to choose five as the appropriate dimension in in [[7], Section 2.8]; [[9], Section 6.1]; [[6], Chapter 5]. which to work. These methods may be categorized as either heuristic or statistical approaches [20]. The statistical methods may be Page 13 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Velicer's minimum average part Figure 7 ial statistic Velicer's minimum average partial statistic. Velicer's minimum average partial statistic displays a minimum value at five, indicating that the implied dimension is five. Page 14 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Th Figure 8 e cdc15 yeast cell-cycle data set The cdc15 yeast cell-cycle data set. (a) Error bars about the eigenvalues obtained from the covariance matrix of the cdc15 yeast cell-cycle data set illustrating North et al. (1982) "rule of thumb" estimate with δ = 1.5. The spacing between the second and third eigenvalues indicate a possible degenerate subspace spanned by the associated eigenvectors, (b) is a graph of the bro- ken stick model (circles) plotted against the eigenvalues (bars) obtained from the covariance matrix of the yeast cdc15 data set. The broken stick model (and the modified broken stick model) indicate a dimension of four. further partitioned into two groups: those that make A. 2 Scree test assumptions regarding the distribution of the data and The scree test is a graphical technique attributed to Cattell those that do not make such assumptions. Jolliffe [9] crit- [49] who described it in term of retaining the correct icizes the former stating that the distributional assump- number of factors in a factor analysis. However, it is tions are often unrealistic and adds that these methods widely used in PCA [9]. While a scree graph is simple to tend to over-estimate the number of components. The lat- construct, its interpretation may be highly subjective. Let ter methods tend to be computationally intensive (for λ represent the k-th eigenvalue obtained from a covari- example, cross validation and bootstrapping). Our ance or correlation matrix. A graph of λ against k is approach is to consider only heuristic methods here with known as a scree graph. The location on the graph where the exception of Velicer's Partial Correlation test, which is a sharp change in slope occurs in the line segments join- a statistical method that does not require distributional ing the points is referred to as an elbow. The value of k at assumptions nor is it computationally intensive. We which this occurs represents the number of components present below a brief discussion of the heuristic tech- that should be retained in the PCA. Jackson [7] notes that niques for determining the dimensionality of data sets as the scree test is a graphical substitute for a significance well as Velicer's Partial Correlation test. test. He points out that interpretation might be con- Page 15 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 3: Eigenvalues of yeast cdc15 cell cycle data No. Eigenvalue (Covariance) Percent of Total Cumulated Percentage Eigenvalue (Correlation) Random 1 1.9162 32.25 32.25 5.0025 0.1004 2 1.1681 19.66 51.91 2.8896 0.0997 3 0.8560 14.41 66.32 2.4140 0.0975 4 0.8320 14.00 80.32 1.7371 0.0940 5 0.3295 5.55 85.87 0.7499 0.0905 6 0.2087 3.51 89.38 0.5495 0.0870 7 0.1490 2.51 91.89 0.3663 0.0841 8 0.1337 2.25 94.14 0.3064 0.0831 9 0.0881 1.48 95.62 0.2706 0.0808 10 0.0842 1.42 97.04 0.2206 0.0801 11 0.0580 0.98 98.02 0.1507 0.0779 12 0.0402 0.68 98.69 0.1292 0.0750 13 0.0341 0.57 99.27 0.0930 0.0727 14 0.0273 0.46 99.73 0.0731 0.0702 15 0.0162 0.27 100.00 0.0464 0.0650 founded in cases where the scree graph either does not t(k) because an appropriate measure of a lack-of-fit is have a clearly defined break or has more than one break. λ (see Jolliffe 2002, pp. 113). ∑ i Also, if the first few roots are widely separated, it may be ik =+1 difficult to interpret where the elbow occurred due to a A.4 Average eigenvalue (Guttman-Kaiser rule and Jolliffe's loss in detail caused by scaling. This problem might be Rule) remedied using the LEV described below. The most common stopping criterion in PCA is the Gutt- man-Kaiser criterion [7]. Principal components associated A.3 Proportion of total variance explained In a PCA model, each eigenvalue represents the level of with eigenvalues derived from a covariance matrix, and variation explained by the associated principal compo- that are larger in magnitude than the average of the eigen- values, are retained. In the case of eigenvalues derived nent. A simple and popular stopping rule is based on the from a correlation matrix, the average is one. Therefore, proportion of the total variance explained by the principal any principal component associated with an eigenvalue components retained in the model. If k components are retained, then we may represent the cumulative variance whose magnitude is greater than one is retained. explained by the first k PC's by Based on simulation studies, Jolliffe [9] modified this rule using a cut-off of 70% of the average root to allow for sam- ∑ i t = 26 pling variation. Rencher [27] states that this method () trace S () works well in practice but when it errs, it is likely to retain too many components. It is also noted that in cases where where S is the sample covariance matrix. The researcher the data set contains a large number of variables that are decides on a satisfactory value for t(k) and then deter- not highly correlated, the technique tends to over estimate mines k accordingly. The obvious problem with the tech- the number of components. Table 4 lists eigenvalues in nique is deciding on an appropriate t(k). In practice it is descending order of magnitude from the correlation common to select levels between 70% to 95% [9]. Jackson matrix associated with a (300 × 9) random data matrix. [7] argues strongly against the use of this method except The elements of the random matrix were drawn uniformly over the interval [0, 1] and a PCA performed on the corre- possibly for exploratory purposes when little is known lation matrix. Note that the first four eigenvalues have val- about the population of the data. An obvious problem ues that exceed 1 and all nine eigenvalues have values that occurs when several eigenvalues are of similar magnitude. exceed 0.7. Thus, Kaiser's rule and its modification suggest For example, suppose for some k = k , t(k ) = 0.50 and the * * the existence of "significant PCs" from randomly gener- remaining q - k eigenvalues have approximately the same ated data – a criticism that calls into question its validity magnitude. Can one justify adding more components [20,25,50,51]. until some predetermined value of t(k) is reached? Jolliffe A.5 Log-eigenvalue diagram, LEV [9] points out that the rule is equivalent to looking at the An adaptation of the scree graph is the log-eigenvalue dia- spectral decomposition of S. Determining how many gram, where log(λ ) is plotted against k. It is based on the terms to include in the decomposition is closely related to Page 16 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 4: Eigenvalues from a random matrix. No. 1 23 45 67 89 Eigenvalue 1.21 1.20 1.13 1.03 0.96 0.93 0.89 0.86 0.77 conjecture that eigenvalues corresponding to 'noise' Reviewers' comments should decay geometrically, therefore, those eigenvalues Orly Alter review should appear linear. Farmer [50] investigated the proce- R. Cangelosi and A. Goriely present two novel mathemat- dure by studying LEV diagrams from different groupings ical methods for estimating the statistically significant of 6000 random numbers. He contends that the LEV dia- dimension of a matrix. One method is based on the Shan- gram is useful in determining the dimension of the data. non entropy of the matrix, and is derived from fundamen- tal principles of information theory. The other method is A.6 Velicer's partial correlation test a modification of the "broken stick" model, and is derived Velicer [52] proposed a test based on the partial correla- from fundamental principles of probability. Also pre- tions among the q original variables with one or more sented are computational estimations of the dimensions principal components removed. The criterion proposed is of six well-studied DNA microarray datasets using these two novel methods as well as ten previous methods. f = r , 27 () Estimating the statistically significant dimension of a mi ∑∑()j qq − 1 () ij ≠ given matrix is a key step in the mathematical modeling of data, e.g., as the authors note, for data interpretation as where r is the partial correlation between the i-th and j- ij well as for estimating missing data. The question of how best to estimate the dimension of a matrix is still an open th variables. Jackson [7] notes that the logic behind question. This open question is faced in most analyses of Velicer's test is that as long as f is decreasing, the partial DNA microarray data (and other large-scale modern data- correlations are declining faster than the residual vari- sets). The work presented here is not only an extensive ances. This means that the test will terminate when, on the analysis of this open question. It is also the first work, to average, additional principal components would repre- the best of my knowledge, to address this key open ques- sent more variance than covariance. Jolliffe [9] warns that tion in the context of DNA microarray data analysis. I the procedure is plausible for use in a factor analysis, but expect it will have a significant impact on this field of research, and recommend its publication. may underestimate the number of principal components in a PCA. This is because it will not retain principal com- For example, R. Cangelosi and A. Goriely show that, in ponents dominated by a single variable whose correla- estimating the number of eigenvectors which are of statis- tions with other variables are close to zero. tical significance in the PCA analysis of DNA microarray data, the method of cumulative percent of variance A.7 Bartlett's equality of roots test should not be used. Unfortunately, this very method is It has been argued in the literature (see North, [38]) that used in an algorithm which estimates missing DNA eigenvalues that are equal to each other should be treated microarray data by fitting the available data with cumula- as a unit, that is, they should either all be retained or all tive-percent-of-variance- selected eigenvectors [Troyan- discarded. A stopping rule can be formulated where the skaya et al., Bioinformatics 17, 520 (2001)]. This might be last m eigenvalues are tested for equality. Jackson [7] one explanation for the superior performance of other presents a form of a test developed by Bartlett [53] which PCA and SVD-based algorithms for estimating DNA is microarray data [e.g., Kim et al., Bioinformatics 15, 187 (2005)]. ⎡ ⎤ q λ ∑ j jk =+1 2 ⎢ ⎥ χν =− lnλ +−ν qk ln 28 () () ∑ () j In another example, R. Cangelosi and A. Goriely estimate ⎢ ⎥ qk − jk −+1 ⎢ ⎥ that there are two eigenvectors which are of statistical sig- ⎣ ⎦ nificance in the yeast cdc15 cell-cycle dataset of 799 genes where χ has (1/2) (q - k - 1)(q - k - 2) degrees of freedom × 15 time points. Their mathematical estimation is in and v represents the number of degrees of freedom associ- agreement with the previous biological experimental ated with the covariance matrix. [Spellman et al., MBC 9, 3273 (1998)] as well as compu- tational [Holter et al., PNAS 97, 8409 (2000)] interpreta- Authors' contributions tions of this dataset. R.C. and A.G. performed research and wrote the paper Page 17 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Declaration of competing interests: I declare that I have no them in general accord, if that is the case), it is undesirable competing interests. to report all estimators for any large set of results. The information dimension's property of increasing with John Spouge's review (John Spouge was nominated by noise makes it undesirable as an estimator, and it can not Eugene Koonin) be recommended. The main value of the paper therefore This paper reviews several methods based on principal resides in its useful review and its software tools. component analysis (PCA) for determining the "true" dimensionality of a matrix subject to statistical noise, with Answers to John Spouge's review specific application to microarray data. It also offers two The main point of the reviewer is the suggestion that the new candidates for estimating the dimensionality, called information dimension's undesirable property of increas- "information dimension" and the " modified broken stick ing with noise makes it undesirable as an estimator. We model". analyze the information in detail and indeed reached the conclusion that its prediction increases with noise. In the Section 2.1 nicely summarizes matrix methods for reduc- preprint reviewed by Dr. Spouge, we only considered the ing dimensionality in microarray data. It describes why effect of noise on the information dimension. It is crucial PCA is preferable to a singular value decomposition (a to note that ALL methods are functions of the noise level change in the intensities of microarray data affects the sin- present in the data. In the new and final version of the gular value decomposition, but not PCA). manuscript, we study the effect of noise on two other methods (Jolliffe's modification of he Guttman-Kaiser Section 2.2 analyzes the broken stick model. Section 2.3 rule and LEV). It clearly appears that in one case the esti- explains in intuitive terms the authors' "modified broken mator increases with noise and in the other one, it stick model", but the algorithm became clear to me only decreases with noise (both effects are undesirable and when it was applied to data later in the paper. The broken unavoidable). The message to the practitioner is the same, stick model has the counterintuitive property of determin- understand the signal to noise ratio of the data and act ing dimensionality without regard to the amount of data, accordingly. We conclude that the information dimension implicitly ignoring the ability of increased data to could still be of interest as an estimator. improve signal-to-noise. The modified broken stick model therefore has some intuitive appeal. David Horn and Roy Varshavsky joint review (both reviewers were nominated by O. Alter) Section 2.4 explains the authors' information dimension. This paper discusses an important problem in data analy- The derivation is thorough, but the resulting measure is sis using PCA. The term 'component retention' that the purely heuristic, as the authors point out. In the end, authors use in the title is usually referred to as dimen- despite the theoretical gloss, it is a just formula, without sional truncation or, in more general terms, as data com- any desirable theoretical properties or intuitive interpreta- pression. The problem is to find the desired truncation tion. level to assure optimal results for applications such as clustering, classification or various prediction tasks. The evaluation of the novel measures therefore depends on their empirical performance, found in the Results and The paper contains a very exhaustive review of the history Discussion. Systematic responses to variables irrelevant to of PCA and describes many recipes for truncation pro- the (known) dimensionality of synthetic data become of posed over the 100 years since PCA was introduced. The central interest. In particular, the authors show data that authors propose also one method of their own, based on their information dimension increases systematically with the use of the entropy of correlation eigenvalues. A com- noise, clearly an undesirable property. The authors also parison of all methods is presented in Table 2, including test the dimensionality estimators on real microarray 14 criteria applied to 6 microarray experiments. This table data. They conclude that six dimensionality measures are demonstrates that the results of their proposed 'informa- in rough accord, with three outliers: Bartlett's test, cumu- tion dimension' are very different from those of most lative percent of variation explained, and the information other truncation methods. dimension (which tends to be higher than other estima- tors). They therefore propose the information dimension We appreciate the quality of the review presented in this as an upper bound for the true dimensionality, with a paper, and we recommend that it should be viewed and consensus estimate being derived from the remaining presented as such. But we have quite a few reservations measures. regarding the presentation in general and their novel method in particular. The choice of dimensionality measure is purely empirical. While it is desirable to check all estimators (and report Page 18 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 1. The motivation for dimensional reduction is briefly Answers to Horn's and Varshavsky's review mentioned in the introduction, but this point is not elab- We would like to thank the reviewers for their careful orated later on in the paper. As a result, the paper lacks a reading of our manuscript and their positive criticisms. target function according to which one could measure the We have modified our manuscript and follow most of performance of the various methods displayed in Table 2. their recommendations. We believe one should test methods according to how well they perform, rather than according to consensus. Specifically, we answer each comment by the reviewer: Performance can be measured on data, but only if a per- formance function is defined, e.g. the best Jaccard score 1. The status of the paper as a review or a regular article. achieved for classification of the data within an SVM approach. Clearly many other criteria can be suggested, It is true that the paper contains a comprehensive survey and results may vary from one dataset to another, but this of the literature and many references. Nevertheless, we is the only valid scientific approach to decide on what believe that the article contains sufficiently many new methods should be used. We believe that it is necessary for results to be seen as a regular journal article. Both the the authors to discuss this issue before the paper is modified broken-stick method and the information accepted for publication. dimension are new and important results for the field of cDNA analysis. 2. All truncation methods are heuristic. Also the new sta- tistical method proposed here is heuristic, as the authors 2. The main criticism of the paper is that we did not test admit. An example presented in Table 1 looks nice, and the performance of the different methods against some should be regarded as some justification; however the benchmark. novel method's disagreement with most other methods (in Table 2) raises the suspicion that the performance of To answer this problem we performed extensive bench- the new method, once scrutinized by some performance marking of the different methods against noisy simulated criterion on real data, may be bad. The authors are aware data for which the true signal and its dimension was of this point and they suggest using their method as an known. We have added this analysis in the paper where upper bound criterion, with which to decide if their pro- we provide an explicit example. This example clearly posed 'consensus dimension' makes sense. This, by itself, establishes our previous claim that the information has a very limited advantage. dimension provides a useful upper bound for the true sig- nal dimension (whereas other traditional methods such 3. The abstract does not represent faithfully the paper. The as Velicer's underestimate the true dimension). Upper new method is based on an 'entropy' measure but this is bounds are extremely important in data analysis as they not really Shannon entropy because no probability is provide a reference point with respect to which other involved. It gives the impression that the new method is methods can be compared. based on some 'truth' whereas others are ad-hoc which, in our opinion, is wrong (see item 2 above). We suggest that 3. The abstract does not represent the paper. once this paper is recognized as a review paper the abstract will reflect the broad review work done here. We did modify the abstract to clarify the relationship of information dimension that we propose with respect to 4. Some methods are described in the body of the article other methods (it is also a heuristic approach!). Now, (e.g., broken stick model), while others are moved to the with the added analysis and wording, we believe that the appendix (e.g., portion of total variance). This separation abstract is indeed a faithful representation of the paper. is not clear. Unifying these two sections can contribute to the paper readability. 4. Some methods appear in the appendix. In conclusion, since the authors admit that information Indeed, the methods presented in the appendix are the dimension cannot serve as a stopping criterion for PCA one that we review. Since we present a modification of the compression, this paper should not be regarded as pro- broken-stick method along with a new heuristic tech- moting a useful truncation method. Nevertheless, we nique, we believe it is appropriate to describe the broken- believe that it may be very useful and informative in stick method in the main body of the text while relegating reviewing and describing the existing methods, once the other known approaches (only used for comparison) to modifications mentioned above are made. We believe this the appendix. Keeping in mind that this is a regular article could then serve well the interested mathematical biology rather than a review, we believe it is justified. community. Page 19 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 23. Franklin S, Gibson D, Robertson P, Pohlmann J, Fralish J: Parallel Acknowledgements Analysis: a method for determining significant components. This material is based upon work supported by the National Science Foun- J Vegatat Sci 1995:99-106. dation under Grant No. DMS grant #0307427 (A.G.), DMS-0604704 (A. G.) 24. Zwick W, Velicer W: Comparison of five rules for determining and DMS-DMS-IGMS-0623989 (A.G.) and a grant from the BIO5 Institute. the number of components to retain. Psychol Bull 1986, 99:432-446. We would like to thank Jay Hoying for introducing us to the field of micro- 25. Karr J, Martin T: Random number and principal components: arrays and Joseph Watkins for many interesting discussions. further searches for the unicorn. In The use of multivariate statis- tics in wildlife habitat Volume RM-87. Edited by: Capen D. United Forest References Service General Technical Report; 1981:20-24. 26. Basilevsky A: Statistical Factor Analysis and Related Methods: Theory and 1. Pearson K: On lines and planes of closest fit to systems of Applications New York: Wiley-Interscience; 1994. points in space. Phil Mag 1901, 2:559-572. 27. Rencher A: Multivariate Statistical Inference and Applications New York: 2. Hotelling H: Analysis of a complex statistical variable into John Wiley & Sons, Inc; 1998. principal components. J Educ Psych 1933, 26:417-441. 498–520 28. Tinker N, Robert L, Harris GBL: Data pre-processing issues in 3. Rao C: The use and inter etation of principal component anal- microarray analysis. In A Practical Approach to Microarray Data Anal- ysis in applied research. Sankhya A 1964, 26:329-358. ysis Edited by: Berrar DP, Dubitzky W, Granzow M. Kluwer, Norwell, 4. Gower J: Some distance properties of latent root and vector MA; 2003:47-64. methods used in multivariate analysis. Biometrika 1966, 29. Dubitzky W, Granzow M, Downes C, Berrar D: Introduction to 53:325-338. microarray data analysis. In A Practical Approach to Microarray Data 5. Jeffers J: Two case studies in the application of principal com- Analysis Edited by: Berrar DP, Dubitzky W, Granzow M. Kluwer, Nor- ponent analysis. Appl Statist 1967, 16:225-236. well, MA; 2003:91-109. 6. Preisendorfer R, Mobley C: Principal component analysis in meterology 30. Baxter M: Standardization and transformation in principal and oceanography Amsterdam: Elsevier; 1988. component analysis, with applications to archaeometry. Appl 7. Jackson J: A User's Guide to Principal Components New York: John Wiley Statist 1995, 44:513-527. & Sons; 1991. 31. Bro R, Smilde A: Centering and scaling in component analysis. 8. Arnold G, Collins A: Interpretation of transformed axes in mul- J Chemometrics 2003, 17:16-33. tivariate analysis. Appl Statist 1993, 42:381-400. 32. Wall M, Rechsteiner A, Rocha L: Singular value decomposition 9. Jolliffe I: Principal Component Analysis Springer, New York; 2002. and principal component analysis. In A Practical Approach to 10. Alter O, Brown P, Botstein D: Singular value decomposition for Microarray Data Analysis Edited by: Berrar DP, Dubitzky W, Granzow genome-wide expression data processing and modeling. Proc M. Kluwer, Norwell, MA; 2003:91-109. Natl Acad Sci USA 2000, 97:10101-10106. 33. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and dis- 11. Holter N, Mitra M, Maritan A, Cieplak M, Banavar J, Fedoroff N: Fun- play of genome-wide expression patterns. Proc Natl Acad Sci damental patterns underlying gene expression profiles: sim- USA 1998, 95:14863-14868. plicity from complexity. Proc Natl Acad Sci USA 2000, 97:8409-14. 34. Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown 12. Crescenzi M, Giuliani A: The main biological determinants of P, B DB, Futcher : Comprehensive identification of cell cycle- tumor line taxonomy elucidated by a principal component regulated genes of the yeast Saccharomyes cerevisiae by analysis of microarray data. FEBS Letters 2001, 507:114-118. microarray hybridization. Mol Biol Cell 1998, 9:3273-3297. 13. Hsiao L, Dangond F, Yoshida T, Hong R, Jensen R, Misra J, Dillon W, 35. Pielou E: Ecological Diversity New York: John Wiley & Sons; 1975. Lee K, Clark K, Haverty P, Weng Z, Mutter G, Frosch M, Macdonald 36. Macarthur R: On the relative abundance of bird species. Proc M, Milford E, Crum C, Bueno R, Pratt R, Mahadevappa M, Warrington Natl Acad Sci USA 1957, 43:293-295. J, Stephanopoulos G, Stephanopoulos G, Gullans S: A compendium 37. Frontier S: Étude de la décroissance des valeurs propres dans of gene expression in normal human tissues. Physiol Genomics une analyse en composantes principales: comparaison avec 2001, 7:97-104. le modèle du bâton brisé. Biol Ecol 1976, 25:67-75. 14. Misra J, Schmitt W, Hwang D, Hsiao L, Gullans S, Stephanopoulos G, 38. North G, Bell T, Cahalan R, Moeng F: Sampling errors in the esti- Stephanopoulos G: Interactive exploration of microarray gene mation of empirical orthogonal functions. Mon Weather Rev expression patterns in a reduced dimensional space. Genome 1982, 110:699-706. Res 2002, 12:1112-1120. 39. Reza F: An Introduction to Information Theory New York: Dover Publi- 15. Chen L, Goryachev A, Sun J, Kim P, Zhang H, Phillips M, Macgregor P, cations, Inc; 1994. Lebel S, Edwards A, Cao Q, Furuya K: Altered expression of 40. Pierce J: An introduction to information theory: symbols, signals and noise genes involved in hepatic morphogenesis and fibrogenesis New York: Dover Publications, Inc; 1980. are identified by cDNA microarray analysis in biliary atresia. 41. Khinchin A: Mathematical Foundations of Information Theory New York: Hepatology 2003, 38(3):567-576. Dover Publications, Inc; 1957. 16. Mori Y, Selaru F, Sato F, Yin J, Simms L, Xu Y, Olaru A, Deacu E, Wang 42. Shannon C: A mathematical theory of communication. Bell Sys- S, Taylor J, Young J, Leggett B, Jass J, Abraham J, Shibata D, Meltzer S: tem Technical Journal 1948, 27:379-423. 623–656. The impact of microsatellite instability on the molecular 43. Schneider T: Information theory primer with an appendix on phenotype of colorecal tumors. Cancer Research 2003, logarithms. Center for Cancer Research Nanobiology Program (CCRNP), 63:4577-4582. [Online] 2005 [http://www.lecb.ncifcrf.gov/toms/paper/primer]. 17. Jiang H, Dang Y, Chen H, Tao L, Sha Q, Chen J, Tsai C, Zhang S: Joint 44. Chu S, Derisi J, Eisen M, Mulholland J, Bolstein D, Brown P, Herskow- analysis of two micorarray gene expression data sets to itz I: The transcriptional program of sporulation in budding select lung adenocarcinoma marker genes. BMC Bioinformatics yeast. Science 1998, 282:699-705. 2004, 5:81. 45. Iyer V, Eisen M, Ross D, Schuler G, Moore T, Lee J, Trent J, Staudt L, 18. Oleksiak M, Roach J, Crawford D: Natural variation in cardiac J Hudson J, Boguski M: The Transcriptional Program in the metabolism and gene expression in Fundulus Heteroclitus. Response of Human Fibroblasts to Serum. Science 1999, Nature Genetics 2005, 37(1):67-72. 283:83-87. 19. Schena M, Shalon D, Davis R, Brown P: Quantitative Monitoring 46. Ross D, Scherf U, Eisen M, Perou C, Rees C, Spellman P, Iyer V, Jeffrey of Gene Expression Patterns with a Complementary DNA S, Rijn MVD, Waltham M, Pergamenschikov A, Lee J, Lashkari D, Microarray. Science 1995, 270:467-470. Shalon D, Myers T, Weinstein J, Botstein D, Brown P: Systematic 20. Jackson D: Stopping rules in principal components analysis: a variation in gene expression patterns in human cancer cell comparison of heuristical and statistical approaches. Ecology lines. Nat Genet 2000, 24:227-235. 1993, 74(8):2204-2214. 47. Raychaurdhuri S, Stuart J, Altman R: Principal component analysis 21. Ferré L: Selection of components in principal component to summarize microarray experiments: application to analysis: A comparison of methods. Computat Statist Data Anal sporulation time series. Pac Symp Biocomput 2000:455-466. 1995, 19:669-682. 48. Lax P: Linear Algebra New York: Wiley; 1996. 22. Bartkowiak A: How to reveal the dimensionality of the data? 49. Cattell B: The scree test for the number of factors. Multiv Behav Applied Stochastic Models and Data Analysis 1991:55-64. Res 1966, 1:245-276. Page 20 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 50. Farmer S: An investigation into the results of principal compo- nent analysis of data derived from random numbers. Statistian 1971, 20:63-72. 51. Stauffer D, Garton E, Steinhorst R: A comparison of principal components from real and random data. Ecology 1985, 66(6):1693-1698. 52. Velicer W: Determining the number of principal components from the matrix of partial correlations. Psychometrika 1975, 41(3):321-327. 53. Bartlett M: Tests of significance in factor analysis. Brit J Psychol Statist Section 1991, 3:77-85. 54. Guiasu Silviu: Information Theory with Applications. New York: McGraw Hill International Book Company; 1977. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 21 of 21 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Biology Direct Springer Journals

Component retention in principal component analysis with application to cDNA microarray data

Biology Direct , Volume 2 (1) – Jan 17, 2007

Loading next page...
 
/lp/springer-journals/component-retention-in-principal-component-analysis-with-application-ah5HJtofH0

References (66)

Publisher
Springer Journals
Copyright
Copyright © 2007 by Cangelosi and Goriely; licensee BioMed Central Ltd.
Subject
Life Sciences; Life Sciences, general
eISSN
1745-6150
DOI
10.1186/1745-6150-2-2
pmid
17229320
Publisher site
See Article on Publisher Site

Abstract

Shannon entropy is used to provide an estimate of the number of interpretable components in a principal component analysis. In addition, several ad hoc stopping rules for dimension determination are reviewed and a modification of the broken stick model is presented. The modification incorporates a test for the presence of an "effective degeneracy" among the subspaces spanned by the eigenvectors of the correlation matrix of the data set then allocates the total variance among subspaces. A summary of the performance of the methods applied to both published microarray data sets and to simulated data is given. This article was reviewed by Orly Alter, John Spouge (nominated by Eugene Koonin), David Horn and Roy Varshavsky (both nominated by O. Alter). 1 Background In his development of PCA, Pearson [1] was interested in Principal component analysis (PCA) is a 100 year old constructing a line or a plane that "best fits" a system of mathematical technique credited to Karl Pearson [1] and points in q-dimensional space. Geometrically, this its properties as well as the interpretation of components amounts to repositioning the origin at the centroid of the have been investigated extensively [2-9]. The technique points in q-dimensional space and then rotating the coor- has found application in many diverse fields such as ecol- dinate axes in such a way as to satisfy the maximal vari- ogy, economics, psychology, meteorology, oceanography, ance property. Statistically speaking, PCA represents a and zoology. More recently it has been applied to the transformation of a set of q correlated variables into linear analysis of data obtained from cDNA microarray experi- combinations of a set of q pair-wise uncorrelated variables ments [10-18]. cDNA microarray experiments provide a called principal components. Components are con- snapshot in time of gene expression levels across poten- structed so that the first component explains the largest tially thousands of genes and several time steps [19]. To amount of total variance in the data and each subsequent assist in the data analysis, PCA (among other techniques) component is constructed so as to explain the largest is generally employed as both a descriptive and data amount of the remaining variance while remaining uncor- reduction technique. The focus of this letter will be on the related with (orthogonal to) previously constructed com- latter. ponents. Page 1 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 We define the dimension of the data set to be equal to the and summarize the results. We also introduce a modifica- number of principal components. The set of q principal tion of the broken stick model which incorporates the components is often reduced to a set of size k, where 1 ≤ k notion of degenerate subspaces in component retention. <<> q. The objective of dimension reduction is to make Finally, we introduce and include in the summary a novel analysis and interpretation easier, while at the same time application of statistical entropy to provide a new heuris- retaining most of the information (variation) contained tic measure of the number of interpretable components. in the data. Clearly, the closer the value of k is to q the bet- ter the PCA model will fit the data since more information 2 Mathematical methods has been retained, while the closer k is to 1, the simpler 2.1 Principal component analysis the model. Each principal component represents a linear combina- tion of the original variables with the first principal com- Many methods, both heuristic and statistically based, ponent defined as the linear combination with maximal have been proposed to determine the number k, that is, sample variance among all linear combinations of the var- the number of "meaningful" components. Some methods iables. The next principal component represents the linear can be easily computed while others are computationally combination that explains the maximal sample variance intensive. Methods include (among others): the broken that remains unexplained by the first with the additional stick model, the Kaiser-Guttman test, Log-Eigenvalue condition that it is orthogonal to the first [27]. Each sub- (LEV) diagram, Velicer's Partial Correlation Procedure, sequent component is determined in a similar fashion. If Cattell's SCREE test, cross-validation, bootstrapping tech- we have a q-dimensional space, we expect to have q prin- niques, cumulative percentage of total of variance, and cipal components due to sampling variation. Bartlett's test for equality of eigenvalues. For a description of these and other methods see [[7], Section 2.8] and [[9], The following derivation can be found in [[27], pp. 373– Section 6.1]. For convenience, a brief overview of the tech- 374], Jolliffe [9] or Basilevsky [26]. Let X be a (p × q) niques considered in this paper is given in the appendices. matrix that contains the observed expression of the i-th gene in its i-th row. Denote by g the i-th observation and Most techniques either suffer from an inherent subjectiv- let S be the sample covariance matrix of X. For a particular , we seek ity or have a tendency to under estimate or over estimate observation g the true dimension of the data [20]. Ferré [21] concludes that there is no ideal solution to the problem of dimen- GG z = a g + a g + ... + a g = (1) ag 1 i,1 2 i,2 p i,p i sionality in a PCA, while Jolliffe [9] notes "... it remains true that attempts to construct rules having more sound GG GG T T statistical foundations seem, at present, to offer little such that var(z) = var(ag ) is maximal subject to aa = advantage over simpler rules in most circumstances." A 1. That is, we maximize the expression comparison of the accuracy of certain methods based on real and simulated data can be found in [20-24]. GG G G TT aSa−− λ a a12() () Data reduction is frequently instrumental in revealing where λ is a Lagrange multiplier. Differentiating with mathematical structure. The challenge is to balance the accuracy (or fit) of the model with ease of analysis and the respect to a leads to the familiar eigenvalue problem potential loss of information. To confound matters, even GG random data may appear to have structure due to sam- Sa−= λ a03 () pling variation. Karr and Martin [25] note that the percent variance attributed to principal components derived from So λ is an eigenvalue of S and real data may not be substantially greater than that GG G G G G TT T derived from randomly generated data. They caution that aSa=== a λλ a a a λ 4 is its correspond- () most biologists could, given a set of random data, gener- ing eigenvector. Since ate plausible "post-facto" explanations for high loadings in "variables." Basilevsky [26] cautions that it is not neces- GG G G G G TT T aSa=== a λλ a a a λ () 4 sarily true that mathematical structure implies a physical process; however, the articles mentioned above provide we see that to maximize the expression we should choose examples of the successful implementation of the tech- the largest eigenvalue and its associated eigenvector. Pro- nique. ceed in a similar fashion to determine all q eigenvalues and eigenvectors. In this report, we apply nine ad-hoc methods to previ- ously published and publicly available microarray data Page 2 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 2.1.1 Data preprocessing sionality in three of the four patterned matrices used in his Data obtained from cDNA microarray experiments are fre- study, giving underestimates in the other. He reported quently "polished" or pre-processed. This may include, that overall, the model was one of the two most accurate but is not limited to: log transformations, the use of under consideration. Bartkowiak [22] claims that the bro- weights and metrics, mean centering of rows (genes) or ken stick model applied to hydro-meteorical data pro- columns (arrays), and normalization, which sets the mag- vided an underestimate of the dimensionality of the data. nitude of a row or column vector equal to one. The term Her claim is based on the fact that other heuristic tech- data preprocessing varies from author to author and its niques generally gave higher numbers (2 versus 5 to 6). merits and implications can be found in [28-32]. However, it should be noted that the true dimension of the data is unknown. Ferré [21] suggests that since PCA is It is important to note that such operations will affect the used primarily for descriptive rather than predictive pur- eigensystem of the data matrix. A simple example is pro- poses, which has been the case with microarray data anal- vided by comparing the singular spectrum from a singular ysis, any solution less than the true dimension is value decomposition (SVD) with that of a traditional acceptable. PCA. Note that PCA can be considered as a special case of singular value decomposition [32]. In SVD one computes The broken-stick model has the advantage of being the eigensystem of X X, where the p × q matrix X contains extremely easy to calculate and implement. Consider the the gene expression data. In PCA one computes the eigen- closed interval J = [0,1]. Suppose J is partitioned into n system of S = M M/(p - 1), where M equals the re-scaled subintervals by randomly selecting n - 1 points from a uni- and column centered (column means are zero) matrix X. form distribution in the same interval. Arrange the subin- The matrix S is recognized as the sample covariance matrix tervals according to length in descending order and of the data. Figure 1 illustrates the eigenvalues (expressed denote by L the length of the k-th subinterval. Then the as a percent of total dispersion) obtained from a PCA and expected value of L is [37] an SVD on both the raw and log base-two transformed [33] elutriation data set of the budding yeast Saccharomy- ces cerevisiae [34]. Note the robustness of PCA. In Figure EL = . 5 () () k ∑ nj 1.a, which is an SVD performed on the raw data, we see jk = the dominance of the first mode. In general, the further Figure 2 provides an illustration of the broken stick distri- the mean is from the origin, the larger the largest singular bution for n = 20 subintervals graphed along with eigen- value will be in a SVD relative to the others [7]. values obtained from the covariance matrix of a random matrix. The elements of the random matrix are drawn 2.2 Broken stick model from a uniform distribution on the interval [0,1]. The bars The so-called broken stick model has been referred to as a represent the values from the broken stick distribution; resource apportionment model [35] and was first pre- the circles represent the eigenvalues of the random matrix. sented as such by MacArthur [36] in the study of the struc- In this case, no component would be retained since the ture of animal communities, specifically, bird species proportion of variance explained by the first (largest) from various regions. Frontier [37] proposed comparing eigenvalue falls below the first value given by the broken eigenvalues from a PCA to values given by the broken stick model. stick distribution. The apportioned resource is the total variance of the data set (the variance is considered a 2.3 Modified broken stick model resource shared among the principal components). Since Consider a subspace spanned by eigenvectors associated each eigenvalue of a PCA represents a measure of each with a set of "nearly equal" eigenvalues that are "well sep- components' variance, a component is retained if its asso- arated" from all other eigenvalues. Such a subspace is well ciated eigenvalue is larger than the value given by the bro- defined in that it is orthogonal to the subspaces spanned ken stick distribution. An example of broken stick by the remaining eigenvectors; however, individual prin- distribution with a plot can be found in Section 3.2. cipal components within that subspace are unstable [9]. This instability is described in North et al. [38] where a As with all methods currently in use, the broken stick first order approximation to estimate how sample eigen- model has drawbacks and advantages. Since the model values and eigenvectors differ from their exact quantities does not consider sample size, Franklin et al. [23] con- is derived. This "rule of thumb" estimate is tends that the broken stick distribution cannot really 1/2 model sampling distributions of eigenvalues. The model δλ ~ λ (2/N) (6) also has a tendency to underestimate the dimension of the data [20-22]. However, Jackson [20] claims that the bro- where N is the sample size and λ is an eigenvalue. The ken stick model accurately determined the correct dimen- interpretation given by North et al. [38] is that "... if a Page 3 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 (a). SVD on raw Data (b). PCA on raw data 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Component Number Component Number (c). SVD on log−transformed data (d). PCA on log−transformed data 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Component Number Component Number SVD Figure 1 and PCA SVD and PCA. (a) SVD performed on the elutriation data set; (b) PCA on the elutriation data set; (c) SVD on the log base two transformed data; (d) PCA on the log base two data. group of true eigenvalues lie within one or two δλ of each (6), an estimate of the sampling error is determined and other then they form an 'effectively degenerate multiplex,' those eigenvalues which lie within 1.5 of each other is and sample eigenvectors are a random mixture of the true noted (note that the value of the spacing, 1.5, is somewhat eigenvectors." arbitrary. In their report, North et al. [38] suggest using a value between 1 and 2.) Components are then grouped As noted previously, the broken stick model has been into subspaces preserving the order determined by the referred to as a resource apportionment model, and in maximum variance property of PCA. Subspaces are particular, the resource to be apportioned among the spanned by either a single eigenvector, or in the case of an components is the total variance. We modify this "effective degeneracy," by multiple eigenvectors. Denote approach by considering the variance as apportioned these subspaces by W . For each W we sum the eigenval- i i among individual subspaces. ues associated with the eigenvectors spanning that space. We then repartition the broken stick model to match the Once the eigenvalues, λ , have been computed, the spac- subspaces and then apply the broken stick model to each ing between them, λ - λ , is calculated. Using Equation subspace, requiring that the sum of the eigenvalues asso- i+1 i Page 4 of 21 (page number not for citation purposes) Percent of total Percent of Total Percent of total Percent of total Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 18 20 Component Number The Figure 2 broken stick method The broken stick method. The broken stick distribution (bars) with eigenvalues obtained from a uniform random matrix of size 500 × 20. ciated with that subspace exceed the value given by the 2.4.1 Information and uncertainty broken stick model. In the context of information theory, information is a measure of what can be communicated rather than what 2.4 Statistical entropy and dimensionality is communicated [42]. Information can also be though of In physics, the entropy of a closed system is a measure of as a term used to describe a process that selects one or disorder. An increase in entropy corresponds to an more objects from a set of object. For example, consider a increase in disorder which is accompanied by a decrease balanced six-sided die. We will consider the die as a in information. In the branch of applied probability device, D, that can produce with equal probability any ele- known as information theory [39-42], the concept of ment of the set S = {1, 2, 3,4, 5, 6}. Of course, observing entropy is used to provide a measure of the uncertainty or a tossed die and noting the number of the top face is a information of a system. Note that uncertainty and infor- finite discrete probability experiment with sample space mation are used synonymously, the reason for which is S where each sample point is equally likely to occur. explained below. The word system is used to imply a com- plete discrete random experiment. Denote the probability model for an experiment with out- comes e ,...,e with associated probabilities p ,...,p as 1 n 1 n Page 5 of 21 (page number not for citation purposes) Percent of total Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 ee " e ⎛ ⎞ 12 n . () 7 Hp() ,...,p =− p log p . () 11 ⎜ ⎟ 12 Nk ∑ k pp " p ⎝ 12 n ⎠ k=1 In the example of the balanced six-sided die we have Equation (11) represents the average information content or average uncertainty of a discrete system. The quantity is considered a measure of information or uncertainty 12 3 4 5 6 ⎛ ⎞ . 8 () depending upon whether we consider ourselves in the ⎜ ⎟ 16////// 16 16 16 16 16 ⎝ ⎠ moment before the experiment (uncertainty) or in a moment after the experiment (information), [54]. Prior to performing the experiment, we are uncertain as to its outcome. Once the die is tossed and we receive infor- 2.4.3 Derivation of H mation regarding the outcome, the uncertainty decreases. Statistical entropy is derived from the negative binomial As a measure of this uncertainty we can say that the device distribution where an experiment with two equally likely has an "uncertainty of six symbols" [43]. Now consider a outcomes, labeled success or failure, is considered. Basi- "fair" coin. This "device," which we will call C, produces levsky [26] shows that with x representing the first two symbols with equal likelihood from the set S = {h, t} number on which a "success" is observed, the probability and we say this device has an "uncertainty of two sym- th of observing success on the x trial is given by f(x) = p = (l/ bols." We denote this probability model as 2) . Upon solving for x we have x = -log p, and the expected or total entropy of a system, H, is ht ⎛ ⎞ . 9 () ⎜ ⎟ 12// 12 ⎝ ⎠ Hf =− xx = p log , 12 () () ∑ ∑ Both models represent uniform distributions (the out- k 2 x k=1 comes in the respective models have equal probabilities), but it is inferred that device D is a finite scheme with where p log is defined to be 0 if p = 0. k 2 k greater uncertainty than device C (an "uncertainty of six symbols" versus an "uncertainty of two symbols"). Conse- quently, we expect device D to convey more information. 2.4.4 Basic properties To see this, consider (as an approximation to the amount It is possible to derive the form of the function H by of information conveyed) the average minimum number assuming it possesses four basic properties, [41]: (i) con- of binary questions that would be required to ascertain the outcome of each experiment. In the case of device D, tinuity, (ii) symmetry, (iii) an extremal property, and (iv) the average minimum number of questions is 2.4 while in additivity. Continuity requires that the measure of uncer- the case of device C only one question is required. Now tainty varies continuously if the probabilities of the out- consider an oddly minted coin with identical sides (say comes of an experiment are varied in a continuous way. heads on either side). The model for this device is Symmetry states that the measure must be invariant to the order of the p s, that is, H(p , p ,...,p ) = H(p , p ,...,p ). k 1 2 N 2 1 N ht ⎛ ⎞ . 10 () Additivity requires that given the following three H func- ⎜ ⎟ ⎝ ⎠ tions defined on the same probability space Since heads, h, is the only possible outcome, we consider H (p , p ,...,p ), this as a "device of one symbol." Notice that this device 1 1 2 N carries no information and contains no element of uncer- H (p , p ,...,p , q , q ,...,q ), (13) tainty. We need not pose a question to ascertain the out- 2 1 2 N 1 2 M come of the experiment. Thus, a function that attempts to H (q /p , q /p ,...,q /p ), quantify the information or uncertainty of a system will 3 1 N 2 N M N depend on the cardinality of the sample space and the the relationship H = H + p H holds. Notice that this probability distribution. 2 1 N 3 implies H ≥ H , that is, partitioning events into sub- 2 3 events cannot decrease the entropy of the system [39]. The 2.4.2 Entropy: a measure of information content (or uncertainty) extremal property, which we now describe, will be used in Every probability model (or device) describes a state of our development of the information dimension described uncertainty [41]. Shannon [42] provided a measure for below. First, notice that since 0 ≤ p ≤ 1 for all k, a mini- such uncertainty, which is known as statistical entropy mum value of 0 is attained when p = 1 and p = p =  = (often referred to as Shannon's entropy). Its functional 1 2 3 p = 0, so H ≥ 0. As an upper bound, we have form is given by Page 6 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 11 1 1 ⎛ ⎞ Hp ,p ,...,p ≤ H , ,..., = log N, 14 () () pp == " = () 18 12 N ⎜ ⎟ 2 1 n NN N ⎝ ⎠ that is, a uniform distribution of probabilities provides an pp == " =01 , 9 () upper bound on the uncertainty measure of all discrete nN +1 probability models whose sample space has cardinality of at most N. This relationship can be proved either with dif- Inserting these values into H and solving for n yields ferential calculus [39] or from Jensen's inequality −p 0 k N N nN== p . () 20 ⎛ ⎞ 0 ∏ ϕϕ ⎜ a ⎟ ≤ a 15 k=1 () () ∑∑ k k ⎜ ⎟ N N k== 11 k ⎝ ⎠ 2.4.6 A geometric example which is valid for any convex function ϕ [41]. Consider the following geometric example. The surface of a three-dimensional ellipsoid is parameterized by the 2.4.5 The information dimension, n 0 equations What we coin as, the "information dimension," n , is pre- sented as a novel (but heuristic) measure for the interpret- x(φ,θ) = R sin(φ)cos(θ), able components in a principal component analysis. We assume that a principal component analysis has been per- y(φ,θ) = R sin(φ)sin(θ), (21) formed on a microarray data set and that our objective is to reduce the dimension of the data by retaining "mean- z(φ,θ) = R cos(φ). ingful" components. This involves setting one or more of the eigenvalues associated with the low variance compo- Points are distributed along the surface of the ellipsoid nents to zero. Let λ , λ ,...,λ represent the eigenvalues according to the above parametrization and are tabulated 1 2 N from a PCA of the data. Define for each k in the matrix X of size (4584 × 3). Set R = R = R = 1, then x y z (21) is a parametrization representing the surface of the unit sphere centered at the origin. Gradually deform the p = . 16 () sphere by changing the values of R subject to the con- ∑ j j=1 straint R R R = 1, which gives ellipsoids of constant vol- x y z ume (equal to 4π/3). We summarize the results in Table 1. The p s satisfy 0 ≤ p ≤ 1, (k = 1,...,N) and p = 1 . k k ∑ k Notice that for the case R = R = R = 1, which represents K =1 x y z the unit sphere, n = 3. The gradual deformation of the We view the distribution of the eigenvalues expressed as a sphere has an information dimension of approximately proportion of total variance as a discrete probability two for the values: R = 2, R = 1, R = 1/2. This suggests that x y z model. the magnitude along the z-axis has become sufficiently small relative to the x- and y-axes, that it may be discarded We begin by normalizing the entropy measure [10] for for information purposes. Thus, a projection onto the xy- some fixed n = N using its extremal property to get plane may provide sufficient information regarding the shape of the object. For R = 8, R = 1, R = 1/8 the object x y z H 1 begins to "look" one dimensional with n = 1.09. With H = =− pp log . 17 () 0 ∑kk 2 logNN log () () 22 this configuration, most of the variance lies along the x- k=1 axis. The values of H will vary between 0 and 1 inclusive. 3 Results and discussion We calculate the entropy of the probability space using In this section we apply the information dimension, the broken stick model, the modified broken stick model, Equation (17) to obtain the functional value H , where 0 Bartlett's Test, Kaiser-Guttman, Jolliffe's modification of ≤ H ≤ 1. (Note that at either extreme the dimension is Kaiser-Guttman, Velicer's minimum average partial known.) Next, we deform the original distribution of (MAP) criteria, Cattell's scree test, parallel analysis, cumu- eigenvalues so that the following holds lative percent of variance explained, and Log-eigenvalue diagram techniques to published yeast cdc15 cell-cycle and elutriation-synchronized cell cycle data sets [34], sporulation data set [44], serum-treated human fibroblast Page 7 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 1: Dimension suggested by the Information Dimension (R , R , R ) (1,1,1) (8/5,1,5/8) (2,1,1/2) (3,1,1/3) (8,1,1/8) x y z H(p ,p ,p ) 1.000 0.781 0.608 0.348 0.074 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.648 0.762 0.890 0.984 1 1 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.253 0.191 0.099 0.015 2 2 1 2 3 p = λ /(λ + λ + λ ) 0.333 0.099 0.048 0.010 0.000 3 3 1 2 3 n 3.00 2.36 1.95 1.47 1.09 data set [45], and the cancer cell lines data sets [46]. These nomials were constructed using a Gramm-Schmidt proc- data sets have been previously explored [10-12,47]. ess [48] with norm Before attempting to reduce the dimension of the data, we px()q ()x dx = δ,2()3 jk jk first consider whether a PCA is appropriate, that is, a data set with very high information content will not lend itself to significant dimension reduction, at least not without where δ is the Kronecker delta function. The functional jk some non-trivial loss of information. In their study, Alter form of the polynomials are: et al. [10] addresses the issue by considering the normal- ized entropy (presented above) of a data set, which is a px()=− α 32() x 1 , measure of the complexity or redundancy of the data. The px()=− α 56x 6x+1 , 22 () index ranges in value from 0 to 1 with values near zero px()=− α 72() x 1 10x −10x x + 1 , 33 () indicating low information content versus values near 1 () 43 2 px=− α 9 210x 420x+ 270x− 60x+ 3 , () 44() which indicate a highly disordered or random data set. In 5 43 2 px()=− α 11 252x 6 630xx+− 560 210x+ 30x− 1, 55() this form, the entropy can only be use to give the 65 4 32 researcher a "feeling" for the potential for dimension px=− α 13 924x 2772x+ 3150x− 16 680xx+− 420 42x+ 1 , ()() reduction. For example, what level of dimension reduc- where the α 's are applied to each functional value and tion is implied by an entropy reading of 0.3 versus 0.4? represent uniform random variables drawn from the interval [0.5,1.5]. The remaining 5,400 rows are popu- Another measure is presented in Jackson [7] and credited lated with random numbers drawn from a uniform distri- to Gleason and Staelin for use with the q × q correlation bution on the interval [-3,3]. Figure 3 provides an matrix, R, and is given by illustration of the polynomials in the presence of Gaus- sian noise (σ = 0.25). Rq − A singular value decomposition was performed on X , for ϑ= , () 22 qq − 1 () σ ranging from 0.0 to 1.0 by 0.1 increments. In the absence of Gaussian noise (σ = 0), the information The statistic also ranges in value from 0 to 1. If there is lit- dimension predicts the dimension of the data to be n = tle or no correlation among the variables, the statistic will 5.9, which compares favorably with the true dimension of be close to 0; a set of highly correlated variables will have 6. It should be noted, however, that like other stopping a statistic close to 1. The statistic is given by Jackson [7] criteria, the information dimension is a function of the asserts that the distribution of the statistic is unknown, noise present in the data. Figure 4 illustrates this depend- but may be useful in comparing data sets. ence when the number of assays is 15. The information dimension (line with circle markers), Jolliffe's modifica- 3.1 Stopping rules applied to synthetic data tion of the Guttman-Kaiser rule (line with star markers) In this section we apply the stopping criteria to a (6000 × and LEV (line with square markers) are plotted against 15) matrix, X , where X is populated with simulated data. noise level, measured in standard deviations. The predic- σ σ The simulation model can be expressed as X = Y + N , tions given by both the information dimension and Gutt- σ σ where N is a random matrix representing Gaussian noise man-Kaiser's rule increase as the noise level increases, and whose entries were drawn from a standard normal while LEV drops sharply. The reason LEV decreases is that distribution with zero mean and standard deviation, σ. higher noise levels cause the distribution of the eigenval- The matrix Y was constructed by populating the first 600 ues to look uniform. The results of applying all of the rows with values from six orthonormal polynomials. Each stopping techniques to the matrix X for σ = 0 and σ = polynomial populates 100 rows of the matrix. The poly- 0.25 are summarized in Table 2. Page 8 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Linear Quadratic Cubic 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 0 0.5 1 0 0.5 1 0 0.5 1 Quartic Quintic Sixth Degree Polynomial 4 4 4 2 2 2 0 0 0 −2 −2 −2 −4 −4 −4 0 0.5 1 0 0.5 1 0 0.5 1 Si Figure 3 mulated data Simulated data. Six orthonormal polynomials with gaussian Noise. 3.2 Yeast cdc15 cell-cycle data set that it requires the first seven eigenvalues to account for A PCA was performed on the genes identified in Spellman over 90% of the variance in the data. The Kaiser-Guttman [34] responsible for cell cycle regulation in yeast samples. test retains the first four eigenvalues which represents the The cdc15 data set contains p = 799 rows (genes) and q = number of eigenvalues obtained from the correlation 15 columns representing equally spaced time points. The matrix that exceeds unity. To incorporate the effect of sam- unpolished data set appears to have a high information ple variance Jolliffe [9] suggests that the appropriate content as suggested by the normalized entropy which is number to retain are those eigenvalues whose value .7264 and the Gleason-Staelin statistic which is 0.3683. exceed 0.7. Jolliffe's modification of Kaiser-Guttman Therefore, we should expect the stopping criteria to indi- would indicate that the first five eigenvalues are signifi- cate that significant dimension reduction may not be pos- cant. Parallel analysis compares the eigenvalues obtained sible. from either the correlation or covariance matrix of the data to those obtain from a matrix whose entries are Eigenvalues based on both the covariance and correlation drawn from a uniform random distribution. matrices are given in Table 2. From the given data we see Page 9 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 sigma level Predi Figure 4 cted dimension for simulated data Predicted dimension for simulated data. Predicted dimension versus noise Level. Shown is the information dimension (line with circle markers), Jolliffe's modification of the Guttman-Kaiser rule (line with star markers) and LEV (line with square markers) plotted against noise level, measured in standard deviations. Note that the predictions given by both the information dimension and Guttman-Kaiser's rule increase as the noise level increases, while LEV drops sharply (see Text). Cattell's scree test looks for an inflection point in the 6.b). For each eigenvalue λ , we graph log (λ ) against j and j j graph of the eigenvalues, which are plotted in descending look for the point at which the eigenvalues decay linearly. order. Figure 5 illustrates Cattell's scree test for eigenval- The method is based on the conjecture that the eigenval- ues obtained from the correlation matrix and from the ues associated with eigenvectors that are dominated by covariance matrix respectively. By inspecting the differ- noise will decay geometrically. The LEV diagram is subject ences of the differences between eigenvalues, we see that to interpretation and may suggest retaining 0, 3, or 9 com- the first inflection point occurs between the fourth and ponents. fifth eigenvalues. Therefore, the scree test gives a dimen- sion of five. Figure 7 illustrates Velicer's minimum average partial cor- relation statistic. It is based upon the average of the Figure 6 contains graphs of the Log-eigenvalue diagram, squared partial correlations between q variables after the LEV, where the eigenvalues are obtained from the correla- first m components have been removed [52]. The sum- tion matrix (Figure 6.a) and the covariance matrix (Figure mary statistic is given by Page 10 of 21 (page number not for citation purposes) Dimension Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 2: Summary of results Column No. 1 2 3 4 5 6 7 8 Data Set Y + N Y + N alpha cdc15 elutriation fibro sporula tumor 0 .25 Broken Stick, (BS) 1 1 2 4 3 2 1 3 Modified BS 1 1 2 4 3 2 1 3 Velicer's MAP 2/8 2 3 5 3 3 2 8/10 Kaiser-Guttman, (KG) 2 2 3 4 3 3 2 9 Jolliffe's KG 7 7 4 5 4 3 3 14 LEV Diagram 6/8 6/8 4/5 5 4/5 4/6 3 12/21 Parallel Analysis 1 1 5 4 3 3 2 8 Scree Test 8 8 5 5 4 6 4 7 Info Dimension 5.9 7.3 11.1 7.2 6.4 3.0 2.6 17.3 Gleason-Staelin Stat .525 .45 .34 .37 .38 .54 .58 .37 Normalized Entropy .917 .941 .779 .726 .706 .438 .493 .696 80% of Var. 4 4 9 4 5 3 3 19 90% of Var. 7 7 14 7 9 5 3 33 Bartlett's Test 15 15 22 15 14 12 7 60 The table contains the results of twelve stopping rules along with two measures of data information content for six cDNA microarray data sets. We recommend looking for a consensus among the rules given in the upper portion of the table, while avoiding rules based on cumulative percent of variation explained or Barlett's test. Synthetic data sets are summarized in columns 1 (no noise) and 2 (Gaussian noise, μ = 0 and σ = 0.25). The matrix sizes for columns 1 through 8 are : (6000 × 15), (6000 × 15), (4579 × 22), (799 × 15), (5981 × 14), (517 × 12), (6118 × 7), and (1375 × 60). through fifteen lengths. This would also suggest that we f = r , () 25 accept or reject the entire subspace. Of course the tail of mi ∑∑()j qq − 1 () ij ≠ the distribution can never exceed that suggested by the broken stick distribution. The broken stick model suggests a dimension of four for the cdc15 data. where r is the element in the i-th row and j-th column of ij 3.3 Summary of results For six microarray data sets the matrix of partial correlations and co-variances. The Table 2 summarizes the results of the stopping criteria for pattern of the statistic given in Figure 7 for cdc15 cell cycle six microarray data sets. Note that Bartlett's test fails to dis- data is typical in that the statistic first declines then rises. card any components. The null hypothesis that all roots Once the statistic begins to rise it indicates that additional are equal is rejected at every stage of the test. The large principal components represent more variance than cov- sample size of each data set was a major factor for all roots ariance [7]. Therefore, no components are retained after testing out to be significantly different. The broken stick the average squared partial correlation reaches a mini- model consistently retained the fewest number of compo- mum. Here the minimum occurs at j = 5, which suggests nents, which appears consistent with comments in the lit- erature. The results of the modified broken stick model retaining the first five principal components. were identical to that of the original model since the first Figure 8. a shows the eigenvalues from the covariance few eigenvalues in each data set appear to be well sepa- matrix along with error bars representing the sampling rated, at least with respect to Equation (6). Since no effec- error estimate suggested by North et al. [38] and presented tive degenerate subspaces were identified, all subspaces in Section 2.3 above. Figure 8.b shows the eigenvalues matched those of the original model. The cumulative per- obtained from the covariance matrix superimposed on the cent of variance at the 90% level retains the greatest broken stick distribution. Applying the rule of thumb number of components while components retained at the method given in North et al. [38], we find that the first six 80% level appear to be more consistent with other rules. eigenvalues are sufficiently separated and may be treated Regardless of the cutoff level chosen, this method is com- as individual subspaces. The remaining eigenvalues are pletely arbitrary and appears to be without merit. While close compared to their sampling error. Therefore, when the LEV diagram is less subjective, it is often difficult to applying the broken stick model we require that the total interpret. Kaiser-Guttman, Jolliffe's modification of Kai- variance of the effectively degenerate subspace spanned by ser-Guttman, Cattell's scree test, parallel analysis and the associated eigenvectors to exceed the value suggested Velicer's MAP consistently retained similar numbers of by the broken stick model for the sum of the seventh components. The information dimension gave compara- Page 11 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Cattell's scree Figure 5 test Cattell's scree test. Cattell's scree test using eigenvalues obtained from (a) the correlation matrix and from (b) the covari- ance matrix of the yeast cdc15 cell-cycle data set. Since the first inflection point occurs between the fourth and fifth eigenval- ues, the implied dimension is five. Cattell's scree test using eigenvalues obtained from (a) the correlation matrix and from (b) the covariance matrix of the yeast cdc15 cell-cycle data set. Since the first inflection point occurs between the fourth and fifth eigenvalues, the implied dimension is five. ble results. It often retained the most components of the Our analysis shows that the broken stick model and five aforementioned rules suggesting that it may provides Velicer's MAP consistently retained the fewest number of an upper bound on the interpretable number of compo- components while stopping rules based on percent varia- nents. tion explained and Bartlett's test retained the greatest number of components. We do not recommend the use of 4 Conclusion Bartlett's test (as presented here) or those based on the Principal component analysis is a powerful descriptive cumulative percent of variation explained. Due to the and data reduction technique for the analysis of microar- large sample size (typical of microarray data), Bartlett's ray data. Twelve stopping rules to determine the appropri- test failed to discard any components, while rules based ate level of data reduction were presented including a new on percent variation explained are completely arbitrary in heuristic model based on statistical entropy. While the nature. issue of component retention remains unresolved, the information dimension provides a reasonable and con- For the analysis of cDNA microarray data, we do not rec- servative estimate of the upper bound of the true dimen- ommend any one stopping technique. Instead, we suggest sion of the data. that one look for a "consensus dimension" given by the Page 12 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Th Figure 6 e LEV Diagram The LEV Diagram. LEV Diagram using eigenvalues obtained from (a) the correlation matrix and from (b) the covariance matrix of the yeast cdc15 cell-cycle data set. The fifth through fifteen eigenvalues lie approximately on a line indicating a dimen- sion of five. modified broken stick model, Velicer's MAP, Jolliffe mod- A A brief overview of stopping techniques ification of Kaiser-Guttman, the LEV diagram, parallel A.1 Introduction analysis, the scree test and the information dimension Franklin et al. [23] suggests that when a researcher uses while using the information dimension as an upper PCA for data analysis, the most critical problem faced is bound for the number of components to retain. Comput- determining the number of components to retain. Indeed, ing all seven stopping rules is an easy task and Matlab rou- retaining too many components potentially leads to an tines are available from the authors. attempt to ascribe physical meaning to what may be noth- ing more than noise in the data set, while retaining too As a guiding example consider the results from the cdc15 few components may cause the researcher to discard valu- cDNA data set given in Table 3. For the cdc15 data set, the able information. Many methods have been proposed to consensus is split between four and five; however, given address the question of component selection and a brief that the information dimension is seven, it appears rea- review is given here. A more extensive review can be found sonable to choose five as the appropriate dimension in in [[7], Section 2.8]; [[9], Section 6.1]; [[6], Chapter 5]. which to work. These methods may be categorized as either heuristic or statistical approaches [20]. The statistical methods may be Page 13 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Velicer's minimum average part Figure 7 ial statistic Velicer's minimum average partial statistic. Velicer's minimum average partial statistic displays a minimum value at five, indicating that the implied dimension is five. Page 14 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Th Figure 8 e cdc15 yeast cell-cycle data set The cdc15 yeast cell-cycle data set. (a) Error bars about the eigenvalues obtained from the covariance matrix of the cdc15 yeast cell-cycle data set illustrating North et al. (1982) "rule of thumb" estimate with δ = 1.5. The spacing between the second and third eigenvalues indicate a possible degenerate subspace spanned by the associated eigenvectors, (b) is a graph of the bro- ken stick model (circles) plotted against the eigenvalues (bars) obtained from the covariance matrix of the yeast cdc15 data set. The broken stick model (and the modified broken stick model) indicate a dimension of four. further partitioned into two groups: those that make A. 2 Scree test assumptions regarding the distribution of the data and The scree test is a graphical technique attributed to Cattell those that do not make such assumptions. Jolliffe [9] crit- [49] who described it in term of retaining the correct icizes the former stating that the distributional assump- number of factors in a factor analysis. However, it is tions are often unrealistic and adds that these methods widely used in PCA [9]. While a scree graph is simple to tend to over-estimate the number of components. The lat- construct, its interpretation may be highly subjective. Let ter methods tend to be computationally intensive (for λ represent the k-th eigenvalue obtained from a covari- example, cross validation and bootstrapping). Our ance or correlation matrix. A graph of λ against k is approach is to consider only heuristic methods here with known as a scree graph. The location on the graph where the exception of Velicer's Partial Correlation test, which is a sharp change in slope occurs in the line segments join- a statistical method that does not require distributional ing the points is referred to as an elbow. The value of k at assumptions nor is it computationally intensive. We which this occurs represents the number of components present below a brief discussion of the heuristic tech- that should be retained in the PCA. Jackson [7] notes that niques for determining the dimensionality of data sets as the scree test is a graphical substitute for a significance well as Velicer's Partial Correlation test. test. He points out that interpretation might be con- Page 15 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 3: Eigenvalues of yeast cdc15 cell cycle data No. Eigenvalue (Covariance) Percent of Total Cumulated Percentage Eigenvalue (Correlation) Random 1 1.9162 32.25 32.25 5.0025 0.1004 2 1.1681 19.66 51.91 2.8896 0.0997 3 0.8560 14.41 66.32 2.4140 0.0975 4 0.8320 14.00 80.32 1.7371 0.0940 5 0.3295 5.55 85.87 0.7499 0.0905 6 0.2087 3.51 89.38 0.5495 0.0870 7 0.1490 2.51 91.89 0.3663 0.0841 8 0.1337 2.25 94.14 0.3064 0.0831 9 0.0881 1.48 95.62 0.2706 0.0808 10 0.0842 1.42 97.04 0.2206 0.0801 11 0.0580 0.98 98.02 0.1507 0.0779 12 0.0402 0.68 98.69 0.1292 0.0750 13 0.0341 0.57 99.27 0.0930 0.0727 14 0.0273 0.46 99.73 0.0731 0.0702 15 0.0162 0.27 100.00 0.0464 0.0650 founded in cases where the scree graph either does not t(k) because an appropriate measure of a lack-of-fit is have a clearly defined break or has more than one break. λ (see Jolliffe 2002, pp. 113). ∑ i Also, if the first few roots are widely separated, it may be ik =+1 difficult to interpret where the elbow occurred due to a A.4 Average eigenvalue (Guttman-Kaiser rule and Jolliffe's loss in detail caused by scaling. This problem might be Rule) remedied using the LEV described below. The most common stopping criterion in PCA is the Gutt- man-Kaiser criterion [7]. Principal components associated A.3 Proportion of total variance explained In a PCA model, each eigenvalue represents the level of with eigenvalues derived from a covariance matrix, and variation explained by the associated principal compo- that are larger in magnitude than the average of the eigen- values, are retained. In the case of eigenvalues derived nent. A simple and popular stopping rule is based on the from a correlation matrix, the average is one. Therefore, proportion of the total variance explained by the principal any principal component associated with an eigenvalue components retained in the model. If k components are retained, then we may represent the cumulative variance whose magnitude is greater than one is retained. explained by the first k PC's by Based on simulation studies, Jolliffe [9] modified this rule using a cut-off of 70% of the average root to allow for sam- ∑ i t = 26 pling variation. Rencher [27] states that this method () trace S () works well in practice but when it errs, it is likely to retain too many components. It is also noted that in cases where where S is the sample covariance matrix. The researcher the data set contains a large number of variables that are decides on a satisfactory value for t(k) and then deter- not highly correlated, the technique tends to over estimate mines k accordingly. The obvious problem with the tech- the number of components. Table 4 lists eigenvalues in nique is deciding on an appropriate t(k). In practice it is descending order of magnitude from the correlation common to select levels between 70% to 95% [9]. Jackson matrix associated with a (300 × 9) random data matrix. [7] argues strongly against the use of this method except The elements of the random matrix were drawn uniformly over the interval [0, 1] and a PCA performed on the corre- possibly for exploratory purposes when little is known lation matrix. Note that the first four eigenvalues have val- about the population of the data. An obvious problem ues that exceed 1 and all nine eigenvalues have values that occurs when several eigenvalues are of similar magnitude. exceed 0.7. Thus, Kaiser's rule and its modification suggest For example, suppose for some k = k , t(k ) = 0.50 and the * * the existence of "significant PCs" from randomly gener- remaining q - k eigenvalues have approximately the same ated data – a criticism that calls into question its validity magnitude. Can one justify adding more components [20,25,50,51]. until some predetermined value of t(k) is reached? Jolliffe A.5 Log-eigenvalue diagram, LEV [9] points out that the rule is equivalent to looking at the An adaptation of the scree graph is the log-eigenvalue dia- spectral decomposition of S. Determining how many gram, where log(λ ) is plotted against k. It is based on the terms to include in the decomposition is closely related to Page 16 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Table 4: Eigenvalues from a random matrix. No. 1 23 45 67 89 Eigenvalue 1.21 1.20 1.13 1.03 0.96 0.93 0.89 0.86 0.77 conjecture that eigenvalues corresponding to 'noise' Reviewers' comments should decay geometrically, therefore, those eigenvalues Orly Alter review should appear linear. Farmer [50] investigated the proce- R. Cangelosi and A. Goriely present two novel mathemat- dure by studying LEV diagrams from different groupings ical methods for estimating the statistically significant of 6000 random numbers. He contends that the LEV dia- dimension of a matrix. One method is based on the Shan- gram is useful in determining the dimension of the data. non entropy of the matrix, and is derived from fundamen- tal principles of information theory. The other method is A.6 Velicer's partial correlation test a modification of the "broken stick" model, and is derived Velicer [52] proposed a test based on the partial correla- from fundamental principles of probability. Also pre- tions among the q original variables with one or more sented are computational estimations of the dimensions principal components removed. The criterion proposed is of six well-studied DNA microarray datasets using these two novel methods as well as ten previous methods. f = r , 27 () Estimating the statistically significant dimension of a mi ∑∑()j qq − 1 () ij ≠ given matrix is a key step in the mathematical modeling of data, e.g., as the authors note, for data interpretation as where r is the partial correlation between the i-th and j- ij well as for estimating missing data. The question of how best to estimate the dimension of a matrix is still an open th variables. Jackson [7] notes that the logic behind question. This open question is faced in most analyses of Velicer's test is that as long as f is decreasing, the partial DNA microarray data (and other large-scale modern data- correlations are declining faster than the residual vari- sets). The work presented here is not only an extensive ances. This means that the test will terminate when, on the analysis of this open question. It is also the first work, to average, additional principal components would repre- the best of my knowledge, to address this key open ques- sent more variance than covariance. Jolliffe [9] warns that tion in the context of DNA microarray data analysis. I the procedure is plausible for use in a factor analysis, but expect it will have a significant impact on this field of research, and recommend its publication. may underestimate the number of principal components in a PCA. This is because it will not retain principal com- For example, R. Cangelosi and A. Goriely show that, in ponents dominated by a single variable whose correla- estimating the number of eigenvectors which are of statis- tions with other variables are close to zero. tical significance in the PCA analysis of DNA microarray data, the method of cumulative percent of variance A.7 Bartlett's equality of roots test should not be used. Unfortunately, this very method is It has been argued in the literature (see North, [38]) that used in an algorithm which estimates missing DNA eigenvalues that are equal to each other should be treated microarray data by fitting the available data with cumula- as a unit, that is, they should either all be retained or all tive-percent-of-variance- selected eigenvectors [Troyan- discarded. A stopping rule can be formulated where the skaya et al., Bioinformatics 17, 520 (2001)]. This might be last m eigenvalues are tested for equality. Jackson [7] one explanation for the superior performance of other presents a form of a test developed by Bartlett [53] which PCA and SVD-based algorithms for estimating DNA is microarray data [e.g., Kim et al., Bioinformatics 15, 187 (2005)]. ⎡ ⎤ q λ ∑ j jk =+1 2 ⎢ ⎥ χν =− lnλ +−ν qk ln 28 () () ∑ () j In another example, R. Cangelosi and A. Goriely estimate ⎢ ⎥ qk − jk −+1 ⎢ ⎥ that there are two eigenvectors which are of statistical sig- ⎣ ⎦ nificance in the yeast cdc15 cell-cycle dataset of 799 genes where χ has (1/2) (q - k - 1)(q - k - 2) degrees of freedom × 15 time points. Their mathematical estimation is in and v represents the number of degrees of freedom associ- agreement with the previous biological experimental ated with the covariance matrix. [Spellman et al., MBC 9, 3273 (1998)] as well as compu- tational [Holter et al., PNAS 97, 8409 (2000)] interpreta- Authors' contributions tions of this dataset. R.C. and A.G. performed research and wrote the paper Page 17 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 Declaration of competing interests: I declare that I have no them in general accord, if that is the case), it is undesirable competing interests. to report all estimators for any large set of results. The information dimension's property of increasing with John Spouge's review (John Spouge was nominated by noise makes it undesirable as an estimator, and it can not Eugene Koonin) be recommended. The main value of the paper therefore This paper reviews several methods based on principal resides in its useful review and its software tools. component analysis (PCA) for determining the "true" dimensionality of a matrix subject to statistical noise, with Answers to John Spouge's review specific application to microarray data. It also offers two The main point of the reviewer is the suggestion that the new candidates for estimating the dimensionality, called information dimension's undesirable property of increas- "information dimension" and the " modified broken stick ing with noise makes it undesirable as an estimator. We model". analyze the information in detail and indeed reached the conclusion that its prediction increases with noise. In the Section 2.1 nicely summarizes matrix methods for reduc- preprint reviewed by Dr. Spouge, we only considered the ing dimensionality in microarray data. It describes why effect of noise on the information dimension. It is crucial PCA is preferable to a singular value decomposition (a to note that ALL methods are functions of the noise level change in the intensities of microarray data affects the sin- present in the data. In the new and final version of the gular value decomposition, but not PCA). manuscript, we study the effect of noise on two other methods (Jolliffe's modification of he Guttman-Kaiser Section 2.2 analyzes the broken stick model. Section 2.3 rule and LEV). It clearly appears that in one case the esti- explains in intuitive terms the authors' "modified broken mator increases with noise and in the other one, it stick model", but the algorithm became clear to me only decreases with noise (both effects are undesirable and when it was applied to data later in the paper. The broken unavoidable). The message to the practitioner is the same, stick model has the counterintuitive property of determin- understand the signal to noise ratio of the data and act ing dimensionality without regard to the amount of data, accordingly. We conclude that the information dimension implicitly ignoring the ability of increased data to could still be of interest as an estimator. improve signal-to-noise. The modified broken stick model therefore has some intuitive appeal. David Horn and Roy Varshavsky joint review (both reviewers were nominated by O. Alter) Section 2.4 explains the authors' information dimension. This paper discusses an important problem in data analy- The derivation is thorough, but the resulting measure is sis using PCA. The term 'component retention' that the purely heuristic, as the authors point out. In the end, authors use in the title is usually referred to as dimen- despite the theoretical gloss, it is a just formula, without sional truncation or, in more general terms, as data com- any desirable theoretical properties or intuitive interpreta- pression. The problem is to find the desired truncation tion. level to assure optimal results for applications such as clustering, classification or various prediction tasks. The evaluation of the novel measures therefore depends on their empirical performance, found in the Results and The paper contains a very exhaustive review of the history Discussion. Systematic responses to variables irrelevant to of PCA and describes many recipes for truncation pro- the (known) dimensionality of synthetic data become of posed over the 100 years since PCA was introduced. The central interest. In particular, the authors show data that authors propose also one method of their own, based on their information dimension increases systematically with the use of the entropy of correlation eigenvalues. A com- noise, clearly an undesirable property. The authors also parison of all methods is presented in Table 2, including test the dimensionality estimators on real microarray 14 criteria applied to 6 microarray experiments. This table data. They conclude that six dimensionality measures are demonstrates that the results of their proposed 'informa- in rough accord, with three outliers: Bartlett's test, cumu- tion dimension' are very different from those of most lative percent of variation explained, and the information other truncation methods. dimension (which tends to be higher than other estima- tors). They therefore propose the information dimension We appreciate the quality of the review presented in this as an upper bound for the true dimensionality, with a paper, and we recommend that it should be viewed and consensus estimate being derived from the remaining presented as such. But we have quite a few reservations measures. regarding the presentation in general and their novel method in particular. The choice of dimensionality measure is purely empirical. While it is desirable to check all estimators (and report Page 18 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 1. The motivation for dimensional reduction is briefly Answers to Horn's and Varshavsky's review mentioned in the introduction, but this point is not elab- We would like to thank the reviewers for their careful orated later on in the paper. As a result, the paper lacks a reading of our manuscript and their positive criticisms. target function according to which one could measure the We have modified our manuscript and follow most of performance of the various methods displayed in Table 2. their recommendations. We believe one should test methods according to how well they perform, rather than according to consensus. Specifically, we answer each comment by the reviewer: Performance can be measured on data, but only if a per- formance function is defined, e.g. the best Jaccard score 1. The status of the paper as a review or a regular article. achieved for classification of the data within an SVM approach. Clearly many other criteria can be suggested, It is true that the paper contains a comprehensive survey and results may vary from one dataset to another, but this of the literature and many references. Nevertheless, we is the only valid scientific approach to decide on what believe that the article contains sufficiently many new methods should be used. We believe that it is necessary for results to be seen as a regular journal article. Both the the authors to discuss this issue before the paper is modified broken-stick method and the information accepted for publication. dimension are new and important results for the field of cDNA analysis. 2. All truncation methods are heuristic. Also the new sta- tistical method proposed here is heuristic, as the authors 2. The main criticism of the paper is that we did not test admit. An example presented in Table 1 looks nice, and the performance of the different methods against some should be regarded as some justification; however the benchmark. novel method's disagreement with most other methods (in Table 2) raises the suspicion that the performance of To answer this problem we performed extensive bench- the new method, once scrutinized by some performance marking of the different methods against noisy simulated criterion on real data, may be bad. The authors are aware data for which the true signal and its dimension was of this point and they suggest using their method as an known. We have added this analysis in the paper where upper bound criterion, with which to decide if their pro- we provide an explicit example. This example clearly posed 'consensus dimension' makes sense. This, by itself, establishes our previous claim that the information has a very limited advantage. dimension provides a useful upper bound for the true sig- nal dimension (whereas other traditional methods such 3. The abstract does not represent faithfully the paper. The as Velicer's underestimate the true dimension). Upper new method is based on an 'entropy' measure but this is bounds are extremely important in data analysis as they not really Shannon entropy because no probability is provide a reference point with respect to which other involved. It gives the impression that the new method is methods can be compared. based on some 'truth' whereas others are ad-hoc which, in our opinion, is wrong (see item 2 above). We suggest that 3. The abstract does not represent the paper. once this paper is recognized as a review paper the abstract will reflect the broad review work done here. We did modify the abstract to clarify the relationship of information dimension that we propose with respect to 4. Some methods are described in the body of the article other methods (it is also a heuristic approach!). Now, (e.g., broken stick model), while others are moved to the with the added analysis and wording, we believe that the appendix (e.g., portion of total variance). This separation abstract is indeed a faithful representation of the paper. is not clear. Unifying these two sections can contribute to the paper readability. 4. Some methods appear in the appendix. In conclusion, since the authors admit that information Indeed, the methods presented in the appendix are the dimension cannot serve as a stopping criterion for PCA one that we review. Since we present a modification of the compression, this paper should not be regarded as pro- broken-stick method along with a new heuristic tech- moting a useful truncation method. Nevertheless, we nique, we believe it is appropriate to describe the broken- believe that it may be very useful and informative in stick method in the main body of the text while relegating reviewing and describing the existing methods, once the other known approaches (only used for comparison) to modifications mentioned above are made. We believe this the appendix. Keeping in mind that this is a regular article could then serve well the interested mathematical biology rather than a review, we believe it is justified. community. Page 19 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 23. Franklin S, Gibson D, Robertson P, Pohlmann J, Fralish J: Parallel Acknowledgements Analysis: a method for determining significant components. This material is based upon work supported by the National Science Foun- J Vegatat Sci 1995:99-106. dation under Grant No. DMS grant #0307427 (A.G.), DMS-0604704 (A. G.) 24. Zwick W, Velicer W: Comparison of five rules for determining and DMS-DMS-IGMS-0623989 (A.G.) and a grant from the BIO5 Institute. the number of components to retain. Psychol Bull 1986, 99:432-446. We would like to thank Jay Hoying for introducing us to the field of micro- 25. Karr J, Martin T: Random number and principal components: arrays and Joseph Watkins for many interesting discussions. further searches for the unicorn. In The use of multivariate statis- tics in wildlife habitat Volume RM-87. Edited by: Capen D. United Forest References Service General Technical Report; 1981:20-24. 26. Basilevsky A: Statistical Factor Analysis and Related Methods: Theory and 1. Pearson K: On lines and planes of closest fit to systems of Applications New York: Wiley-Interscience; 1994. points in space. Phil Mag 1901, 2:559-572. 27. Rencher A: Multivariate Statistical Inference and Applications New York: 2. Hotelling H: Analysis of a complex statistical variable into John Wiley & Sons, Inc; 1998. principal components. J Educ Psych 1933, 26:417-441. 498–520 28. Tinker N, Robert L, Harris GBL: Data pre-processing issues in 3. Rao C: The use and inter etation of principal component anal- microarray analysis. In A Practical Approach to Microarray Data Anal- ysis in applied research. Sankhya A 1964, 26:329-358. ysis Edited by: Berrar DP, Dubitzky W, Granzow M. Kluwer, Norwell, 4. Gower J: Some distance properties of latent root and vector MA; 2003:47-64. methods used in multivariate analysis. Biometrika 1966, 29. Dubitzky W, Granzow M, Downes C, Berrar D: Introduction to 53:325-338. microarray data analysis. In A Practical Approach to Microarray Data 5. Jeffers J: Two case studies in the application of principal com- Analysis Edited by: Berrar DP, Dubitzky W, Granzow M. Kluwer, Nor- ponent analysis. Appl Statist 1967, 16:225-236. well, MA; 2003:91-109. 6. Preisendorfer R, Mobley C: Principal component analysis in meterology 30. Baxter M: Standardization and transformation in principal and oceanography Amsterdam: Elsevier; 1988. component analysis, with applications to archaeometry. Appl 7. Jackson J: A User's Guide to Principal Components New York: John Wiley Statist 1995, 44:513-527. & Sons; 1991. 31. Bro R, Smilde A: Centering and scaling in component analysis. 8. Arnold G, Collins A: Interpretation of transformed axes in mul- J Chemometrics 2003, 17:16-33. tivariate analysis. Appl Statist 1993, 42:381-400. 32. Wall M, Rechsteiner A, Rocha L: Singular value decomposition 9. Jolliffe I: Principal Component Analysis Springer, New York; 2002. and principal component analysis. In A Practical Approach to 10. Alter O, Brown P, Botstein D: Singular value decomposition for Microarray Data Analysis Edited by: Berrar DP, Dubitzky W, Granzow genome-wide expression data processing and modeling. Proc M. Kluwer, Norwell, MA; 2003:91-109. Natl Acad Sci USA 2000, 97:10101-10106. 33. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and dis- 11. Holter N, Mitra M, Maritan A, Cieplak M, Banavar J, Fedoroff N: Fun- play of genome-wide expression patterns. Proc Natl Acad Sci damental patterns underlying gene expression profiles: sim- USA 1998, 95:14863-14868. plicity from complexity. Proc Natl Acad Sci USA 2000, 97:8409-14. 34. Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown 12. Crescenzi M, Giuliani A: The main biological determinants of P, B DB, Futcher : Comprehensive identification of cell cycle- tumor line taxonomy elucidated by a principal component regulated genes of the yeast Saccharomyes cerevisiae by analysis of microarray data. FEBS Letters 2001, 507:114-118. microarray hybridization. Mol Biol Cell 1998, 9:3273-3297. 13. Hsiao L, Dangond F, Yoshida T, Hong R, Jensen R, Misra J, Dillon W, 35. Pielou E: Ecological Diversity New York: John Wiley & Sons; 1975. Lee K, Clark K, Haverty P, Weng Z, Mutter G, Frosch M, Macdonald 36. Macarthur R: On the relative abundance of bird species. Proc M, Milford E, Crum C, Bueno R, Pratt R, Mahadevappa M, Warrington Natl Acad Sci USA 1957, 43:293-295. J, Stephanopoulos G, Stephanopoulos G, Gullans S: A compendium 37. Frontier S: Étude de la décroissance des valeurs propres dans of gene expression in normal human tissues. Physiol Genomics une analyse en composantes principales: comparaison avec 2001, 7:97-104. le modèle du bâton brisé. Biol Ecol 1976, 25:67-75. 14. Misra J, Schmitt W, Hwang D, Hsiao L, Gullans S, Stephanopoulos G, 38. North G, Bell T, Cahalan R, Moeng F: Sampling errors in the esti- Stephanopoulos G: Interactive exploration of microarray gene mation of empirical orthogonal functions. Mon Weather Rev expression patterns in a reduced dimensional space. Genome 1982, 110:699-706. Res 2002, 12:1112-1120. 39. Reza F: An Introduction to Information Theory New York: Dover Publi- 15. Chen L, Goryachev A, Sun J, Kim P, Zhang H, Phillips M, Macgregor P, cations, Inc; 1994. Lebel S, Edwards A, Cao Q, Furuya K: Altered expression of 40. Pierce J: An introduction to information theory: symbols, signals and noise genes involved in hepatic morphogenesis and fibrogenesis New York: Dover Publications, Inc; 1980. are identified by cDNA microarray analysis in biliary atresia. 41. Khinchin A: Mathematical Foundations of Information Theory New York: Hepatology 2003, 38(3):567-576. Dover Publications, Inc; 1957. 16. Mori Y, Selaru F, Sato F, Yin J, Simms L, Xu Y, Olaru A, Deacu E, Wang 42. Shannon C: A mathematical theory of communication. Bell Sys- S, Taylor J, Young J, Leggett B, Jass J, Abraham J, Shibata D, Meltzer S: tem Technical Journal 1948, 27:379-423. 623–656. The impact of microsatellite instability on the molecular 43. Schneider T: Information theory primer with an appendix on phenotype of colorecal tumors. Cancer Research 2003, logarithms. Center for Cancer Research Nanobiology Program (CCRNP), 63:4577-4582. [Online] 2005 [http://www.lecb.ncifcrf.gov/toms/paper/primer]. 17. Jiang H, Dang Y, Chen H, Tao L, Sha Q, Chen J, Tsai C, Zhang S: Joint 44. Chu S, Derisi J, Eisen M, Mulholland J, Bolstein D, Brown P, Herskow- analysis of two micorarray gene expression data sets to itz I: The transcriptional program of sporulation in budding select lung adenocarcinoma marker genes. BMC Bioinformatics yeast. Science 1998, 282:699-705. 2004, 5:81. 45. Iyer V, Eisen M, Ross D, Schuler G, Moore T, Lee J, Trent J, Staudt L, 18. Oleksiak M, Roach J, Crawford D: Natural variation in cardiac J Hudson J, Boguski M: The Transcriptional Program in the metabolism and gene expression in Fundulus Heteroclitus. Response of Human Fibroblasts to Serum. Science 1999, Nature Genetics 2005, 37(1):67-72. 283:83-87. 19. Schena M, Shalon D, Davis R, Brown P: Quantitative Monitoring 46. Ross D, Scherf U, Eisen M, Perou C, Rees C, Spellman P, Iyer V, Jeffrey of Gene Expression Patterns with a Complementary DNA S, Rijn MVD, Waltham M, Pergamenschikov A, Lee J, Lashkari D, Microarray. Science 1995, 270:467-470. Shalon D, Myers T, Weinstein J, Botstein D, Brown P: Systematic 20. Jackson D: Stopping rules in principal components analysis: a variation in gene expression patterns in human cancer cell comparison of heuristical and statistical approaches. Ecology lines. Nat Genet 2000, 24:227-235. 1993, 74(8):2204-2214. 47. Raychaurdhuri S, Stuart J, Altman R: Principal component analysis 21. Ferré L: Selection of components in principal component to summarize microarray experiments: application to analysis: A comparison of methods. Computat Statist Data Anal sporulation time series. Pac Symp Biocomput 2000:455-466. 1995, 19:669-682. 48. Lax P: Linear Algebra New York: Wiley; 1996. 22. Bartkowiak A: How to reveal the dimensionality of the data? 49. Cattell B: The scree test for the number of factors. Multiv Behav Applied Stochastic Models and Data Analysis 1991:55-64. Res 1966, 1:245-276. Page 20 of 21 (page number not for citation purposes) Biology Direct 2007, 2:2 http://www.biology-direct.com/content/2/1/2 50. Farmer S: An investigation into the results of principal compo- nent analysis of data derived from random numbers. Statistian 1971, 20:63-72. 51. Stauffer D, Garton E, Steinhorst R: A comparison of principal components from real and random data. Ecology 1985, 66(6):1693-1698. 52. Velicer W: Determining the number of principal components from the matrix of partial correlations. Psychometrika 1975, 41(3):321-327. 53. Bartlett M: Tests of significance in factor analysis. Brit J Psychol Statist Section 1991, 3:77-85. 54. Guiasu Silviu: Information Theory with Applications. New York: McGraw Hill International Book Company; 1977. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 21 of 21 (page number not for citation purposes)

Journal

Biology DirectSpringer Journals

Published: Jan 17, 2007

There are no references for this article.