Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Hindawi Publishing Corporation Advances in Artificial Neural Systems Volume 2011, Article ID 617427, 8 pages doi:10.1155/2011/617427 Research Article Soft Topographic Maps for Clustering and Classifying Bacteria Using Housekeeping Genes Massimo La Rosa, Riccardo Rizzo, and Alfonso Urso ICAR-CNR, Consiglio Nazionale delle Ricerche, Viale delle Scienze, Ed.11, 90128 Palermo, Italy Correspondence should be addressed to Riccardo Rizzo, ricrizzo@pa.icar.cnr.it Received 11 May 2011; Revised 13 July 2011; Accepted 26 July 2011 Academic Editor: Tomasz G. Smolinski Copyright © 2011 Massimo La Rosa et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Self-Organizing Map (SOM) algorithm is widely used for building topographic maps of data represented in a vectorial space, but it does not operate with dissimilarity data. Soft Topographic Map (STM) algorithm is an extension of SOM to arbitrary distance measures, and it creates a map using a set of units, organized in a rectangular lattice, defining data neighbourhood relationships. In the last years, a new standard for identifying bacteria using genotypic information began to be developed. In this new approach, phylogenetic relationships of bacteria could be determined by comparing a stable part of the bacteria genetic code, the so-called “housekeeping genes.” The goal of this work is to build a topographic representation of bacteria clusters, by means of self-organizing maps, starting from genotypic features regarding housekeeping genes. 1. Introduction nucleotide sequences is not reliable and well structured, the information provided by the gene sequences is in the form of Microbial identification is a fundamental topic for the study a pairwise dissimilarity matrix. We computed such a matrix of infectious diseases, and new approaches in the analysis of in terms of string distances by means of well understood and bacterial isolates, for identification purposes, are currently theoretically sound techniques commonly used in genomics, under development. The classical method to identify bac- and, in order to produce the topographic representation, terial isolates is based on the comparison of morphologic we adopted a modified version of Self-Organizing Map that and phenotypic characteristics to those described as type is able to work with input dataset expressed in terms of or typical strains. On the other hand, recent trends focus dissimilarity distances. on the analysis of bacteria genotype, taking into account the “housekeeping” genes, representing a very stable part 2. Background of DNA. One of the most used genes, that in many studies has proven to be especially suitable for taxonomic and The job of putting scientific names to microbial isolates, identification goals (see Section 2), is the 16S rRNA gene. namely the bacteria identification, is within the practice of Employing genotypic features allows to obtain a classification clinical microbiology. The aim is to give insight into the for rare or poorly described bacteria, to classify organisms etiological agent causing an infectious disease, in order to with an unusual phenotype in a well-defined taxon, and find possible effective antimicrobial therapy. The traditional to find misclassification that can lead to the discovery and method for performing this task is dependent on the description of new pathogens. comparison of an accurate morphologic and phenotypic In this work, we present a method to make a topographic description of type strains or typical strains with the accurate representation of bacteria clusters and to visualize the morphologic and phenotypic description of the isolate to relations among them. This topographic map is obtained be identified. Microbiologists used standard references such considering a single gene of bacteria genome, the 16S rRNA as Bergey’s Manual of Systematic Bacteriology [1]. In the gene. Since the definition of a vector space to represent 1980s, a new standard for identifying bacteria began to be 2 Advances in Artificial Neural Systems developed by Woese et al. [2]. It was shown that phylogenetic all sequences at the same time; Needleman and Wunsch [22], relationships of bacteria could be determined considering that provide a pairwise alignment, that is the best alignment genotypic methods by comparing a stable part of the genetic configuration between two sequences. code. The identification of bacteria based on genotypic Once aligned, it is possible to compute a distance between methods is generally more accurate than the traditional two homologous sequences. In bioinformatics domain, there identification on the basis of phenotypic characteristics. The are many types of distances, usually called “evolutionary preferred genetic technique that has emerged is based on the distance”; these distances differ from each other on the basis comparison of the bacterial 16S rRNA gene sequence, and, of their a priori assumptions. in recent years, several attempts to reorganize actual bacteria The simplest kind of distance is the number of substitu- taxonomy have been carried out by adopting 16S rRNA gene tions per site, defined as sequences. number of different nucleotides Authors in [1] focused on the study of bacteria belonging p = . (1) total number of compared nucleotides to the prokaryotic phyla and adopted the Principal Com- ponent Analysis method [3]onmatricesofevolutionary The number of substitutions observed is often smaller than distances. Authors in [4–6] carried out an analysis of 16S the number of substitutions that have actually taken place. rRNA gene sequences to classify bacteria with atypical This is due to many genetic phenomena such as multiple phenotype: they proposed that two bacterial isolates would substitutions on the same site (multiple hits), convergent belong to different species if the dissimilarity in the 16S substitutions or retromutations. For these reasons, a series rRNA gene sequences between them was more than 1% of stochastic methods has been introduced in order to obtain and less than 3%. Clustering approaches for DNA sequences a better estimate of evolutionary distances. In our study, we were carried out by [7, 8]: the authors considered human considered the method proposed by [23], whose a priori endogenous retrovirus sequences and a distance matrix assumptions are based on the FASTA similarity scores [9]; then they adopted (1) all sites evolve in an independent manner; Median SOM [10], an extension of the Self-Organizing Map (SOM) [11], to nonvectorial data. (2) all sites can change with the same probability; As the authors said, the Median SOM has a better (3) all kinds of substitution are equally probable; convergence if the patterns are roughly ordered. This is not (4) substitution speed is constant over time. an issue for the Soft Topographic Map and the Deterministic Annealing approach. The Median SOM was also used to According to [23], the evolutionary distance d between two cluster protein sequences from SWISS-PROT database in nucleotide sequences is equal to [12]. Authors in [13] proposed a protein sequence clustering 3 4 method based on the Optic algorithm [14]. In [15], a tech- d =− ln 1 − p,(2) nique to find functional genomic clusters in RNA expression 4 3 data by computing the entropy of gene expression patterns where p is the number of substitutions per site (1). and the mutual information between RNA expression pat- terns for each pair of genes is described. INPARANOID 3.2. Soft Topographic Map Algorithm. A widely used algo- [16] is another related approach that performs a clustering rithm for topographic maps is the Kohonen’s Self-Organizing based on BLAST [17] scores to find orthologs and in- Map (SOM) algorithm [11], butitdoesnot operatewith paralogs in two species. The use of maps for organization dissimilarity data. The SOM network builds a projection of biological data was also used in [18], where a map of from an input space to a lattice (usually 2D) of neurons, gammaproteobacteria is reported; the obtained map is based visualized as a 2D map. Each neuron is a pointer to a position on a reorganization of the dissimilarity matrix, and some of in the input space and is a tile on the map. The input patterns the results can be obtained with the approach proposed in are distributed on the map because they are associated to this work. the nearest neuron in the input space: that neuron is usually Topological representations are not restricted, however, referred to as the best-matching unit (bmu). SOM networks only to biological data, but they can be adopted, for instance, are trained using the unsupervised learning paradigm: the with video and audio data [19], as well. label of input patterns, if present, will not be considered The proposed work represents an extended version of our during training phase. preliminary results presented in [20]. The SOM is widely used to project input data into a low- dimensional space [24, 25]. Many studies on the SOM algorithm have been carried 3. Methods after the original paper: according to Luttrell’s work [26], 3.1. Sequence Alignment and Evolutionary Distance. Se- the generation of topographic maps can be interpreted as an quence alignment is a well-known bioinformatics technique optimization problem based on the minimization of a cost useful to compare genomic sequences, even of different function. This cost function represents an energy function, length, between two different species. In our system, we and it takes its minimum when each data point is mapped to used two of the most popular alignment algorithms: the best matching neuron, thus providing the optimal set of ClustalW [21], implementing a multiple alignment among parameters for the map. Advances in Artificial Neural Systems 3 An algorithm based on this formulation of the problem the four closest occupied neurons, along the vertical and was developed by Graepel et al. [27, 28] and provides horizontal axes referred to that empty cell. Gray scale is an extension of SOM to arbitrary distance measures. This calibrated so that bright values denote proximity and dark algorithm is called Soft Topographic Map (STM) and creates values represent distance. Two sample maps and the distance a map using a set of units (neurons or models) organized scale are shown in Figure 1. in a rectangular lattice that defines their neighborhood relationships. STM is able to work with data whose features 4.3. Map Evaluation Criteria. In order to select the map are expressed in terms of dissimilarity measures among each dimension, it is useful to evaluate the evolution of the other. Algorithm full description, along with theoretical and clustering process with regard to map size. To this end, it practical details, can be read in [27]. is possible to compare the number of neural units and the number of patterns defining the following ratio: 4. Implementation number of pattern to classify (3) K = . The Soft Topographic Map algorithm described in this paper number of neural units needs some tuning; in this section, we give all the necessary If K ≥ 1, then each neural unit can have many input information for a fruitful use of the algorithm. patterns so that each neural unit can be considered as a cluster. In this case, the focus is on the use of all neural units, 4.1. Dataset. The main purpose of our work is to demon- and the ones that are not used are often referred to as “dead strate that STM algorithm can be applied to a biological units”. dataset in order to obtain a topographic map useful to If K< 1, then the single neural unit cannot be a cluster visualize clusters of bacteria belonging to the same order, center and the cluster is constituted by many neural units according to actual taxonomy. Biological dataset is composed separated by a set of dead units. The maps with K ≥ 1are of 16S rRNA gene sequences. Each sequence is a text string sometimes called KNN-SOM, while the ones with K< 1are containing only four types of characters: “A,” “C,” “G,” “T” visualized with a technique called U-Matrix [31]. corresponding to the four DNA nucleotides. Information In our implementation, we started with a ratio K ≈ 2 content of the dataset is expressed in terms of a dissimilarity (using a square 8 × 8 STM map) in order to understand if measure, computed according to (2). According to the actual it was possible to identify the neural units as clusters. It was taxonomy [1], we focused our attention on a class containing difficult to find this correspondence due to the high number some of the most common and dangerous bacteria related of units that were associated with sequences of different to human pathologies: Gammaproteobacteria, belonging to order. This is highlighted in the center diagram in Figure 2, the Proteobacteria phylum. In Table 1, a brief description of that shows the number of mixed clusters (i.e., units that have the experimental dataset is shown: the dataset is composed associated sequences of different orders). of 147 type strains, and the resulting 16S gene sequences Usually neural network results are determined by initial were downloaded from NCBI public nucleotide database, weight values. A common procedure to filter this noise is GenBank [29], in FASTA format [30]. to train many networks with different initial set up. In our experiments, we used 20 different network initializations for 4.2. Parameters Setup. A Soft Topographic Map is an array of each experiment. many neural units where patterns to classify are associated to For the evaluation of the quality of the mapping, several these units at the end of the training phase. In order to speed methods are reported in literature, but these methods need up processing time, we applied a slightly tuned version of the a metric space (a vector space) where the patterns and the Soft Topographic Map algorithm: neighborhood functions units of the map are represented as vectors. For a short associated to each neuron have been set to zero if they review on topology preservation, see [32]. In our problem, we have not a feature space where the patterns are placed, and referred to neurons outside a previously chosen radius in the grid. The radius has been put to 1/3 of the map we have only a dissimilarity matrix that reports the pattern dimensions. As for the other parameters of the algorithm, organization. we put the annealing increasing factor η = 1.1 and threshold In order to establish an evaluation criteria for the −5 convergence = 10 , as suggested by [27]. After several tests obtained maps, we noticed that the rows and the columns we chose, as a good compromise between processing time of the map represent a linear ordering of the patterns, and clustering quality, the final value of inverse temperature order that should be present also in the dissimilarity matrix. equal to 10 times the initial value, leading as a consequence to For example, selecting a row on the map, we have a 25 learning epochs; finally, we put the width of neighborhood set of ordered patterns; the same patterns are used to functions σ to 0.5. select the corresponding subset of rows and columns of Themapshavebeendrawn using agray-levelscale to the dissimilarity matrix. These dissimilarity values can be represent the distances between the units: the color between considered as distance values and allow to order the patterns two near occupied cells, both horizontally and vertically, is in a linear fashion. A pattern sequence can be easily obtained proportional to the average distance of the patterns being using the Sammon mapping technique on a linear space in those neurons. To be more precise, empty cells are filled starting from the data of dissimilarity matrix. This sequence with a gray level proportional to the mean distance among should be identical to the one obtained from the map; the 4 Advances in Artificial Neural Systems Table 1: Actual taxonomy of the bacteria dataset. We focused on Gammaproteobacteria class, which is divided into 14 orders. Each order has one or more families. Inside each family, we considered only the type strains, that is, sample species. Gammaproteobacteria Order name Number of families Number of type strains Code numbers Chromatiales 3 families 25 type strains 1–25 Acidithiobacillales 2 families 2 type strains 26,27 Xanthomonodales 1 family 11 type strains 28–38 Cardiobacteriales 1 family 3 type strains 39, 40, 41 Thiotrichales 3 families 11 type strains 42–52 Legionellales 2 families 2 type strains 53, 54 Methylococcales 1 families 7 type strains 55–61 Oceanospirillales 4 families 11 type strains 62–72 Pseudomonadales 2 families 7 type strains 73–79 Alteromonadales 1 family 13 type strains 80–92 Vibrionales 1 family 3 type strains 93, 94, 95 Aeromonadales 2 families 7 type strains 96–102 Enterobacteriales 1 family 39 type strains 103–141 Pasteurellales 1 families 6 type strains 142–147 14 orders 25 families 147 type strains 1 1 0.33 0.33 0.29 0.29 0.26 0.26 0.24 0.24 0.19 0.19 0.16 0.16 0.13 0.13 0.09 0.09 0.06 0.06 0.03 0.03 0 0 (a) (b) Figure 1: 12 × 12 (left) and 20 × 20 (right) topographic maps of bacteria dataset. In the legend under the figures, the dissimilarity values corresponding to each gray level are shown. 0.33 is the max distance in our dataset, so this darkest gray level available. two sequences can be compared using the Spearman’s rank Evaluating this coefficient, we can decide which geometry correlation coefficient [33]definedas can be used. Maps with few neural units are discarded, because there are units with many patterns that create ties 6 d in the ordering; in fact, patterns associated with the same ρ = 1 − ,(4) unit do not have any order, while very large maps present a n(n − 1) naturally decreasing value due to the fact that the patterns are very sparse. This effect can be seen in Figure 2 on the right of where d is the difference between each rank of correspond- the thin vertical line. ing values of the compared variables x and y; n is the number of pairs of values. In the above equation, we consider only the term in the square brackets because we discard the possible 5. Results inversion between the pattern sequence of the map and the one of the dissimilarity matrix. Averaging all the Spearman In this Section, we present the results we obtained applying coefficients for each column and each row, we obtain a score the techniques described in Section 3 to the bacteria dataset for a given map. All these scores, calculated for each map described in Section 4. geometry and for initialization, are reported in the upper Given the dataset described above, we carried out diagram of Figure 2 as a box plot. several experiments. We obtained several maps of different Advances in Artificial Neural Systems 5 0.7 clustering properties because input patterns aim at spreading 0.6 all over the grid, filling all the available space. Considering 0.5 the definition of parameter K givenin(3), maps with K< 1 0.4 and K 1 are meaningless. 0.3 The map size and the optimal K parameter value are 0.2 also a function of the method used to produce the dissimi- 0.1 larity matrix. For example, using Normalized Compression 60 Distance (NCD) [34], the optimum size of the map can be 50 different, as stated in one of our previous work about this topic [35]. One of the most interesting result is that there are some anomalies that are constant for all the tests regardless the dimension of the maps. For example, in small maps (not shown here), the “Alterococcus agarolyticus” (number 103 in Figure 3) bacterium of the “Enterobacteriales” order is incorrectly clustered together with bacteria of other orders, whereas, in larger maps, it is isolated in an individual cluster, usually at the border of the map and far from its homologous strains (see Figure 3). Another interesting example is given by “Legionella pneumophila”(number 54 in Figure 3)bacterium 0.1 of “Legionellales” order: in all maps, it is located in a corner of the grid and surrounded by a dark gray area. This would suggest that these two bacteria could form new orders, not present in actual taxonomy, or at least new families. The same Map size anomalies are confirmed by the Multidimensional Scaling and the evolutionary tree. 2.3 1.47 1.02 0.75 0.57 0.45 0.36 0.16 0.09 Since the maps provide a visualization of bacteria K value datasets, if there are some “anomalies”, they are clearly Figure 2: In the upper graph is the box plot graph of the Spearman highlighted as isolated elements standing at the border or coefficient. In the chart on the center, we can observe that the in the corners of the map. These anomalies can suggest number of bacteria belonging to mixed clusters, that is, cells in the biologists to do further experimental trials in order to map labeled with bacteria of different orders, aims at decreasing as determine if, eventually, there are some misclassifications in the size of maps increases. In the lower graph is the processing time the taxonomy. That does not mean the proposed method in minutes (logarithmic scale). should mainly be used in order to perform identification or annotation of unknown bacterial species, but that the visualization is also able to detect anomalies and if there dimensions, from 8 × 8upto45 × 45 neurons, and for are unknown elements, to project them in the map because every configuration, we trained 20 maps in order to avoid of unsupervised learning feature of STM algorithm (see Section 3). the dependence from the initial conditions. Comparing the results provided from pairwise and The bacteria organization in the map finds some other multiple alignment, we saw that there are not meaningful confirmation in [18], for example, the neighborhood of differences in the corresponding maps, so we focused only Xanthomonas (33 in Figure 3), Pseudomonas (73), and on the evolutionary distances computed from pairwise Enterobacteriales (114); notice also that the position of alignment. Buchnera (106) is not in the same compact group of the other In Figure 1, we can see the evolution of clustering process Enterobacteriales in the map center, although not so distant with regard to map size: first of all, we can notice how most as depicted in [18]. Considering the evaluation of the map, reported in of the bacteria are classified according to their order in the actual taxonomy; then, we can observe that the number of Figure 2, we choose in the set of the 14×14 maps the one that bacteria belonging to mixed clusters, that is, cells in the map presents the absolute minimum of the Spearman coefficient before of its natural decreasing on the right side of the thin labeled with bacteria of different orders, aims at decreasing as the size of maps increases. We can state that small maps, vertical line. This choice also minimizes the number of mixed according to the chart until about 10 × 10, do not provide clusters, as can be seen in the center diagram of Figure 2. useful results because there are too few available neurons and consequently the maps are not able to correctly discriminate 5.1. Comparison with Phylogenetic Tree. We compared the among different patterns. If we look, in fact, at the charts chosen 14 × 14 map with the phylogenetic tree referred of Figure 2, there are too many mixed clusters and high to our dataset. In Figure 3, it is possible to notice that values of the Spearman coefficient. On the other side, we there are four outliers bacteria: “Francisella tularensis” (45), noticed that in very large maps (not shown in this paper), “Legionella pneumophila” (54), “Alterococcus agarolyticus” from 25 × 25 and so on, the topographic maps “lose” their (103), and “Buchnera aphidicola” (106). The first three Processing time (min.) Bacteria in Spearman coefficient mixed clusters 8 × 8 10 × 10 12 × 12 14 × 14 16 × 16 18 × 18 20 × 20 30 × 30 40 × 40 6 Advances in Artificial Neural Systems (a) (b) Figure 3: Comparison between the phylogenetic tree and the selected 14×14 map. It is possible to notice that there are four outliers bacteria: “Francisella tularensis,”“Alterococcus agarolyticus,”“Legionella pneumophila,”“Buchnera aphidicola.” The first three bacteria are clustered in the border of the map and far from their homologous strains; the remaining one lies in a single cell surrounded by a dark gray area that indicates its actual distance from its neighbors is bigger than the one shown in the map. There are other bacteria far from their homologous strains: “Schineria larvae,”“Arhodomonas aquaeolei,”“Halothiobacillus neapolitanus,”“Nitrosococcus nitrosus.” “Enterobacteriales” and “Pasteurellales” form compact group in both representations. bacteria are positioned on the border of the map and far completely isolated. At the same time our method provides a from their homologous strains; the remaining one lies in a very simple system to immediately visualize compact orders single cell surrounded by a dark gray area that indicates its and/or families, as previously explained. actual distance from its neighbors is bigger than it appears. This is clear looking at Figure 3 where the map and Apart from these four elements, in the phylogenetic tree, the tree contain the same objects but the map is far more we found other bacteria far from their order, for instance, readable. “Schineria larvae” (34), “Halothiobacillus neapolitanus”(8), “Nitrosococcus nitrosus” (12), and “Arhodomonas aquaeolei” 5.2. Comparison with Multidimensional Scaling. Multidi- (3): once again these elements are at the border or in a mensional Scaling (MDS) is a widely used technique for zone on the map surrounded by a dark gray level. Although embedding a dataset, defined only in terms of pairwise “Schineria larvae” (34) and “Halothiobacillus neapolitanus” distances, in an euclidean space and plotting it in a 2D (8) are coupled in the dendrogram, we can see in the map (or 3D) plane [36]. For this reason, we compared our two- how they are actually far away: that happens because some dimensional topographic representation with a 2D plot, pairings in the tree are forced, and, in this case, do not give obtained through MDS, of our bacteria dataset, presented in useful information. “Schineria larvae” (34) and “Francisella Figure 4. tularensis” (45), whose actual distance is 0.1339, are close in First of all, we can notice that the four outliers bacteria, the map, but their surrounding gray level explains their real “Francisella tularensis,”“Buchnera aphidicola,”“Alterococcus distance, as we can also see in the phylogenetic tree. agarolyticus,”“Legionella pneumophila,” are separated from If we consider entire orders, for example, “Enterobacte- all the other elements. Apart from this evident result, there riales” and “Pasteurellales”, they form compact groups both are not many other similarities with our map nor with in the tree and in the map. Moreover, the “Methylococcales” the phylogenetic tree. Bacteria belonging to “Pasteurellales” order that in actual taxonomy has one family, in the map, order, for example, forming in the previous visualizations is divided in two clusters (56, 57 and 55, 58, 59, 60, 61) as a well-defined group, in MDS plot, stand in very distant reported in the phylogenetic tree and in [18]. zones without any observable relationship. There are some Our visualization method allows, then, not only to detect dislocated elements even inside “Enterobacteriales,” though some singular situations, but also to understand their relative most of them still form a compact group in the center part positions with regards to all the patterns in the dataset. At of the diagram. Moreover, it is difficult to give a clue on the a first look to the phylogenetic tree, in fact, it should be distance among the patterns. possible to wrongly realize that the four outliers described In conclusion, the use of MDS plotting gives less above are far from all the other bacteria, but near each other. information with respect to the ones obtained by means Using the map, instead, we can see how the four outliers are of topographic map and phylogenetic tree. Because of the Advances in Artificial Neural Systems 7 Enterobacteriales Francisella Buchnera tularensis no. 45 aphidicola no. 106 Alterococcus agarolyticus no. 103 Enterobacteriales ouliers Legionella pneumophila no. 54 Figure 4: 2D representation of bacteria dataset obtained through Multidimensional Scaling. Apart from the four outliers, “Legionella pneumophila,”“Alterococcus agarolyticus,”“Francisella tularensis,”“Buchnera aphidicola,” that are separated from all the other elements, the remaining bacteria do not show meaningful similarities with the visualizations obtained through topographic map and phylogenetic tree. distortion introduced by MDS, in fact, most of the patterns, “anomalies” in input dataset. These anomalies should be with the exception of the four outliers, have lost their further investigated because they, eventually, could repre- distinctive properties already discussed in the previous sent incorrect classification or incorrect registration in the paragraph. database. It also provides a compact representation, in one image, useful to visualize bacteria clusters and their mutual separation, although the evaluation of distance between 6. Conclusion clusters is still inaccurate. Furthermore our system has In recent trends for the definition of bacteria taxonomy, proved to be a valid alternative to the traditional visualizing genotypical characteristics are considered very important tool used in bioinformatics, like phylogenetic trees and 2D and type strains are compared on the basis of the stable plot obtained through MDS. part of the genetic code. In this paper, the Soft Topographic In future research activities, we intend to extend the Map algorithm has been applied to the visualization and analysis to other “housekeeping” genes and to combine clustering of bacteria according to their genotypic similarity. different genotypical characteristics in order to obtain finer In the similarity measure, we have adopted the 16S rRNA clustering and classification. We would like also to use other gene sequence, as commonly used for taxonomic purposes. distance measures, eventually alignment-free, and different A characteristic of the proposed approach is that the clustering algorithm in order to improve execution time and topographic map is built from the genetic data, using the the quality of clustering. Soft Topographic Map algorithm working on proximity data, rather than using a vector space representation. The generated maps show that the proposed approach provides a References clustering that generally reflects the current taxonomy with [1] G.M.Garrity,B.A.Julia,and T. Lilburn,“Therevised some singular cases. Moreover, the results depend on the road maptothe manual,” in Bergey’s Manual of Systematic size of the maps, since small and large maps, with regards Bacteriology, G. M. Garrity, Ed., pp. 159–187, Springer, New to the number of input patterns, do not give meaningful York, NY, USA, 204. information. The size of the maps should be chosen so that [2] C.R.Woese,E.Stackebrandt,T.J.Macke,and G. E. Fox, the ratio between input elements and neurons is K ≈ 1, with “A phylogenetic definition of the major eubacterial taxa,” a corresponding value of Spearman coefficient representing Systematic and Applied Microbiology, vol. 6, no. 2, pp. 143–151, a local minimum. The visualization of bacteria dataset through the map [3] I.T.Joliffe, Principal Component Analysis,Springer,New York, also allows an easy identification of cases representing some NY, USA, 1986. 8 Advances in Artificial Neural Systems [4] J. E. Clarridge, “Impact of 16S rRNA gene sequence analysis Analysis VII, vol. 4723 of Lecture Notes in Computer Science, for identification of bacteria on clinical microbiology and pp. 332–343, 2007. infectious diseases,” Clinical Microbiology Reviews, vol. 17, no. [21] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL 4, pp. 840–862, 2004. W: Improving the sensitivity of progressive multiple sequence [5] M. Drancourt, C. Bollet, A. Carlioz, R. Martelin, J.-P. Gayral, alignment through sequence weighting, position-specific gap and D. Raoult, “16S ribosomal DNA sequence analysis of a penalties and weight matrix choice,” Nucleic Acids Research, large collection of environmental and clinical unidentifiable vol. 22, no. 22, pp. 4673–4680, 1994. bacterial isolates,” Journal of Clinical Microbiology, vol. 38, pp. [22] S. B. Needleman and C. D. Wunsch, “A general method appli- 3623–3630, 2000. cable to the search for similarities in the amino acid sequence [6] M.Drancourt,P.Berger, andD.Raoult, “Systematic16S rRNA of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, gene sequencing of atypical clinical isolates identified 27 new pp. 443–453, 1970. bacterial species associated with humans,” Journal of Clinical [23] T. H. Jukes and C. R. Cantor, “Evolution of protein molecules,” Microbiology, vol. 42, no. 5, pp. 2197–2202, 2004. in Mammalian Protein Metabolism,H.N.Munro, Ed., pp.21– [7] M.Oja,P.Somervuo, S. Kaski, andT.Kohonen,“Clustering 132, Academic Press, New York, NY, USA, 1969. of human endogenous retrovirus sequences with median self- [24] A. N. Gorban and A. Zinovyev, “Principal manifolds and organizing map,” in Proceedings of the Workshop on Self- graphs in practice: from molecular biology to dynamical Organizing Maps (WSOM ’03), 2003. systems,” International Journal of Neural Systems, vol. 20, no. [8] M.Oja,G.O.Sperber,J.Blomberg, andS.Kaski,“Self- 3, pp. 219–232, 2010. organizing map-based discovery and visualization of human [25] W. Barbakh and C. Fyfe, “Online clustering algorithms,” endogenous retroviral sequence groups,” International Journal International Journal of Neural Systems, vol. 18, no. 3, pp. 185– of Neural Systems, vol. 15, no. 3, Article ID 163179, 2005. 194, 2008. [9] W. R. Pearson and D. J. Lipman, “Improved tools for biological [26] S. P. Luttrell, “A Bayesian analysis of self-organizing maps,” sequence comparison,” Proceedings of the National Academy of Neural Computation, vol. 6, pp. 767–794, 1994. Sciences of the United States of America, vol. 85, no. 8, pp. 2444– [27] T. Graepel, M. Burger, and K. Obermayer, “Self-organizing 2448, 1988. maps: generalizations and new optimization techniques,” [10] T. Kohonen and P. Somervuo, “How to make large self- Neurocomputing, vol. 21, no. 1–3, pp. 173–190, 1998. organizing maps for nonvectorial data,” Neural Networks, vol. [28] T. Graepel and K. Obermayer, “A stochastic self-organizing 15, no. 8-9, pp. 945–952, 2002. map for proximity data,” Neural Computation, vol. 11, no. 1, [11] T. Kohonen, Self-Organizing Maps, Springer, Berlin, Germany, pp. 139–155, 1999. [29] GenBank, 2007, http://www.ncbi.nlm.nih.gov/entrez/query [12] P. Somervuo and T. Kohonen, “Clustering and visualization of .fcgi?db=Nucleotide. large protein sequence databases by means of an extension of [30] Fasta, 2007, http://www.ncbi.nlm.nih.gov/blast/fasta.shtml. the self-organizing map,” in Proceedings of the 3rd International [31] A. Ultsch, “Maps for the visualization of high dimensional Conference on Discovery Science, pp. 76–85, 2000. data spaces,” in Proceedings of the Workshop on Self-Organizing [13] Y. Chen,K.D.Reilly, A. P. Sprague, andZ.Guan, “Seqoptics:a Maps (WSOM ’03), vol. 3, pp. 225–230, 2003. protein sequence clustering method,” in Proceedings of the 1st [32] D. Vidaurre and J. Muruzabal, ´ “A quick assessment of topology International Multi- Symposiums on Computer and Computa- preservation for SOM structures,” IEEE Transactions on Neural tional Sciences (IMSCCS’06), vol. 1, pp. 69–75, June 2006. Networks, vol. 18, no. 5, pp. 1524–1528, 2007. [14] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, [33] E. W. Weisstein, The CRC Concise Encyclopedia of Mathematics, “Optics: ordering points to identify the clustering structure,” CRC Press, New York, NY, USA, 1999. in Proceedings of the ACM SIGMOD International Conference [34] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitanyi, “The on Management of Data, pp. 49–60, Philadelphia, Pa, USA, similarity metric,” IEEE Transactions on Information Theory, June 1999. vol. 50, no. 12, pp. 3250–3264, 2004. [15] A. J. Butte and I. S. Kohane, “Mutual information rele- [35] M. La Rosa, S. Gaglio, R. Rizzo, and A. Urso, “Normalised vance networks: functional genomic clustering using pairwise compression distance and evolutionary distance of genomic entropy measurements,” in Proceedings of the Pacific Sympo- sequences: comparison of clustering results,” International sium on Biocomputing, vol. 5, pp. 415–426, 2000. Journal of Knowledge Engineering and Soft Data Paradigms, vol. [16] M. Remm, C. E. V. Storm, and E. L. L. Sonnhammer, 1, no. 4, pp. 345–362, 2009. “Automatic clustering of orthologs and in-paralogs from [36] W. S. Torgerson, “Multidimensional scaling: I. Theory and pairwise species comparisons,” Journal of Molecular Biology, method,” Psychometrika, vol. 17, pp. 401–419, 1952. vol. 314, no. 5, pp. 1041–1052, 2001. [17] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lip- man, “Basic local alignment search tool,” JournalofMolecular Biology, vol. 232, pp. 584–599, 1993. [18] G. M. Garrity and T. G. Lilburn, “Self-organizing and self- correcting classifications of biological data,” Bioinformatics, vol. 21, no. 10, pp. 2309–2314, 2005. [19] C. Fyfe,W.Barbakh,W.C.Ooi,and H. Ko,“Topological mappings of video and audio data,” International Journal of Neural Systems, vol. 18, no. 6, pp. 481–489, 2008. [20] M. La Rosa, G. Di Fatta, S. Gaglio, G. M. Giammanco, R. Rizzo, and A. M. Urso, “Soft topographic map for clustering and classification of bacteria,” in Advances in Intelligent Data Journal of Advances in Industrial Engineering Multimedia Applied Computational Intelligence and Soft Computing International Journal of The Scientific Distributed World Journal Sensor Networks Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Advances in Fuzzy Systems Modelling & Simulation in Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Submit your manuscripts at http://www.hindawi.com Journal of Computer Networks and Communications Advances in Artic fi ial Intelligence Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Biomedical Imaging Artificial Neural Systems International Journal of Computer Games Advances in Advances in Computer Engineering Technology Software Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Reconfigurable Computing Computational Advances in Journal of Journal of Intelligence and Human-Computer Electrical and Computer Robotics Interaction Neuroscience Engineering Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014
Advances in Artificial Neural Systems – Hindawi Publishing Corporation
Published: Oct 12, 2011
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.