Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

TreeGenes: A Forest Tree Genome Database

TreeGenes: A Forest Tree Genome Database Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2008, Article ID 412875, 7 pages doi:10.1155/2008/412875 Research Article 1 2 1 1 Jill L. Wegrzyn, Jennifer M. Lee, Brandon R. Tearse, and David B. Neale Department of Plant Sciences, University of California, Davis, CA 95616, USA Department of Evolution and Ecology, University of California, Davis, CA 95616, USA Correspondence should be addressed to David B. Neale, dbneale@ucdavis.edu Received 17 December 2007; Revised 13 June 2008; Accepted 11 July 2008 Recommended by Shizhong Xu The Dendrome Project and associated TreeGenes database serve the forest genetics research community through a curated and integrated web-based relational database. The research community is composed of approximately 2 000 members representing over 730 organizations worldwide. The database itself is composed of a wide range of genetic data from many forest trees with focused efforts on commercially important members of the Pinaceae family. The primary data types curated include species, publications, tree and DNA extraction information, genetic maps, molecular markers, ESTs, genotypic, and phenotypic data. There are currently ten main search modules or user access points within this PostgreSQL database. These access points allow users to navigate logically through the related data types. The goals of the Dendrome Project are to (1) provide a comprehensive resource for forest tree genomics data to facilitate gene discovery in related species, (2) develop interfaces that encourage the submission and integration of all genomic data, and to (3) centralize and distribute existing and novel online tools for the research community that both support and ease analysis. Recent developments have focused on increasing data content, functional annotations, data retrieval, and visualization tools. TreeGenes was developed to provide a centralized web resource with analysis and visualization tools to support data storage and exchange. Copyright © 2008 Jill L. Wegrzyn et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION oping genetic resources for these organisms improve the genetic understanding individual member species within The TreeGenes database is a resource for all forest trees, the same families or genera. The rapid rate in which howeverworktodatehas focusedonspecific members genetic data is being generated from large-scale EST and of the Pinaceae family. Pinaceae is one of eight families resequencing projects has required corresponding growth of the order Coniferales (conifer) and includes 10 genera in relational databases and associated computational tools. and approximately 220 species. Species of Pinaceae are The ability to combine data from different sources facilitates commercially important to the forest industry and are used interpretation and potentially allows stronger inferences to for building, packaging, and paper products worldwide. be made. When information from several different databases Because of the very large size and complexity of the conifer is required, the assembly of data into a format suitable genomes (10–40 Gb) [1], greater emphasis has been placed for querying is a challenge. The development of systems on the expressed portion of the genome. This has been for the integration and combined analysis of diverse data achieved through EST and large-scale resequencing projects. remains a priority in bioinformatics. The TreeGenes database The principal species represented include six members of helps researchers efficiently analyze, access, integrate, and the genus Pinus (taeda, elliottii, radiata, pinaster, sylvestris apply the data. This paper will navigate through the and lambertiana), two species from other genera of the Dendrome (http://dendrome.ucdavis.edu/) and TreeGenes Pinaceae (Picea abies and Pseudotsuga menzeisii), and one (http://dendrome.ucdavis.edu/treegenes/) interfaces as well member of another family of conifer (Cryptomeria japonica as the types of data that can be accessed through these Cupressaceae). Recent work has also focused on integrating resources for members of Populus and Eucalyptus. Devel- resources. 2 International Journal of Plant Genomics Figure 1: Diversity of web-based resources available through the Dendrome Project and TreeGenes database. The Dendrome project (http://dendrome.ucdavis.edu/) serves as a community resource and portal to a variety of resources. These include the TreeGenes database followed by the supporting large-scale project pages, and the Dendrome Plone which is a controlled access forum. 2. CONTENT AND ORGANIZATION opportunities, and the community-curated repository of links. In addition, a discussion forum has recently been implemented to encourage users to submit questions and 2.1. Dendrome project resources comments on the database as well as participate in general discussions of conifer genomics. The related Dendrome The Dendrome pages are entry points to valuable links, Plone (http://dendrome.ucdavis.edu/TGPlone), based on the community projects, information forums, custom tools, Plone 3.0 content management platform (http://plone.org/), and the TreeGenes database (http://dendrome.ucdavis.edu/ provides a user friendly environment for investigators to resources/)(Figure 1). The resources available include cus- share and obtain information relating to a variety of projects. tom BLAST and FASTA services for sequence similarity searches. These services directly access all publicly available sequence data as well as custom EST or related databases. 2.2. Treegenes database overview Users can submit requests for the generation and submission of a searchable database to this repository. An area for the TreeGenes (http://dendrome.ucdavis.edu/treegenes/)func- retrieval and submission of custom scripts developed in the tions through a semi-automated PostgreSQL version 8.1.4- community of general use for the manipulation of genetic based database that consists of modules to hold a broad data exists. This interface allows authors to post and make range of data and information for trees. This system has available scripts and more advanced programs that they have a front end consisting of Perl 5.8.5 scripts running in a designed including relevant documentation. The tools pages Linux/Apache/PHP environment. The database is organized are internally curated and provide information on open into ten different modules that are highly connected in order source and freeware software packages that are relevant to to ease access and analysis of the data (Figure 2). These the processing, availability, and analysis of data presented modules include sample tracking that holds tree source and in the TreeGenes database. Links’ pages allow users to DNA extraction information. The sequencing and primers submit to and describe online resources and projects relevant module contains sequences from the resequencing efforts as to the research community. The links are submitted into well as data describing how the sequences were generated and one or more relevant genomic categories and are available parameters used in their alignment and analysis. The Species immediately in the repository.Forms are present through module holds the taxonomy information and the Colleague the Dendrome site in order to encourage users to modify, Module contains information on laboratories and individu- add, and correct a variety of information. The Dendrome als. There is a comparative map module that uses Cmap [2] pages are the primary source of useful information for to hold and view genetic maps, map relations, and molecular the research community such as upcoming events, research markers. TheLiteraturemodulestorespublications. TheEST Jill L. Wegrzyn et al. 3 module stores sequence and annotation information and 2.2.4. Literature module ties into an automated pipeline that allows users to submit The Literature module is responsible for performing regu- sequences for processing, analysis, and Genbank submission. lar and automated searches of major relevant publication TreeGenes is being expanded to include more extensive repositories including Pubmed [6] and Biosis according genotype and phenotype data utilizing data models from to a compiled list of keywords. TreeGenes organizes these the Germinate database [3]. This organization allows for publications, creates a list of searchable keywords, and builds a combination of internal curation as well as third-party internal relations to provide a centralized forest genetics submissions to validate and maintain current content. publication repository. This resource can be queried on many levels and allows for a customized subscription to preview 2.2.1. Species module recently added publications. In addition to querying public repositories, TreeGenes has a unique feature that allows for Individual species and colleague databases exist in the Tree- the submission of manuscripts by the author. These submit- Genes schema and are well integrated with other database ted publications undergo the same process of keyword gen- modules. Species is a manually curated database of 222 mem- eration. In addition to the submission of new publications, bers of the Pinaceae family. It is currently being expanded to authors are encouraged to submit supplemental information. encompass more forest trees. The Species module contains Each detail view provides automated external linking in detailed information for each entry including range maps addition to the opportunity to provide sequence data, raw and multiple images for each tree. Internal connections to mapping files, or accession numbers from public database the researchers who study each species, moreover, relevant submissions. This is the primary mechanism by which the publications are available in the detail view. External links curators can organize the information and further populate to NCBI’s extended taxonomy [4] provide direct access to the related modules. All standardized nomenclatures are publicly available sequence sets. enforced during submission including accessions IDs and TreeGenes comparative mapping nomenclature. 2.2.2. Colleague module The colleague database is a semi-automated directory of 2.2.5. Expressed sequence tags module nearly 2 000 researchers and 730 organizations. The colleague interfaces allow users to add, modify, and remove the contact The TreeGenes EST pipeline and database apply a combina- information of both the users themselves and their respective tion of custom and open-source tools integrated into a fully organizations. The interface offers the user the ability to automated processing pipeline (Figure 4). The processing query researchers who focus on particular species in addition occurs at five unique stages and allows users to either original to their specific research interests. tracefiles or FASTA files. In the latter case, the quality scores are unable to be considered in the processing. (1) Specific EST nomenclature is enforced during submission of 2.2.3. Comparative maps module the initial trace files or FASTA text files. (2) Tracefiles are The comparative map module is displayed through a mod- processed to identify a filtered, high-quality clone library as ified version of Cmap [2]. Cmap is one component of the determined by Ewing et al. [7]. (3) Sequence clustering con- larger Generic Model Organism Database (GMOD) toolkit sists of assembling high-quality sequences to produce longer [5]. This package has the ability to compare across genetic, transcripts and reduce overall redundancy. This occurs via sequence, and physical maps. At this time, TreeGenes only two rounds of Cap3 [8] processing. (4) Annotation involves contains genetic maps. The interactive display allows indi- pairwise comparisons of the EST clone library and the viduals to select single maps and continue to add additional EST contig consensus sequences. Sequence identification and linkage groups from any species to build a comparative view. annotation is provided by a series of BLAST homology Maps can be selected by specific linkage groups or they searches (Parallel and Priority BLAST) against user-defined can be queried by their features. TreeGenes currently houses and publicly available sequence databases implemented with features (marker types) for over 50 genetic maps including NCBI’s BLAST [9]. In general, searches are performed AFLP, ESTP, Isozyme, SSR, RAPD, and RFLP. The final against the Genbank nr protein database [4], however users map display is highly interactive. Selecting specific features may select custom datasets. The UniGene dataset is derived allows for the display of marker information and sequence by selecting the clone that best represents each contig and information. Detail views also include links to internal the singletons that have unique or no matches are further databases within TreeGenes including species, literature, col- annotated. This level of annotation consists of Gene Ontol- league, EST, QTLs, and PCR primer information (Figure 3). ogy (GO) [10] (with preference given to Plant GO [11]hits A standardized nomenclature has been developed to describe when available), KEGG for metabolic pathways [12], Enzyme each potential map feature. Custom scripts exist to generate Commission (EC) [13], and InterPro [14]for conserved and enforce this documented nomenclature during the protein domains (which includes CDD [15], SMART [16], import process. This internal validation allows for name- and Pfam [17]). The final stage and optional stage of the based correspondences (markers) to be annotated. These processing pipeline involves submission to Genbank [4] correspondences can be easily viewed during comparisons of following the approval of the owner. Users can login in the maps themselves as well as Cmap’s matrix view. and view the original EST data, the cleansed data and the 4 International Journal of Plant Genomics Genotype SNP Phenotype Expression Species Comparative map Sample Tracking Colleague Literature Sequencing EST Primers EST in Genbank Figure 2: Modular view of the TreeGenes database schema. The TreeGenes database (http://dendrome.ucdavis.edu/treegenes/)isafully relational PostgreSQL database with a total of ten modules. These modules have connections supported by interfaces that allow queries across these data types. Current development is focused on fully incorporating the genotype and phenotype modules. Figure 3: Comparative mapping (CMAP) interface applied to Pinaceae. The modified version of Cmap utilized to represent the genetic map unit in TreeGenes is highly interactive. This interface allows for the visualization of comparative map builds including matrix views that display the correspondences between the map sets. Cmap’s internal links also provide detailed supporting information on molecular markers, primer sequences, species, and publications. analysis results. Tables, formatted text, and links are provided 2.2.6. Resequencing interface for viewing summary and detailed information. Submission includes the generation, formatting, and actual submission The Sample Tracking, Primers, Sequencing, and Genotype of the required flat files. Since the TreeGenes database is modules are currently accessible through an interactive home to many species, the database is organized to maintain interface that includes a growing library of predefined independent species and project sets while search interfaces queries (Figure 5). Each template provides a simplified view have been developed to support comparative queries. of an underlying query by means of a text description Jill L. Wegrzyn et al. 5 Sequence processing Trace files Sequence clustering Loading FASTA files Annotation TreeGenes EST module scripts File preparation Submission flat files dbEST Gene ontology Custom BLAST Figure 4: EST analysis and submission process. The EST processing and submission pipeline can support multiple projects and species (http://dendrome.ucdavis.edu/treegenes/est/). Tracefiles are renamed, assembled, clustered, annotated, and submitted to Genbank upon the submitter’s approval. The web interface allows users to track the progress, view EST sequences, and provide basic annotation-based searches once the data has been loaded into the TreeGenes database. and one or more searchable fields. The recent increase cussion forum available through the Dendrome Project in high-throughput sequencing projects encouraged the website provides an opportunity for users to submit sug- development of interfaces capable of dealing with the analysis gestions on improvements or additions to the community of large amounts of data. In short, researchers can query resources. Input relating to new functionality and additional thousands of sequences in a single operation. The desktop data sources are welcome here. A help form is available for style interface assists with these searches and allows users to user-specific queries. This feedback form will automatically customize their results and organize data views. Users can send us the inquiry, making it easier to give an accurate perform multiple searches at one time and combine results. response. Further information is available by joining one of Data types available here currently include ESTs, EST anno- the TreeGenes electronic mailing lists (details on the website) tation, tracefiles, SNPs, primer sequences, and resequenced or by email to info@treegenes.ucdavis.edu. amplicons (including DNA extraction and tree sample information) (http://dendrome.ucdavis.edu/interface/). This 4. FUTURE DEVELOPMENT interface currently accesses information for over 40 000 ESTs and nearly 8 000 resequenced amplicons. The structure of TreeGenes permits researchers to rapidly accumulate a wealth of information about a particular 3. AVAILABILITY, DATA SUBMISSION, object or set of objects. This flexible design facilitates the FEEDBACK, AND SUPPORT formulation of new hypotheses for refining subsequent investigations. In addition to refinement and extension of The Dendrome Project and the TreeGenes database are smaller scale investigations, TreeGenes can also facilitate publicly available and can be accessed at http://dendrome more comprehensive approaches by allowing the investiga- .ucdavis.edu/. From the website, there is access to help in tion of interactions among datasets. TreeGenes is still in a the form of tutorials and a user manual. We encourage phase of rapid development. Future plans include incorpo- researchers to actively participate in making Dendrome rating standardized gene ontology to describe phenotypic and TreeGenes more accessible by submitting data and traits unique to forest trees. In addition, robust databases providing feedback on general usability. We are inter- will be available for the submission and visualization of ested in being able to provide a unique resource to SNP and expression data. The advanced workspace interface the community, which is fully dependent on individual will be expanded to accommodate both of these data types submissions. TreeGenes has a robust interface to submit as well as tools to ease the analysis and comparison of a variety of information including sequence data and this information. With an emphasis on meaningful data comparative mapping files through the literature database. acquisition and interface design, TreeGenes continues to More information on this resource can be found in serve a critical role in the efficient storage and analysis of data http://dendrome.ucdavis.edu/TreeGenes/literature/. The dis- by the forest genetics community. 6 International Journal of Plant Genomics (a) (b) Figure 5: Resequencing interface. The workspace interface environment allows for more complex, batch searches of EST, annotation, tracefile, PCR primer, amplicon sequence, SNP, DNA extraction, and tree sample data types (http://dendrome.ucdavis.edu/interface/). Users can initiate searches based on any of these data types. The data retrieved can be compiled into customized lists, saved for future searches, and downloaded to a local machine for further analysis. LIST OF ABBREVIATIONS [2] D. Ware, P. Jaiswal, J. Ni, et al., “Gramene: a resource for comparative grass genomics,” Nucleic Acids Research, vol. 30, EST: Expressed sequence tag no. 1, pp. 103–105, 2002. GUI: Graphical user interface [3] J. M. Lee, G. F. Davenport, D. Marshall, et al., “GERMINATE. GO: Gene ontology A generic database for integrating genotypic and phenotypic PHP: Hypertext preprocessor information for plant genetic resource collections,” Plant PO: Plant ontology Physiology, vol. 139, no. 2, pp. 619–631, 2005. SNP: Single nucleotide polymorphism [4] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, “GenBank: update,” Nucleic Acids Research, vol. 32, database issue, pp. D23–D26, 2004. REFERENCES [5] L. D. Stein, C. Mungall, S. Shu, et al., “The generic genome [1] M. R. Ahuja and D. B. Neale, “Evolution of genome size in browser: a building block for a model organism system data- conifers,” Silvae Genetica, vol. 54, no. 3, pp. 126–137, 2005. base,” Genome Research, vol. 12, no. 10, pp. 1599–1610, 2002. Jill L. Wegrzyn et al. 7 [6] N. C. Putnam, “Searching MEDLINE free on the Internet using the National Library of Medicine’s PubMed,” Clinical Excellence for Nurse Practitioners, vol. 2, no. 5, pp. 314–316, [7] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base- calling of automated sequencer traces using phred. I. Accuracy assessment,” Genome Research, vol. 8, no. 3, pp. 175–185, 1998. [8] X.Huang andA.Madan,“CAP3:aDNAsequenceassembly program,” Genome Research, vol. 9, no. 9, pp. 868–877, 1999. [9] M.Johnson,I.Zaretskaya, Y. Raytselis, Y. Merezhuk,S. McGinnis, and T. L. Madden, “NCBI BLAST: a better web interface,” Nucleic Acids Research,vol. 36, webserverissue,pp. W5–W9, 2008. [10] M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. [11] P. Jaiswal, S. Avraham, K. Ilic, et al., “Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages,” Comparative and Functional Genomics, vol. 6, no. 7-8, pp. 388– 397, 2005. [12] J. Wixon and D. Kell, “The Kyoto encyclopedia of genes and genomes—KEGG,” Yeast, vol. 17, no. 1, pp. 48–55, 2000. [13] S. Grisolia, “Enzyme nomenclature,” Science, vol. 133, no. 3465, pp. 1672–1674, 1961. [14] N. J. Mulder, R. Apweiler, T. K. Attwood, et al., “InterPro: an integrated documentation resource for protein families, domains and functional sites,” Briefings in Bioinformatics, vol. 3, no. 3, pp. 225–235, 2002. [15] A. Marchler-Bauer, J. B. Anderson, P. F. Cherukuri, et al., “CDD: a Conserved Domain Database for protein classifica- tion,” Nucleic Acids Research, vol. 33, database issue, pp. D192– D196, 2005. [16] C. P. Ponting, J. Schultz, F. Milpetz, and P. Bork, “SMART: identification and annotation of domains from signalling and extracellular protein sequences,” Nucleic Acids Research, vol. 27, no. 1, pp. 229–232, 1999. [17] A. Bateman, E. Birney, L. Cerruti, et al., “The Pfam protein families database,” Nucleic Acids Research, vol. 30, no. 1, pp. 276–280, 2002. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Plant Genomics Hindawi Publishing Corporation

Loading next page...
 
/lp/hindawi-publishing-corporation/treegenes-a-forest-tree-genome-database-mHR2HzOLsr
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2008 Jill L. Wegrzyn et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-5370
DOI
10.1155/2008/412875
Publisher site
See Article on Publisher Site

Abstract

Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2008, Article ID 412875, 7 pages doi:10.1155/2008/412875 Research Article 1 2 1 1 Jill L. Wegrzyn, Jennifer M. Lee, Brandon R. Tearse, and David B. Neale Department of Plant Sciences, University of California, Davis, CA 95616, USA Department of Evolution and Ecology, University of California, Davis, CA 95616, USA Correspondence should be addressed to David B. Neale, dbneale@ucdavis.edu Received 17 December 2007; Revised 13 June 2008; Accepted 11 July 2008 Recommended by Shizhong Xu The Dendrome Project and associated TreeGenes database serve the forest genetics research community through a curated and integrated web-based relational database. The research community is composed of approximately 2 000 members representing over 730 organizations worldwide. The database itself is composed of a wide range of genetic data from many forest trees with focused efforts on commercially important members of the Pinaceae family. The primary data types curated include species, publications, tree and DNA extraction information, genetic maps, molecular markers, ESTs, genotypic, and phenotypic data. There are currently ten main search modules or user access points within this PostgreSQL database. These access points allow users to navigate logically through the related data types. The goals of the Dendrome Project are to (1) provide a comprehensive resource for forest tree genomics data to facilitate gene discovery in related species, (2) develop interfaces that encourage the submission and integration of all genomic data, and to (3) centralize and distribute existing and novel online tools for the research community that both support and ease analysis. Recent developments have focused on increasing data content, functional annotations, data retrieval, and visualization tools. TreeGenes was developed to provide a centralized web resource with analysis and visualization tools to support data storage and exchange. Copyright © 2008 Jill L. Wegrzyn et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION oping genetic resources for these organisms improve the genetic understanding individual member species within The TreeGenes database is a resource for all forest trees, the same families or genera. The rapid rate in which howeverworktodatehas focusedonspecific members genetic data is being generated from large-scale EST and of the Pinaceae family. Pinaceae is one of eight families resequencing projects has required corresponding growth of the order Coniferales (conifer) and includes 10 genera in relational databases and associated computational tools. and approximately 220 species. Species of Pinaceae are The ability to combine data from different sources facilitates commercially important to the forest industry and are used interpretation and potentially allows stronger inferences to for building, packaging, and paper products worldwide. be made. When information from several different databases Because of the very large size and complexity of the conifer is required, the assembly of data into a format suitable genomes (10–40 Gb) [1], greater emphasis has been placed for querying is a challenge. The development of systems on the expressed portion of the genome. This has been for the integration and combined analysis of diverse data achieved through EST and large-scale resequencing projects. remains a priority in bioinformatics. The TreeGenes database The principal species represented include six members of helps researchers efficiently analyze, access, integrate, and the genus Pinus (taeda, elliottii, radiata, pinaster, sylvestris apply the data. This paper will navigate through the and lambertiana), two species from other genera of the Dendrome (http://dendrome.ucdavis.edu/) and TreeGenes Pinaceae (Picea abies and Pseudotsuga menzeisii), and one (http://dendrome.ucdavis.edu/treegenes/) interfaces as well member of another family of conifer (Cryptomeria japonica as the types of data that can be accessed through these Cupressaceae). Recent work has also focused on integrating resources for members of Populus and Eucalyptus. Devel- resources. 2 International Journal of Plant Genomics Figure 1: Diversity of web-based resources available through the Dendrome Project and TreeGenes database. The Dendrome project (http://dendrome.ucdavis.edu/) serves as a community resource and portal to a variety of resources. These include the TreeGenes database followed by the supporting large-scale project pages, and the Dendrome Plone which is a controlled access forum. 2. CONTENT AND ORGANIZATION opportunities, and the community-curated repository of links. In addition, a discussion forum has recently been implemented to encourage users to submit questions and 2.1. Dendrome project resources comments on the database as well as participate in general discussions of conifer genomics. The related Dendrome The Dendrome pages are entry points to valuable links, Plone (http://dendrome.ucdavis.edu/TGPlone), based on the community projects, information forums, custom tools, Plone 3.0 content management platform (http://plone.org/), and the TreeGenes database (http://dendrome.ucdavis.edu/ provides a user friendly environment for investigators to resources/)(Figure 1). The resources available include cus- share and obtain information relating to a variety of projects. tom BLAST and FASTA services for sequence similarity searches. These services directly access all publicly available sequence data as well as custom EST or related databases. 2.2. Treegenes database overview Users can submit requests for the generation and submission of a searchable database to this repository. An area for the TreeGenes (http://dendrome.ucdavis.edu/treegenes/)func- retrieval and submission of custom scripts developed in the tions through a semi-automated PostgreSQL version 8.1.4- community of general use for the manipulation of genetic based database that consists of modules to hold a broad data exists. This interface allows authors to post and make range of data and information for trees. This system has available scripts and more advanced programs that they have a front end consisting of Perl 5.8.5 scripts running in a designed including relevant documentation. The tools pages Linux/Apache/PHP environment. The database is organized are internally curated and provide information on open into ten different modules that are highly connected in order source and freeware software packages that are relevant to to ease access and analysis of the data (Figure 2). These the processing, availability, and analysis of data presented modules include sample tracking that holds tree source and in the TreeGenes database. Links’ pages allow users to DNA extraction information. The sequencing and primers submit to and describe online resources and projects relevant module contains sequences from the resequencing efforts as to the research community. The links are submitted into well as data describing how the sequences were generated and one or more relevant genomic categories and are available parameters used in their alignment and analysis. The Species immediately in the repository.Forms are present through module holds the taxonomy information and the Colleague the Dendrome site in order to encourage users to modify, Module contains information on laboratories and individu- add, and correct a variety of information. The Dendrome als. There is a comparative map module that uses Cmap [2] pages are the primary source of useful information for to hold and view genetic maps, map relations, and molecular the research community such as upcoming events, research markers. TheLiteraturemodulestorespublications. TheEST Jill L. Wegrzyn et al. 3 module stores sequence and annotation information and 2.2.4. Literature module ties into an automated pipeline that allows users to submit The Literature module is responsible for performing regu- sequences for processing, analysis, and Genbank submission. lar and automated searches of major relevant publication TreeGenes is being expanded to include more extensive repositories including Pubmed [6] and Biosis according genotype and phenotype data utilizing data models from to a compiled list of keywords. TreeGenes organizes these the Germinate database [3]. This organization allows for publications, creates a list of searchable keywords, and builds a combination of internal curation as well as third-party internal relations to provide a centralized forest genetics submissions to validate and maintain current content. publication repository. This resource can be queried on many levels and allows for a customized subscription to preview 2.2.1. Species module recently added publications. In addition to querying public repositories, TreeGenes has a unique feature that allows for Individual species and colleague databases exist in the Tree- the submission of manuscripts by the author. These submit- Genes schema and are well integrated with other database ted publications undergo the same process of keyword gen- modules. Species is a manually curated database of 222 mem- eration. In addition to the submission of new publications, bers of the Pinaceae family. It is currently being expanded to authors are encouraged to submit supplemental information. encompass more forest trees. The Species module contains Each detail view provides automated external linking in detailed information for each entry including range maps addition to the opportunity to provide sequence data, raw and multiple images for each tree. Internal connections to mapping files, or accession numbers from public database the researchers who study each species, moreover, relevant submissions. This is the primary mechanism by which the publications are available in the detail view. External links curators can organize the information and further populate to NCBI’s extended taxonomy [4] provide direct access to the related modules. All standardized nomenclatures are publicly available sequence sets. enforced during submission including accessions IDs and TreeGenes comparative mapping nomenclature. 2.2.2. Colleague module The colleague database is a semi-automated directory of 2.2.5. Expressed sequence tags module nearly 2 000 researchers and 730 organizations. The colleague interfaces allow users to add, modify, and remove the contact The TreeGenes EST pipeline and database apply a combina- information of both the users themselves and their respective tion of custom and open-source tools integrated into a fully organizations. The interface offers the user the ability to automated processing pipeline (Figure 4). The processing query researchers who focus on particular species in addition occurs at five unique stages and allows users to either original to their specific research interests. tracefiles or FASTA files. In the latter case, the quality scores are unable to be considered in the processing. (1) Specific EST nomenclature is enforced during submission of 2.2.3. Comparative maps module the initial trace files or FASTA text files. (2) Tracefiles are The comparative map module is displayed through a mod- processed to identify a filtered, high-quality clone library as ified version of Cmap [2]. Cmap is one component of the determined by Ewing et al. [7]. (3) Sequence clustering con- larger Generic Model Organism Database (GMOD) toolkit sists of assembling high-quality sequences to produce longer [5]. This package has the ability to compare across genetic, transcripts and reduce overall redundancy. This occurs via sequence, and physical maps. At this time, TreeGenes only two rounds of Cap3 [8] processing. (4) Annotation involves contains genetic maps. The interactive display allows indi- pairwise comparisons of the EST clone library and the viduals to select single maps and continue to add additional EST contig consensus sequences. Sequence identification and linkage groups from any species to build a comparative view. annotation is provided by a series of BLAST homology Maps can be selected by specific linkage groups or they searches (Parallel and Priority BLAST) against user-defined can be queried by their features. TreeGenes currently houses and publicly available sequence databases implemented with features (marker types) for over 50 genetic maps including NCBI’s BLAST [9]. In general, searches are performed AFLP, ESTP, Isozyme, SSR, RAPD, and RFLP. The final against the Genbank nr protein database [4], however users map display is highly interactive. Selecting specific features may select custom datasets. The UniGene dataset is derived allows for the display of marker information and sequence by selecting the clone that best represents each contig and information. Detail views also include links to internal the singletons that have unique or no matches are further databases within TreeGenes including species, literature, col- annotated. This level of annotation consists of Gene Ontol- league, EST, QTLs, and PCR primer information (Figure 3). ogy (GO) [10] (with preference given to Plant GO [11]hits A standardized nomenclature has been developed to describe when available), KEGG for metabolic pathways [12], Enzyme each potential map feature. Custom scripts exist to generate Commission (EC) [13], and InterPro [14]for conserved and enforce this documented nomenclature during the protein domains (which includes CDD [15], SMART [16], import process. This internal validation allows for name- and Pfam [17]). The final stage and optional stage of the based correspondences (markers) to be annotated. These processing pipeline involves submission to Genbank [4] correspondences can be easily viewed during comparisons of following the approval of the owner. Users can login in the maps themselves as well as Cmap’s matrix view. and view the original EST data, the cleansed data and the 4 International Journal of Plant Genomics Genotype SNP Phenotype Expression Species Comparative map Sample Tracking Colleague Literature Sequencing EST Primers EST in Genbank Figure 2: Modular view of the TreeGenes database schema. The TreeGenes database (http://dendrome.ucdavis.edu/treegenes/)isafully relational PostgreSQL database with a total of ten modules. These modules have connections supported by interfaces that allow queries across these data types. Current development is focused on fully incorporating the genotype and phenotype modules. Figure 3: Comparative mapping (CMAP) interface applied to Pinaceae. The modified version of Cmap utilized to represent the genetic map unit in TreeGenes is highly interactive. This interface allows for the visualization of comparative map builds including matrix views that display the correspondences between the map sets. Cmap’s internal links also provide detailed supporting information on molecular markers, primer sequences, species, and publications. analysis results. Tables, formatted text, and links are provided 2.2.6. Resequencing interface for viewing summary and detailed information. Submission includes the generation, formatting, and actual submission The Sample Tracking, Primers, Sequencing, and Genotype of the required flat files. Since the TreeGenes database is modules are currently accessible through an interactive home to many species, the database is organized to maintain interface that includes a growing library of predefined independent species and project sets while search interfaces queries (Figure 5). Each template provides a simplified view have been developed to support comparative queries. of an underlying query by means of a text description Jill L. Wegrzyn et al. 5 Sequence processing Trace files Sequence clustering Loading FASTA files Annotation TreeGenes EST module scripts File preparation Submission flat files dbEST Gene ontology Custom BLAST Figure 4: EST analysis and submission process. The EST processing and submission pipeline can support multiple projects and species (http://dendrome.ucdavis.edu/treegenes/est/). Tracefiles are renamed, assembled, clustered, annotated, and submitted to Genbank upon the submitter’s approval. The web interface allows users to track the progress, view EST sequences, and provide basic annotation-based searches once the data has been loaded into the TreeGenes database. and one or more searchable fields. The recent increase cussion forum available through the Dendrome Project in high-throughput sequencing projects encouraged the website provides an opportunity for users to submit sug- development of interfaces capable of dealing with the analysis gestions on improvements or additions to the community of large amounts of data. In short, researchers can query resources. Input relating to new functionality and additional thousands of sequences in a single operation. The desktop data sources are welcome here. A help form is available for style interface assists with these searches and allows users to user-specific queries. This feedback form will automatically customize their results and organize data views. Users can send us the inquiry, making it easier to give an accurate perform multiple searches at one time and combine results. response. Further information is available by joining one of Data types available here currently include ESTs, EST anno- the TreeGenes electronic mailing lists (details on the website) tation, tracefiles, SNPs, primer sequences, and resequenced or by email to info@treegenes.ucdavis.edu. amplicons (including DNA extraction and tree sample information) (http://dendrome.ucdavis.edu/interface/). This 4. FUTURE DEVELOPMENT interface currently accesses information for over 40 000 ESTs and nearly 8 000 resequenced amplicons. The structure of TreeGenes permits researchers to rapidly accumulate a wealth of information about a particular 3. AVAILABILITY, DATA SUBMISSION, object or set of objects. This flexible design facilitates the FEEDBACK, AND SUPPORT formulation of new hypotheses for refining subsequent investigations. In addition to refinement and extension of The Dendrome Project and the TreeGenes database are smaller scale investigations, TreeGenes can also facilitate publicly available and can be accessed at http://dendrome more comprehensive approaches by allowing the investiga- .ucdavis.edu/. From the website, there is access to help in tion of interactions among datasets. TreeGenes is still in a the form of tutorials and a user manual. We encourage phase of rapid development. Future plans include incorpo- researchers to actively participate in making Dendrome rating standardized gene ontology to describe phenotypic and TreeGenes more accessible by submitting data and traits unique to forest trees. In addition, robust databases providing feedback on general usability. We are inter- will be available for the submission and visualization of ested in being able to provide a unique resource to SNP and expression data. The advanced workspace interface the community, which is fully dependent on individual will be expanded to accommodate both of these data types submissions. TreeGenes has a robust interface to submit as well as tools to ease the analysis and comparison of a variety of information including sequence data and this information. With an emphasis on meaningful data comparative mapping files through the literature database. acquisition and interface design, TreeGenes continues to More information on this resource can be found in serve a critical role in the efficient storage and analysis of data http://dendrome.ucdavis.edu/TreeGenes/literature/. The dis- by the forest genetics community. 6 International Journal of Plant Genomics (a) (b) Figure 5: Resequencing interface. The workspace interface environment allows for more complex, batch searches of EST, annotation, tracefile, PCR primer, amplicon sequence, SNP, DNA extraction, and tree sample data types (http://dendrome.ucdavis.edu/interface/). Users can initiate searches based on any of these data types. The data retrieved can be compiled into customized lists, saved for future searches, and downloaded to a local machine for further analysis. LIST OF ABBREVIATIONS [2] D. Ware, P. Jaiswal, J. Ni, et al., “Gramene: a resource for comparative grass genomics,” Nucleic Acids Research, vol. 30, EST: Expressed sequence tag no. 1, pp. 103–105, 2002. GUI: Graphical user interface [3] J. M. Lee, G. F. Davenport, D. Marshall, et al., “GERMINATE. GO: Gene ontology A generic database for integrating genotypic and phenotypic PHP: Hypertext preprocessor information for plant genetic resource collections,” Plant PO: Plant ontology Physiology, vol. 139, no. 2, pp. 619–631, 2005. SNP: Single nucleotide polymorphism [4] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, “GenBank: update,” Nucleic Acids Research, vol. 32, database issue, pp. D23–D26, 2004. REFERENCES [5] L. D. Stein, C. Mungall, S. Shu, et al., “The generic genome [1] M. R. Ahuja and D. B. Neale, “Evolution of genome size in browser: a building block for a model organism system data- conifers,” Silvae Genetica, vol. 54, no. 3, pp. 126–137, 2005. base,” Genome Research, vol. 12, no. 10, pp. 1599–1610, 2002. Jill L. Wegrzyn et al. 7 [6] N. C. Putnam, “Searching MEDLINE free on the Internet using the National Library of Medicine’s PubMed,” Clinical Excellence for Nurse Practitioners, vol. 2, no. 5, pp. 314–316, [7] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, “Base- calling of automated sequencer traces using phred. I. Accuracy assessment,” Genome Research, vol. 8, no. 3, pp. 175–185, 1998. [8] X.Huang andA.Madan,“CAP3:aDNAsequenceassembly program,” Genome Research, vol. 9, no. 9, pp. 868–877, 1999. [9] M.Johnson,I.Zaretskaya, Y. Raytselis, Y. Merezhuk,S. McGinnis, and T. L. Madden, “NCBI BLAST: a better web interface,” Nucleic Acids Research,vol. 36, webserverissue,pp. W5–W9, 2008. [10] M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. [11] P. Jaiswal, S. Avraham, K. Ilic, et al., “Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages,” Comparative and Functional Genomics, vol. 6, no. 7-8, pp. 388– 397, 2005. [12] J. Wixon and D. Kell, “The Kyoto encyclopedia of genes and genomes—KEGG,” Yeast, vol. 17, no. 1, pp. 48–55, 2000. [13] S. Grisolia, “Enzyme nomenclature,” Science, vol. 133, no. 3465, pp. 1672–1674, 1961. [14] N. J. Mulder, R. Apweiler, T. K. Attwood, et al., “InterPro: an integrated documentation resource for protein families, domains and functional sites,” Briefings in Bioinformatics, vol. 3, no. 3, pp. 225–235, 2002. [15] A. Marchler-Bauer, J. B. Anderson, P. F. Cherukuri, et al., “CDD: a Conserved Domain Database for protein classifica- tion,” Nucleic Acids Research, vol. 33, database issue, pp. D192– D196, 2005. [16] C. P. Ponting, J. Schultz, F. Milpetz, and P. Bork, “SMART: identification and annotation of domains from signalling and extracellular protein sequences,” Nucleic Acids Research, vol. 27, no. 1, pp. 229–232, 1999. [17] A. Bateman, E. Birney, L. Cerruti, et al., “The Pfam protein families database,” Nucleic Acids Research, vol. 30, no. 1, pp. 276–280, 2002. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal

International Journal of Plant GenomicsHindawi Publishing Corporation

Published: Aug 18, 2008

References