Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data

POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2011, Article ID 923035, 10 pages doi:10.1155/2011/923035 Research Article POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data 1 1 2 1 Ethalinda K. S. Cannon, Scott M. Birkett, Bremen L. Braun, Sateesh Kodavali, 3 4 5 5 Douglas M. Jennewein, Alper Yilmaz, Valentin Antonescu, Corina Antonescu, 2, 6, 7 1, 8 9, 10 2 Lisa C. Harper, Jack M. Gardiner, Mary L. Schaeffer, Darwin A. Campbell, 2 1 11 12 12 Carson M. Andorf, Destri Andorf, Damon Lisch, Karen E. Koch, Donald R. McCarty, 5 4 3 1, 2 John Quackenbush, Erich Grotewold, Carol M. Lushbough, Taner Z. Sen, 1, 2 and Carolyn J. Lawrence Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA Plant Biotechnology Center and Department of Molecular Genetics, The Ohio State University, Columbus, OH 43210, USA Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Sm822, Boston, MA 02215, USA USDA-ARS Plant Gene Expression Center, Albany, CA 94710, USA Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA USDA-ARS Plant Genetics Research Unit, University of Missouri, Columbia, MO 65211, USA Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA Correspondence should be addressed to Carolyn J. Lawrence, triffid@iastate.edu Received 16 August 2011; Accepted 29 November 2011 Academic Editor: Pierre Sourdille Copyright © 2011 Ethalinda K. S. Cannon et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time—sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses Web services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn’s utility are provided herein. 1. Introduction an explosion of technologies that allowed large-scale ge- nomic experiments to flourish, and PGRP grants fueled 1.1. Need for the POPcorn Resource. In 1998, the National unprecedented advances in plant genomics research. This Science Foundation (NSF) launched the Plant Genome Re- program was unique in that it strongly encouraged large search Program (PGRP), as part of the National Plant Ge- collaborative projects and required project outcomes to be nome Initiative. The establishment of PGRP coincided with publicly available. Largely as the result of NSF’s forward 2 International Journal of Plant Genomics thinking program, many independent online resources for The overall objective of the POPcorn (PrOject Portal plant research have been developed in the past 12 years. for corn; http://popcorn.maizegdb.org/) project is to develop While this abundance of genomic data has transformed plant unified public resources that facilitate access to the outcomes science in many ways, it has also created some problems: of maize genetics and genomics research projects and to the plethora of independent websites requires researcher ensure their sustainability by migrating them to MaizeGDB, awareness of the various projects and what data each offers. the maize Model Organism Database (MOD) [6, 7]. POP- Finding and using these resources is not always straightfor- corn was designed from the outset to be a 2-year project with ward. Most sites use a variety of different tools that are often specific goals. Because of its short time frame, we did not aim unique to that resource, each requiring that the researcher to develop new technologies and solutions but rather to make learn how to interact with them. In addition, it is also often use of the best existing technologies. This short time frame difficult to use the results from one resource in another, and also forced us to create practical goals. Instead of locating and it is not generally possible to search multiple resources at aggregating data from distributed resources, POPcorn allows the same time. Instead, researchers find themselves repeating access to distributed datasets primarily by sequence queries the same search (e.g., BLAST [1]) at multiple sites in the rather than by classifications or keywords. The advantage of hopes of locating all information relevant to their research. In this approach is that it does not require cross-compatibility addition, when funding for a project ends, the data generated of often idiosyncratic terminology. Rather, it focuses on a often are not moved to long-term repositories. Thus, project universal feature of genetic and genomic datasets: sequence. sites degrade over time and sometimes disappear entirely. No matter how sophisticated or powerful, a resource is not When the previously accessible data disappear, generated useful unless it is adopted by researchers. For this reason, our resources are effectively lost. Aggravating these problems primary goal has been to produce a resource that researchers is the sheer volume of data available. These problems will use to aid their discoveries. have been acknowledged by various groups, including the maize research community (reviewed in the 2007 Aller- 1.2. Definitions. In order to describe the work accomplished ton Report at http://www.maizegdb.org/AllertonReport.doc) as a part of the POPcorn project, a few definitions and and are currently prevalent in many research communities discussions of the available technologies are in order: [2–4]. Internet technologies evolved to accommodate the mas- Provenance. Where and by what means an item originated sive quantity of emerging genomics data and to deliver hu- and the changes that may have occurred as the item moved man-readable content of the ever-increasing amount of in- from one place to another. formation. However, machine readability of this content lags dismally behind. Efforts have been underway for more than a decade to improve the machine readability of new Resource. A database, data visualization, data search, data types of data with the overarching goal of creating web analysis, or any web application that serves or processes in- resources that use standard ontologies that can be pro- formation. cessed by machines. Most notable of these are technolo- gies developed under the umbrella of the Semantic Web Sequenced-Indexed Data. Data that are associated with se- (http://www.w3.org/2001/sw/) such as the standard model quence. for data interchange called RDF (Resource Description Framework; http://www.w3.org/RDF/) and a mechanism to process the data content called OWL (Web Ontology Unlike Data. Data in different formats and/or describing dif- Language; http://www.w3.org/TR/owl-features/). Although ferent things. improvements that make content more visual and accessible to humans have been widely adopted, new technologies Web Service. A software system to support communication and standards that enable machine-readable content have and data transfer among web resources, typically using a been adopted more slowly. Finding relationships, setting standard protocol such as the Representational State Transfer standards, and aggregating the complexities introduced by called REST [8] or the Simple Object Access Protocol, SOAP diversedatatypes is achallenge thathas received agreat (http://www.w3.org/TR/soap12-part1). deal of attention. Beavis [5] points out several issues that providers of biological data must address. Indeed, consortia of researchers focused on developing and implementing 2. Materials and Methods standards that have been formed including the Genomic Standards Consortium and the Genome Reference Consor- 2.1. Implementation. The POPcorn webpages were modeled tium, and an open access journal called Standards in Genomic after the online database PGROP (Plant Genome Research Sciences (http://standardsingenomics.org/)was foundedin Outreach Portal; [9]), writteninPHP andrun on an Apache 2009. These efforts are actively ongoing especially in the life web server on a virtual machine created by VMware and sciences and are gaining momentum, but at this time are not running Red Hat Enterprise Linux Server release 5.6. The yet adequate for widespread implementation. The number backend data processing scripts were written in Perl 5.8. The and variety of rapidly evolving efforts for creating common MaizeGDB Oracle 11 g database is accessed directly via SQL. standards is a challenge in its own right. Web services employed include wwwBLAST (Perl), NCBI’s International Journal of Plant Genomics 3 URLAPI, and a SOAP Web service for BLASTing was devel- not allow us to implement user-designed workflows. Instead oped by PlantGDB (Java) [10]. Data were passed between sc- we chose to “hard code” workflows for what we knew to ripts and services using XML (http://www.w3.org/TR/2008/ be common tasks. Since POPcorn was charged to provide REC-xml-20081126/)and JSON (http://json-schema.org/). a sequence-based search resource for maize researchers; we call this collection of workflows the “Sequence-Indexed Data Search.” In designing tools that enabled researchers to carry 2.2. Development Approaches. In ordertoprovide access out common tasks, we learned through discussions with to projects, project data, and web resources with a hand- researchers that they frequently begin a given task using curated, searchable database, content had to be loaded into keyword searches via GenBank’s Entrez search service to the POPcorn curation database. These data are updated to find sequences. To incorporate this ability into POPcorn, we the production site’s database on the first Tuesday of each added a utility that accesses Entrez as a Web service. month. Currently, 242 project and resource descriptions are made available via POPcorn. Where possible, resources and 2.3. Availability. The data, database, and code that make up data are associated with projects, and projects are related POPcorn are in the public domain and are freely available to one another where such a relationship makes sense. For example, the Maize Genome Sequencing pilot project (http:// upon request using the “Ask a Question” link at the top of any POPcorn page. www.broadinstitute.org/annotation/plants/maize/)to eval- uate strategies for producing a sequence of the Maize genome and to generate genome resources for the community is 3. Results related to the funded B73 Maize Genome Sequencing project (http://genome.wustl.edu/genomes/view/zea mays mays 3.1. Accomplishments. In developing POPcorn, we addressed cv. b73) to sequence the gene space of Zea mays ssp.mays four specific problems: (1) inability to locate all projects with using the public B73 line. Researchers can submit their data relevant to a particular research problem, (2) repetitive projects and correct the information residing at POPcorn nature of performing the same sequence searches at multiple via email or using links from the POPcorn website. Most sites, (3) challenges associated with locating all types of data projects were identified by POPcorn curators and developers related to a particular sequence, and (4) issues associated by searching funding awards, attending conference talks, with long-term data storage once individual projects have and viewing posters. Other projects and descriptions were been completed. provided by researchers directly. One of POPcorn’s objectives was to enable BLAST 3.1.1. Search for Relevant Projects. In a rapidly evolving searches of multiple target datasets that are distributed across research area such as maize genomics and molecular biology, multiple websites from a single page. We used Web services it can be particularly challenging to keep abreast of molec- to provide access to these datasets. In addition, distributing ular tools and resources that can accelerate one’s research the BLAST requests via Web services permits multiple program. Indeed, many have experienced the frustration simultaneous BLAST jobs to execute on multiple servers. of choosing a research path or approach, only to have it Most BLAST Web services were implemented with NCBI’s trumped or rendered less than cutting edge by the newest wwwBLAST. One team (BioExtract; [11]) created a custom technological improvement. Since its inception in 1998, BLAST Web service for us. We also used NCBI’s URLAPI NSF PGRP has supported a rich variety of maize genomics Web service to run BLAST jobs on the NCBI servers against projects, each developing useful tools and having its own the most current GenBank data. We adapted CViT software project website. While this has moved the field forward by (http://sourceforge.net/apps/mediawiki/cvit/index.php?title leaps and bounds, it is sometimes difficult to keep abreast of =Main Page;[12]) to display sequence BLAST hits on the new and potentially useful advances. Most researchers would overall view of the B73 reference genome assembly. like to keep currentwithall the variousresearchprojects Because POPcorn makes heavy use of Web services over going on in their field in as efficient a way as possible. which we have little control, an automated script checks all To address this need for the maize and plant biology Web services daily and reports if any are not responding. research communities, the Project Search feature (http:// Searches and BLAST targets that use Web services all check popcorn.maizegdb.org/search/project search/project search those services before appearing on pages in option lists. To .php) at POPcorn was created. The Project Search accesses the extent possible, errors that come from the Web service a hand-curated database of maize research projects and (e.g., a query too large for the BLAST service) are reported resources that is updated monthly and provides maize and back to the researcher. plant biology researchers a one-stop shopping resource with To create useful workflows, for example, “locate mutant reasonable assurance that it contains all publicly available seed stock containing variations in a gene of interest,” maize resources and tools. To date, POPcorn enables access the multistep process was implemented in code to permit to 109 projects and 133 resources. Projects and resources are repetition of the same series of steps. The topic of workflows searched as separate entities and can be queried by keyword, has received much interest of the past decade: systems like investigator, institution, country, and category. Projects Taverna [13]and BioExtract [11] have been developed to also can be accessed by browsing from five precompiled enable users to create their own workflows for retrieving and categories: sequencing, mapping, mutation, bioinformatics, analyzing data. The limited scope of the POPcorn project did and breeding. Given the complementary approaches that 4 International Journal of Plant Genomics exist for searching POPcorn, virtually any approach a re- Although GenBank is tasked with maintaining the sequence searcher might take towards locating a project or resource is information generated by a project, much of what these likely to yield meaningful information. projects produce is beyond the scope of GenBank’s mis- sion. The vast majority of maize genomics projects are extramurally funded for two to five years. While funded, 3.1.2. Simultaneous Sequence Search at Multiple Websites. most of these projects do a good job of making their One of the initial impacts of the PGRP was the rapid resources (either informational or physical) available to increase in the number of available maize DNA sequences the maize research community. But what happens to these from a wide variety of projects, each with its own particular resources, some developed at considerable expense, when biological focus. Initially many of the DNA sequences were project funds for supporting them have been exhausted? expressed sequence tags (ESTs) from a wide range of tissues, Many projects manage to maintain and distribute resources genetic backgrounds, and treatments, each chosen to meet for a period of time, but ultimately their ability to do so the specific needs of a particular project. Later, projects declines and valuable resources can be lost. To address the focused on genomic sequencing often with the goal of issue of potential information loss, the POPcorn project capturing the maize gene space because the Maize Genome developed the ZeAlign tool (http://zealign.maizegdb.org/)to Sequencing project [14]was notyet underway.Eachof prepare sequence-indexed data for inclusion in MaizeGDB, these project types generally made their sequencing results the final repository for maize data and for the tools and available through a project website prior to publication processes developed as a part of the POPcorn project. with eventual submission to GenBank. In many cases, ZeAlign enables researchers to align their sequences to the projects generated and/or assembled sequence-associated latest maize reference genome assembly using BLAST, then information that could not be adequately represented and submit their alignments to MaizeGDB for public display queried at GenBank. Maize and plant biology researchers often found themselves migrating from website to website, via the MaizeGDB Genome Browser [15]. In addition to mining each for what it could contribute to their research. new tools, project records maintained by POPcorn include While the approach was workable for a very small number project expiration dates so that the MaizeGDB team knows of projects generating small sequence sets, it quickly became when to begin contacting PIs to obtain data that should be burdensome for researchers to search many projects that brought into MaizeGDB. housed large sequence datasets. It was especially difficult to compare results from the different sites side-by-side 3.2. Data, Methodologies, and Tools. Over the course of devel- because each website used different parameters by default oping POPcorn, various aspects of data aggregation, query and displayed customized displays for result sets. technologies, and standard terminologies were considered. To address this problem and to allow a more rational In many cases, selections of particular methodologies and and focused approach to searching maize sequence resources, tools were made based upon research needs combined with POPcorn BLAST (http://popcorn.maizegdb.org/search/seq- practicality. Those decisions and selections are discussed in uence search/home.php?a=BLAST UI) was developed. This the subsequent sections. utility permits BLAST searches of sequence resources at mul- tiple sites using a single query. Datasets are searched directly 3.2.1. Data. One common approach to aggregating unlike at the host site with host tools or mirrored at MaizeGDB and kept current with update scripts that run at regular intervals. data is to combine it all within a single relational data warehouse. This provides good control over the quality and Researchers can upload individual sequences or batch files structure of the data, but costs a great deal in terms of curator and search all or any combination of multiple databases (as of this writing, 44 BLAST targets are available), each with its time. Another approach is to link databases into a network of federated databases. This distributes the responsibility for ownunique focusonaclassortypeofmaize DNAsequence. maintaining data, but limits access and control over the Results for each of the databases can be viewed individually. Results are returned either by email or via web interface with quality and structure of the data. a choice of multiple download formats. POPcorn does not itself catalogue research data; instead it contains information describing data (metadata) and how to access availableresources.Wewereabletomakeuse 3.1.3. Finding Data Associated with Sequence. We developed a set of utilities that carry out multistep searches for seq- of MaizeGDB and NCBI [16] warehoused data to search uence-indexed data, that is, data that can be found via and retrieve information from many projects and to access sequence (expression patterns, similar sequences, functional additional databases: GRASSIUS [17], Gramene [18], Maize- annotations, associated locus, phenotype, traits, publica- Sequence.org (http://www.maizesequence.org/), DFCI [19], tions, etc.). For a detailed example of how these tools work, Plant Genomics MAGIs [20], PLEXdb [21], PlantGDB [10], see “4. Example Usage Case: Sequence-indexed Data Search.” the Photosynthesis Mutant Library (PML) [22], and Phy- tozome (http://www.phytozome.net/search.php). Personnel working at these databases allowed access to their data and, 3.1.4. Migrating Data from Completed Projects to MaizeGDB. One unanticipated issue arising from the rapid proliferation in some instances, installed or permitted us to install Web of maize genomic tools and resources was a need to main- services on their servers (GRASSIUS, DFCI, PLEXdb) or tain generated data after completion of any given project. developed tools (PlantGDB) to enable our access. International Journal of Plant Genomics 5 3.2.2. Connecting Offsite Resources to the POPcorn Project. 3.2.4. Locating Resources. POPcorn’s projects and resources Similar to data, tools may be maintained locally or dis- were located and curated into the database by hand. It was tributed and accessed by various technologies including not until the site served quite a few project descriptions and various Web services. Where possible we have used existing resources (∼100 total) that outside groups began to contact technological solutions to the problems of distributed data us directly to request specific changes to their own projects’ and resources, such as Web services and two machine- descriptions and to request inclusion of new projects and readable protocols for encoding data, XML and JSON. We resources. chose not to use more sophisticated technologies like SSWAP (Simple Semantic Web Architecture and Protocol) [23, 24] 3.2.5. Divergent Data Types. POPcorn did not attempt to or SPARQL (SPARQL Protocol and RDF Query Language; tackle the problem of integrating divergent data types (also http://www.w3.org/TR/rdf-sparql-query/) because, although called “unlike data”) except to show all related data located they show promise, there is less likelihood of their long-term during sequence-based, multistep data searches (called success and their current implementations were too limited sequence-indexed data searches; described below). To estab- for our needs. Data available via these technologies tend to lish relationships between classes of data, POPcorn relied on be to proof-of-concept implementations or limited in scope. human curation rather than formalized data descriptions. We had hoped to make considerable use of Web services Indeed, formalized data descriptions did not exist for the to search and retrieve data from collaborator databases, majority of the data we accessed. but we found this approach challenging to implement and not robust. POPcorn has little to no control over Web 3.2.6. Verifying Links. One problem with the ubiquitous links services provided by other websites, especially in managing, in webpages is not knowing what is on the other end of a recovering from, and reporting errors back to the researcher. link or if anything is even there at all. POPcorn addressed the Most of the Web services we found that best suited our former with hand curation and the latter with an automated needs were those provided by NCBI: eUtils (which we used script that runs daily to verify that all webpages and Web for searching NCBI Entrez databases; http://www.ncbi.nlm services listed in the database are responding as expected. .nih.gov/books/NBK25501/), URLAPI (which we used for Missing pages and services are reported to the POPcorn executing BLAST searches at NCBI; http://www.ncbi..html), team for remedy and data searches that rely on missing Web and wwwBLAST (which is freely available for download and services are taken offline until the problem is corrected. In can be installed as a BLAST Web service on any database ser- addition, pages that present more than one search type to ver; http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/www- the researcher and involve one or more Web services are blast/). We requested that wwwBLAST be implemented at verified before the pages loads in the browser and options some of our collaborator websites, including GRASSIUS, are presented. DFCI, and the Plant Genomics MAGI site, and we installed wwwBLAST at PLEXdb. We also made use of a BLAST Web 3.2.7. Versioning and Provenance. A significant problem ass- service developed by PlantGDB’s BioExtract team to execute ociated with gathering data from distributed sources is the BLAST searches against datasets at PlantGDB. risk of the “provenance data” being lost along with accom- panying credits and citations. Provenance data are a record 3.2.3. Identifiers. Attaching unique, unambiguous identifiers of the workflow activities invoked, services and databases to sequences has long been challenging. For example, accessed, datasets used, and other specifics of the compu- all sequences at NCBI have an accession number (e.g., tational analysis detailing how, where, when, why, and by AF448416 is the accession number for the genomic sequenc- whom the results of an experiment were generated [25]. ing containing the maize bronze1 locus), a version extension POPcorn strives to give credit to the data providers by for the accession (e.g., AF448416.1 is the first version of the providing extensive information in the project pages and, sequence record and AF448416.2 would represent the second where possible, also gives data version information. version), and a GI number unique to each version of an accession (e.g., AF448416.1 is assigned GI 18092333, and subsequent versions of the accession would have an entirely 4. Example Usage Case: unique and unrelated new GI designation assigned). This Sequence-Indexed Data Search topic is discussed more fully at NCBI (http://www.ncbi.nlm .nih.gov/Sitemap/sequenceIDs.html). We decided to use Researchers often want to identify mutant alleles to charac- mainly GenBank accessions for identifiers because the terize the function of a gene or set of genes of interest. For GenBank accession number without a version appended example, once a gene or sequence of interest is identified, the identifies the most up-to-date version of a given sequence. characterization of additional alleles can confirm gene func- Where no GenBank identifier exists, we use project-specific tion: multiple alleles demonstrating a shared phenotype are identifiers. For example, for the most up-to-date maize gene excellent confirmation that a particular gene is responsible model sequences we use the Gramene gene model names for a given phenotype. Indeed, characterization of multiple (bz1’s gene model identifier is GRMZM2G165390), both alleles of a gene is often required by journals for publication. because the availability of those sequences from GenBank For maize, there are several publicly available collections lags behind their public release and because researchers use of mutants with transposable element insertions in known those identifiers frequently in their own research. locations. POPcorn enables researchers to search various 6 International Journal of Plant Genomics databases simultaneously for sequences of interest that har- (2) Ac/Ds bor a mutation within or near the query sequence. Outputs contain not only the sequence itself but also identifiers for (2.1) BLAST at GenBank against flanking sequence for Ac/Ds. and access to relevant seed stocks. In our example, a researcher we will call “Jane Smith” (2.2) Search for Ac/Ds record at MaizeGDB using hit is working to increase the density at which plants can grow accession. and has identified a sequence that confers an “upright leaves” (2.3) Generate links to PlantGDB for viewing inser- phenotype in some maize lines that could be useful (reviewed tion information and ordering stocks. in [26]). She would like to determine whether anything about this sequence has been published and if there are publicly (3) PML Mu insertions available stocks that contain mutations in this sequence and, if so, obtain the seed stocks for her own investigations. To (3.1) BLAST at MaizeGDB against current filtered accomplish these goals, Jane goes to the POPcorn homepage: gene models. http://popcorn.maizegdb.org/main/index.php, then selects (3.2) Use gene model IDs to link to the PML Mu in- the “Search maize resources associated with a sequence” sertions website to get insertion and seed stock button under the “Sequence → Biology” header. information. On the next page (Figure 1), she pastes the DNA sequence of her gene of interest and selects checkboxes to search (4) TILLING both “Mutant seed stocks” and “Publications.” These selec- tions enable simultaneous searches of relevant publications (4.1) BLAST at GenBank against the Maize TILLING (indexed at both NCBI’s PubMed and MaizeGDB) as well project target sequences. as sequence-indexed insertions of Mutator (Mu) elements (4.2) Search for associated locus record at Maize- [27, 28], Activator/Dissociation (Ac/Ds)elements[29–32], GDB. and TILLING point mutations [33] for which seed stocks (4.3) Search for stock record associated with the locus are available. When she clicks the “Begin search” button, record. POPcorn carries out the following steps. First, the input may be FASTA or raw sequence, GenBank (5) Publications IDs, or Gramene maize gene model IDs. If needed, the seq- uences associated with identifiers are retrieved and all is (5.1) Execute simultaneous BLAST searches against converted to FASTA. GenBank’s protein, nucleotide, GSS, and EST Next, the following five searches are carried out simulta- databases. neously. (5.2) For each hit, search MaizeGDB data for match- (1) UniformMu ing loci. (5.3) For each locus found, search for associated pub- (1.1) If input is one or more gene model names, look lications. up UniformMu insertions directly in the Maize- GDB database and go to step 1.7. (5.4) For each hit, check the GenBank record for publication information. (1.2) If input is a GenBank identifier, request the FASTA-formatted sequence from GenBank via Finally, when all searches are completed, display results Webservice andgotostep1.3. on one page with a tab control to enable viewing each results (1.3) BLAST query sequence against the current set. working gene set to get the gene model name(s) For the example given, results appear directly within the of the sequence [14]. If there is a matching web browser or, if Jane wishes to do other things while gene model, go to step 1.4. If no gene model is waiting for the searches to complete, using an output link returned, go to step 1.5. that is sent to her via email once the results set has been (1.4) Do a direct lookup in the MaizeGDB database generated (Figure 1(b)). For the search of maize stocks, using the gene model names. If successful, go to there are several possible results. In this case, there are Mu step 1.7. and Ac/Ds insertions in the input sequence. Jane can use (1.5) BLAST query sequence against the current ref- the provided links to carefully check the results to see if erence genome assembly. they are useful. For example, an insertion may be located (1.6) Get hit coordinates, add 200 upstream and in an intron and she may only be interested in insertions downstream bases, then do a direct lookup in into exons. Another possible result is an insertion into a the MaizeGDB database. sequence that does not contain an established gene model. (1.7) Get locus record at MaizeGDB. Lastly, it is possible that no hits will be returned for the (1.8) Get variation records for that locus at Maize- query sequence: whereas there are ∼6,500 Mu insertions and GDB. 2351 Ac/Ds insertions publicly available, there are currently (1.9) Get UniformMu stock records for each varia- over 39,000 well-supported gene models (i.e., the “Filtered tion at MaizeGDB. Gene Set”) in maize [14]. International Journal of Plant Genomics 7 (a) (b) Figure 1: Continued. 8 International Journal of Plant Genomics (c) Figure 1: Sequence-indexed data search: sequence to stocks and publications. To locate stocks with mutations in or near the sequence of interest as well as relevant publications simultaneously, the process consists of steps demarcated by red circles. (a): (1) Paste a sequence (FASTA-formatted or raw) or sequence identifier (GenBank accession number or GI) into the “Input sequence” field and indicate whether the sequence is made up of nucleotides or amino acids. (2) Next, select one or more datasets to search. Here, mutant seed stocks and publications are selected. (3) Type an email address into the text box (optional) then click the “Begin search” button. (b): (4) The number of results from each search type are shown. Once a results set has been chosen (here “UniformMu”), (5) BLAST parameters used to produce the results are displayed. (6) The algorithm for conducting the selected search is shown, and (7) identified gene models and stocks are listed along with a snapshot of the MaizeGDB Genome Browser in the region of the mutation to show genomic context. (c): (8) The number of results from each search type is shown. Once a results set has been chosen (here “publications”), (9) BLAST parameters used to produce the results are displayed. (10) The algorithm for conducting the selected search is shown, and (11) identified citations are shown with links to both PubMed and MaizeGDB. Simultaneously, the results (Figure 1(c)) indicate that 5. Conclusions and Future Directions Jane has located 8 citations to her best hit: the sequence of The technologies used to approach research problems in the well-characterized liguleless1 gene. molecular genetics and plant breeding evolve continually, By using this sequenced-indexed data search, Jane has with sequence becoming the true “coin of the realm.” Various been spared several tedious steps and can quickly determine online resources emerge over time, with little more than if there are available resources for characterizing the biolog- sequence in common. In order to best support the paradigm ical function of her sequence of interest. Equally important, shift from the use of molecular markers to genotype by Jane’s chances of finding a mutation in her gene of interest sequencing, online resources must be reworked to accommo- are increased by using POPcorn because otherwise she would date a more sequence-centric perspective. have had to locate all relevant project websites, which is not a trivial task. In addition, were she to perform the same search In its current implementation, POPcorn serves as a centralized web-accessible resource for gaining access to out- of all projects she could locate by hand, each would have its comes of maize research projects with an emphasis on the own biases and eccentricities based upon differing default parameters, which would make direct comparisons of results use of DNA and protein sequence for query inputs. The underlying premise is that, by creating a single point of from diversewebsitesdifficult. International Journal of Plant Genomics 9 access, researchers save valuable time and incorporate re- descriptions of their projects for inclusion in the POPcorn sources previously unknown to them into their analyses. resource (see http://popcorn.maizegdb.org/search/sequence By design, working in tandem with MaizeGDB allowed for search/webcollaborators.php and http://popcorn.maizegdb the development of a pipeline whereby ongoing projects’ .org/search/project search/project search.php). POPcorn data are accessible via POPcorn during the projects’ funded funding was provided by NSF DBI 0743804 (POPcorn period and relevant data are simultaneously prepared for and CViT; C. Lawrence, PI and T.Z. Sen, co-PI), NSF IOS inclusion in MaizeGDB at the end of a project’s funded 0965380 (development of ZeAlign; J. Lynch, PI), NSF IOS period. In addition, by bringing collaborators’ project data 0701405 (GRASSIUS; E. Grotewold, PI), NSF IOS 0649614 into MaizeGDB at the end of their funding periods, these (DFCI’s ZmGI; J. Quackenbush, PI), NSF IOS 0606906 valuable data are preserved on the long term. (PlantGDB, PGROP, and BioExtract; V. Brendel, PI), NSF Discussions with maize researchers always indicate that DBI 0321711 (MAGI; P. Schnable, PI with help from Eddy clear, easy-to-use graphical user interfaces (GUIs) are critical Yeh), NSF IOS 0543441 (PLEXdb; J. Dickerson, PI), NSF IOS for success. Investigators found the sequence-indexed data 0922560 (PML; A. Barkan, PI), and NIH 5 T32 CA106196-05 searches to be very helpful, but many required assistance to (A. Yilmaz) along with in-kind support from MaizeGDB perform the searches. Part of this difficulty was conceptual and the USDA-ARS. (very few search utilities permit searching multiple data types starting from sequence) and part was due to the GUI. References This problem is being addressed in part by integrating the developed sequence-indexed data search into relevant pages [1] S. F. Altschul, T. L. Madden, A. A. Scha¨ffer et al., “Gapped at MaizeGDB as well as by modifying the GUI on the main BLAST and PSI-BLAST: a new generation of protein database POPcorn search page based on guidance from researchers search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. using the tool. 3389–3402, 1997. We set out to make use of the most effective solu- [2] B. Buxton, V. Hayward, I. Pearson et al., “Big data: the next tions available. All features described here are available via Google. Interview by Duncan Graham-Rowe,” Nature, vol. 455, pp. 8–9, 2008. the POPcorn stand-alone website: http://popcorn.maizegdb [3] C. Lynch, “Big data: how do your data grow?” Nature, vol. 455, .org/. This site will be maintained by MaizeGDB personnel, no. 7209, pp. 28–29, 2008. and features developed for POPcorn are currently being [4] D. Howe, M. Costanzo, P. Fey et al., “Big data: the future of incorporated into the MaizeGDB website directly. For biocuration,” Nature, vol. 455, no. 7209, pp. 47–50, 2008. example, the search for UniformMu seed stock described [5] W.D.Beavis, Ed., Architectures for Integration of Data and above has been integrated into the page describing the Applications: Lessons from Integration Projects, Columbia, Mo, UniformMu project. POPcorn BLAST will be merged with USA, 2005. the MaizeGDB BLAST resource, and the coded workflows are [6] C.J.Lawrence, L. C. Harper,M.L.Schaeffer,T.Z.Sen, being integrated as a unit. The POPcorn resource currently T. E. Seigfried, and D. A. Campbell, “MaizeGDB: the maize is accessed by 114 unique users per day (averaged over a 6 model organism database for basic, translational, and applied month period). research,” International Journal of Plant Genomics, vol. 2008, Going forward, new project and resource records along Article ID 496957, 10 pages, 2008. with new BLAST target databases and new sequence-indexed [7] T. Sen, C. Andorf, M. Schaeffer et al., “MaizeGDB becomes data searches will be added as the landscape changes. “sequence-centric”,” Database, vol. 2009, p. bap020, 2009. We will continue to monitor funding awards and attend [8] R. Fielding, Architectural styles and the design of network- conference talks and poster sessions to learn about projects based software architectures, Doctoral Dissertation, University that should be indexed at POPcorn. The addition of new of California, 2000. data and projects to the BLAST databases and the sequence- [9] S. B. Baran, C. J. Lawrence, and V. Brendel, “Plant genome indexed data search requires that collaborators open their research outreach portal. A gateway to plant genome research data to POPcorn, typically through a Web service such as ”outreach” programs and activities,” Plant Physiology, vol. 134, no. 3, p. 889, 2004. wwwBLAST, and work with the MaizeGDB team to ensure [10] J. Duvick, A. Fu, U. Muppirala et al., “PlantGDB: a resource that the logical steps required for deploying a sequence- for comparative plant genomics,” Nucleic Acids Research, vol. indexed data search of new data types are correct. Integration 36, no. 1, pp. D959–D965, 2008. of developed technologies into the MaizeGDB resource likely [11] C. Lushbough, M. K. Bergman, C. J. Lawrence, D. Jennewein, will increase usage significantly. We hope that the success of and V. Brendel, “BioExtract server—an integrated workflow- our end product will be judged by its usability and adoption enabling system to access and analyze heterogeneous, dis- by the research community it seeks to serve. tributed biomolecular data,” IEEE/ACM Transactions on Com- putational Biology and Bioinformatics, vol. 7, no. 1, pp. 12–24, Acknowledgments [12] Ethalinda K. S. Cannon, “Chromosome visualization tool: a whole genome viewer,” International Journal of Plant Geno- The authors thank Martin Spalding for helping with ad- mics, vol. 2011, Article ID 373875, 4 pages, 2011. ministrative aspects of project direction, Matthew Steven for [13] A. Lanzen ´ and T. Oinn, “The Taverna Interaction Service. contributing to the conceptualization of some of the BLAST Enabling manual interaction in workflows,” Bioinformatics, capabilities, and all maize researchers who have contributed vol. 24, no. 8, pp. 1118–1120, 2008. 10 International Journal of Plant Genomics [14] P. S. Schnable, D. Ware, R. S. Fulton et al., “The B73 maize [31] J. M. Kolkman, L. J. Conrad, P. R. Farmer et al., “Distribution genome: complexity, diversity, and dynamics,” Science, vol. of Activator (Ac) throughout the maize genome for use in 326, no. 5956, pp. 1112–1115, 2009. regional mutagenesis,” Genetics, vol. 169, no. 2, pp. 981–995, [15] T. Sen, L. Harper, M. Schaeffer et al., “Choosing a genome browser for a Model Organism Database: surveying the maize [32] M. Cowperthwaite, W. Park, Z. Xu, X. Yan, S. C. Maurais, and community,” Database, vol. 2010, p. baq007, 2010. H. K. Dooner, “Use of the transposon Ac as a gene-searching engine in the maize genome,” Plant Cell, vol. 14, no. 3, pp. 713– [16] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and 726, 2002. D. L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 36, no. 1, pp. D25–D30, 2008. [33] B. J. Till, S. H. Reynolds, C. Weil et al., “Discovery of induced point mutations in maize genes by TILLING,” BMC Plant [17] A. Yilmaz,M.Y.Nishiyama,B.G.Fuentes et al., “GRASSIUS: Biology, vol. 4, article 12, 2004. a platform for comparative regulatory genomics across the grasses,” Plant Physiology, vol. 149, no. 1, pp. 171–180, 2009. [18] K. Youens-Clark, E. Buckler, T. Casstevens et al., “Gramene database in 2010: updates and extensions,” Nucleic Acids Research, vol. 39, no. 1, supplement, pp. D1085–D1094, 2011. [19] C. Antonescu, V. Antonescu, R. Sultana, and J. Quackenbush, “Using the DFCI gene index databases for biological discov- ery,” Current Protocols in Bioinformatics, vol. 1, pp. Unit1.6.1– 36, 2010. [20] Y. Fu, S. J. Emrich, L. Guo et al., “Quality assessment of maize assembled genomic islands (MAGIs) and large-sclae experimental verification of predicted genes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 34, pp. 12282–12287, 2005. [21] R. P. Wise, R. A. Caldo, L. Hong, L. Shen, E. Cannon, and J. A. Dickerson, “BarleyBase/PLEXdb: a unified expression profiling database for plants and plant pathogens,” Methods in Molecular Biology, vol. 406, pp. 347–363, 2007. [22] R. Williams-Carrier,N.Stiffler, S. Belcher et al., “Use of Illu- mina sequencing to identify transposon insertions underlying mutant phenotypes in high-copy Mutator lines of maize,” Plant Journal, vol. 63, no. 1, pp. 167–177, 2010. [23] D. D. G. Gessler,G.S.Schiltz,G.D.May et al., “SSWAP:a simple semantic web architecture and protocol for semantic web services,” BMC Bioinformatics, vol. 10, p. 309, 2009. [24] R. T. Nelson,S.Avraham,R.C.Shoemaker,G.D.May, D. Ware, and D. D. G. Gessler, “Applications and methods utilizing the Simple Semantic Web Architecture and Protocol (SSWAP) for bioinformatics resource discovery and disparate data and service integration,” BioData Mining,vol. 3, no.1, article 3, 2010. [25] R. S. Barga and L. A. Digiampietri, “Automatic capture and efficient storage of e-Science experiment provenance,” Concur- rency Computation Practice and Experience,vol. 20, no.5,pp. 419–429, 2008. [26] F. Tian, P. J. Bradbury, P. J. Brown et al., “Genome-wide associ- ation study of leaf architecture in the maize nested association mapping population,” Nature Genetics, vol. 43, pp. 159–162, [27] D. R. McCarty, A. Mark Settles, M. Suzuki et al., “Steady-state transposon mutagenesis in inbred maize,” The Plant Journal for Cell and Molecular Biology, vol. 44, no. 1, pp. 52–61, 2005. [28] A. M. Settles, D. R. Holding, B. C. Tan et al., “Sequence- indexed mutations in maize using the UniformMu trans- poson-tagging population,” BMC Genomics, vol. 8, article 116, [29] K. R. Ahern, P. Deewatthanawong, J. Schares et al., “Regional mutagenesis using Dissociation in maize,” Methods, vol. 49, no. 3, pp. 248–254, 2009. [30] E. Vollbrecht, J. Duvick, J. P. Schares et al., “Genome-wide distribution of transposed Dissociation elements in maize,” Plant Cell, vol. 22, no. 6, pp. 1667–1685, 2010. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Plant Genomics Hindawi Publishing Corporation

Loading next page...
 
/lp/hindawi-publishing-corporation/popcorn-an-online-resource-providing-access-to-distributed-and-diverse-084fRyzKSM
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2011 Ethalinda K. S. Cannon et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-5370
DOI
10.1155/2011/923035
Publisher site
See Article on Publisher Site

Abstract

Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2011, Article ID 923035, 10 pages doi:10.1155/2011/923035 Research Article POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data 1 1 2 1 Ethalinda K. S. Cannon, Scott M. Birkett, Bremen L. Braun, Sateesh Kodavali, 3 4 5 5 Douglas M. Jennewein, Alper Yilmaz, Valentin Antonescu, Corina Antonescu, 2, 6, 7 1, 8 9, 10 2 Lisa C. Harper, Jack M. Gardiner, Mary L. Schaeffer, Darwin A. Campbell, 2 1 11 12 12 Carson M. Andorf, Destri Andorf, Damon Lisch, Karen E. Koch, Donald R. McCarty, 5 4 3 1, 2 John Quackenbush, Erich Grotewold, Carol M. Lushbough, Taner Z. Sen, 1, 2 and Carolyn J. Lawrence Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA Plant Biotechnology Center and Department of Molecular Genetics, The Ohio State University, Columbus, OH 43210, USA Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Sm822, Boston, MA 02215, USA USDA-ARS Plant Gene Expression Center, Albany, CA 94710, USA Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA USDA-ARS Plant Genetics Research Unit, University of Missouri, Columbia, MO 65211, USA Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA Correspondence should be addressed to Carolyn J. Lawrence, triffid@iastate.edu Received 16 August 2011; Accepted 29 November 2011 Academic Editor: Pierre Sourdille Copyright © 2011 Ethalinda K. S. Cannon et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time—sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses Web services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn’s utility are provided herein. 1. Introduction an explosion of technologies that allowed large-scale ge- nomic experiments to flourish, and PGRP grants fueled 1.1. Need for the POPcorn Resource. In 1998, the National unprecedented advances in plant genomics research. This Science Foundation (NSF) launched the Plant Genome Re- program was unique in that it strongly encouraged large search Program (PGRP), as part of the National Plant Ge- collaborative projects and required project outcomes to be nome Initiative. The establishment of PGRP coincided with publicly available. Largely as the result of NSF’s forward 2 International Journal of Plant Genomics thinking program, many independent online resources for The overall objective of the POPcorn (PrOject Portal plant research have been developed in the past 12 years. for corn; http://popcorn.maizegdb.org/) project is to develop While this abundance of genomic data has transformed plant unified public resources that facilitate access to the outcomes science in many ways, it has also created some problems: of maize genetics and genomics research projects and to the plethora of independent websites requires researcher ensure their sustainability by migrating them to MaizeGDB, awareness of the various projects and what data each offers. the maize Model Organism Database (MOD) [6, 7]. POP- Finding and using these resources is not always straightfor- corn was designed from the outset to be a 2-year project with ward. Most sites use a variety of different tools that are often specific goals. Because of its short time frame, we did not aim unique to that resource, each requiring that the researcher to develop new technologies and solutions but rather to make learn how to interact with them. In addition, it is also often use of the best existing technologies. This short time frame difficult to use the results from one resource in another, and also forced us to create practical goals. Instead of locating and it is not generally possible to search multiple resources at aggregating data from distributed resources, POPcorn allows the same time. Instead, researchers find themselves repeating access to distributed datasets primarily by sequence queries the same search (e.g., BLAST [1]) at multiple sites in the rather than by classifications or keywords. The advantage of hopes of locating all information relevant to their research. In this approach is that it does not require cross-compatibility addition, when funding for a project ends, the data generated of often idiosyncratic terminology. Rather, it focuses on a often are not moved to long-term repositories. Thus, project universal feature of genetic and genomic datasets: sequence. sites degrade over time and sometimes disappear entirely. No matter how sophisticated or powerful, a resource is not When the previously accessible data disappear, generated useful unless it is adopted by researchers. For this reason, our resources are effectively lost. Aggravating these problems primary goal has been to produce a resource that researchers is the sheer volume of data available. These problems will use to aid their discoveries. have been acknowledged by various groups, including the maize research community (reviewed in the 2007 Aller- 1.2. Definitions. In order to describe the work accomplished ton Report at http://www.maizegdb.org/AllertonReport.doc) as a part of the POPcorn project, a few definitions and and are currently prevalent in many research communities discussions of the available technologies are in order: [2–4]. Internet technologies evolved to accommodate the mas- Provenance. Where and by what means an item originated sive quantity of emerging genomics data and to deliver hu- and the changes that may have occurred as the item moved man-readable content of the ever-increasing amount of in- from one place to another. formation. However, machine readability of this content lags dismally behind. Efforts have been underway for more than a decade to improve the machine readability of new Resource. A database, data visualization, data search, data types of data with the overarching goal of creating web analysis, or any web application that serves or processes in- resources that use standard ontologies that can be pro- formation. cessed by machines. Most notable of these are technolo- gies developed under the umbrella of the Semantic Web Sequenced-Indexed Data. Data that are associated with se- (http://www.w3.org/2001/sw/) such as the standard model quence. for data interchange called RDF (Resource Description Framework; http://www.w3.org/RDF/) and a mechanism to process the data content called OWL (Web Ontology Unlike Data. Data in different formats and/or describing dif- Language; http://www.w3.org/TR/owl-features/). Although ferent things. improvements that make content more visual and accessible to humans have been widely adopted, new technologies Web Service. A software system to support communication and standards that enable machine-readable content have and data transfer among web resources, typically using a been adopted more slowly. Finding relationships, setting standard protocol such as the Representational State Transfer standards, and aggregating the complexities introduced by called REST [8] or the Simple Object Access Protocol, SOAP diversedatatypes is achallenge thathas received agreat (http://www.w3.org/TR/soap12-part1). deal of attention. Beavis [5] points out several issues that providers of biological data must address. Indeed, consortia of researchers focused on developing and implementing 2. Materials and Methods standards that have been formed including the Genomic Standards Consortium and the Genome Reference Consor- 2.1. Implementation. The POPcorn webpages were modeled tium, and an open access journal called Standards in Genomic after the online database PGROP (Plant Genome Research Sciences (http://standardsingenomics.org/)was foundedin Outreach Portal; [9]), writteninPHP andrun on an Apache 2009. These efforts are actively ongoing especially in the life web server on a virtual machine created by VMware and sciences and are gaining momentum, but at this time are not running Red Hat Enterprise Linux Server release 5.6. The yet adequate for widespread implementation. The number backend data processing scripts were written in Perl 5.8. The and variety of rapidly evolving efforts for creating common MaizeGDB Oracle 11 g database is accessed directly via SQL. standards is a challenge in its own right. Web services employed include wwwBLAST (Perl), NCBI’s International Journal of Plant Genomics 3 URLAPI, and a SOAP Web service for BLASTing was devel- not allow us to implement user-designed workflows. Instead oped by PlantGDB (Java) [10]. Data were passed between sc- we chose to “hard code” workflows for what we knew to ripts and services using XML (http://www.w3.org/TR/2008/ be common tasks. Since POPcorn was charged to provide REC-xml-20081126/)and JSON (http://json-schema.org/). a sequence-based search resource for maize researchers; we call this collection of workflows the “Sequence-Indexed Data Search.” In designing tools that enabled researchers to carry 2.2. Development Approaches. In ordertoprovide access out common tasks, we learned through discussions with to projects, project data, and web resources with a hand- researchers that they frequently begin a given task using curated, searchable database, content had to be loaded into keyword searches via GenBank’s Entrez search service to the POPcorn curation database. These data are updated to find sequences. To incorporate this ability into POPcorn, we the production site’s database on the first Tuesday of each added a utility that accesses Entrez as a Web service. month. Currently, 242 project and resource descriptions are made available via POPcorn. Where possible, resources and 2.3. Availability. The data, database, and code that make up data are associated with projects, and projects are related POPcorn are in the public domain and are freely available to one another where such a relationship makes sense. For example, the Maize Genome Sequencing pilot project (http:// upon request using the “Ask a Question” link at the top of any POPcorn page. www.broadinstitute.org/annotation/plants/maize/)to eval- uate strategies for producing a sequence of the Maize genome and to generate genome resources for the community is 3. Results related to the funded B73 Maize Genome Sequencing project (http://genome.wustl.edu/genomes/view/zea mays mays 3.1. Accomplishments. In developing POPcorn, we addressed cv. b73) to sequence the gene space of Zea mays ssp.mays four specific problems: (1) inability to locate all projects with using the public B73 line. Researchers can submit their data relevant to a particular research problem, (2) repetitive projects and correct the information residing at POPcorn nature of performing the same sequence searches at multiple via email or using links from the POPcorn website. Most sites, (3) challenges associated with locating all types of data projects were identified by POPcorn curators and developers related to a particular sequence, and (4) issues associated by searching funding awards, attending conference talks, with long-term data storage once individual projects have and viewing posters. Other projects and descriptions were been completed. provided by researchers directly. One of POPcorn’s objectives was to enable BLAST 3.1.1. Search for Relevant Projects. In a rapidly evolving searches of multiple target datasets that are distributed across research area such as maize genomics and molecular biology, multiple websites from a single page. We used Web services it can be particularly challenging to keep abreast of molec- to provide access to these datasets. In addition, distributing ular tools and resources that can accelerate one’s research the BLAST requests via Web services permits multiple program. Indeed, many have experienced the frustration simultaneous BLAST jobs to execute on multiple servers. of choosing a research path or approach, only to have it Most BLAST Web services were implemented with NCBI’s trumped or rendered less than cutting edge by the newest wwwBLAST. One team (BioExtract; [11]) created a custom technological improvement. Since its inception in 1998, BLAST Web service for us. We also used NCBI’s URLAPI NSF PGRP has supported a rich variety of maize genomics Web service to run BLAST jobs on the NCBI servers against projects, each developing useful tools and having its own the most current GenBank data. We adapted CViT software project website. While this has moved the field forward by (http://sourceforge.net/apps/mediawiki/cvit/index.php?title leaps and bounds, it is sometimes difficult to keep abreast of =Main Page;[12]) to display sequence BLAST hits on the new and potentially useful advances. Most researchers would overall view of the B73 reference genome assembly. like to keep currentwithall the variousresearchprojects Because POPcorn makes heavy use of Web services over going on in their field in as efficient a way as possible. which we have little control, an automated script checks all To address this need for the maize and plant biology Web services daily and reports if any are not responding. research communities, the Project Search feature (http:// Searches and BLAST targets that use Web services all check popcorn.maizegdb.org/search/project search/project search those services before appearing on pages in option lists. To .php) at POPcorn was created. The Project Search accesses the extent possible, errors that come from the Web service a hand-curated database of maize research projects and (e.g., a query too large for the BLAST service) are reported resources that is updated monthly and provides maize and back to the researcher. plant biology researchers a one-stop shopping resource with To create useful workflows, for example, “locate mutant reasonable assurance that it contains all publicly available seed stock containing variations in a gene of interest,” maize resources and tools. To date, POPcorn enables access the multistep process was implemented in code to permit to 109 projects and 133 resources. Projects and resources are repetition of the same series of steps. The topic of workflows searched as separate entities and can be queried by keyword, has received much interest of the past decade: systems like investigator, institution, country, and category. Projects Taverna [13]and BioExtract [11] have been developed to also can be accessed by browsing from five precompiled enable users to create their own workflows for retrieving and categories: sequencing, mapping, mutation, bioinformatics, analyzing data. The limited scope of the POPcorn project did and breeding. Given the complementary approaches that 4 International Journal of Plant Genomics exist for searching POPcorn, virtually any approach a re- Although GenBank is tasked with maintaining the sequence searcher might take towards locating a project or resource is information generated by a project, much of what these likely to yield meaningful information. projects produce is beyond the scope of GenBank’s mis- sion. The vast majority of maize genomics projects are extramurally funded for two to five years. While funded, 3.1.2. Simultaneous Sequence Search at Multiple Websites. most of these projects do a good job of making their One of the initial impacts of the PGRP was the rapid resources (either informational or physical) available to increase in the number of available maize DNA sequences the maize research community. But what happens to these from a wide variety of projects, each with its own particular resources, some developed at considerable expense, when biological focus. Initially many of the DNA sequences were project funds for supporting them have been exhausted? expressed sequence tags (ESTs) from a wide range of tissues, Many projects manage to maintain and distribute resources genetic backgrounds, and treatments, each chosen to meet for a period of time, but ultimately their ability to do so the specific needs of a particular project. Later, projects declines and valuable resources can be lost. To address the focused on genomic sequencing often with the goal of issue of potential information loss, the POPcorn project capturing the maize gene space because the Maize Genome developed the ZeAlign tool (http://zealign.maizegdb.org/)to Sequencing project [14]was notyet underway.Eachof prepare sequence-indexed data for inclusion in MaizeGDB, these project types generally made their sequencing results the final repository for maize data and for the tools and available through a project website prior to publication processes developed as a part of the POPcorn project. with eventual submission to GenBank. In many cases, ZeAlign enables researchers to align their sequences to the projects generated and/or assembled sequence-associated latest maize reference genome assembly using BLAST, then information that could not be adequately represented and submit their alignments to MaizeGDB for public display queried at GenBank. Maize and plant biology researchers often found themselves migrating from website to website, via the MaizeGDB Genome Browser [15]. In addition to mining each for what it could contribute to their research. new tools, project records maintained by POPcorn include While the approach was workable for a very small number project expiration dates so that the MaizeGDB team knows of projects generating small sequence sets, it quickly became when to begin contacting PIs to obtain data that should be burdensome for researchers to search many projects that brought into MaizeGDB. housed large sequence datasets. It was especially difficult to compare results from the different sites side-by-side 3.2. Data, Methodologies, and Tools. Over the course of devel- because each website used different parameters by default oping POPcorn, various aspects of data aggregation, query and displayed customized displays for result sets. technologies, and standard terminologies were considered. To address this problem and to allow a more rational In many cases, selections of particular methodologies and and focused approach to searching maize sequence resources, tools were made based upon research needs combined with POPcorn BLAST (http://popcorn.maizegdb.org/search/seq- practicality. Those decisions and selections are discussed in uence search/home.php?a=BLAST UI) was developed. This the subsequent sections. utility permits BLAST searches of sequence resources at mul- tiple sites using a single query. Datasets are searched directly 3.2.1. Data. One common approach to aggregating unlike at the host site with host tools or mirrored at MaizeGDB and kept current with update scripts that run at regular intervals. data is to combine it all within a single relational data warehouse. This provides good control over the quality and Researchers can upload individual sequences or batch files structure of the data, but costs a great deal in terms of curator and search all or any combination of multiple databases (as of this writing, 44 BLAST targets are available), each with its time. Another approach is to link databases into a network of federated databases. This distributes the responsibility for ownunique focusonaclassortypeofmaize DNAsequence. maintaining data, but limits access and control over the Results for each of the databases can be viewed individually. Results are returned either by email or via web interface with quality and structure of the data. a choice of multiple download formats. POPcorn does not itself catalogue research data; instead it contains information describing data (metadata) and how to access availableresources.Wewereabletomakeuse 3.1.3. Finding Data Associated with Sequence. We developed a set of utilities that carry out multistep searches for seq- of MaizeGDB and NCBI [16] warehoused data to search uence-indexed data, that is, data that can be found via and retrieve information from many projects and to access sequence (expression patterns, similar sequences, functional additional databases: GRASSIUS [17], Gramene [18], Maize- annotations, associated locus, phenotype, traits, publica- Sequence.org (http://www.maizesequence.org/), DFCI [19], tions, etc.). For a detailed example of how these tools work, Plant Genomics MAGIs [20], PLEXdb [21], PlantGDB [10], see “4. Example Usage Case: Sequence-indexed Data Search.” the Photosynthesis Mutant Library (PML) [22], and Phy- tozome (http://www.phytozome.net/search.php). Personnel working at these databases allowed access to their data and, 3.1.4. Migrating Data from Completed Projects to MaizeGDB. One unanticipated issue arising from the rapid proliferation in some instances, installed or permitted us to install Web of maize genomic tools and resources was a need to main- services on their servers (GRASSIUS, DFCI, PLEXdb) or tain generated data after completion of any given project. developed tools (PlantGDB) to enable our access. International Journal of Plant Genomics 5 3.2.2. Connecting Offsite Resources to the POPcorn Project. 3.2.4. Locating Resources. POPcorn’s projects and resources Similar to data, tools may be maintained locally or dis- were located and curated into the database by hand. It was tributed and accessed by various technologies including not until the site served quite a few project descriptions and various Web services. Where possible we have used existing resources (∼100 total) that outside groups began to contact technological solutions to the problems of distributed data us directly to request specific changes to their own projects’ and resources, such as Web services and two machine- descriptions and to request inclusion of new projects and readable protocols for encoding data, XML and JSON. We resources. chose not to use more sophisticated technologies like SSWAP (Simple Semantic Web Architecture and Protocol) [23, 24] 3.2.5. Divergent Data Types. POPcorn did not attempt to or SPARQL (SPARQL Protocol and RDF Query Language; tackle the problem of integrating divergent data types (also http://www.w3.org/TR/rdf-sparql-query/) because, although called “unlike data”) except to show all related data located they show promise, there is less likelihood of their long-term during sequence-based, multistep data searches (called success and their current implementations were too limited sequence-indexed data searches; described below). To estab- for our needs. Data available via these technologies tend to lish relationships between classes of data, POPcorn relied on be to proof-of-concept implementations or limited in scope. human curation rather than formalized data descriptions. We had hoped to make considerable use of Web services Indeed, formalized data descriptions did not exist for the to search and retrieve data from collaborator databases, majority of the data we accessed. but we found this approach challenging to implement and not robust. POPcorn has little to no control over Web 3.2.6. Verifying Links. One problem with the ubiquitous links services provided by other websites, especially in managing, in webpages is not knowing what is on the other end of a recovering from, and reporting errors back to the researcher. link or if anything is even there at all. POPcorn addressed the Most of the Web services we found that best suited our former with hand curation and the latter with an automated needs were those provided by NCBI: eUtils (which we used script that runs daily to verify that all webpages and Web for searching NCBI Entrez databases; http://www.ncbi.nlm services listed in the database are responding as expected. .nih.gov/books/NBK25501/), URLAPI (which we used for Missing pages and services are reported to the POPcorn executing BLAST searches at NCBI; http://www.ncbi..html), team for remedy and data searches that rely on missing Web and wwwBLAST (which is freely available for download and services are taken offline until the problem is corrected. In can be installed as a BLAST Web service on any database ser- addition, pages that present more than one search type to ver; http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/www- the researcher and involve one or more Web services are blast/). We requested that wwwBLAST be implemented at verified before the pages loads in the browser and options some of our collaborator websites, including GRASSIUS, are presented. DFCI, and the Plant Genomics MAGI site, and we installed wwwBLAST at PLEXdb. We also made use of a BLAST Web 3.2.7. Versioning and Provenance. A significant problem ass- service developed by PlantGDB’s BioExtract team to execute ociated with gathering data from distributed sources is the BLAST searches against datasets at PlantGDB. risk of the “provenance data” being lost along with accom- panying credits and citations. Provenance data are a record 3.2.3. Identifiers. Attaching unique, unambiguous identifiers of the workflow activities invoked, services and databases to sequences has long been challenging. For example, accessed, datasets used, and other specifics of the compu- all sequences at NCBI have an accession number (e.g., tational analysis detailing how, where, when, why, and by AF448416 is the accession number for the genomic sequenc- whom the results of an experiment were generated [25]. ing containing the maize bronze1 locus), a version extension POPcorn strives to give credit to the data providers by for the accession (e.g., AF448416.1 is the first version of the providing extensive information in the project pages and, sequence record and AF448416.2 would represent the second where possible, also gives data version information. version), and a GI number unique to each version of an accession (e.g., AF448416.1 is assigned GI 18092333, and subsequent versions of the accession would have an entirely 4. Example Usage Case: unique and unrelated new GI designation assigned). This Sequence-Indexed Data Search topic is discussed more fully at NCBI (http://www.ncbi.nlm .nih.gov/Sitemap/sequenceIDs.html). We decided to use Researchers often want to identify mutant alleles to charac- mainly GenBank accessions for identifiers because the terize the function of a gene or set of genes of interest. For GenBank accession number without a version appended example, once a gene or sequence of interest is identified, the identifies the most up-to-date version of a given sequence. characterization of additional alleles can confirm gene func- Where no GenBank identifier exists, we use project-specific tion: multiple alleles demonstrating a shared phenotype are identifiers. For example, for the most up-to-date maize gene excellent confirmation that a particular gene is responsible model sequences we use the Gramene gene model names for a given phenotype. Indeed, characterization of multiple (bz1’s gene model identifier is GRMZM2G165390), both alleles of a gene is often required by journals for publication. because the availability of those sequences from GenBank For maize, there are several publicly available collections lags behind their public release and because researchers use of mutants with transposable element insertions in known those identifiers frequently in their own research. locations. POPcorn enables researchers to search various 6 International Journal of Plant Genomics databases simultaneously for sequences of interest that har- (2) Ac/Ds bor a mutation within or near the query sequence. Outputs contain not only the sequence itself but also identifiers for (2.1) BLAST at GenBank against flanking sequence for Ac/Ds. and access to relevant seed stocks. In our example, a researcher we will call “Jane Smith” (2.2) Search for Ac/Ds record at MaizeGDB using hit is working to increase the density at which plants can grow accession. and has identified a sequence that confers an “upright leaves” (2.3) Generate links to PlantGDB for viewing inser- phenotype in some maize lines that could be useful (reviewed tion information and ordering stocks. in [26]). She would like to determine whether anything about this sequence has been published and if there are publicly (3) PML Mu insertions available stocks that contain mutations in this sequence and, if so, obtain the seed stocks for her own investigations. To (3.1) BLAST at MaizeGDB against current filtered accomplish these goals, Jane goes to the POPcorn homepage: gene models. http://popcorn.maizegdb.org/main/index.php, then selects (3.2) Use gene model IDs to link to the PML Mu in- the “Search maize resources associated with a sequence” sertions website to get insertion and seed stock button under the “Sequence → Biology” header. information. On the next page (Figure 1), she pastes the DNA sequence of her gene of interest and selects checkboxes to search (4) TILLING both “Mutant seed stocks” and “Publications.” These selec- tions enable simultaneous searches of relevant publications (4.1) BLAST at GenBank against the Maize TILLING (indexed at both NCBI’s PubMed and MaizeGDB) as well project target sequences. as sequence-indexed insertions of Mutator (Mu) elements (4.2) Search for associated locus record at Maize- [27, 28], Activator/Dissociation (Ac/Ds)elements[29–32], GDB. and TILLING point mutations [33] for which seed stocks (4.3) Search for stock record associated with the locus are available. When she clicks the “Begin search” button, record. POPcorn carries out the following steps. First, the input may be FASTA or raw sequence, GenBank (5) Publications IDs, or Gramene maize gene model IDs. If needed, the seq- uences associated with identifiers are retrieved and all is (5.1) Execute simultaneous BLAST searches against converted to FASTA. GenBank’s protein, nucleotide, GSS, and EST Next, the following five searches are carried out simulta- databases. neously. (5.2) For each hit, search MaizeGDB data for match- (1) UniformMu ing loci. (5.3) For each locus found, search for associated pub- (1.1) If input is one or more gene model names, look lications. up UniformMu insertions directly in the Maize- GDB database and go to step 1.7. (5.4) For each hit, check the GenBank record for publication information. (1.2) If input is a GenBank identifier, request the FASTA-formatted sequence from GenBank via Finally, when all searches are completed, display results Webservice andgotostep1.3. on one page with a tab control to enable viewing each results (1.3) BLAST query sequence against the current set. working gene set to get the gene model name(s) For the example given, results appear directly within the of the sequence [14]. If there is a matching web browser or, if Jane wishes to do other things while gene model, go to step 1.4. If no gene model is waiting for the searches to complete, using an output link returned, go to step 1.5. that is sent to her via email once the results set has been (1.4) Do a direct lookup in the MaizeGDB database generated (Figure 1(b)). For the search of maize stocks, using the gene model names. If successful, go to there are several possible results. In this case, there are Mu step 1.7. and Ac/Ds insertions in the input sequence. Jane can use (1.5) BLAST query sequence against the current ref- the provided links to carefully check the results to see if erence genome assembly. they are useful. For example, an insertion may be located (1.6) Get hit coordinates, add 200 upstream and in an intron and she may only be interested in insertions downstream bases, then do a direct lookup in into exons. Another possible result is an insertion into a the MaizeGDB database. sequence that does not contain an established gene model. (1.7) Get locus record at MaizeGDB. Lastly, it is possible that no hits will be returned for the (1.8) Get variation records for that locus at Maize- query sequence: whereas there are ∼6,500 Mu insertions and GDB. 2351 Ac/Ds insertions publicly available, there are currently (1.9) Get UniformMu stock records for each varia- over 39,000 well-supported gene models (i.e., the “Filtered tion at MaizeGDB. Gene Set”) in maize [14]. International Journal of Plant Genomics 7 (a) (b) Figure 1: Continued. 8 International Journal of Plant Genomics (c) Figure 1: Sequence-indexed data search: sequence to stocks and publications. To locate stocks with mutations in or near the sequence of interest as well as relevant publications simultaneously, the process consists of steps demarcated by red circles. (a): (1) Paste a sequence (FASTA-formatted or raw) or sequence identifier (GenBank accession number or GI) into the “Input sequence” field and indicate whether the sequence is made up of nucleotides or amino acids. (2) Next, select one or more datasets to search. Here, mutant seed stocks and publications are selected. (3) Type an email address into the text box (optional) then click the “Begin search” button. (b): (4) The number of results from each search type are shown. Once a results set has been chosen (here “UniformMu”), (5) BLAST parameters used to produce the results are displayed. (6) The algorithm for conducting the selected search is shown, and (7) identified gene models and stocks are listed along with a snapshot of the MaizeGDB Genome Browser in the region of the mutation to show genomic context. (c): (8) The number of results from each search type is shown. Once a results set has been chosen (here “publications”), (9) BLAST parameters used to produce the results are displayed. (10) The algorithm for conducting the selected search is shown, and (11) identified citations are shown with links to both PubMed and MaizeGDB. Simultaneously, the results (Figure 1(c)) indicate that 5. Conclusions and Future Directions Jane has located 8 citations to her best hit: the sequence of The technologies used to approach research problems in the well-characterized liguleless1 gene. molecular genetics and plant breeding evolve continually, By using this sequenced-indexed data search, Jane has with sequence becoming the true “coin of the realm.” Various been spared several tedious steps and can quickly determine online resources emerge over time, with little more than if there are available resources for characterizing the biolog- sequence in common. In order to best support the paradigm ical function of her sequence of interest. Equally important, shift from the use of molecular markers to genotype by Jane’s chances of finding a mutation in her gene of interest sequencing, online resources must be reworked to accommo- are increased by using POPcorn because otherwise she would date a more sequence-centric perspective. have had to locate all relevant project websites, which is not a trivial task. In addition, were she to perform the same search In its current implementation, POPcorn serves as a centralized web-accessible resource for gaining access to out- of all projects she could locate by hand, each would have its comes of maize research projects with an emphasis on the own biases and eccentricities based upon differing default parameters, which would make direct comparisons of results use of DNA and protein sequence for query inputs. The underlying premise is that, by creating a single point of from diversewebsitesdifficult. International Journal of Plant Genomics 9 access, researchers save valuable time and incorporate re- descriptions of their projects for inclusion in the POPcorn sources previously unknown to them into their analyses. resource (see http://popcorn.maizegdb.org/search/sequence By design, working in tandem with MaizeGDB allowed for search/webcollaborators.php and http://popcorn.maizegdb the development of a pipeline whereby ongoing projects’ .org/search/project search/project search.php). POPcorn data are accessible via POPcorn during the projects’ funded funding was provided by NSF DBI 0743804 (POPcorn period and relevant data are simultaneously prepared for and CViT; C. Lawrence, PI and T.Z. Sen, co-PI), NSF IOS inclusion in MaizeGDB at the end of a project’s funded 0965380 (development of ZeAlign; J. Lynch, PI), NSF IOS period. In addition, by bringing collaborators’ project data 0701405 (GRASSIUS; E. Grotewold, PI), NSF IOS 0649614 into MaizeGDB at the end of their funding periods, these (DFCI’s ZmGI; J. Quackenbush, PI), NSF IOS 0606906 valuable data are preserved on the long term. (PlantGDB, PGROP, and BioExtract; V. Brendel, PI), NSF Discussions with maize researchers always indicate that DBI 0321711 (MAGI; P. Schnable, PI with help from Eddy clear, easy-to-use graphical user interfaces (GUIs) are critical Yeh), NSF IOS 0543441 (PLEXdb; J. Dickerson, PI), NSF IOS for success. Investigators found the sequence-indexed data 0922560 (PML; A. Barkan, PI), and NIH 5 T32 CA106196-05 searches to be very helpful, but many required assistance to (A. Yilmaz) along with in-kind support from MaizeGDB perform the searches. Part of this difficulty was conceptual and the USDA-ARS. (very few search utilities permit searching multiple data types starting from sequence) and part was due to the GUI. References This problem is being addressed in part by integrating the developed sequence-indexed data search into relevant pages [1] S. F. Altschul, T. L. Madden, A. A. Scha¨ffer et al., “Gapped at MaizeGDB as well as by modifying the GUI on the main BLAST and PSI-BLAST: a new generation of protein database POPcorn search page based on guidance from researchers search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. using the tool. 3389–3402, 1997. We set out to make use of the most effective solu- [2] B. Buxton, V. Hayward, I. Pearson et al., “Big data: the next tions available. All features described here are available via Google. Interview by Duncan Graham-Rowe,” Nature, vol. 455, pp. 8–9, 2008. the POPcorn stand-alone website: http://popcorn.maizegdb [3] C. Lynch, “Big data: how do your data grow?” Nature, vol. 455, .org/. This site will be maintained by MaizeGDB personnel, no. 7209, pp. 28–29, 2008. and features developed for POPcorn are currently being [4] D. Howe, M. Costanzo, P. Fey et al., “Big data: the future of incorporated into the MaizeGDB website directly. For biocuration,” Nature, vol. 455, no. 7209, pp. 47–50, 2008. example, the search for UniformMu seed stock described [5] W.D.Beavis, Ed., Architectures for Integration of Data and above has been integrated into the page describing the Applications: Lessons from Integration Projects, Columbia, Mo, UniformMu project. POPcorn BLAST will be merged with USA, 2005. the MaizeGDB BLAST resource, and the coded workflows are [6] C.J.Lawrence, L. C. Harper,M.L.Schaeffer,T.Z.Sen, being integrated as a unit. The POPcorn resource currently T. E. Seigfried, and D. A. Campbell, “MaizeGDB: the maize is accessed by 114 unique users per day (averaged over a 6 model organism database for basic, translational, and applied month period). research,” International Journal of Plant Genomics, vol. 2008, Going forward, new project and resource records along Article ID 496957, 10 pages, 2008. with new BLAST target databases and new sequence-indexed [7] T. Sen, C. Andorf, M. Schaeffer et al., “MaizeGDB becomes data searches will be added as the landscape changes. “sequence-centric”,” Database, vol. 2009, p. bap020, 2009. We will continue to monitor funding awards and attend [8] R. Fielding, Architectural styles and the design of network- conference talks and poster sessions to learn about projects based software architectures, Doctoral Dissertation, University that should be indexed at POPcorn. The addition of new of California, 2000. data and projects to the BLAST databases and the sequence- [9] S. B. Baran, C. J. Lawrence, and V. Brendel, “Plant genome indexed data search requires that collaborators open their research outreach portal. A gateway to plant genome research data to POPcorn, typically through a Web service such as ”outreach” programs and activities,” Plant Physiology, vol. 134, no. 3, p. 889, 2004. wwwBLAST, and work with the MaizeGDB team to ensure [10] J. Duvick, A. Fu, U. Muppirala et al., “PlantGDB: a resource that the logical steps required for deploying a sequence- for comparative plant genomics,” Nucleic Acids Research, vol. indexed data search of new data types are correct. Integration 36, no. 1, pp. D959–D965, 2008. of developed technologies into the MaizeGDB resource likely [11] C. Lushbough, M. K. Bergman, C. J. Lawrence, D. Jennewein, will increase usage significantly. We hope that the success of and V. Brendel, “BioExtract server—an integrated workflow- our end product will be judged by its usability and adoption enabling system to access and analyze heterogeneous, dis- by the research community it seeks to serve. tributed biomolecular data,” IEEE/ACM Transactions on Com- putational Biology and Bioinformatics, vol. 7, no. 1, pp. 12–24, Acknowledgments [12] Ethalinda K. S. Cannon, “Chromosome visualization tool: a whole genome viewer,” International Journal of Plant Geno- The authors thank Martin Spalding for helping with ad- mics, vol. 2011, Article ID 373875, 4 pages, 2011. ministrative aspects of project direction, Matthew Steven for [13] A. Lanzen ´ and T. Oinn, “The Taverna Interaction Service. contributing to the conceptualization of some of the BLAST Enabling manual interaction in workflows,” Bioinformatics, capabilities, and all maize researchers who have contributed vol. 24, no. 8, pp. 1118–1120, 2008. 10 International Journal of Plant Genomics [14] P. S. Schnable, D. Ware, R. S. Fulton et al., “The B73 maize [31] J. M. Kolkman, L. J. Conrad, P. R. Farmer et al., “Distribution genome: complexity, diversity, and dynamics,” Science, vol. of Activator (Ac) throughout the maize genome for use in 326, no. 5956, pp. 1112–1115, 2009. regional mutagenesis,” Genetics, vol. 169, no. 2, pp. 981–995, [15] T. Sen, L. Harper, M. Schaeffer et al., “Choosing a genome browser for a Model Organism Database: surveying the maize [32] M. Cowperthwaite, W. Park, Z. Xu, X. Yan, S. C. Maurais, and community,” Database, vol. 2010, p. baq007, 2010. H. K. Dooner, “Use of the transposon Ac as a gene-searching engine in the maize genome,” Plant Cell, vol. 14, no. 3, pp. 713– [16] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and 726, 2002. D. L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 36, no. 1, pp. D25–D30, 2008. [33] B. J. Till, S. H. Reynolds, C. Weil et al., “Discovery of induced point mutations in maize genes by TILLING,” BMC Plant [17] A. Yilmaz,M.Y.Nishiyama,B.G.Fuentes et al., “GRASSIUS: Biology, vol. 4, article 12, 2004. a platform for comparative regulatory genomics across the grasses,” Plant Physiology, vol. 149, no. 1, pp. 171–180, 2009. [18] K. Youens-Clark, E. Buckler, T. Casstevens et al., “Gramene database in 2010: updates and extensions,” Nucleic Acids Research, vol. 39, no. 1, supplement, pp. D1085–D1094, 2011. [19] C. Antonescu, V. Antonescu, R. Sultana, and J. Quackenbush, “Using the DFCI gene index databases for biological discov- ery,” Current Protocols in Bioinformatics, vol. 1, pp. Unit1.6.1– 36, 2010. [20] Y. Fu, S. J. Emrich, L. Guo et al., “Quality assessment of maize assembled genomic islands (MAGIs) and large-sclae experimental verification of predicted genes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 34, pp. 12282–12287, 2005. [21] R. P. Wise, R. A. Caldo, L. Hong, L. Shen, E. Cannon, and J. A. Dickerson, “BarleyBase/PLEXdb: a unified expression profiling database for plants and plant pathogens,” Methods in Molecular Biology, vol. 406, pp. 347–363, 2007. [22] R. Williams-Carrier,N.Stiffler, S. Belcher et al., “Use of Illu- mina sequencing to identify transposon insertions underlying mutant phenotypes in high-copy Mutator lines of maize,” Plant Journal, vol. 63, no. 1, pp. 167–177, 2010. [23] D. D. G. Gessler,G.S.Schiltz,G.D.May et al., “SSWAP:a simple semantic web architecture and protocol for semantic web services,” BMC Bioinformatics, vol. 10, p. 309, 2009. [24] R. T. Nelson,S.Avraham,R.C.Shoemaker,G.D.May, D. Ware, and D. D. G. Gessler, “Applications and methods utilizing the Simple Semantic Web Architecture and Protocol (SSWAP) for bioinformatics resource discovery and disparate data and service integration,” BioData Mining,vol. 3, no.1, article 3, 2010. [25] R. S. Barga and L. A. Digiampietri, “Automatic capture and efficient storage of e-Science experiment provenance,” Concur- rency Computation Practice and Experience,vol. 20, no.5,pp. 419–429, 2008. [26] F. Tian, P. J. Bradbury, P. J. Brown et al., “Genome-wide associ- ation study of leaf architecture in the maize nested association mapping population,” Nature Genetics, vol. 43, pp. 159–162, [27] D. R. McCarty, A. Mark Settles, M. Suzuki et al., “Steady-state transposon mutagenesis in inbred maize,” The Plant Journal for Cell and Molecular Biology, vol. 44, no. 1, pp. 52–61, 2005. [28] A. M. Settles, D. R. Holding, B. C. Tan et al., “Sequence- indexed mutations in maize using the UniformMu trans- poson-tagging population,” BMC Genomics, vol. 8, article 116, [29] K. R. Ahern, P. Deewatthanawong, J. Schares et al., “Regional mutagenesis using Dissociation in maize,” Methods, vol. 49, no. 3, pp. 248–254, 2009. [30] E. Vollbrecht, J. Duvick, J. P. Schares et al., “Genome-wide distribution of transposed Dissociation elements in maize,” Plant Cell, vol. 22, no. 6, pp. 1667–1685, 2010. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal

International Journal of Plant GenomicsHindawi Publishing Corporation

Published: Dec 27, 2011

References