Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Cross-Chip Probe Matching Tool: A Web-Based Tool for Linking Microarray Probes within and across Plant Species

Cross-Chip Probe Matching Tool: A Web-Based Tool for Linking Microarray Probes within and across... Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2008, Article ID 451327, 7 pages doi:10.1155/2008/451327 Resource Review Cross-Chip Probe Matching Tool: A Web-Based Tool for Linking Microarray Probes within and across Plant Species 1, 2 2 2, 3 Ruchi Ghanekar, Vinodh Srinivasasainagendra, and Grier P. Page Department of Electrical and Computer Engineering, UAB School of Engineering, University of Alabama at Birmingham, 1530 Third Avenue South, Birmingham, AL 35294-4461, USA Department of Biostatistics, University of Alabama at Birmingham, 1665 University Blvd, Birmingham, Al 35294-0022, USA Statistics and Epidemiology Unit, RTI International, Oxford Building, Suite 119, 2951 Flowers Road South, Atlanta, GA 30341-5533, USA Correspondence should be addressed to Grier P. Page, gpage@rti.org Received 2 November 2007; Accepted 14 August 2008 Recommended by Chunguang Du The CCPMT is a free, web-based tool that allows plant investigators to rapidly determine if a given gene is present across various microarray platforms, which, of a list of genes, is present on array(s), and which gene a probe or probe set queries and vice versa, and to compare and contrast the gene contents of arrays. The CCPMT also maps a probe or probe sets to a gene or genes within and across species, and permits the mapping of the entire content from one array to another. By using the CCPMT, investigators will have a better understanding of the contents of arrays, a better ability to link data between experiments, ability to conduct meta-analysis and combine datasets, and an increased ability to conduct data mining projects. Copyright © 2008 Ruchi Ghanekar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION that sets in the public domain allow can be used for data mining or meta-analysis if the elements can be connected. Microarrays are an incredibly powerful technology that However, it is difficult to compare and combine the results enables the rapid and relatively accurate measurement of due to the difficultly in matching probes across arrays with thousands of genes in a single sample. Many different the genes, or even to determine if a given gene is on a given platform. To make matters worse, while the probe microarray platforms have been developed and each has somewhat different content and format. One key differ- sequences on an array are constant, the genome annotation ence is the type of probe used to query a gene expres- and gene models are not, and homologous genes may have different names across species. As a result, matching probes sion; some platforms use a single probe, and others use many probes. The probes may be short (25 base pairs) across arrays is continually evolving and needs continuing oligonucleotides (Affymetrix and NimbleGen arrays), long updating. (50–70 bp) oligonucleotides (Operon, Agilent, CATMA), or Investigators have long realized the problem of linking cDNA clones (AFGC arrays). Each of the formats has its probes across platforms; as a result, several tools have been advantages and disadvantages as well as its proponents and developed. These include Keck ARray Manager and Annota- opponents. One thing on which everybody agrees is that tor (KARMA) [1], RESOURCERER [2], and GeneSeer [3]. arrays will be a part of the experimental techniques of plant Our tool has several advantages over the other tools for several reasons. None of the other tools allows investigators biologists for years to come. Since there are many microarray platforms even within to query for genes within a microarray platform nor do the a single species, different investigators may use different other tools allow queries by Arabidopsis Genome Initiative platforms to try to address similar or complementary (AGI) annotation IDs or by TIGR tentative consensus (TC) experimental questions or data may be collected across types gene IDs. Furthermore, our tool sends the results to the using different platforms. Also, the large number of datasets investigators by email as well as a web-based report making 2 International Journal of Plant Genomics results’ tracking and storage easier. More importantly for Microarray vendor provided NCBI BLAST plant researchers, only RESOURCERER has any provision blastn program probe sequence for the linking of plant array data, but it has fewer array types. We developed the CCPMT to enable investigators to rapidly determine (1) if a given gene is present across many types of array platforms within and across species, (2) which, of a list of genes, is present on array(s), and (3) which gene Tair dataset a probe or probe set queries. The CCPMT also maps a probe set or probe sets to a gene or genes within and across species, and permits the mapping of the contents from one array to another, both within and across species. Blastn result The CCPMT is the first tool exclusively designed probe set and corresponding for linking probes from plant microarrays within and AGI mapping across microarray platforms and species. A web-based tool, CCPMT, helps investigators query for annotations at probe Figure 1: CCPMT Arabidopsis BLAST workflow. The workflow in level with probe set IDs or even at gene level with gene CCPMT to get the probe set to AGI mappings is shown. identifiers such as AGI, EGO [4], andTCIDs. In CCPMT, an investigator can enter either individual or multiple probe set or gene identifiers (separated by commas) in the textbox to query the CCPMT database. Checkboxes for microarray nucleotide sequences of the probe sets (Table 1) directly from vendors provide the option of selecting multiple arrays the vendors. while querying the CCPMT database. CCPMT also offers In the case of Arabidopsis, all vendors provided the the flexibility to carry out a one-to-one comparison of mappings between their probe sets and the corresponding microarrays. Results are displayed immediately in the web AGI gene identifiers. However, due to evolving genome browser and are also sent through email in a csv file format. annotation, we derived a new set of mappings between the CCPMT has a flexible database design, and in the probe sets and the corresponding AGI IDs. The steps of immediate future additional plant arrays will be added the process are illustrated in Figure 1. The mapping was to the database; we will revise the underlying annotation accomplished using the NCBI blastn [5] program. Blastn and mapping for the probes based upon new genomic compares a nucleotide query sequence against a nucleotide information. sequence database. We used two different databases for blastn By using the CCPMT, investigators will have a better analysis. For the Affymetrix and Operon probe sequences, understanding of the contents of arrays, a better ability to which do not contain introns, the AGI CDS database at link data between experiments, plus the ability to more easily TAIR was used as the sequence database due to the lack conduct data mining projects. of introns and the UTRs in this database. The AGI CDS dataset is based on the TAIR6.0 release version, and was released in November 2005. For the AFGC and CATMA 2. METHODS arrays, which do contain some intronic and UTR sequences, the AGI Transcripts dataset was used. The AGI Transcripts 2.1. Arrays selected for initial analysis dataset includes all of the coding sequences from Arabidopsis, Initially we focused upon microarrays with diverse probe as well as containing the UTRs. Neither database contained types (short and long oligos as well as cDNA) and for intronic sequence. The AGI Transcripts dataset used the both Poplar and Arabidopsis. Poplar and Arabidopsis were TAIR6.0 release version and was released in November 2005. chosen due to both having completely sequenced genomes The blastn expected value and percent identity cut-off were −4 and being relatively closely related species. The Arabidopsis 10 and 98%, respectively. arrays as tools are the Affymetrix Arabidopsis genome (8 K) commonly referred to as AG, Affymetrix Arabidopsis genome 2.3. Poplar data preprocessing ATH1-121501 (25 K) commonly referred to as ATH1, Agilent Arabidopsis 2 Oligo Microarray (V2) G4136B, Arabidopsis About 27% of the Poplar sequence have significant homology Functional Genomics Consortium (AFGC) array, Complete to Arabidopsis protein-coding sequences [6]and have been Arabidopsis Transcriptome MicroArray (CATMA) array, sequenced. Unlike Arabidopsis, Poplar does not have a Operon Arabidopsis Genome Oligo Set Version 3.0, and universal gene annotation ID; so in CCPMT Poplar, probe Affymetrix Poplar Genome Array. The array that we are sets are mapped within the species using the TIGR TC calling AFGC actually represents all cDNA clones used in all IDs and across plant species using the EGO database. The of the AFGC arrays including the 11 k, 13 k, and 16 k arrays. Poplar target sequences were sequence-aligned with the TIGR Poplar TC dataset using the blastn program as shown in Figure 2. The blastn expected value and percent identity 2.2. Arabidopsis data preprocessing −4 cut-off were 10 and 98%, respectively. TIGR also provides We obtained the probe set ID, the vendor’s corresponding a file with a mapping of the EGO ID and the corresponding mapping to AGI ID (for Arabidopsis arrays), and the TCs for all species. From this file, the mappings between EGO Ruchi Ghanekar et al. 3 Affymetrix Poplar probe set Poplar TC sequence files EGO TOG (TIGRorthologgroup)ID Arabidopsis TC mappings Yes No Does TC annotation have AGI? Query the existing CCPMT Use the TC oligo sequence to database with the AGI to get BLASTitagainstthe TAIR the corresponding provided Arabidopsis dataset. Arabidopsis probe sets from The BLAST returns the AGI ID the Arabidopsis microarray vendors Figure 2: Poplar-Arabidopsis mapping. The above workflow explains the steps that were undertaken while mapping the Affymetrix Poplar probe set ID with the Arabidopsis probe set ID. TIGR EGO ID was used to go across species during mapping. Table 1: Web pages from where plant microarray data were downloaded. Probe sequence file location Vendor-provided annotation file location http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix AG byproduct.affx?product=atgenome1 byproduct.affx?product=atgenome1 http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix ATH1 byproduct.affx?product=arab byproduct.affx?product=arab Operon http://omad.operon.com/download/index.php http://omad.operon.com/download/index.php ftp://ftp.arabidopsis.org/home/tair/Microarrays/ ftp://ftp.arabidopsis.org/home/tair/ CATMA CATMA/ Microarrays/CATMA/ ftp://ftp.arabidopsis.org/home/tair/ ftp://ftp.arabidopsis.org/home/tair/ AFGC Microarrays/AFGC/ Microarrays/AFGC/ http://www.chem.agilent.com/Scripts/ Agilent NA (do not provide sequence files) PDS.asp?lPage=37068 http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix Poplar Genome Array byproduct.affx?product=poplar byproduct.affx?product=poplar IDs and the corresponding Arabidopsis and Poplar TCs were Transcripts” dataset using blastn. Based on the cut-offs used parsed. The mapping of the TC to EGOs was assumed to be there is the one-to-many mapping at several stages. A probe correct. In the future, any plant species with genes mapping set can map to multiple genes, and multiple probe sets can to an EGO ID can be easily incorporated into CCPMT. map to one gene (Table 2). Mapping the Arabidopsis TCs to their corresponding AGI IDs As an example, Figure 3 illustrates the mapping of the was achieved by using the Arabidopsis TC sequences (TIGR Affymetrix Poplar Genome Array with the Affymetrix AG provides this file) and sequence-aligning with the TAIR “AGI and Affymetrix ATH1 arrays; similar processes are used for 4 International Journal of Plant Genomics Affymetrix Poplar genome array has 61414 Poplar probe sets Blastn E-value = E-4 Persent ID = 98% Poplar TC file from TIGR TIGR has provided 41375 Poplar TCs Total mappings 57668 Poplar EGO-TC mappings by TIGR 16818 Poplar Affymetrix TIGR Poplar TC TIGR orthologous TCs probe set EGO TC19207 Poplar TC19207 PtpAffx.249.8.A1_s_at 894156 Ptp.1492.1.S1_s_at TC28054 894523 Poplar TC28054 Table A: total Poplar prope set-TC-EGO mappings 21931 Poplar Affymetrix Poplar TC TIGR probe set EGO PtpAffx.249.8.A1_s_at TC19207 894156 Ptp.1492.1.S1_s_at TC28054 894523 Affymetrix Arabidopsis genome (8 k) target sequences has 8297 Arabidopsis probesets Affymetrix Arabidopsis genome ATH1-121501 (25 k) target sequences has 22810 Arabidopsis probesets Blastn TAIR AGI CDS dataset E-value = E-4 Persent ID = 98% Table B: total mappings of AG-AGI 7998 table mappings of ATH1-AGI 23664 Affymetrix Arabidopsis Arabidopsis AGI genome (AG) Affymetrix ATH1 probe set 12936_s_at 264474_s_at AT5G38410 12752_s_at 254386_at AT4G21960 Arabidopsis TCs proivded by TIGR TIGR has 28901 Arabidopsis TCs Blastn E-value = E-4 Persent ID = 98% TIGR ‘‘AGI transcripts’’ dataset Table D: Arabidopsis EGO-TC mappings by TIGR 18551 Table C: total TC-AGI mappings 41651 TIGR Arabidopsis TC AGI TIGR orthologous TCs EGO TC261045 AT5G38410 894156 Arabidopsis TC261045 TC251315 AT4G21960 Arabidopsis TC251315 894523 Ruchi Ghanekar et al. 5 Union of table B, table C and table D Table E: total AG-AGI-TC-EGO mappings 7823 table ATH1-AGI-TC-EGO mappings 20051 AG probe set ATH1probe set AGI Arabidopsis TC EGO 12936_s_at 264474_s_at AT5G38410 TC261045 894156 12752_s_at 254386_at AT4G21960 TC251315 894523 Union of table A and table E 7744 mappings between Affymetrix Poplar genome array probe sets and Affymetrix AG probe sets 17297 mappings between Affymetrix Poplar genome array probe sets and Affymetrix ATH1 probe sets Arabidopsis Arabidopsis Affymetrix Poplar Affymetrix Poplar Arabidopsis Arabidopsis Affymetrix ATH1 probe set TC AGI TC AG probe set probe set PtpAffx.249.8.A1_s_at TC19207 TC261045 AT5G38410 12936_s_at 264474_s_at Ptp.1492.1.S1_s_at TC28054 TC251315 AT4G21960 12752_s_at 254386_at Figure 3: Workflow for the mapping between Affymetrix Poplar,Affymetrix AG, and AffymetrixATH1arrays. Table 2: Comparing microarray vendor and CCPMT mappings. Type of match Affymetrix AG Affymetrix ATH1 Operon CATMA AFGC Number of probes per array type 8297 22810 29954 24576 19108 Nil entries from vendor (no mapping for these probes) 141 250 936 2969 2823 Absent-vendor; present-blast 0 0 0 0 1 Present-vendor; absent-blast 850 930 2335 2990 10952 Many-vendor; one-blast 124 584 0 30 117 One-vendor; many-blast 338 896 480 408 368 Exact match 6932 20193 26138 19551 6413 Percentage of the vendor mapping numbers 84% 89% 87% 80% 34% the other arrays. Table 3 contains the number of matches that paste their queries in a textbox and, upon submission of were found between all possible matches among arrays. the queries, the results are displayed in a browser-friendly format. One can also compare entire arrays by selecting the input array and the output array from the drop-down 2.4. The CCPMT application menu. The CCPMT (http://www.ssg.uab.edu/ccpmt/) is composed of three pieces, namely, web pages (front end), core methods, 2.6. Example of the use of CCPMT and database (back end). The CCPMT web pages are written in JSP. Once the user hits the submit button, all of the data We illustrate the utility of the CCPMT via mapping the probe that have been entered are sent to the core code of Java set 244904 at that is found on the Affymetrix AG array to servlets. The servlets act as the core methods that process determine which probe sets on the ATH1 array query the the information received from the JSP pages and query the same gene. Step 1 (illustrated in Figure S1 in Supplementary database. MySQL is used as the back-end database to store Material available online at doi:10.1155/2008/451327) shows the microarray mappings. The code underlying the CCPMT that the user wants to map the input data using Affymetrix is available from the corresponding author by request. probe set IDs. In addition, users’ email address is entered so that the results can also be sent as an attachment in comma- separated file format. The next step (see Figure S2) is to enter 2.5. Using the CCPMT the probe set(s), 244904 at in this case, and the species of the The CCPMT is designed to be flexible and to allow for linking probe set, and to indicate which arrays to find homologous probes across arrays from a variety of starting data. CCPMT probe sets (in this example, Affymetrix AG and Affymetrix can be queried either at the probe set level or with identifiers ATH1 arrays). The results are then displayed in Figure S3 such as the probe set IDs, AGI IDs, TIGR EGO IDs, or TC which shows that the probe set 244904 at was mapped to IDs, and output can be and is returned in these formats 244922 s at and 244923 s at through the respective AGI IDs as well. As CCPMT is a web application, users can type or and that they map to AT2G07674. 6 International Journal of Plant Genomics Table 3: Summary table of the number of probes that are linked between the various arrays currently in the CCPMT from the array in row to the arrays in columns. The above and below diagonal elements are slightly different for the methods we used such as Blasn, and percent identity is not always reflexive. Affymetrix Poplar Affymetrix AG Affymetrix ATH1 AFGC Agilent CATMA Operon Genome Array Affymetrix AG — 7828 12170 7018 7193 8361 7744 Affymetrix ATH1 7827 — 30066 19188 20521 24636 17279 AFGC 12171 30066 — 29622 26070 30509 Agilent 7018 19188 29622 — 18563 21371 CATMA 7192 20521 26070 18561 — 23082 Operon 8362 24636 30509 21371 23081 — Affymetrix Poplar Genome Array 7744 17279 17793 16912 16378 17504 3. DISCUSSION nately, these databases give slightly different mapping. We have used TIGR EGO database as it has more plant Microarrays are gaining popularity in plant research. In addi- sequence data and has plant biologists devoted to curating tion, the requirement of many journals to deposit microarray the databases, as opposed to HomoloGene which is data into public databases has made large amounts of data mammal-centric. Thus, the choice we made about cut-offs available for other investigators to use. But because there are is conservative, but we have probably missed some probes a large number of arrays and array types, it can be difficult with lower homology that actually do bind certain RNAs, to compare data across datasets. We developed the CCPMT and many others identify paralogous genes. As a result of to allow investigators to identify common elements between these issues, our mapping is different from those provided databases rapidly and accurately. by the vendor. The highest overlap is between the mapping While most vendors provide some mapping of probes provided by Affymetrix and the CCPMT mapping for the to genes, in many cases the annotation is out of data or Affymetrix ATH1 array at 89%, while the AFGC has the the companies use different standards for mapping. In lowest overlap at about 66%. some cases, there is considerable difference between our We think that the function allowing direct comparison mapping and those provided with the arrays. This is due of complete arrays is very useful for several reasons. One to at least three reasons. The first is that sequence, gene of the reasons why we developed the CCPMT was to allow models, and annotation, especially for the incompletely coexpression analysis across arrays and species. This map- sequenced genomes, can change rapidly. As a result, the ping in the CCPMT will be the basis of our next additions to provided annotation may be out of date. For example, CressExpress (http://www.cressexpress.org/), and others may data for CATMA and AFGC, obtained with TAIR at use this as well for similar projects. Data from experiments ftp://ftp.arabidopsis.org/home/tair/Microarrays/,had a that are often collected across time and different array timestamp of January 2006, but the FASTA file format platforms are used, which requires the mapping of probes has a timestamp of April 2004. The second reason for across array platforms. This ability will be greatly amplified differences would be the choice of cut-off for mapping. by the ability of the CCPMT to map data across platforms. −4 We used >98% and E score of less than 10 for all but Theannotation andsequencefor genesaswellasgene the AFGC arrays. Our choice of >98% is debatable, and models are continuing to evolve, especially as additional somewhat different answers are obtained if other values are species are sequenced. We have set up the CCPMT to allow used; 98% may identify some paralogous genes, especially for us to rapidly change the various portions of the database across species. It has not been conclusively established and mapping as data change. We plan to revise the CCPMT what level of sequence similarity is needed between a gene based upon new genomic information. and a probe set for efficient binding. It is known that a CCPMT currently has six Arabidopsis microarray arrays single-base-pair difference in a short oligo can (with >50% and one Poplar microarray. The tool was designed in such of the time depending on the position of the SNP) destroy a way that one can easily incorporate a new microarray most binding. But since Affymetrix arrays usually have 11 vendor for the current plant species as well as for new plant sets of short oligos, the nonbinding of a single probe may species. In the near future, we will rule out mapping for all or may not affect the overall RNA quantitation [7]. Long Affymetrix-provided arrays for plant species, as well as those oligos bind relatively well with a few (1–3 bp) differences, long oligo arrays from Operon and Agilent. but there is usually no redundancy of the addition of probes. cDNA clones can be quite long and only a portion of the REFERENCES sequence needs to be homologous for binding. A third source of difference may result from the choice of common [1] K.-H. Cheung, J. Hager, D. Pan, et al., “KARMA: a web genes. We used the TIGR EGO, but the NCBI HomoloGene server application for comparing and annotating heteroge- (http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene) neous microarray platforms,” Nucleic Acids Research, vol. 32, also identifies homologous genes across species. Unfortu- web server issue, pp. W441–W444, 2004. Ruchi Ghanekar et al. 7 [2] J. Tsai, R. Sultana, Y. Lee, et al., “Resourcerer: a database for annotating and linking microarray resources within and across species,” Genome Biology, vol. 2, no. 11, pp. 1–4, 2001. [3] A. J. Olson, T. Tully, and R. Sachidanandam, “GeneSeer: a sage for gene names and genomic resources,” BMC Genomics, vol. 6, article 134, 2005. [4] J.Quackenbush,F.Liang,I.Holt,G.Pertea, andJ.Upton, “The TIGR Gene Indices: reconstruction and representation of expressed gene sequences,” Nucleic Acids Research, vol. 28, no. 1, pp. 141–145, 2000. [5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. [6] B. Stirling, Z. K. Yang, L. E. Gunter, G. A. Tuskan, and H. D. Bradshaw Jr., “Comparative sequence analysis between orthologous regions of the Arabidopsis and Populus genomes reveals substantial synteny and microcollinearity,” Canadian Journal of Forest Research, vol. 33, no. 11, pp. 2245–2251, 2003. [7] J.O.Borevitz, D. Liang,D.Plouffe, et al., “Large-scale identifi- cation of single-feature polymorphisms in complex genomes,” Genome Research, vol. 13, no. 3, pp. 513–523, 2003. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Plant Genomics Hindawi Publishing Corporation

Cross-Chip Probe Matching Tool: A Web-Based Tool for Linking Microarray Probes within and across Plant Species

Loading next page...
 
/lp/hindawi-publishing-corporation/cross-chip-probe-matching-tool-a-web-based-tool-for-linking-microarray-2TUB0qOwbE
Publisher
Hindawi Publishing Corporation
Copyright
Copyright © 2008 Ruchi Ghanekar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ISSN
1687-5370
DOI
10.1155/2008/451327
Publisher site
See Article on Publisher Site

Abstract

Hindawi Publishing Corporation International Journal of Plant Genomics Volume 2008, Article ID 451327, 7 pages doi:10.1155/2008/451327 Resource Review Cross-Chip Probe Matching Tool: A Web-Based Tool for Linking Microarray Probes within and across Plant Species 1, 2 2 2, 3 Ruchi Ghanekar, Vinodh Srinivasasainagendra, and Grier P. Page Department of Electrical and Computer Engineering, UAB School of Engineering, University of Alabama at Birmingham, 1530 Third Avenue South, Birmingham, AL 35294-4461, USA Department of Biostatistics, University of Alabama at Birmingham, 1665 University Blvd, Birmingham, Al 35294-0022, USA Statistics and Epidemiology Unit, RTI International, Oxford Building, Suite 119, 2951 Flowers Road South, Atlanta, GA 30341-5533, USA Correspondence should be addressed to Grier P. Page, gpage@rti.org Received 2 November 2007; Accepted 14 August 2008 Recommended by Chunguang Du The CCPMT is a free, web-based tool that allows plant investigators to rapidly determine if a given gene is present across various microarray platforms, which, of a list of genes, is present on array(s), and which gene a probe or probe set queries and vice versa, and to compare and contrast the gene contents of arrays. The CCPMT also maps a probe or probe sets to a gene or genes within and across species, and permits the mapping of the entire content from one array to another. By using the CCPMT, investigators will have a better understanding of the contents of arrays, a better ability to link data between experiments, ability to conduct meta-analysis and combine datasets, and an increased ability to conduct data mining projects. Copyright © 2008 Ruchi Ghanekar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION that sets in the public domain allow can be used for data mining or meta-analysis if the elements can be connected. Microarrays are an incredibly powerful technology that However, it is difficult to compare and combine the results enables the rapid and relatively accurate measurement of due to the difficultly in matching probes across arrays with thousands of genes in a single sample. Many different the genes, or even to determine if a given gene is on a given platform. To make matters worse, while the probe microarray platforms have been developed and each has somewhat different content and format. One key differ- sequences on an array are constant, the genome annotation ence is the type of probe used to query a gene expres- and gene models are not, and homologous genes may have different names across species. As a result, matching probes sion; some platforms use a single probe, and others use many probes. The probes may be short (25 base pairs) across arrays is continually evolving and needs continuing oligonucleotides (Affymetrix and NimbleGen arrays), long updating. (50–70 bp) oligonucleotides (Operon, Agilent, CATMA), or Investigators have long realized the problem of linking cDNA clones (AFGC arrays). Each of the formats has its probes across platforms; as a result, several tools have been advantages and disadvantages as well as its proponents and developed. These include Keck ARray Manager and Annota- opponents. One thing on which everybody agrees is that tor (KARMA) [1], RESOURCERER [2], and GeneSeer [3]. arrays will be a part of the experimental techniques of plant Our tool has several advantages over the other tools for several reasons. None of the other tools allows investigators biologists for years to come. Since there are many microarray platforms even within to query for genes within a microarray platform nor do the a single species, different investigators may use different other tools allow queries by Arabidopsis Genome Initiative platforms to try to address similar or complementary (AGI) annotation IDs or by TIGR tentative consensus (TC) experimental questions or data may be collected across types gene IDs. Furthermore, our tool sends the results to the using different platforms. Also, the large number of datasets investigators by email as well as a web-based report making 2 International Journal of Plant Genomics results’ tracking and storage easier. More importantly for Microarray vendor provided NCBI BLAST plant researchers, only RESOURCERER has any provision blastn program probe sequence for the linking of plant array data, but it has fewer array types. We developed the CCPMT to enable investigators to rapidly determine (1) if a given gene is present across many types of array platforms within and across species, (2) which, of a list of genes, is present on array(s), and (3) which gene Tair dataset a probe or probe set queries. The CCPMT also maps a probe set or probe sets to a gene or genes within and across species, and permits the mapping of the contents from one array to another, both within and across species. Blastn result The CCPMT is the first tool exclusively designed probe set and corresponding for linking probes from plant microarrays within and AGI mapping across microarray platforms and species. A web-based tool, CCPMT, helps investigators query for annotations at probe Figure 1: CCPMT Arabidopsis BLAST workflow. The workflow in level with probe set IDs or even at gene level with gene CCPMT to get the probe set to AGI mappings is shown. identifiers such as AGI, EGO [4], andTCIDs. In CCPMT, an investigator can enter either individual or multiple probe set or gene identifiers (separated by commas) in the textbox to query the CCPMT database. Checkboxes for microarray nucleotide sequences of the probe sets (Table 1) directly from vendors provide the option of selecting multiple arrays the vendors. while querying the CCPMT database. CCPMT also offers In the case of Arabidopsis, all vendors provided the the flexibility to carry out a one-to-one comparison of mappings between their probe sets and the corresponding microarrays. Results are displayed immediately in the web AGI gene identifiers. However, due to evolving genome browser and are also sent through email in a csv file format. annotation, we derived a new set of mappings between the CCPMT has a flexible database design, and in the probe sets and the corresponding AGI IDs. The steps of immediate future additional plant arrays will be added the process are illustrated in Figure 1. The mapping was to the database; we will revise the underlying annotation accomplished using the NCBI blastn [5] program. Blastn and mapping for the probes based upon new genomic compares a nucleotide query sequence against a nucleotide information. sequence database. We used two different databases for blastn By using the CCPMT, investigators will have a better analysis. For the Affymetrix and Operon probe sequences, understanding of the contents of arrays, a better ability to which do not contain introns, the AGI CDS database at link data between experiments, plus the ability to more easily TAIR was used as the sequence database due to the lack conduct data mining projects. of introns and the UTRs in this database. The AGI CDS dataset is based on the TAIR6.0 release version, and was released in November 2005. For the AFGC and CATMA 2. METHODS arrays, which do contain some intronic and UTR sequences, the AGI Transcripts dataset was used. The AGI Transcripts 2.1. Arrays selected for initial analysis dataset includes all of the coding sequences from Arabidopsis, Initially we focused upon microarrays with diverse probe as well as containing the UTRs. Neither database contained types (short and long oligos as well as cDNA) and for intronic sequence. The AGI Transcripts dataset used the both Poplar and Arabidopsis. Poplar and Arabidopsis were TAIR6.0 release version and was released in November 2005. chosen due to both having completely sequenced genomes The blastn expected value and percent identity cut-off were −4 and being relatively closely related species. The Arabidopsis 10 and 98%, respectively. arrays as tools are the Affymetrix Arabidopsis genome (8 K) commonly referred to as AG, Affymetrix Arabidopsis genome 2.3. Poplar data preprocessing ATH1-121501 (25 K) commonly referred to as ATH1, Agilent Arabidopsis 2 Oligo Microarray (V2) G4136B, Arabidopsis About 27% of the Poplar sequence have significant homology Functional Genomics Consortium (AFGC) array, Complete to Arabidopsis protein-coding sequences [6]and have been Arabidopsis Transcriptome MicroArray (CATMA) array, sequenced. Unlike Arabidopsis, Poplar does not have a Operon Arabidopsis Genome Oligo Set Version 3.0, and universal gene annotation ID; so in CCPMT Poplar, probe Affymetrix Poplar Genome Array. The array that we are sets are mapped within the species using the TIGR TC calling AFGC actually represents all cDNA clones used in all IDs and across plant species using the EGO database. The of the AFGC arrays including the 11 k, 13 k, and 16 k arrays. Poplar target sequences were sequence-aligned with the TIGR Poplar TC dataset using the blastn program as shown in Figure 2. The blastn expected value and percent identity 2.2. Arabidopsis data preprocessing −4 cut-off were 10 and 98%, respectively. TIGR also provides We obtained the probe set ID, the vendor’s corresponding a file with a mapping of the EGO ID and the corresponding mapping to AGI ID (for Arabidopsis arrays), and the TCs for all species. From this file, the mappings between EGO Ruchi Ghanekar et al. 3 Affymetrix Poplar probe set Poplar TC sequence files EGO TOG (TIGRorthologgroup)ID Arabidopsis TC mappings Yes No Does TC annotation have AGI? Query the existing CCPMT Use the TC oligo sequence to database with the AGI to get BLASTitagainstthe TAIR the corresponding provided Arabidopsis dataset. Arabidopsis probe sets from The BLAST returns the AGI ID the Arabidopsis microarray vendors Figure 2: Poplar-Arabidopsis mapping. The above workflow explains the steps that were undertaken while mapping the Affymetrix Poplar probe set ID with the Arabidopsis probe set ID. TIGR EGO ID was used to go across species during mapping. Table 1: Web pages from where plant microarray data were downloaded. Probe sequence file location Vendor-provided annotation file location http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix AG byproduct.affx?product=atgenome1 byproduct.affx?product=atgenome1 http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix ATH1 byproduct.affx?product=arab byproduct.affx?product=arab Operon http://omad.operon.com/download/index.php http://omad.operon.com/download/index.php ftp://ftp.arabidopsis.org/home/tair/Microarrays/ ftp://ftp.arabidopsis.org/home/tair/ CATMA CATMA/ Microarrays/CATMA/ ftp://ftp.arabidopsis.org/home/tair/ ftp://ftp.arabidopsis.org/home/tair/ AFGC Microarrays/AFGC/ Microarrays/AFGC/ http://www.chem.agilent.com/Scripts/ Agilent NA (do not provide sequence files) PDS.asp?lPage=37068 http://www.affymetrix.com/support/technical/ http://www.affymetrix.com/support/technical/ Affymetrix Poplar Genome Array byproduct.affx?product=poplar byproduct.affx?product=poplar IDs and the corresponding Arabidopsis and Poplar TCs were Transcripts” dataset using blastn. Based on the cut-offs used parsed. The mapping of the TC to EGOs was assumed to be there is the one-to-many mapping at several stages. A probe correct. In the future, any plant species with genes mapping set can map to multiple genes, and multiple probe sets can to an EGO ID can be easily incorporated into CCPMT. map to one gene (Table 2). Mapping the Arabidopsis TCs to their corresponding AGI IDs As an example, Figure 3 illustrates the mapping of the was achieved by using the Arabidopsis TC sequences (TIGR Affymetrix Poplar Genome Array with the Affymetrix AG provides this file) and sequence-aligning with the TAIR “AGI and Affymetrix ATH1 arrays; similar processes are used for 4 International Journal of Plant Genomics Affymetrix Poplar genome array has 61414 Poplar probe sets Blastn E-value = E-4 Persent ID = 98% Poplar TC file from TIGR TIGR has provided 41375 Poplar TCs Total mappings 57668 Poplar EGO-TC mappings by TIGR 16818 Poplar Affymetrix TIGR Poplar TC TIGR orthologous TCs probe set EGO TC19207 Poplar TC19207 PtpAffx.249.8.A1_s_at 894156 Ptp.1492.1.S1_s_at TC28054 894523 Poplar TC28054 Table A: total Poplar prope set-TC-EGO mappings 21931 Poplar Affymetrix Poplar TC TIGR probe set EGO PtpAffx.249.8.A1_s_at TC19207 894156 Ptp.1492.1.S1_s_at TC28054 894523 Affymetrix Arabidopsis genome (8 k) target sequences has 8297 Arabidopsis probesets Affymetrix Arabidopsis genome ATH1-121501 (25 k) target sequences has 22810 Arabidopsis probesets Blastn TAIR AGI CDS dataset E-value = E-4 Persent ID = 98% Table B: total mappings of AG-AGI 7998 table mappings of ATH1-AGI 23664 Affymetrix Arabidopsis Arabidopsis AGI genome (AG) Affymetrix ATH1 probe set 12936_s_at 264474_s_at AT5G38410 12752_s_at 254386_at AT4G21960 Arabidopsis TCs proivded by TIGR TIGR has 28901 Arabidopsis TCs Blastn E-value = E-4 Persent ID = 98% TIGR ‘‘AGI transcripts’’ dataset Table D: Arabidopsis EGO-TC mappings by TIGR 18551 Table C: total TC-AGI mappings 41651 TIGR Arabidopsis TC AGI TIGR orthologous TCs EGO TC261045 AT5G38410 894156 Arabidopsis TC261045 TC251315 AT4G21960 Arabidopsis TC251315 894523 Ruchi Ghanekar et al. 5 Union of table B, table C and table D Table E: total AG-AGI-TC-EGO mappings 7823 table ATH1-AGI-TC-EGO mappings 20051 AG probe set ATH1probe set AGI Arabidopsis TC EGO 12936_s_at 264474_s_at AT5G38410 TC261045 894156 12752_s_at 254386_at AT4G21960 TC251315 894523 Union of table A and table E 7744 mappings between Affymetrix Poplar genome array probe sets and Affymetrix AG probe sets 17297 mappings between Affymetrix Poplar genome array probe sets and Affymetrix ATH1 probe sets Arabidopsis Arabidopsis Affymetrix Poplar Affymetrix Poplar Arabidopsis Arabidopsis Affymetrix ATH1 probe set TC AGI TC AG probe set probe set PtpAffx.249.8.A1_s_at TC19207 TC261045 AT5G38410 12936_s_at 264474_s_at Ptp.1492.1.S1_s_at TC28054 TC251315 AT4G21960 12752_s_at 254386_at Figure 3: Workflow for the mapping between Affymetrix Poplar,Affymetrix AG, and AffymetrixATH1arrays. Table 2: Comparing microarray vendor and CCPMT mappings. Type of match Affymetrix AG Affymetrix ATH1 Operon CATMA AFGC Number of probes per array type 8297 22810 29954 24576 19108 Nil entries from vendor (no mapping for these probes) 141 250 936 2969 2823 Absent-vendor; present-blast 0 0 0 0 1 Present-vendor; absent-blast 850 930 2335 2990 10952 Many-vendor; one-blast 124 584 0 30 117 One-vendor; many-blast 338 896 480 408 368 Exact match 6932 20193 26138 19551 6413 Percentage of the vendor mapping numbers 84% 89% 87% 80% 34% the other arrays. Table 3 contains the number of matches that paste their queries in a textbox and, upon submission of were found between all possible matches among arrays. the queries, the results are displayed in a browser-friendly format. One can also compare entire arrays by selecting the input array and the output array from the drop-down 2.4. The CCPMT application menu. The CCPMT (http://www.ssg.uab.edu/ccpmt/) is composed of three pieces, namely, web pages (front end), core methods, 2.6. Example of the use of CCPMT and database (back end). The CCPMT web pages are written in JSP. Once the user hits the submit button, all of the data We illustrate the utility of the CCPMT via mapping the probe that have been entered are sent to the core code of Java set 244904 at that is found on the Affymetrix AG array to servlets. The servlets act as the core methods that process determine which probe sets on the ATH1 array query the the information received from the JSP pages and query the same gene. Step 1 (illustrated in Figure S1 in Supplementary database. MySQL is used as the back-end database to store Material available online at doi:10.1155/2008/451327) shows the microarray mappings. The code underlying the CCPMT that the user wants to map the input data using Affymetrix is available from the corresponding author by request. probe set IDs. In addition, users’ email address is entered so that the results can also be sent as an attachment in comma- separated file format. The next step (see Figure S2) is to enter 2.5. Using the CCPMT the probe set(s), 244904 at in this case, and the species of the The CCPMT is designed to be flexible and to allow for linking probe set, and to indicate which arrays to find homologous probes across arrays from a variety of starting data. CCPMT probe sets (in this example, Affymetrix AG and Affymetrix can be queried either at the probe set level or with identifiers ATH1 arrays). The results are then displayed in Figure S3 such as the probe set IDs, AGI IDs, TIGR EGO IDs, or TC which shows that the probe set 244904 at was mapped to IDs, and output can be and is returned in these formats 244922 s at and 244923 s at through the respective AGI IDs as well. As CCPMT is a web application, users can type or and that they map to AT2G07674. 6 International Journal of Plant Genomics Table 3: Summary table of the number of probes that are linked between the various arrays currently in the CCPMT from the array in row to the arrays in columns. The above and below diagonal elements are slightly different for the methods we used such as Blasn, and percent identity is not always reflexive. Affymetrix Poplar Affymetrix AG Affymetrix ATH1 AFGC Agilent CATMA Operon Genome Array Affymetrix AG — 7828 12170 7018 7193 8361 7744 Affymetrix ATH1 7827 — 30066 19188 20521 24636 17279 AFGC 12171 30066 — 29622 26070 30509 Agilent 7018 19188 29622 — 18563 21371 CATMA 7192 20521 26070 18561 — 23082 Operon 8362 24636 30509 21371 23081 — Affymetrix Poplar Genome Array 7744 17279 17793 16912 16378 17504 3. DISCUSSION nately, these databases give slightly different mapping. We have used TIGR EGO database as it has more plant Microarrays are gaining popularity in plant research. In addi- sequence data and has plant biologists devoted to curating tion, the requirement of many journals to deposit microarray the databases, as opposed to HomoloGene which is data into public databases has made large amounts of data mammal-centric. Thus, the choice we made about cut-offs available for other investigators to use. But because there are is conservative, but we have probably missed some probes a large number of arrays and array types, it can be difficult with lower homology that actually do bind certain RNAs, to compare data across datasets. We developed the CCPMT and many others identify paralogous genes. As a result of to allow investigators to identify common elements between these issues, our mapping is different from those provided databases rapidly and accurately. by the vendor. The highest overlap is between the mapping While most vendors provide some mapping of probes provided by Affymetrix and the CCPMT mapping for the to genes, in many cases the annotation is out of data or Affymetrix ATH1 array at 89%, while the AFGC has the the companies use different standards for mapping. In lowest overlap at about 66%. some cases, there is considerable difference between our We think that the function allowing direct comparison mapping and those provided with the arrays. This is due of complete arrays is very useful for several reasons. One to at least three reasons. The first is that sequence, gene of the reasons why we developed the CCPMT was to allow models, and annotation, especially for the incompletely coexpression analysis across arrays and species. This map- sequenced genomes, can change rapidly. As a result, the ping in the CCPMT will be the basis of our next additions to provided annotation may be out of date. For example, CressExpress (http://www.cressexpress.org/), and others may data for CATMA and AFGC, obtained with TAIR at use this as well for similar projects. Data from experiments ftp://ftp.arabidopsis.org/home/tair/Microarrays/,had a that are often collected across time and different array timestamp of January 2006, but the FASTA file format platforms are used, which requires the mapping of probes has a timestamp of April 2004. The second reason for across array platforms. This ability will be greatly amplified differences would be the choice of cut-off for mapping. by the ability of the CCPMT to map data across platforms. −4 We used >98% and E score of less than 10 for all but Theannotation andsequencefor genesaswellasgene the AFGC arrays. Our choice of >98% is debatable, and models are continuing to evolve, especially as additional somewhat different answers are obtained if other values are species are sequenced. We have set up the CCPMT to allow used; 98% may identify some paralogous genes, especially for us to rapidly change the various portions of the database across species. It has not been conclusively established and mapping as data change. We plan to revise the CCPMT what level of sequence similarity is needed between a gene based upon new genomic information. and a probe set for efficient binding. It is known that a CCPMT currently has six Arabidopsis microarray arrays single-base-pair difference in a short oligo can (with >50% and one Poplar microarray. The tool was designed in such of the time depending on the position of the SNP) destroy a way that one can easily incorporate a new microarray most binding. But since Affymetrix arrays usually have 11 vendor for the current plant species as well as for new plant sets of short oligos, the nonbinding of a single probe may species. In the near future, we will rule out mapping for all or may not affect the overall RNA quantitation [7]. Long Affymetrix-provided arrays for plant species, as well as those oligos bind relatively well with a few (1–3 bp) differences, long oligo arrays from Operon and Agilent. but there is usually no redundancy of the addition of probes. cDNA clones can be quite long and only a portion of the REFERENCES sequence needs to be homologous for binding. A third source of difference may result from the choice of common [1] K.-H. Cheung, J. Hager, D. Pan, et al., “KARMA: a web genes. We used the TIGR EGO, but the NCBI HomoloGene server application for comparing and annotating heteroge- (http://www.ncbi.nlm.nih.gov/sites/entrez?db=homologene) neous microarray platforms,” Nucleic Acids Research, vol. 32, also identifies homologous genes across species. Unfortu- web server issue, pp. W441–W444, 2004. Ruchi Ghanekar et al. 7 [2] J. Tsai, R. Sultana, Y. Lee, et al., “Resourcerer: a database for annotating and linking microarray resources within and across species,” Genome Biology, vol. 2, no. 11, pp. 1–4, 2001. [3] A. J. Olson, T. Tully, and R. Sachidanandam, “GeneSeer: a sage for gene names and genomic resources,” BMC Genomics, vol. 6, article 134, 2005. [4] J.Quackenbush,F.Liang,I.Holt,G.Pertea, andJ.Upton, “The TIGR Gene Indices: reconstruction and representation of expressed gene sequences,” Nucleic Acids Research, vol. 28, no. 1, pp. 141–145, 2000. [5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. [6] B. Stirling, Z. K. Yang, L. E. Gunter, G. A. Tuskan, and H. D. Bradshaw Jr., “Comparative sequence analysis between orthologous regions of the Arabidopsis and Populus genomes reveals substantial synteny and microcollinearity,” Canadian Journal of Forest Research, vol. 33, no. 11, pp. 2245–2251, 2003. [7] J.O.Borevitz, D. Liang,D.Plouffe, et al., “Large-scale identifi- cation of single-feature polymorphisms in complex genomes,” Genome Research, vol. 13, no. 3, pp. 513–523, 2003. International Journal of Peptides Advances in International Journal of BioMed Stem Cells Virolog y Research International International Genomics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Journal of Nucleic Acids International Journal of Zoology Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific Journal of Signal Transduction World Journal Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 International Journal of Advances in Genetics Anatomy Biochemistry Research International Research International Microbiology Research International Bioinformatics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 Enzyme Journal of International Journal of Molecular Biology Archaea Research Evolutionary Biology International Marine Biology Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal

International Journal of Plant GenomicsHindawi Publishing Corporation

Published: Oct 21, 2008

References